[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T0000). [00:00:04] Zoranzoki21: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:55] Hi [00:01:01] jouncebot: now [00:01:01] For the next 0 hour(s) and 58 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T0000) [00:01:36] I have one patch for this evening SWAT [00:03:07] Can someone deploy my patch? [00:05:39] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) [00:06:20] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:51] 10Operations, 10ops-eqsin: apply asset tags to s[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T244900 (10RobH) [00:06:53] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) [00:07:18] (03CR) 10VolkerE: "Honestly, this is a simple fix that leads to a byte reduction. The valid question about title should be answered on a more general level f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [00:08:04] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) I'll mail this to Jin tomorrow (Friday) with directions to leave them in our racks at eqsin. They won't end up being applied for a couple of weeks, as he is working with another contr... [00:08:30] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:33] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10bd808) [00:08:37] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:32] (03CR) 10Jforrester: [C: 03+1] "WFM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571836 (owner: 10VolkerE) [00:11:00] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:47] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10bd808) @faidon did some investigation and found that this may have been caused by upstream repo state rather than our mirror being out of... [00:11:57] (03CR) 10VolkerE: "Follow-up fix at If460b13bf8f7543" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) (owner: 10Ammarpad) [00:13:05] PROBLEM - Host mw2293 is DOWN: PING CRITICAL - Packet loss = 100% [00:13:29] RECOVERY - Host mw2293 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [00:15:17] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2293.codfw.wmnet'] ` and were **ALL** successful. [00:16:24] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2294.codfw.wmnet'] ` and were **ALL** successful. [00:17:23] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2295.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [00:17:27] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2296.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [00:17:42] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Jclark-ctr) @jcrespo. Battery replacement delivery date is 02/22/20 Please message me on irc for what time works best for you for replacement. I can accommodate your schedule [00:19:21] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:31:45] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:05] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Below are switch ports for host are racked cabled and updated netbox. Handing over to... [00:36:14] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:40] (03PS4) 10Krinkle: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 [00:36:42] (03PS4) 10Krinkle: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 [00:36:44] (03PS5) 10Krinkle: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 [00:36:46] (03PS5) 10Krinkle: etcd: Pass wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [00:36:48] (03PS2) 10Krinkle: etcd: Use require_once for etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571799 [00:39:02] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2295.codfw.wmnet'] ` and were **ALL** successful. [00:39:10] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2297.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [00:40:35] Zoranzoki21: I'm sorry I wasn't around earlier. [00:40:42] What's your patch? [00:41:11] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2296.codfw.wmnet'] ` and were **ALL** successful. [00:41:16] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2298.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [00:41:17] https://gerrit.wikimedia.org/r/c/570680/ [00:41:42] Zoranzoki21: Okay. I can deploy that now. [00:41:54] Cool, I will rollback it into the row for Evening SWAT [00:42:07] And also, as patch is "throttle related" mwdebug testing by me isn't needed [00:42:09] Thanks! [00:42:17] Great. [00:42:40] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [00:43:26] Thanks! [00:43:39] (03Merged) 10jenkins-bot: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570680 (https://phabricator.wikimedia.org/T244488) (owner: 10Superzerocool) [00:45:14] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Jclark-ctr) [00:45:53] !log niharika29@deploy1001 Synchronized wmf-config/throttle.php: Throttle rule for National Gallery of Canada Library and Archives edit-a-thon - T244488 (duration: 01m 07s) [00:45:56] Zoranzoki21: Deployed. Thanks for your patience. :) [00:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:57] T244488: Lift IP cap for National Gallery of Canada edit-a-thon - https://phabricator.wikimedia.org/T244488 [00:46:21] Great, thanks! [00:54:08] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:24] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:05] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T0100). [01:00:07] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2297.codfw.wmnet'] ` and were **ALL** successful. [01:01:09] !log starting phabricator deploy, momentary downtime expected while apache restarts [01:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:57] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Masumrezarock100) @WhitePhosphorus can you confirm the status of the discussion/if consensus has reached in that linked discussion?... [01:03:01] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2299.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [01:03:29] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2298.codfw.wmnet'] ` and were **ALL** successful. [01:03:51] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2300.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [01:10:24] !log no apparent problems with phabricator upgrade, all done [01:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:17] 10Operations, 10netops: Upgrade routers - https://phabricator.wikimedia.org/T243080 (10ayounsi) Feb. 18th - 13:00UTC - 2h - cr2/3-esams [01:17:47] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:52] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [01:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:00] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:18] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [01:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:24] (03CR) 10Mstyles: "> Patch Set 6: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [01:24:44] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2299.codfw.wmnet'] ` and were **ALL** successful. [01:26:28] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2300.codfw.wmnet'] ` and were **ALL** successful. [01:32:20] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [01:36:26] (03CR) 10Cwhite: "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [01:37:33] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/571622 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [01:46:27] (03PS2) 10Aaron Schulz: Make deployment-prep $wgWANObjectCaches better match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571792 [01:46:33] (03PS2) 10Aaron Schulz: Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 [01:51:17] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10Tgr) There seem to be [[https://codesearch.wmflabs.org/deployed/?q=%5C... [02:11:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received: / (spec from root) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [02:12:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:19:08] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2301.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [02:27:55] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2302.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [02:33:10] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:22] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:06] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2301.codfw.wmnet'] ` and were **ALL** successful. [02:41:57] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [02:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:15] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [02:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:14] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2303.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [02:47:58] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2302.codfw.wmnet'] ` and were **ALL** successful. [02:49:28] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2304.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:00:13] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:32] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:35] (03CR) 10Aaron Schulz: "It seems like an extension change should be made instead to clean up some logic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis) [03:04:26] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:37] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:07:15] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2303.codfw.wmnet'] ` and were **ALL** successful. [03:10:19] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2304.codfw.wmnet'] ` and were **ALL** successful. [03:12:48] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2305.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:13:03] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2306.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:19:55] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [03:27:50] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:04] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:01] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:25] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:45] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2305.codfw.wmnet'] ` and were **ALL** successful. [03:37:09] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2306.codfw.wmnet'] ` and were **ALL** successful. [03:37:43] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2307.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:37:48] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2308.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:37:53] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2309.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2020... [03:52:37] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:49] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:53] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [03:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:50] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:17] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) |servers Rack A3 |OS|ready for service| |servers Rack A6 |OS|ready for service| |mw2291|Stretch|yes| |mw2301|Stretch|yes| |... [03:57:09] !log pt1979@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:53] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [03:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:33] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2307.codfw.wmnet'] ` and were **ALL** successful. [04:01:51] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2308.codfw.wmnet'] ` and were **ALL** successful. [04:02:27] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2309.codfw.wmnet'] ` and were **ALL** successful. [04:52:52] (03PS1) 10Andrew Bogott: Remove nfs 'dumps' mount from wikidata-primary-sources-tool [puppet] - 10https://gerrit.wikimedia.org/r/571858 (https://phabricator.wikimedia.org/T208417) [04:54:11] (03CR) 10Andrew Bogott: [C: 03+2] Remove nfs 'dumps' mount from wikidata-primary-sources-tool [puppet] - 10https://gerrit.wikimedia.org/r/571858 (https://phabricator.wikimedia.org/T208417) (owner: 10Andrew Bogott) [05:08:42] (03PS1) 10Gergő Tisza: Allow non-autoconfirmed users to propose OAuth apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571860 (https://phabricator.wikimedia.org/T213760) [05:09:44] !log depool cp20[04,11] and reimage as buster - T242093 [05:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:49] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [05:11:18] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2004.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130511_vgutie... [05:13:46] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2011.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130513_vgutie... [05:14:07] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10Aklapper) @Anthere: Please provide a [description of the list for the list info page](https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list). Plus I don't know wha... [05:25:44] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:01] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:45] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2004.codfw.wmnet'] ` and were **ALL** successful. [05:34:57] !log pool cp2004 running buster - T242093 [05:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:01] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [05:35:34] !log depool cp2001 and reimage as buster - T242093 [05:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:02] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130535_vgutie... [05:36:04] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2011.codfw.wmnet'] ` and were **ALL** successful. [05:37:47] !log pool cp2011 running buster - T242093 [05:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:24] !log depool cp2008 and reimage as buster - T242093 [05:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:06] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2008.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130538_vgutie... [05:39:27] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [05:47:04] !log Silence m3 hosts for maintenance - T244566 [05:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:09] T244566: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 [05:49:54] (03PS2) 10Vgutierrez: Release 8.0.6-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 [05:50:05] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 (owner: 10Vgutierrez) [05:50:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:17] !log Upgrade db1128 without restarting mysql - T244566 [05:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:21] T244566: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 [05:53:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:45] (03PS1) 10Vgutierrez: install_server: Reiamge {upload,text}@eqiad as buster [puppet] - 10https://gerrit.wikimedia.org/r/571862 (https://phabricator.wikimedia.org/T242093) [05:55:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:31] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2001.codfw.wmnet'] ` and were **ALL** successful. [06:00:04] marostegui and twentyafterfour: Time to snap out of that daydream and deploy Phabricator database master restart. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T0600). [06:00:09] twentyafterfour: ready? [06:00:33] marostegui: yeah let me set read-only, gimme ~30 seconds [06:00:38] cool! [06:00:53] !log Start phabricator maintenance T244566 [06:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:57] T244566: Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 [06:01:06] !log set phabricator read-only [06:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:10] marostegui: go for it [06:01:13] ok [06:01:23] stopping [06:02:09] hmm rather than read-only phab I'm seeing (Can Not Connect to MySQL). [06:02:20] twentyafterfour: all done [06:02:22] oh well, this won't take long [06:02:27] and we are back [06:02:33] Phab isn't working for me: A Troublesome Encounter! [06:02:41] !log set phabricator read-only to false [06:02:41] DannyS712: see backlog, maintenance in progress [06:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:43] nvm, I guess I mised the memo [06:02:48] DannyS712: try again? [06:02:54] works now [06:03:08] marostegui: everything looks good to me... [06:03:16] yep, same: https://phabricator.wikimedia.org/T245102 [06:03:32] twentyafterfour: then we are done! thank you :) [06:03:47] twentyafterfour: the cannot connect is expected, as we were restarting mysql and the proxy didn't failover yet to the slave [06:03:56] !log pool cp20[01,08] running buster - T242093 [06:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:04] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [06:04:05] ahh, how long does it take to fail over? [06:04:29] twentyafterfour: 1 minute [06:04:36] ah ok good to know. [06:04:39] !log depool cp20[02,05] and reimage as buster - T242093 [06:04:39] we check every 3 seconds, and we retry 20 times [06:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:59] marostegui: that seems reasonable. [06:05:05] thanks that was easy [06:05:07] twentyafterfour: yeah, to avoid false positives [06:05:16] twentyafterfour: thank you for being around so late for you :) [06:05:22] no problem at all [06:05:24] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2002.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [06:05:59] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [06:06:50] (03CR) 10Marostegui: [C: 03+1] install_server: Reiamge {upload,text}@eqiad as buster [puppet] - 10https://gerrit.wikimedia.org/r/571862 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [06:07:05] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reiamge {upload,text}@eqiad as buster [puppet] - 10https://gerrit.wikimedia.org/r/571862 (https://phabricator.wikimedia.org/T242093) (owner: 10Vgutierrez) [06:08:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2005.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reima... [06:10:12] 10Operations, 10DBA, 10Phabricator, 10Release-Engineering-Team (Development services): Upgrade and restart m3 (phabricator) master (db1128) - https://phabricator.wikimedia.org/T244566 (10Marostegui) 05Open→03Resolved This was done successfully. Downtime was 56 seconds: 06:01:19 - 06:02:15 [06:10:13] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:10:24] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:11:19] (03PS1) 10Marostegui: Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/571863 [06:11:51] !log depool cp10[89,90] and reimage as buster - T242093 [06:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:55] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [06:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10396 and previous config saved to /var/cache/conftool/dbconfig/20200213-061219-marostegui.json [06:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:23] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:12:33] (03CR) 10Marostegui: [C: 03+2] Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/571863 (owner: 10Marostegui) [06:12:56] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1089.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130612_vgutie... [06:17:03] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1090.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130616_vgutie... [06:19:47] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:19:48] !log testing a new build of ATS 8.0.6 in cp40[26,32] [06:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10397 and previous config saved to /var/cache/conftool/dbconfig/20200213-062148-marostegui.json [06:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:53] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:22:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:52] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:23:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:21] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2002.codfw.wmnet'] ` [06:25:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:36] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571864 (https://phabricator.wikimedia.org/T232446) [06:26:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10398 and previous config saved to /var/cache/conftool/dbconfig/20200213-062642-marostegui.json [06:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:07] (03PS1) 10Vgutierrez: Release 8.0.6-rc0-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/571865 [06:28:19] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.6-rc0-1wm1 [debs/trafficserver] (8.0.6) - 10https://gerrit.wikimedia.org/r/571865 (owner: 10Vgutierrez) [06:28:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:27] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571864 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [06:28:48] (03Abandoned) 10Vgutierrez: Release 8.0.6-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571660 (owner: 10Vgutierrez) [06:29:47] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2005.codfw.wmnet'] ` and were **ALL** successful. [06:30:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:45] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) There are two different angles to consider: - From Jumpcloud we need the replication (as in a full copy) of the LDAP direc... [06:31:49] (03PS1) 10Marostegui: report_users: Add dbproxy1015 IP [software] - 10https://gerrit.wikimedia.org/r/571866 [06:32:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3318, db1099:3311 T239453', diff saved to https://phabricator.wikimedia.org/P10399 and previous config saved to /var/cache/conftool/dbconfig/20200213-063207-marostegui.json [06:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:11] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:32:29] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [06:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:02] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1015 IP [software] - 10https://gerrit.wikimedia.org/r/571866 (owner: 10Marostegui) [06:33:29] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1089.eqiad.wmnet'] ` [06:33:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1099:3318 into vslow for s8 T239453', diff saved to https://phabricator.wikimedia.org/P10400 and previous config saved to /var/cache/conftool/dbconfig/20200213-063334-marostegui.json [06:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 for compression - T232446', diff saved to https://phabricator.wikimedia.org/P10401 and previous config saved to /var/cache/conftool/dbconfig/20200213-063535-marostegui.json [06:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:39] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:36:42] !log Upgrade and compress db1087, this will generate lag on s8 on the wiki replicas - T232446 [06:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:24] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1090.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1090.eqiad.wmnet'] ` [06:40:45] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10MoritzMuehlenhoff) I think we can rule out a change in the upstream repository; initially I had the hunch that this could be caused by a... [06:43:07] (03PS3) 10Muehlenhoff: Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) [06:43:57] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:45:45] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:46:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudnet2002-dev and cloudweb2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571749 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [06:49:29] (03PS1) 10Elukey: profile::kerberos: switch an-tool1006 with stat1007 in email body [puppet] - 10https://gerrit.wikimedia.org/r/571868 [06:49:39] !log pool cp20[02,05] running buster - T242093 [06:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:43] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [06:49:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:51:02] (03PS2) 10Elukey: profile::kerberos: switch an-tool1006 with stat1007 in email body [puppet] - 10https://gerrit.wikimedia.org/r/571868 [06:57:23] (03CR) 10Elukey: [C: 03+2] profile::kerberos: switch an-tool1006 with stat1007 in email body [puppet] - 10https://gerrit.wikimedia.org/r/571868 (owner: 10Elukey) [07:01:04] (03PS1) 10Vgutierrez: Release 8.0.5-1wm16 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571869 (https://phabricator.wikimedia.org/T244464) [07:01:10] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm16 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571869 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [07:01:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpd: log X-Request-Id header if any is incoming [puppet] - 10https://gerrit.wikimedia.org/r/570735 (https://phabricator.wikimedia.org/T244545) (owner: 10Dzahn) [07:01:59] !log pool cp10[89,90] running buster - T242093 [07:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:03] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [07:03:01] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [07:03:12] !log depool cp10[87,88] and reimage as buster - T242093 [07:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:54] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1087.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130703_vgutie... [07:09:15] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1088.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130708_vgutie... [07:13:43] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) 05Stalled→03Open After some investigations, it looks like [[ https://github.com/apache/trafficserver/pull/5811| PR 5811 ]] from up... [07:13:46] 10Operations, 10Traffic: ulsfo varnish-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10Vgutierrez) [07:14:11] 10Operations, 10Traffic, 10serviceops: Add x-request-id to httpd (apache) logs - https://phabricator.wikimedia.org/T244545 (10Joe) 05Open→03Resolved a:03Joe [07:16:05] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [07:21:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:23] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:51] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1087.eqiad.wmnet'] ` [07:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1107 with weight 50 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10402 and previous config saved to /var/cache/conftool/dbconfig/20200213-072839-marostegui.json [07:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:48] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:29:26] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1088.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1088.eqiad.wmnet'] ` [07:42:09] (03PS1) 10Vgutierrez: ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16 [puppet] - 10https://gerrit.wikimedia.org/r/571876 (https://phabricator.wikimedia.org/T244464) [07:46:17] (03CR) 10Vgutierrez: [C: 03+2] ATS: Test KA between ats-tls and varnish-fe on cp4031 + ATS 8.0.5-1wm16 [puppet] - 10https://gerrit.wikimedia.org/r/571876 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [07:47:05] !log installing Java security updates on stat/SWAP hosts [07:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:18] !log testing ATS 8.0.5-1wm16 + KA between ats-tls and varnish-fe in cp4031 - T244464 [07:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:22] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [07:49:50] !log pool cp10[87,88] running buster - T242093 [07:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:53] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [07:52:22] 10Operations, 10Analytics, 10serviceops, 10vm-requests, 10User-Elukey: Create a replacement for kraz.wikimedia.org - https://phabricator.wikimedia.org/T244719 (10elukey) My bad, today I reviewed irc2001 with Moritz and it turned up that a GNOME deployment happened. This is because when d-i was running an... [07:53:18] !log rolling restart of restbase-dev to pick up Java security update [07:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:53] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [07:56:44] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (Needed by 31st January) eqiad: rack/setup/install es102[0-5].eqiad.wmnet - https://phabricator.wikimedia.org/T241359 (10Marostegui) [07:57:08] !log depool cp10[85,86] and reimage as buster - T242093 [07:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:12] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [07:57:41] !log installing Java security updates on Hadoop, Kafka/Jumbo, AQS and Druid canaries [07:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:59] (03PS1) 10Elukey: admin: add kerberos flag for user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/571918 (https://phabricator.wikimedia.org/T245091) [07:58:17] (03CR) 10Elukey: [C: 03+2] admin: add kerberos flag for user ppchelko [puppet] - 10https://gerrit.wikimedia.org/r/571918 (https://phabricator.wikimedia.org/T245091) (owner: 10Elukey) [08:00:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:02] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [08:02:00] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [08:07:28] (03PS1) 10Marostegui: dbproxy1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571920 (https://phabricator.wikimedia.org/T202367) [08:08:46] (03CR) 10Marostegui: "I would like to restart this conversation and see if we can get this out of the way, so the only two pending old proxies (labs one) can be" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [08:09:15] (03CR) 10Marostegui: [C: 03+2] dbproxy1015: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/571920 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:11:09] (03CR) 10KartikMistry: "Yes. Let me schedule deploying for this next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [08:11:31] (03PS3) 10KartikMistry: ContentTranslation: Set cookieDomain to null for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 [08:12:27] (03CR) 10Muehlenhoff: "Sure thing, I don't remember the finer details of my line of thought from back then, but I'll revisit this in more detail tomorrow and get" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [08:12:39] (03CR) 10jerkins-bot: [V: 04-1] ContentTranslation: Set cookieDomain to null for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [08:15:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:02] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:32] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:18:59] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:39] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1085.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1085.eqiad.wmnet'] ` [08:21:15] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:34] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1086.eqiad.wmnet'] ` and were **ALL** successful. [08:30:11] (03PS1) 10DannyS712: Update wgAvailableRights declaration of `autoreviewprotected` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571922 (https://phabricator.wikimedia.org/T230103) [08:30:35] (03PS2) 10DannyS712: Update wgAvailableRights declaration of `autoreviewprotected` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571922 (https://phabricator.wikimedia.org/T230103) [08:32:23] (03CR) 10Marostegui: "> Sure thing, I don't remember the finer details of my line of" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [08:48:29] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform partial/incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10Marostegui) We can also take into consideration reading the binlogs from a given file/position using the coordinates we store on the... [08:56:50] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10MoritzMuehlenhoff) [08:57:24] !log restart elasticsearch on elastic2051 - JVM upgrade [08:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1107 50 -> 100 for 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10403 and previous config saved to /var/cache/conftool/dbconfig/20200213-085957-marostegui.json [09:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:02] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [09:10:49] !log installing Java security updates on elastic* and relforge* [09:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:40] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: output logs ingested by deprecated inputs to kafka-logging [puppet] - 10https://gerrit.wikimedia.org/r/571548 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:13:04] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove defalut value from kafka input type field [puppet] - 10https://gerrit.wikimedia.org/r/571813 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:13:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo optional 'type' fixed in I3862c56c42a" [puppet] - 10https://gerrit.wikimedia.org/r/571554 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:21:12] (03PS1) 10Muehlenhoff: Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) [09:22:18] (03CR) 10jerkins-bot: [V: 04-1] Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:23:02] (03PS2) 10Muehlenhoff: Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) [09:26:38] 10Operations, 10netops: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10jijiki) [09:26:47] (03PS1) 10Muehlenhoff: Switch dns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571928 (https://phabricator.wikimedia.org/T156955) [09:27:52] (03CR) 10jerkins-bot: [V: 04-1] Switch dns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571928 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:28:42] (03PS2) 10Muehlenhoff: Switch dns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571928 (https://phabricator.wikimedia.org/T156955) [09:31:39] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [09:31:53] (03Abandoned) 10Gehel: Reshard commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533023 (https://phabricator.wikimedia.org/T231446) (owner: 10Mathew.onipe) [09:32:37] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [09:34:55] (03CR) 10Urbanecm: [C: 03+1] Update wgAvailableRights declaration of `autoreviewprotected` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571922 (https://phabricator.wikimedia.org/T230103) (owner: 10DannyS712) [09:36:13] (03PS1) 10Muehlenhoff: Remove obsolete lvm-noraid-large.a.cfg [puppet] - 10https://gerrit.wikimedia.org/r/571929 (https://phabricator.wikimedia.org/T156955) [09:36:37] (03PS1) 10Urbanecm: Grant autopatrol to azwiki patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571930 (https://phabricator.wikimedia.org/T244338) [09:38:18] (03PS5) 10Urbanecm: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [09:42:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/571753 (owner: 10Jbond) [09:45:06] !log pool cp10[85,86] running buster - T242093 [09:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:10] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [09:45:49] !log depool cp10[83,84] and reimage as buster - T242093 [09:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:24] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1083.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130946_vgutie... [09:46:50] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [09:48:30] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1084.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002130948_vgutie... [09:51:32] 10Operations: Full root partition/disk on gerrit1002 - https://phabricator.wikimedia.org/T245127 (10MoritzMuehlenhoff) [09:51:38] 10Operations: Full root partition/disk on gerrit1002 - https://phabricator.wikimedia.org/T245127 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:52:40] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) [09:52:43] 10Operations: Full root partition/disk on gerrit1002 - https://phabricator.wikimedia.org/T245127 (10Marostegui) [09:53:08] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Marostegui) Per the duplicate task I merged here filled by @MoritzMuehlenhoff: ` root@gerrit1002:~# df -hT / Filesystem Type Size Used Avail Use% Mounted on /dev/vda1 ext4 63G 60G 0 100% / ` [09:57:20] (03CR) 10Luke081515: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571930 (https://phabricator.wikimedia.org/T244338) (owner: 10Urbanecm) [10:03:13] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:03:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Switch dns* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571928 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:03:54] (03CR) 10Filippo Giunchedi: [C: 03+2] Remove obsolete lvm-noraid-large.a.cfg [puppet] - 10https://gerrit.wikimedia.org/r/571929 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:03:58] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:36] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform partial/incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) >>! In T244884#5879850, @Marostegui wrote: > We can also take into consideration reading the binlogs from a given file/posit... [10:11:02] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) [10:11:16] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1084.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1084.eqiad.wmnet'] ` [10:14:37] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1083.eqiad.wmnet'] ` and were **ALL** successful. [10:16:01] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10jcrespo) Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (b... [10:21:34] !log pool cp10[83,84] running buster - T242093 [10:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:21:53] (03CR) 10Jbond: [C: 03+2] cas-graphite: add new domain to dns [dns] - 10https://gerrit.wikimedia.org/r/571753 (owner: 10Jbond) [10:22:05] (03PS2) 10Jbond: cas-graphite: add new domain to dns [dns] - 10https://gerrit.wikimedia.org/r/571753 [10:23:14] !log removing /root/.ssh/known_hosts in cumin1001 [10:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:40] and 2001 [10:24:03] !log depool cp10[81,82] and reimage as buster - T242093 [10:24:24] (03CR) 10Arturo Borrero Gonzalez: "Good finding!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/571778 (https://phabricator.wikimedia.org/T244954) (owner: 10BryanDavis) [10:24:44] vgutierrez: Failed to log message to wiki. Somebody should check the error logs. [10:24:51] wonderfukl [10:25:10] maybe I'm being rate limited on the SAL /o\ [10:25:26] I wouldn't blame the poor bot for that TBH [10:25:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "please collect +1 from Andrew as well." [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:27:05] hmm [10:27:09] or wikibugs is toasted [10:27:20] or last reimages didn't reach the phabricator task volans [10:27:26] (cp1081 && cp1082) [10:27:50] I can see the comments on the task (T242093) [10:27:50] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [10:27:54] so it's a ircbot issue [10:28:43] vgutierrez: https://phabricator.wikimedia.org/T241109 maybe this? [10:30:20] vgutierrez: 2020-02-13 10:26:32 [INFO] (vgutierrez) wmf-auto-reimage::phabricator_task_update: Updated Phabricator task 'T242093' [10:30:54] so yeah,it's just phab->IRC [10:31:12] So looks related to the above task [10:31:39] wikibugs: you there? [10:34:48] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm16 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571869 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:35:23] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 8.0.5-1wm16 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/571869 (https://phabricator.wikimedia.org/T244464) (owner: 10Vgutierrez) [10:40:39] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1081.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1081.eqiad.wmnet'] ` [10:41:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:57] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1082.eqiad.wmnet'] ` Of which those **FAILED**: ` ['cp1082.eqiad.wmnet'] ` [10:44:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:44:12] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:15] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lea_Lacroix_WMDE) Thanks all for your feedback. Since the change we perform didn't have the expected re... [10:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:15] (03CR) 10Ema: [C: 03+2] vcl: merge wikimedia-common into wikimedia-frontend [puppet] - 10https://gerrit.wikimedia.org/r/571689 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [10:49:40] !log cp4027 (cache_text): apply wikimedia-common/wikimedia-frontend VCL merge T241239 [10:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:44] T241239: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 [10:51:26] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): siteinfo api calls should be cached for N minutes on the caching layer - https://phabricator.wikimedia.org/T244204 (10Joe) To be clear, I think what @Anomie described is a perfectly valid way of handling ca... [10:58:09] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10jijiki) >>! In T227734#5879320, @Tgr wrote: > * getid3 (a TimedMediaHa... [10:59:37] !log cp4021 (cache_upload): apply wikimedia-common/wikimedia-frontend VCL merge T241239 [10:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:41] T241239: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 [11:02:53] !log pool cp10[81,82] and reimage as buster - T242093 [11:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] arg [11:02:57] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:03:03] let me fix that :) [11:07:02] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:07:11] 10Operations, 10conftool, 10serviceops-radar, 10Wikimedia-Incident: Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 (10akosiaris) p:05Triage→03Medium Note that we currently have such an alert (or at least something close to it). The c... [11:08:29] !log upload trafficserver 8.0.5-1wm16 to apt.wm.o (buster) - T244464 [11:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:33] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [11:12:09] !log A:cp re-enable puppet, leave it to cron to apply wikimedia-common/wikimedia-frontend VCL merge T241239 [11:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:13] T241239: Cleanup after varnish-be -> ats-be migration - https://phabricator.wikimedia.org/T241239 [11:13:16] (03PS1) 10Jbond: graphite: remove trailing slash from proxy uri [puppet] - 10https://gerrit.wikimedia.org/r/571945 (https://phabricator.wikimedia.org/T244861) [11:13:20] (03PS4) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [11:13:56] (03Abandoned) 10Ema: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) (owner: 10Ema) [11:14:16] (03Abandoned) 10Ema: numa_networking: move setting to tlsproxy::instance [puppet] - 10https://gerrit.wikimedia.org/r/430902 (https://phabricator.wikimedia.org/T193865) (owner: 10Ema) [11:14:27] (03Abandoned) 10Ema: vcl: invoke builtin vcl_hit [puppet] - 10https://gerrit.wikimedia.org/r/433338 (https://phabricator.wikimedia.org/T192368) (owner: 10Ema) [11:14:54] (03CR) 10Jbond: [C: 03+2] graphite: remove trailing slash from proxy uri [puppet] - 10https://gerrit.wikimedia.org/r/571945 (https://phabricator.wikimedia.org/T244861) (owner: 10Jbond) [11:14:55] (03Abandoned) 10Ema: varnish: fix order of parameters in tests helper script [puppet] - 10https://gerrit.wikimedia.org/r/554287 (owner: 10Ema) [11:15:23] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) The issue with timeouts and KeepAlive can be easily understood with a small environment using curl + ATS + httpbin. 1. curl requests /... [11:16:37] !log depool cp10[79,80] and reimage as buster - T242093 [11:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:41] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:17:33] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1079.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131117_vgutie... [11:18:21] !log rolling upgrade of ATS to version 8.0.5-1wm16 fleet wide - T244464 [11:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:25] T244464: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 [11:19:26] 10Operations, 10DBA, 10procurement: eqiad: 9 database servers - https://phabricator.wikimedia.org/T245137 (10Marostegui) [11:20:18] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1080.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131119_vgutie... [11:28:44] (03PS5) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [11:29:57] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&v [11:29:57] g-eqiad&var-topic=All&var-consumer_group=All [11:31:20] (03PS6) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [11:34:24] (03PS7) 10Ema: vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) [11:35:07] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:09] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:46] (03CR) 10Ema: "pcc looks sane: https://puppet-compiler.wmflabs.org/compiler1003/20781/" [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:41:37] (03CR) 10Ema: [C: 03+2] vcl: simplify backend listing [puppet] - 10https://gerrit.wikimedia.org/r/571697 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [11:45:00] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 3 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10jbond) p:05Triage→03Medium [11:45:50] 10Operations, 10conftool: Enforce in dbctl that core sections and es clusters always have at least two replicas - https://phabricator.wikimedia.org/T245036 (10jbond) p:05Triage→03Medium [11:46:17] 10Operations, 10conftool, 10Wikimedia-Incident: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10jbond) p:05Triage→03Medium [11:46:28] 10Operations, 10Pybal, 10Wikimedia-Incident: Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10jbond) p:05Triage→03Medium [11:46:48] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1079.eqiad.wmnet'] ` and were **ALL** successful. [11:47:05] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10jbond) p:05Triage→03Medium [11:47:59] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10jbond) p:05Triage→03Medium [11:48:18] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1080.eqiad.wmnet'] ` and were **ALL** successful. [11:48:20] 10Operations, 10netops: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10jbond) p:05Triage→03Medium [11:49:17] (03PS2) 10Ema: varnish::instance: remove 'layer' [puppet] - 10https://gerrit.wikimedia.org/r/571727 (https://phabricator.wikimedia.org/T241239) [11:49:50] 10Operations, 10Pybal, 10Wikimedia-Incident: Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10mark) Personally I don't think Pybal should be rejecting that; it's a valid configuration from a technical standpoint, and there ca... [11:51:45] 10Operations: Migrate Cumin hosts to Buster - https://phabricator.wikimedia.org/T245114 (10jcrespo) Cumin should for the most part be able to be upgraded to 10.4 with ease as it only holds the client, not the server, and that is why easier to upgrade (and it should be transparent). The main issue on db/backups... [11:52:13] !log pool cp10[79,80] running buster - T242093 [11:52:16] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [11:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:17] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [11:53:20] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [11:53:43] !log depool cp10[77,78] and reimage as buster - T242093 [11:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:09] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1077.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131154_vgutie... [11:56:19] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1078.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131156_vgutie... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T1200). [12:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:21] I have some deployment, It wait for a bit [12:03:24] here [12:03:29] Amir1: you want to deploy before me? [12:03:40] Urbanecm: no no, you go first [12:04:04] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571930 (https://phabricator.wikimedia.org/T244338) (owner: 10Urbanecm) [12:05:02] (03Merged) 10jenkins-bot: Grant autopatrol to azwiki patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571930 (https://phabricator.wikimedia.org/T244338) (owner: 10Urbanecm) [12:08:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 176b0e8: Grant autopatrol to azwiki patrollers (T244338) (duration: 01m 05s) [12:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:56] T244338: Adding "autopatrolled" powers to the right of "patrol" in azwiki - https://phabricator.wikimedia.org/T244338 [12:09:52] (03PS1) 10Ladsgroup: Revert "Triple the factor of WDQS lag to maxlag for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571956 [12:10:11] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571922 (https://phabricator.wikimedia.org/T230103) (owner: 10DannyS712) [12:10:20] (03PS6) 10Urbanecm: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [12:10:27] (03CR) 10Urbanecm: [C: 03+2] Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [12:11:06] Amir1: is your commit related to my earlier question in #wikidata? [12:11:09] (03Merged) 10jenkins-bot: Update wgAvailableRights declaration of `autoreviewprotected` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571922 (https://phabricator.wikimedia.org/T230103) (owner: 10DannyS712) [12:11:22] (03Merged) 10jenkins-bot: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [12:11:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] Urbanecm: let me see [12:13:13] (03PS1) 10Jbond: authdns: create dns-admin group to allow pusing dns changes [puppet] - 10https://gerrit.wikimedia.org/r/571957 [12:13:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0f035e4: Update wgAvailableRights declaration of autoreviewprotected (T230103) (duration: 01m 03s) [12:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:38] T230103: Systematize wgRestrictionLevels - https://phabricator.wikimedia.org/T230103 [12:13:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:59] !log disable ping offload in codfw [12:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:17] Urbanecm: it is related, we have been in this funny oscillation that bots edit too fast, the WDQS lags behind, the lag starts to get enforced, the bots stop, WDQS gets updated, the bot start again [12:14:41] this factor just changes the oscillation time period [12:14:50] gotcha [12:14:59] and why are you reverting the change, if I may ask? [12:15:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 1c81925: Create Test Custodians group at Beta Wikiversity (T240438) (duration: 01m 07s) [12:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:38] T240438: Create 'Test custodians' user group for betawv - https://phabricator.wikimedia.org/T240438 [12:15:39] Because increasing the oscillation time period caused more issues (unhappiness of the users) [12:16:09] Amir1: aha - do you have any link at hand? [12:16:15] btw, I'm done now - please go ahead [12:16:16] yeah yeah [12:16:18] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:31] https://phabricator.wikimedia.org/T243701 [12:16:51] cool [12:17:03] (03PS2) 10Ladsgroup: Revert "Triple the factor of WDQS lag to maxlag for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571956 [12:17:44] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571956 (owner: 10Ladsgroup) [12:18:02] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10TheDJ) 1. ID3: - module.audio-video.asf.php ASF fileformat metadata... [12:18:12] thanks a lot Amir1 [12:18:21] !log re-image ping2001 to buster - T244584 [12:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:25] T244584: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 [12:18:35] (03PS1) 10Alexandros Kosiaris: ganeti1009: Reserve ganeti1009 for k8s tests [puppet] - 10https://gerrit.wikimedia.org/r/571958 [12:18:42] (03Merged) 10jenkins-bot: Revert "Triple the factor of WDQS lag to maxlag for Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571956 (owner: 10Ladsgroup) [12:20:43] moritzm: to be sure, doing https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM will install buster by default, right? [12:20:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571956|Revert: Triple the factor of WDQS lag to maxlag for Wikidata (T244722)]] (duration: 01m 04s) [12:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] T244722: increase factor for query service that is taken into account for maxlag - https://phabricator.wikimedia.org/T244722 [12:23:05] (03PS1) 10Ladsgroup: Read and write more in the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571961 (https://phabricator.wikimedia.org/T219123) [12:23:25] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1077.eqiad.wmnet'] ` and were **ALL** successful. [12:23:55] (03CR) 10Ladsgroup: [C: 03+2] Read and write more in the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571961 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [12:25:04] (03Merged) 10jenkins-bot: Read and write more in the new term store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571961 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [12:25:21] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1078.eqiad.wmnet'] ` and were **ALL** successful. [12:25:48] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:26:26] XioNoX: if the host is currently configured for Stretch, that needs to be dropped from modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 [12:26:49] so just remove the option stanza for the stretch-installer and it will switch to Buster [12:27:14] and make sure to run puppet on A:installserver first [12:28:14] !log pool cp10[77,78] running buster - T242093 [12:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:18] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [12:28:22] ahhh, I guess I have to re-image again [12:28:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:28:53] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [12:29:00] !log depool cp10[75,76] and reimage as buster - T242093 [12:29:01] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571961|Read and write more in the new term store]] (duration: 01m 03s) [12:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:39] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1075.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131229_vgutie... [12:30:16] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:30:26] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571961|Read and write more in the new term store]], take II, the cache issue (T219123 T225055) (duration: 01m 03s) [12:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:32] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:30:32] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [12:31:27] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp1076.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202002131231_vgutie... [12:32:13] (03PS1) 10Ayounsi: Remove option pxelinux.pathprefix stretch stanza for rpki and ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/571962 (https://phabricator.wikimedia.org/T244584) [12:33:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:33:10] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/571962 [12:34:00] (03CR) 10Ayounsi: [C: 03+2] Remove option pxelinux.pathprefix stretch stanza for rpki and ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/571962 (https://phabricator.wikimedia.org/T244584) (owner: 10Ayounsi) [12:34:12] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 633 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:16] XioNoX: Looks good [12:34:22] thx! [12:34:44] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 74312 bytes in 2.951 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:56] !log EU SWAT is done [12:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:52] could the 571961 patch go into the wiki page for the swat, just for the record? I don't see it in there anyways [12:38:42] yeah, I should put both there [12:38:46] I will [12:38:50] 👍 [12:38:52] yaaa debian 10 [12:39:58] 10Operations, 10SRE-Access-Requests: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) @Dwisehaupt "ops" is a very powerful set of permissions granting global root to the production cluster and is currently not very fine-grained (but we're working on improving... [12:41:35] (03PS2) 10Jbond: authdns: create dns-admin group to allow pushing dns changes [puppet] - 10https://gerrit.wikimedia.org/r/571957 [12:47:10] 10Operations, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066 (10jbond) p:05Triage→03Medium [12:47:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:53] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:06] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1075.eqiad.wmnet'] ` and were **ALL** successful. [13:00:18] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1076.eqiad.wmnet'] ` and were **ALL** successful. [13:00:51] !log pool cp10[75,76] running buster - T242093 [13:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:55] T242093: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 [13:01:29] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) [13:01:44] 10Operations, 10Traffic: Upgrade cache cluster to debian buster - https://phabricator.wikimedia.org/T242093 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'cat /etc/debian_version' 78 hosts will be targeted: cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-20... [13:01:51] \o/ [13:01:54] RECOVERY - traffic_server backend process restarted on cp3052 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3052&var-layer=backend [13:02:34] \o/ \o/ \o/ [13:02:54] Repeat after me: T L S. T L S. T L S. [13:03:18] yup, TLSv1.3 is getting closer and closer [13:03:35] !log re-enable ping offload in codfw - T244584 [13:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:39] T244584: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 [13:06:24] !log disable ping offload in eqiad - T244584 [13:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:34] (03CR) 10Hnowlan: Migrate changeprop & cpjobqueue to kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [13:24:08] !log re-enable ping offload in eqiad - T244584 [13:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] T244584: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 [13:28:41] (03CR) 10Ladsgroup: [C: 04-1] "One note." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [13:31:52] !log disable ping offload in esams - T244584 [13:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:58] T244584: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 [13:43:10] 10Operations, 10Traffic, 10Patch-For-Review: Investigate side-effects of enabling KA between ats-tls and varnish-fe - https://phabricator.wikimedia.org/T244464 (10Vgutierrez) the issue described above it should be fixed almost everywhere: `===== NODE GROUP ===== (76) cp[2001-2002,2004-2008,2010-2014,2016-202... [13:48:47] 10Operations, 10Pybal, 10Wikimedia-Incident: Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10CDanis) >>! In T245060#5880484, @mark wrote: > Personally I don't think Pybal should be rejecting that; it's a valid configuration... [13:54:43] (03PS1) 10Vgutierrez: ATS: Enable TLSv1.3 for ats-be <--> applayer communication [puppet] - 10https://gerrit.wikimedia.org/r/571976 (https://phabricator.wikimedia.org/T170567) [13:55:24] (03CR) 10Elukey: [C: 03+2] dumps:Add license footer to all analytics dumps pages [puppet] - 10https://gerrit.wikimedia.org/r/571672 (https://phabricator.wikimedia.org/T244685) (owner: 10Fdans) [13:55:55] 10Operations, 10conftool, 10serviceops-radar, 10Wikimedia-Incident: Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 (10CDanis) >>! In T245058#5880313, @akosiaris wrote: > We can adapt it to do what we want. However from a cursory reading of... [13:58:12] (03PS2) 10Vgutierrez: ATS: Enable TLSv1.3 for ats-be <--> applayer communication [puppet] - 10https://gerrit.wikimedia.org/r/571976 (https://phabricator.wikimedia.org/T170567) [14:02:51] (03PS2) 10Itamar Givon: Add definitions for redirect badges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) [14:04:24] (03PS3) 10Vgutierrez: ATS: Enable TLSv1.3 for ats-be <--> applayer communication [puppet] - 10https://gerrit.wikimedia.org/r/571976 (https://phabricator.wikimedia.org/T170567) [14:04:26] (03PS1) 10Vgutierrez: ssl_ciphersuite: Fix TLSv1.3 ciphersuites names [puppet] - 10https://gerrit.wikimedia.org/r/571978 (https://phabricator.wikimedia.org/T170567) [14:04:40] (03PS3) 10Jbond: authdns: create dns-admin group to allow pushing dns changes [puppet] - 10https://gerrit.wikimedia.org/r/571957 [14:04:51] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) >>! In T170567#5880790, @gerritbot wrote: > Change 571976 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez): > [operations/puppet@pr... [14:05:43] 10Operations, 10conftool, 10serviceops-radar, 10Wikimedia-Incident: Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058 (10akosiaris) From IRC ` akosiaris: ok, I think I am understanding now what you want to see as an alert. pybal /alerts aler... [14:05:44] (03CR) 10Itamar Givon: Add definitions for redirect badges (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [14:06:44] (03CR) 10Ladsgroup: [C: 03+1] "This can go live after the merged patch lands in production (wmf.20) meaning Thursday next week. Nothing to do for now. I can backport tha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571738 (https://phabricator.wikimedia.org/T235420) (owner: 10Itamar Givon) [14:11:09] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:11:17] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) 05Open→03Resolved a:03Gehel Closing this as it seems that we have the rate limiting that we want at this point. We'll contin... [14:16:19] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10AntiCompositeNumber) [14:17:04] if y'all could get around to making a public decision on that one... [14:17:24] (/s) [14:17:45] what is "that one"? [14:18:16] the maps outage -> all maps.wm.o requests from external referrers have a ratelimit of 0 [14:18:24] that's not quite true AntiComposite [14:18:41] external clients who request things that happen to be cache hits will get a 200 OK with the tile [14:19:07] but yes, I will be writing an incident doc today, and at least getting some public details up [14:19:12] good [14:22:57] the cache hits thing is probably the reason I haven't seen more complaints about it -- intermittent problems are harder to notice [14:23:47] PROBLEM - ores on ores1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [14:24:13] !log re-enable ping offload in esams - T244584 [14:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] T244584: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 [14:24:59] 10Operations: Upgrade ping VMs to buster - https://phabricator.wikimedia.org/T244584 (10ayounsi) 05Open→03Resolved Done. [14:27:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 after 10.4 testing - T242702', diff saved to https://phabricator.wikimedia.org/P10405 and previous config saved to /var/cache/conftool/dbconfig/20200213-142735-marostegui.json [14:27:36] can someone please change the clinic person to john ? [14:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:39] RECOVERY - ores on ores1001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 7.114 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [14:27:39] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [14:27:57] effie: I can do that [14:28:02] thank you dear [14:28:20] Oh, it is actually there [14:28:25] you were too slow [14:28:36] Damn! [14:30:39] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10faidon) @Jclark-ctr @wiki_willy what's the status here? It sounds like a decom that was only partial and that only needs a few more steps to finalize perhaps? [14:30:56] (03PS4) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [14:33:48] !log add routinator_0.6.4_amd64.deb to buster-wikimedia apt repo [14:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:47] (03CR) 10Jbond: [C: 03+2] puppet_compile: remove submodule support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571713 (owner: 10Jbond) [14:35:24] (03PS1) 10Vgutierrez: ssl_ciphersuite: Enable TLSv1.3 where available [puppet] - 10https://gerrit.wikimedia.org/r/571985 (https://phabricator.wikimedia.org/T170567) [14:35:42] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Bugreporter) The only way to resolve the issue is increase the rate of edit Query Updater can handle. [14:36:40] (03CR) 10Andrew Bogott: [C: 03+1] Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [14:37:21] (03CR) 10jerkins-bot: [V: 04-1] ssl_ciphersuite: Enable TLSv1.3 where available [puppet] - 10https://gerrit.wikimedia.org/r/571985 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:38:12] (03PS1) 10Filippo Giunchedi: prometheus: remove nginx jobs [puppet] - 10https://gerrit.wikimedia.org/r/571987 [14:39:43] (03PS2) 10Vgutierrez: ssl_ciphersuite: Enable TLSv1.3 where available [puppet] - 10https://gerrit.wikimedia.org/r/571985 (https://phabricator.wikimedia.org/T170567) [14:43:33] (03CR) 10Ema: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20785/" [puppet] - 10https://gerrit.wikimedia.org/r/571727 (https://phabricator.wikimedia.org/T241239) (owner: 10Ema) [14:43:52] (03PS2) 10Vgutierrez: ssl_ciphersuite: Fix TLSv1.3 ciphersuites names [puppet] - 10https://gerrit.wikimedia.org/r/571978 (https://phabricator.wikimedia.org/T170567) [14:43:54] (03PS4) 10Vgutierrez: ATS: Enable TLSv1.3 for ats-be <--> applayer communication [puppet] - 10https://gerrit.wikimedia.org/r/571976 (https://phabricator.wikimedia.org/T170567) [14:43:56] (03PS3) 10Vgutierrez: ssl_ciphersuite: Enable TLSv1.3 where available [puppet] - 10https://gerrit.wikimedia.org/r/571985 (https://phabricator.wikimedia.org/T170567) [14:43:58] (03PS1) 10Vgutierrez: ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/571988 (https://phabricator.wikimedia.org/T170567) [14:46:06] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/20786/" [puppet] - 10https://gerrit.wikimedia.org/r/571988 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:47:32] (03CR) 10Ema: [C: 03+1] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/571988 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:47:32] !log re-image rpki2001 - T244585 [14:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:37] T244585: Upgrade rpki VMs to buster - https://phabricator.wikimedia.org/T244585 [14:49:24] (I have a UBN patch for Commons landing now.) [14:50:38] (03CR) 10Vgutierrez: [C: 03+2] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/571988 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [14:50:50] jouncebot: next [14:50:50] In 2 hour(s) and 9 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T1700) [14:51:09] !log test TLSv1.3 between ats-be and applayer in cp3050 - T170567 [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [14:52:05] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:55:43] (03PS1) 10Ottomata: Ensure absent no longer needed mediawiki Avro data Hadoop purge jobs [puppet] - 10https://gerrit.wikimedia.org/r/571990 [14:55:48] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Jgreen) [15:03:38] (03PS1) 10Jgreen: remove americium.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571992 (https://phabricator.wikimedia.org/T245038) [15:04:02] (03PS1) 10Jbond: puppet_compile: add changelog [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571993 [15:04:44] (03PS1) 10Andrew Bogott: huggle cloud-vps project: remove 'scratch' and 'dumps' nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/571994 (https://phabricator.wikimedia.org/T208405) [15:04:46] (03CR) 10Jgreen: [C: 03+2] remove americium.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/571992 (https://phabricator.wikimedia.org/T245038) (owner: 10Jgreen) [15:05:14] (03CR) 10Jbond: [C: 03+2] puppet_compile: add changelog [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/571993 (owner: 10Jbond) [15:05:58] (03CR) 10Andrew Bogott: [C: 03+2] huggle cloud-vps project: remove 'scratch' and 'dumps' nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/571994 (https://phabricator.wikimedia.org/T208405) (owner: 10Andrew Bogott) [15:06:07] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Jgreen) a:05Jgreen→03Jclark-ctr [15:06:37] 10Operations: ganeti doesn't change the boot order to network - https://phabricator.wikimedia.org/T245158 (10ayounsi) p:05Triage→03High [15:07:03] 10Operations: Upgrade rpki VMs to buster - https://phabricator.wikimedia.org/T244585 (10ayounsi) [15:07:05] 10Operations: ganeti doesn't change the boot order to network - https://phabricator.wikimedia.org/T245158 (10ayounsi) [15:08:48] (03PS1) 10Volans: ganeti: add logging for GntInstance actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) [15:08:50] (03PS1) 10Volans: ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) [15:09:06] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) @cmjohnson this hasn't moved since November, does it just need network port setup? [15:11:04] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20787/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/571990 (owner: 10Ottomata) [15:13:10] (03PS4) 10Volans: sre.hosts.decommission: improve Ganeti VM support [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) [15:13:12] (03PS1) 10Volans: sre.ganeti.makevm: refactor for new spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) [15:15:57] (03CR) 10Volans: "Changes from PS2 are only in the module-level docstring." [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [15:17:02] 10Operations, 10Phatality, 10observability: Phatality deployments invoke oom-killer on logstash::collector nodes. - https://phabricator.wikimedia.org/T237706 (10Krinkle) [15:18:27] (03PS1) 10Andrew Bogott: Horizon: remove ACTIVE_REGION settings from local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/572002 [15:19:45] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: remove ACTIVE_REGION settings from local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/572002 (owner: 10Andrew Bogott) [15:21:44] (03PS1) 10Vgutierrez: ATS: Use TLSv1.3 on ats-be <--> applayer on esams [puppet] - 10https://gerrit.wikimedia.org/r/572004 (https://phabricator.wikimedia.org/T170567) [15:22:03] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/extensions/WikibaseMediaInfo/resources/: UBN fix: Force non-value to be undefined (duration: 01m 06s) [15:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:37] (03CR) 10Ema: [C: 03+1] ATS: Use TLSv1.3 on ats-be <--> applayer on esams [puppet] - 10https://gerrit.wikimedia.org/r/572004 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [15:23:43] (03PS1) 10Andrew Bogott: Horizon: remove ACTIVE_REGION settings from local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/572006 [15:23:45] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/20788/" [puppet] - 10https://gerrit.wikimedia.org/r/572004 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [15:23:49] (03CR) 10Vgutierrez: [C: 03+2] ATS: Use TLSv1.3 on ats-be <--> applayer on esams [puppet] - 10https://gerrit.wikimedia.org/r/572004 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [15:24:53] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: remove ACTIVE_REGION settings from local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/572006 (owner: 10Andrew Bogott) [15:27:19] !log turning on TLSv1.3 between ats-be and applayer in cp30[51-52] - T170567 [15:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:23] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [15:28:21] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&v [15:28:21] g-eqiad&var-topic=All&var-consumer_group=All [15:28:49] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) Its now possible to use `rich_data` in the mode field to check the catalogue compilation with rich_data e... [15:29:53] I'm looking into the too many messages alert, it is related to elk7 cluster though (not in production) [15:30:05] ah yes I was about to ask [15:30:06] ack :) [15:33:07] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:33:08] mmhh yeah there was a spike of logs a little while ago and it looks like only logstash1023 is consuming, not 1024 or 1025. herron known/expected ? [15:33:32] the alert will recover in a little bit by itself regardless [15:33:53] godog yes thats expected [15:34:42] ack, thanks herron [15:35:03] but considering this I’ll fire those instances back up shortly [15:35:20] (03PS1) 10CRusnov: netbox report alerts: Remove trailing full-stop [puppet] - 10https://gerrit.wikimedia.org/r/572008 [15:36:28] SGTM [15:38:31] !log disable allow_half_open on ats-tls @ cp4031 - T236458 [15:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:36] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [15:38:52] (03PS1) 10Ema: ATS: remove 'tls:' prefix from X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/572009 (https://phabricator.wikimedia.org/T237993) [15:39:09] 10Operations: ganeti doesn't change the boot order to network - https://phabricator.wikimedia.org/T245158 (10akosiaris) 05Open→03Resolved a:03akosiaris The VM had a `kvm_extra: -bios OVMF.fd` in its configuration. That meant is used UEFI, not BIOS and hence the usual `boot_order: network` ganeti functional... [15:39:11] 10Operations: Upgrade rpki VMs to buster - https://phabricator.wikimedia.org/T244585 (10akosiaris) [15:39:23] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:39:36] em, is that expected ^ ? [15:39:54] (03CR) 10Elukey: [C: 03+1] ATS: remove 'tls:' prefix from X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/572009 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:41:04] (03CR) 10jerkins-bot: [V: 04-1] ATS: remove 'tls:' prefix from X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/572009 (https://phabricator.wikimedia.org/T237993) (owner: 10Ema) [15:41:34] ahahhaah [15:42:04] (03PS1) 10Filippo Giunchedi: varnish: use journald for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) [15:42:08] elukey: nice, CI tests work :) [15:42:09] ah tests! [15:42:20] !log rolling restart of ats-be on esams - T170567 [15:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:27] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [15:42:37] (03PS2) 10Ema: ATS: remove 'tls:' prefix from X-Analytics-TLS [puppet] - 10https://gerrit.wikimedia.org/r/572009 (https://phabricator.wikimedia.org/T237993) [15:43:48] 10Operations: ganeti doesn't change the boot order to network - https://phabricator.wikimedia.org/T245158 (10MoritzMuehlenhoff) How did the OVMF come about? Some snowflake setting from ealier tests or something we might run into again? [15:44:12] (03PS3) 10Muehlenhoff: Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) [15:44:59] (03CR) 10Filippo Giunchedi: "Buster upgrades are complete so we can go ahead with this" [puppet] - 10https://gerrit.wikimedia.org/r/572012 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [15:45:39] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/572008 (owner: 10CRusnov) [15:47:05] (03PS1) 10Muehlenhoff: Remove two obsolete partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572013 (https://phabricator.wikimedia.org/T156955) [15:47:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudcontrol2001-dev to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/571925 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:47:46] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.53 ms [15:49:31] 10Operations: ganeti doesn't change the boot order to network - https://phabricator.wikimedia.org/T245158 (10MoritzMuehlenhoff) >>! In T245158#5881233, @MoritzMuehlenhoff wrote: > How did the OVMF come about? Some snowflake setting from ealier tests or something we might run into again? nvm, saw the discussion... [15:51:33] (03PS1) 10Filippo Giunchedi: Switch bast*/cumin*/scandium to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572014 (https://phabricator.wikimedia.org/T156955) [15:51:47] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove two obsolete partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572013 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [15:58:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/572014 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:58:32] (03PS2) 10Volans: ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) [15:59:09] <_joe_> !log depooling mw1238 for analysis [15:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:51] (03PS1) 10Vgutierrez: ATS: Puppetize allow_half_open HTTP setting [puppet] - 10https://gerrit.wikimedia.org/r/572016 (https://phabricator.wikimedia.org/T236458) [16:01:06] (03PS1) 10Vgutierrez: ATS: Disable allow_half_open in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/572017 (https://phabricator.wikimedia.org/T236458) [16:02:14] <_joe_> HHVM, when it was malfunctioning, had the common decency not to recover its status as soon as depooled and pooled back again [16:02:21] <_joe_> !log pooled mw1238 again [16:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:08] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1003/20790/" [puppet] - 10https://gerrit.wikimedia.org/r/572016 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [16:03:19] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10ayounsi) p:05Triage→03Medium [16:05:04] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): mirrors.wikimedia.org libgtk-3-common all 3.22.11-1 hash mismatch - https://phabricator.wikimedia.org/T245071 (10bd808) 05Open→03Invalid >>! In T245071#5879649, @MoritzMuehlenhoff wrote: > But the file size appears to be truncated? It reports 107... [16:05:48] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10ayounsi) Still from LibreNMS: > 2020-02-13 15:46:52 notice ps1-a8-codfw SENTRY3_5179AF] EVENT: System boot complete notice > 2020-02-13 15:46:52 notice ps1-a8-codfw **NO MA... [16:07:02] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10RobH) > Uptime: 0 days 0 hours 20 minutes 27 seconds It did indeed reboot. [16:09:46] (03CR) 10BryanDavis: [C: 03+1] "The original deployment of OAuth took a very conservative approach to most configuration (and implementation) options. We have not seen an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571860 (https://phabricator.wikimedia.org/T213760) (owner: 10Gergő Tisza) [16:10:30] <_joe_> !log depooling/repooling mw1240 [16:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:32] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:16:39] rpki2001 is back online on buster waiting to check if everything is good before doing eqiad [16:18:36] 10Operations, 10Analytics, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) [16:20:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Puppetize allow_half_open HTTP setting [puppet] - 10https://gerrit.wikimedia.org/r/572016 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [16:21:01] !log restarting elastic2043 - T243715 [16:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:04] 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) "i have a case open with Dell. See below for case reference. According to Dell like always they will not give you a solution to the problem like they don't know... [16:21:05] T243715: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 [16:22:16] !log canceled the restart of elastic2043 - T243715 [16:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:20] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:24:19] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable allow_half_open in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/572017 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [16:24:30] 10Operations, 10Analytics, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) We can also change the alarm to hourly and see events caught. Now, this might mean a super large number of false positives so we nee... [16:24:33] (03PS2) 10Vgutierrez: ATS: Disable allow_half_open in cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/572017 (https://phabricator.wikimedia.org/T236458) [16:25:02] (03PS1) 10Dwisehaupt: Test comment to verify limits of access [dns] - 10https://gerrit.wikimedia.org/r/572019 (https://phabricator.wikimedia.org/T244901) [16:25:07] 10Operations, 10Analytics, 10Research, 10Traffic: Wikipedia Accessibility, check false positives and false negatives of traffic alarms - https://phabricator.wikimedia.org/T245166 (10Nuria) [16:28:39] !log rebooting elastic2043 for firmware upgrade [16:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:26] (03PS1) 10Ottomata: Remove unused absented avro hadoop deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/572020 [16:31:20] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:52] (03Abandoned) 10Dwisehaupt: Test comment to verify limits of access [dns] - 10https://gerrit.wikimedia.org/r/572019 (https://phabricator.wikimedia.org/T244901) (owner: 10Dwisehaupt) [16:32:25] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10RobH) Ok, it was firmware ` Sentry Switched CDU Version 7.1b ` and is now upgraded to firmware ` Sentry Switched CDU Version 7.1d ` This caused the PDU interface to re... [16:32:44] !log ps1-a8-codfw.mgmt.codfw.wmnet firmware upgraded via T245164 [16:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:49] T245164: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 [16:33:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10Jgreen) >>! In T244901#5880636, @jbond wrote: > @Dwisehaupt "ops" is a very powerful set of permissions granting global root to the production cluster and is c... [16:34:41] 10Operations, 10MediaWiki-General, 10serviceops: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10Joe) [16:35:01] 10Operations, 10MediaWiki-General, 10serviceops: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10Joe) [16:36:01] (03CR) 10Nikerabbit: [C: 04-1] WIP: Add config for OpusMT (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563110 (https://phabricator.wikimedia.org/T234194) (owner: 10KartikMistry) [16:37:04] (03CR) 10Ottomata: [C: 03+2] Remove unused absented avro hadoop deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/572020 (owner: 10Ottomata) [16:38:36] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [16:39:12] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) [16:45:27] (03PS1) 10CRusnov: Bump Netbox revision to v2.7.4 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/572025 [16:46:25] (03PS1) 10BryanDavis: rebuild_all: Ignore errors in Docker layer pruning [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/572026 (https://phabricator.wikimedia.org/T244795) [16:46:44] (03CR) 10jerkins-bot: [V: 04-1] rebuild_all: Ignore errors in Docker layer pruning [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/572026 (https://phabricator.wikimedia.org/T244795) (owner: 10BryanDavis) [16:47:39] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) [16:48:02] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10wiki_willy) @Cmjohnson - looks like it's too late to do this one today, since they need 24hrs to depool. Chatted with @Marostegui briefly, so just let know them know when's a good date for you t... [16:48:47] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:49:06] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10WDoranWMF) p:05Triage→03High [16:49:32] (03CR) 10BryanDavis: "recheck" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/572026 (https://phabricator.wikimedia.org/T244795) (owner: 10BryanDavis) [16:49:37] (03CR) 10Nray: [C: 03+1] "@Jforrester You have the Web's team approval to merge. I apparently don't have +2 rights to wmf-config, but feel free to self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570691 (https://phabricator.wikimedia.org/T244369) (owner: 10Jforrester) [16:50:56] (03CR) 10Bstorm: [C: 04-1] "Just thought of a problem with this: monitoring will be screwed up. It depended on the deprecated rpcbind and all that. Will add somethi" [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [16:50:59] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) [16:51:32] (03PS5) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [16:54:45] (03CR) 10BryanDavis: [V: 03+2 C: 03+2] "Jenkins seems to be busted in some way that is not related to tox or this change." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/572026 (https://phabricator.wikimedia.org/T244795) (owner: 10BryanDavis) [16:55:26] (03CR) 10jerkins-bot: [V: 04-1] rebuild_all: Ignore errors in Docker layer pruning [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/572026 (https://phabricator.wikimedia.org/T244795) (owner: 10BryanDavis) [16:55:28] (03CR) 10Volans: [C: 03+1] "LGTM, thanks! :-)" [puppet] - 10https://gerrit.wikimedia.org/r/572008 (owner: 10CRusnov) [16:57:34] (03PS1) 10Cmjohnson: Adding mac to dhcpd for mw1351 [puppet] - 10https://gerrit.wikimedia.org/r/572031 (https://phabricator.wikimedia.org/T236437) [16:59:42] (03CR) 10Effie Mouzeli: [C: 03+1] "We are still using nginx somewhere else, but we are not gathering metrics for that instance and we will soon replace it anyway, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/571987 (owner: 10Filippo Giunchedi) [17:00:05] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:46] (03CR) 10Cmjohnson: [C: 03+2] Adding mac to dhcpd for mw1351 [puppet] - 10https://gerrit.wikimedia.org/r/572031 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [17:01:31] (03CR) 10Jforrester: "recheck" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/567081 (owner: 10BryanDavis) [17:02:34] I'll sling out a few things. [17:03:00] (03PS3) 10Jforrester: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 [17:03:22] (03PS3) 10Jforrester: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 [17:04:05] (03PS4) 10Jforrester: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 [17:04:09] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [17:05:11] (03CR) 10Jforrester: [C: 03+2] Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [17:05:20] (03PS1) 10Effie Mouzeli: thumbor: remove nginx code leftovers [puppet] - 10https://gerrit.wikimedia.org/r/572033 [17:05:54] (03CR) 10Jforrester: [C: 03+2] build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 (owner: 10Jforrester) [17:06:21] (03Merged) 10jenkins-bot: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [17:06:29] 10Operations, 10netops: RRDP status alert - https://phabricator.wikimedia.org/T245121 (10ayounsi) I think this means that the query to that URL times out. As it completes properly from codfw I'm wondering if it's not an issue with the webproxies (overloaded or similar). Any idea who can help looking into it? [17:07:07] (03Merged) 10jenkins-bot: build: Upgrade composer testing stack to latest as used Wikimedia-wide [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571804 (owner: 10Jforrester) [17:07:22] (03CR) 10Andrew Bogott: [C: 04-1] wikimediacloud.org: introduce new service names for DNS (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [17:08:22] (03CR) 10Krinkle: [C: 03+2] Make deployment-prep $wgWANObjectCaches better match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571792 (owner: 10Aaron Schulz) [17:08:32] (03CR) 10Krinkle: [C: 03+1] Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [17:08:41] (03CR) 10Jforrester: "OK, if/when this lands, it should remove the entry from StaticSettingsTest.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [17:09:27] (03Merged) 10jenkins-bot: Make deployment-prep $wgWANObjectCaches better match production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571792 (owner: 10Aaron Schulz) [17:09:31] !log jforrester@deploy1001 Started scap: wmf-config/CommonSettings.php No-op (code style only) deploy sync [17:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:35] !log jforrester@deploy1001 sync aborted: wmf-config/CommonSettings.php No-op (code style only) deploy sync (duration: 00m 04s) [17:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:42] Hah, that was a silly typo. [17:10:21] (03CR) 10CRusnov: [C: 03+2] netbox report alerts: Remove trailing full-stop [puppet] - 10https://gerrit.wikimedia.org/r/572008 (owner: 10CRusnov) [17:10:27] Krinkle: That's OK to just pull and not sync, right? [17:10:34] James_F: Yep [17:10:38] Cool. [17:10:39] Was just gonna say [17:10:47] given train block, we can do some clean up I suppose. [17:10:51] .. which is what you're doing [17:10:55] Yes. :-) [17:10:58] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: No-op (code style only) deploy sync (duration: 01m 07s) [17:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:54] (03CR) 10Krinkle: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [17:12:41] (03CR) 10Jforrester: Test that every wiki has wgServer, wgCanonicalServer, and wgLanguageCode set or inherited (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571803 (owner: 10Jforrester) [17:14:19] (03PS4) 10Arturo Borrero Gonzalez: wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) [17:14:58] (03CR) 10Andrew Bogott: [C: 03+1] wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [17:15:04] James_F: more comin'? [17:15:30] (03CR) 10Jforrester: [C: 03+1] etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [17:15:54] Krinkle: No, I'll wait on the trwiki change until post-train. [17:15:57] So go for it. [17:16:03] Was reviewing the UBNs. [17:16:11] k [17:16:15] (03CR) 10Krinkle: [C: 03+2] etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [17:16:35] * Krinkle touch /var/lock/scap-global-lock; and staging on mwdebug1002 [17:16:44] 10Operations, 10ops-codfw, 10Discovery: elastic2043 has hardware errors that trigger reboots - https://phabricator.wikimedia.org/T243715 (10Papaul) Before BIOS Version 1.5.6 iDRAC Firmware Version 3.21.21.21 After BIOS Version 2.4.8 iDRAC Firmware Version 3.21.21.21 can not upgrade IDRAC getting the e... [17:16:50] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1009: Device not healthy -SMART- - https://phabricator.wikimedia.org/T244986 (10aborrero) a:05aborrero→03Jclark-ctr General hard drive failure doesn't sound good. Please @Jclark-ctr @Cmjohnson advice how to proceed. [17:17:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: introduce new service names for DNS [dns] - 10https://gerrit.wikimedia.org/r/572023 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [17:22:23] (03PS1) 10Niedzielski: [prod] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) [17:23:35] (03PS1) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [17:25:06] (03PS3) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [17:25:46] 10Operations, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) [17:25:58] (03CR) 10Bstorm: [C: 04-1] "Re-applying my -1 until we are totally ok with merging." [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [17:26:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Jgreen) [17:26:06] (03CR) 10jerkins-bot: [V: 04-1] Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [17:26:31] (03PS3) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [17:27:01] (03PS5) 10Jforrester: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [17:27:13] Krinkle: Didn't land, merge-conflicted. [17:27:28] ack [17:28:05] (03CR) 10Krinkle: [C: 03+2] etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [17:28:42] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [17:29:07] (03Merged) 10jenkins-bot: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [17:32:12] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove nginx jobs [puppet] - 10https://gerrit.wikimedia.org/r/571987 (owner: 10Filippo Giunchedi) [17:32:27] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I2e4fb0c086de0f8ac (duration: 01m 06s) [17:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:16] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on authdns2001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, md_clear, ssbd} https://wikitech.wikimedia.org/wiki/Microcode [17:34:42] (03PS5) 10Krinkle: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 [17:35:39] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: I2e4fb0c086de0f8ac (duration: 01m 06s) [17:35:39] (03CR) 10Krinkle: [C: 03+2] etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 (owner: 10Krinkle) [17:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:06] (03Merged) 10jenkins-bot: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 (owner: 10Krinkle) [17:38:08] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [17:39:08] PROBLEM - Host db1080.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:30] RECOVERY - Host db1080.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [17:40:30] PROBLEM - Host db1081.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:30] PROBLEM - Host db1086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:34] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Iefff596955e (duration: 01m 06s) [17:40:34] PROBLEM - Host ms-be1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:34] PROBLEM - Host mc1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:53] (03PS3) 10VolkerE: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) [17:41:00] (03PS6) 10Krinkle: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 [17:41:08] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:41:12] PROBLEM - Host ps1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:41:17] (03CR) 10Krinkle: [C: 03+2] etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 (owner: 10Krinkle) [17:41:20] PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:41:20] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:41:22] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:41:22] PROBLEM - Host elastic1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:22] PROBLEM - Host db1122.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:23] PROBLEM - Host cloudmetrics1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:32] PROBLEM - Host aqs1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] PROBLEM - Host analytics1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] PROBLEM - Host cloudvirt1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] PROBLEM - Host analytics1074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] PROBLEM - Host restbase-dev1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host druid1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host druid1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host druid1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host elastic1037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host druid1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host elastic1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:36] PROBLEM - Host elastic1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:40] PROBLEM - Host elastic1062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:40] PROBLEM - Host cloudvirt1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:41] ... [17:41:48] PROBLEM - Host labsdb1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:48] PROBLEM - Host labstore1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:48] PROBLEM - Host snapshot1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:41:58] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: Iefff596955e (duration: 01m 08s) [17:42:00] PROBLEM - Host notebook1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:02] PROBLEM - Host elastic1047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:15] all mgmt, probably related to the ps? [17:42:17] (03Merged) 10jenkins-bot: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 (owner: 10Krinkle) [17:42:18] PROBLEM - Host conf1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - Host mc1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - Host mc1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - Host ms-be1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - Host mc1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:18] PROBLEM - Host mc1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:22] PROBLEM - Host db1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:22] PROBLEM - Host db1088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:26] PROBLEM - Host db1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:28] PROBLEM - Host mw1256.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host mw1269.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host dbstore1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host mw1300.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host mc1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host elastic1039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host mw1320.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:40] PROBLEM - Host mw1311.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:41] PROBLEM - Host ms-be1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:41] PROBLEM - Host mw1334.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:42:42] RECOVERY - Host ps1-d4-eqiad is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 5.88 ms [17:42:56] RECOVERY - Host ps1-d5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [17:42:58] RECOVERY - Host elastic1047.mgmt is UP: PING OK - Packet loss = 16%, RTA = 1.02 ms [17:42:58] PROBLEM - Host ms-be1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:00] PROBLEM - Host conf1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:00] PROBLEM - Host ms-be1056.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:06] RECOVERY - Host labstore1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [17:43:10] RECOVERY - Host mw1311.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 1.43 ms [17:43:14] PROBLEM - Host db1139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:16] PROBLEM - Host scb1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:16] PROBLEM - Host cloudcontrol1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:16] PROBLEM - Host cloudmetrics1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:17] PROBLEM - Host cloudnet1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:17] PROBLEM - Host cloudservices1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:18] PROBLEM - Host cloudstore1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:19] PROBLEM - Host cloudvirt1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:43:36] poor icinga-wm [17:44:01] most likely all this spam is a PDU issue affecting only the management network, which means it's not anything in the critical production flow [17:44:39] PROBLEM - Host cloudvirt1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] PROBLEM - Host mwmaint1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:39] RECOVERY - Host kafka-main1003.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 93%, RTA = 2.32 ms [17:44:41] PROBLEM - Host db1079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:41] PROBLEM - Host db1074.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:41] PROBLEM - Host ms-be1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:43] RECOVERY - Host labsdb1010.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 39.08 ms [17:44:47] PROBLEM - Host db1092.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:47] PROBLEM - Host ms-be1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:47] PROBLEM - Host db1097.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:49] PROBLEM - Host db1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:51] PROBLEM - Host db1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:51] PROBLEM - Host db1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:51] PROBLEM - Host db1095.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:44:51] RECOVERY - Host cloudmetrics1002.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 1.87 ms [17:44:57] RECOVERY - Host cloudvirt1014.mgmt is UP: PING OK - Packet loss = 16%, RTA = 1.38 ms [17:44:59] PROBLEM - Host db1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:05] RECOVERY - Host ms-be1034.mgmt is UP: PING WARNING - Packet loss = 58%, RTA = 1298.02 ms [17:45:07] RECOVERY - Host es1015.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 1.16 ms [17:45:07] RECOVERY - Host ms-be1037.mgmt is UP: PING WARNING - Packet loss = 73%, RTA = 1.43 ms [17:45:09] RECOVERY - Host mw1242.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 28%, RTA = 1.76 ms [17:45:15] RECOVERY - Host cloudnet1003.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 17.78 ms [17:45:17] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [17:45:17] RECOVERY - Host ms-be1055.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 1.05 ms [17:45:17] RECOVERY - Host mw1226.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 86%, RTA = 1759.56 ms [17:45:17] RECOVERY - Host scb1002.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 86%, RTA = 1257.92 ms [17:45:19] PROBLEM - Host an-worker1092.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:19] PROBLEM - Host analytics1037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:19] PROBLEM - Host analytics1040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:19] PROBLEM - Host analytics1039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:19] PROBLEM - Host an-worker1094.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:21] PROBLEM - Host an-worker1095.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:21] PROBLEM - Host ores1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:23] PROBLEM - Host kafka-jumbo1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:23] PROBLEM - Host analytics1069.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:23] PROBLEM - Host analytics1055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:23] PROBLEM - Host db1106.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:25] PROBLEM - Host db1125.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:25] PROBLEM - Host db1129.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:27] PROBLEM - Host db1137.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:27] PROBLEM - Host db1136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:27] PROBLEM - Host db1135.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:27] PROBLEM - Host db1123.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:27] PROBLEM - Host kafka-jumbo1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:29] PROBLEM - Host analytics1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:29] PROBLEM - Host analytics1067.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:29] PROBLEM - Host cloudcontrol1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:45:30] PROBLEM - Host cloudstore1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:46:11] (03CR) 10Hnowlan: [C: 03+2] hiera: Add restbase202[123] to hiera [puppet] - 10https://gerrit.wikimedia.org/r/570094 (https://phabricator.wikimedia.org/T244178) (owner: 10Clarakosi) [17:46:43] RECOVERY - Host msw1-eqiad is UP: PING WARNING - Packet loss = 61%, RTA = 0.94 ms [17:46:43] RECOVERY - Host an-presto1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [17:46:43] RECOVERY - Host an-worker1095.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:46:43] RECOVERY - Host analytics1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [17:46:43] RECOVERY - Host analytics1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [17:46:43] RECOVERY - Host analytics1069.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [17:46:43] RECOVERY - Host cloudstore1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [17:46:44] RECOVERY - Host db1129.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [17:46:45] RECOVERY - Host db1135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:46:45] RECOVERY - Host db1137.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [17:46:46] RECOVERY - Host dumpsdata1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:47:43] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:47:56] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:47:58] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:47:58] RECOVERY - Host ps1-a5-eqiad is UP: PING WARNING - Packet loss = 80%, RTA = 1.47 ms [17:48:08] RECOVERY - Host cloudstore1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [17:48:08] RECOVERY - Host cp1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.02 ms [17:48:08] RECOVERY - Host cp1088.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms [17:48:08] RECOVERY - Host db1123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [17:48:08] RECOVERY - Host dumpsdata1001.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 2.74 ms [17:48:09] RECOVERY - Host elastic1061.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 2.35 ms [17:48:09] RECOVERY - Host ganeti1010.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 1.84 ms [17:48:10] RECOVERY - Host logstash1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [17:48:26] (03PS6) 10Krinkle: etcd: Pass wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [17:48:45] RECOVERY - Host ganeti1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [17:48:45] RECOVERY - Host mc-gp1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [17:48:48] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [17:48:50] PROBLEM - Host conf1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:50] PROBLEM - Host mc1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:50] PROBLEM - Host mc1027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:50] PROBLEM - Host ms-be1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:50] PROBLEM - Host mc1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:50] PROBLEM - Host mc1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:52] PROBLEM - Host ms-be1052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:54] RECOVERY - Host ganeti1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.06 ms [17:48:54] RECOVERY - Host labstore1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [17:48:54] RECOVERY - Host mw1252.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [17:48:54] RECOVERY - Host rdb1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.43 ms [17:48:54] RECOVERY - Host wtp1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [17:48:54] RECOVERY - Host wtp1043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [17:48:55] RECOVERY - Host mc1035.mgmt is UP: PING OK - Packet loss = 16%, RTA = 2.28 ms [17:48:55] RECOVERY - Host restbase-dev1006.mgmt is UP: PING OK - Packet loss = 16%, RTA = 2.54 ms [17:48:56] PROBLEM - Host db1088.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:56] PROBLEM - Host mw1222.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:57] RECOVERY - Host aqs1006.mgmt is UP: PING WARNING - Packet loss = 28%, RTA = 1.98 ms [17:48:57] PROBLEM - Host dbproxy1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:48:58] PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:00] PROBLEM - Host db1084.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:02] RECOVERY - Host stat1004.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 4.22 ms [17:49:02] RECOVERY - Host thumbor1001.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 4.43 ms [17:49:02] RECOVERY - Host wdqs1007.mgmt is UP: PING WARNING - Packet loss = 61%, RTA = 4.45 ms [17:49:04] PROBLEM - Host an-conf1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:04] PROBLEM - Host an-conf1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:04] PROBLEM - Host an-coord1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:04] PROBLEM - Host an-presto1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:04] PROBLEM - Host an-worker1078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:49:04] RECOVERY - Host an-worker1087.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 5.49 ms [17:49:23] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: Ibfca686f681 (duration: 01m 06s) [17:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:54] (03CR) 10Krinkle: [C: 03+2] etcd: Pass wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 (owner: 10Krinkle) [17:50:01] RECOVERY - Host ps1-c5-eqiad is UP: PING WARNING - Packet loss = 28%, RTA = 2.00 ms [17:50:17] RECOVERY - Host ps1-c6-eqiad is UP: PING WARNING - Packet loss = 73%, RTA = 333.42 ms [17:50:27] RECOVERY - Host ps1-c4-eqiad is UP: PING WARNING - Packet loss = 93%, RTA = 2.22 ms [17:50:29] RECOVERY - Host db1122.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 130.28 ms [17:50:29] RECOVERY - Host cloudmetrics1001.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 8.47 ms [17:50:31] RECOVERY - Host wtp1037.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.89 ms [17:50:37] RECOVERY - Host flerovium.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 0.99 ms [17:50:37] RECOVERY - Host cloudvirt1014.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 0.81 ms [17:50:39] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.10 ms [17:50:39] RECOVERY - Host analytics1042.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 1.20 ms [17:50:39] RECOVERY - Host elastic1032.mgmt is UP: PING WARNING - Packet loss = 73%, RTA = 1548.13 ms [17:50:41] RECOVERY - Host sessionstore1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [17:50:43] RECOVERY - Host cloudvirt1026.mgmt is UP: PING WARNING - Packet loss = 54%, RTA = 1.25 ms [17:50:43] RECOVERY - Host an-worker1092.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 37%, RTA = 2.35 ms [17:50:43] RECOVERY - Host cloudvirt1020.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 28.00 ms [17:50:45] RECOVERY - Host ms-be1033.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 2.88 ms [17:50:45] RECOVERY - Host pc1009.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 1.81 ms [17:50:45] RECOVERY - Host analytics1053.mgmt is UP: PING WARNING - Packet loss = 54%, RTA = 15.86 ms [17:50:45] RECOVERY - Host kubernetes1001.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 44%, RTA = 921.17 ms [17:50:45] RECOVERY - Host cescout1001.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 54%, RTA = 1754.32 ms [17:50:47] RECOVERY - Host kubernetes1004.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 50%, RTA = 9.27 ms [17:50:47] RECOVERY - Host mc1026.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 2.60 ms [17:50:47] RECOVERY - Host analytics1058.mgmt is UP: PING WARNING - Packet loss = 58%, RTA = 3.00 ms [17:50:47] RECOVERY - Host analytics1047.mgmt is UP: PING WARNING - Packet loss = 58%, RTA = 2.93 ms [17:50:51] RECOVERY - Host cloudvirt1003.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 2.24 ms [17:50:51] RECOVERY - Host cp1087.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 93%, RTA = 1639.62 ms [17:50:51] RECOVERY - Host ms-be1031.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 1577.10 ms [17:50:53] RECOVERY - Host dbproxy1015.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 73%, RTA = 5.56 ms [17:50:53] RECOVERY - Host es1018.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 751.81 ms [17:50:53] RECOVERY - Host francium.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 86%, RTA = 1620.08 ms [17:50:53] RECOVERY - Host kafka-jumbo1006.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 93%, RTA = 2.39 ms [17:50:53] RECOVERY - Host mc1030.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 2.82 ms [17:50:53] RECOVERY - Host ps1-a1-eqiad is UP: PING WARNING - Packet loss = 93%, RTA = 3.64 ms [17:50:54] RECOVERY - Host ps1-a5-eqiad is UP: PING WARNING - Packet loss = 80%, RTA = 4.70 ms [17:50:54] RECOVERY - Host sodium.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 73%, RTA = 3.03 ms [17:50:55] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:50:57] RECOVERY - Host ps1-a6-eqiad is UP: PING WARNING - Packet loss = 86%, RTA = 78.51 ms [17:50:59] RECOVERY - Host ps1-d8-eqiad is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 86%, RTA = 108.34 ms [17:50:59] RECOVERY - Host ps1-b6-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 3.28 ms [17:51:03] RECOVERY - Host ps1-d5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.35 ms [17:51:11] RECOVERY - Host mc1021.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 0.93 ms [17:51:13] PROBLEM - Host db1077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:14] (03Merged) 10jenkins-bot: etcd: Pass wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 (owner: 10Krinkle) [17:51:21] RECOVERY - Host ms-be1018.mgmt is UP: PING WARNING - Packet loss = 73%, RTA = 1.36 ms [17:51:21] PROBLEM - Host logstash1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:21] PROBLEM - Host logstash1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:25] PROBLEM - Host lvs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:25] PROBLEM - Host lvs1014.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:29] RECOVERY - Host ms-be1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [17:51:29] PROBLEM - Host lvs1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:33] RECOVERY - Host db1125.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [17:51:35] PROBLEM - Host db1096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:37] RECOVERY - Host ms-be1054.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 1.80 ms [17:51:37] PROBLEM - Host cloudvirt1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:37] PROBLEM - Host kubernetes1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:37] PROBLEM - Host ms-be1043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:37] PROBLEM - Host ms-be1047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:41] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:51:41] RECOVERY - Host mc1020.mgmt is UP: PING OK - Packet loss = 16%, RTA = 1.54 ms [17:51:45] RECOVERY - Host relforge1002.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 2.22 ms [17:51:51] PROBLEM - Host kafka-jumbo1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:51:53] PROBLEM - Host ps1-b7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:51:59] RECOVERY - Host analytics1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [17:51:59] PROBLEM - Host ms-be1057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:03] PROBLEM - Host elastic1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:25] RECOVERY - Host analytics1034.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 0.93 ms [17:52:27] RECOVERY - Host mw1348.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 708.82 ms [17:52:34] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) I have uploaded a patch that could possibly work, my issue generally is that I can't find a sane and safe way to test if those logstas... [17:52:35] RECOVERY - Host mc-gp1003.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 44%, RTA = 3.94 ms [17:52:35] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:52:35] PROBLEM - Host ps1-b8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:52:37] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [17:52:37] PROBLEM - Host ganeti1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:37] PROBLEM - Host ms-be1024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:37] PROBLEM - Host thumbor1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:43] RECOVERY - Host wdqs1004.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 86%, RTA = 13.72 ms [17:52:45] RECOVERY - Host ps1-b1-eqiad is UP: PING WARNING - Packet loss = 50%, RTA = 1.56 ms [17:52:57] PROBLEM - Host wtp1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:59] PROBLEM - Host wtp1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:01] PROBLEM - Host dbproxy1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:01] PROBLEM - Host es1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:01] RECOVERY - Host ps1-c3-eqiad is UP: PING WARNING - Packet loss = 80%, RTA = 3.01 ms [17:53:01] (03PS4) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [17:53:03] PROBLEM - Host elastic1062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:03] PROBLEM - Host an-presto1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:03] PROBLEM - Host an-master1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:53:11] RECOVERY - Host ps1-c2-eqiad is UP: PING WARNING - Packet loss = 93%, RTA = 228.05 ms [17:53:11] RECOVERY - Host ms-be1024.mgmt is UP: PING WARNING - Packet loss = 86%, RTA = 3.58 ms [17:53:15] RECOVERY - Host an-worker1091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.78 ms [17:53:15] RECOVERY - Host an-worker1093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.93 ms [17:53:15] RECOVERY - Host an-worker1094.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [17:53:15] RECOVERY - Host analytics1067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [17:53:15] RECOVERY - Host backup1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [17:53:15] RECOVERY - Host cloudelastic1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [17:53:16] RECOVERY - Host db1097.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 2.05 ms [17:53:16] RECOVERY - Host elastic1052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [17:53:17] RECOVERY - Host elastic1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms [17:53:17] RECOVERY - Host helium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.45 ms [17:53:18] RECOVERY - Host labweb1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.12 ms [17:53:18] RECOVERY - Host ms-be1039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 329.25 ms [17:53:19] RECOVERY - Host ms-be1043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.06 ms [17:53:19] RECOVERY - Host ms-be1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms [17:53:20] RECOVERY - Host mw1222.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.78 ms [17:53:20] RECOVERY - Host mw1230.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 2.79 ms [17:53:21] RECOVERY - Host mw1238.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.04 ms [17:53:21] RECOVERY - Host mw1246.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.91 ms [17:53:22] RECOVERY - Host ores1008.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 2.32 ms [17:53:22] RECOVERY - Host ores1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [17:53:23] RECOVERY - Host pc1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [17:53:28] (03PS2) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [17:56:51] RECOVERY - Host cloudstore1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.71 ms [17:56:51] RECOVERY - Host db1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [17:56:51] RECOVERY - Host dbproxy1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [17:56:52] RECOVERY - Host ganeti1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [17:56:52] RECOVERY - Host ganeti1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [17:56:53] RECOVERY - Host kafka-main1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [17:56:53] RECOVERY - Host aqs1007.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 16%, RTA = 64.45 ms [17:56:54] RECOVERY - Host sessionstore1002.mgmt is UP: PING WARNING - DUPLICATES FOUND! Packet loss = 0%, RTA = 4.18 ms [17:56:54] RECOVERY - Host wtp1047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.43 ms [17:56:55] RECOVERY - Host elastic1057.mgmt is UP: PING WARNING - Packet loss = 37%, RTA = 3.60 ms [17:56:55] RECOVERY - Host an-coord1001.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 4.98 ms [17:56:56] RECOVERY - Host aqs1005.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 171.57 ms [17:56:57] RECOVERY - Host cloudvirt1002.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 5.29 ms [17:56:57] RECOVERY - Host ms-be1034.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 4.70 ms [17:56:58] RECOVERY - Host dbprov1002.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 1.81 ms [17:56:58] RECOVERY - Host ores1006.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 3.43 ms [17:56:59] RECOVERY - Host pc1007.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 4.66 ms [17:57:04] huh [17:57:43] apergos: see -sre [17:57:43] RECOVERY - Host mw1267.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [17:57:43] RECOVERY - Host re0.cr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [17:57:43] PROBLEM - Host dbproxy1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:57:43] RECOVERY - Host ps1-c4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.88 ms [17:57:43] RECOVERY - Host stat1006.mgmt is UP: PING WARNING - Packet loss = 44%, RTA = 1.67 ms [17:57:44] RECOVERY - Host ms-be1036.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 1.43 ms [17:57:45] (03PS3) 10Krinkle: etcd: Use require_once for etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571799 [17:57:49] PROBLEM - Host db1097.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:57:49] PROBLEM - Host mw1242.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:57:51] RECOVERY - Host wtp1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [17:57:53] RECOVERY - Host aqs1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [17:57:57] RECOVERY - Host ms-be1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [17:57:57] RECOVERY - Host mw1301.mgmt is UP: PING OK - Packet loss = 16%, RTA = 2.07 ms [17:58:01] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.49 ms [17:58:03] looking [17:58:05] RECOVERY - Host mc1027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [17:58:05] PROBLEM - Host analytics1068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:09] PROBLEM - Host labsdb1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:13] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: Iae1f45896 (duration: 01m 08s) [17:58:15] RECOVERY - Host ps1-c3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [17:58:15] RECOVERY - Host druid1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [17:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:15] RECOVERY - Host elastic1067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.86 ms [17:58:15] RECOVERY - Host maps1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [17:58:15] RECOVERY - Host ms-be1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [17:58:17] RECOVERY - Host snapshot1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [17:58:17] RECOVERY - Host mw1314.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [17:58:17] RECOVERY - Host mw1317.mgmt is UP: PING OK - Packet loss = 16%, RTA = 1.09 ms [17:58:19] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.18 ms [17:58:25] RECOVERY - Host mw1294.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [17:58:27] PROBLEM - Host puppetmaster1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:27] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [17:58:27] RECOVERY - Host mc1022.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 1.01 ms [17:58:29] RECOVERY - Host analytics1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [17:58:29] RECOVERY - Host elastic1039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [17:58:29] RECOVERY - Host elastic1047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [17:58:29] RECOVERY - Host ms-be1023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [17:58:29] RECOVERY - Host mc1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [17:58:30] RECOVERY - Host labstore1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [17:58:30] (03PS4) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [17:58:30] RECOVERY - Host maps1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [17:58:31] RECOVERY - Host mw1251.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [17:58:32] RECOVERY - Host mc1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [17:58:32] RECOVERY - Host druid1005.mgmt is UP: PING WARNING - Packet loss = 28%, RTA = 1.41 ms [17:58:37] RECOVERY - Host db1121.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [17:58:37] PROBLEM - Host mw1300.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:47] RECOVERY - Host cp1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [17:58:53] PROBLEM - Host ores1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:53] RECOVERY - Host cp1090.mgmt is UP: PING WARNING - Packet loss = 58%, RTA = 1.19 ms [17:58:55] (03CR) 10Bstorm: [C: 04-1] cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [17:58:57] PROBLEM - Host cp1078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:59] PROBLEM - Host mw1227.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:01] RECOVERY - Host elastic1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [17:59:05] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:59:07] RECOVERY - Host ms-be1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [17:59:12] (03PS1) 10C. Scott Ananian: WIP: load Parsoid from the vendor repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 [17:59:13] RECOVERY - Host dbproxy1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [17:59:15] RECOVERY - Host analytics1039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [17:59:17] RECOVERY - Host mw1308.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [17:59:19] RECOVERY - Host db1077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [17:59:19] RECOVERY - Host labsdb1010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [17:59:19] RECOVERY - Host mc1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [17:59:19] RECOVERY - Host elastic1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [17:59:25] PROBLEM - Host analytics1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] PROBLEM - Host analytics1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] PROBLEM - Host cp1082.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] PROBLEM - Host db1115.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] PROBLEM - Host elastic1060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:25] PROBLEM - Host es1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:26] PROBLEM - Host logstash1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:27] PROBLEM - Host mw1266.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:27] PROBLEM - Host mw1271.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:28] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:59:28] PROBLEM - Host mw1304.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:29] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:59:29] PROBLEM - Host restbase1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:29] PROBLEM - Host restbase1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:30] PROBLEM - Host stat1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:30] PROBLEM - Host wtp1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:31] PROBLEM - Host wtp1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:59:37] RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:59:39] RECOVERY - Host cloudvirt1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [17:59:39] RECOVERY - Host mc1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [17:59:39] RECOVERY - Host an-presto1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [17:59:42] !log downtimed mgmt in eqiad for 1h [17:59:44] (03PS1) 10Jbond: admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 [17:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:45] PROBLEM - Host ps1-c2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:59:47] RECOVERY - Host ps1-c1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.43 ms [17:59:49] RECOVERY - Host ps1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.45 ms [17:59:51] RECOVERY - Host db1107.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [17:59:51] RECOVERY - Host analytics1067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:59:53] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.97 ms [17:59:53] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [17:59:54] (03CR) 10Krinkle: [C: 03+2] etcd: Use require_once for etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571799 (owner: 10Krinkle) [17:59:55] RECOVERY - Host mw1263.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [17:59:55] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [17:59:57] RECOVERY - Host scb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T1800). [18:00:05] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [18:00:09] RECOVERY - Host analytics1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:00:12] RECOVERY - Host mw1231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [18:00:15] RECOVERY - Host relforge1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [18:00:21] RECOVERY - Host scb1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:00:21] RECOVERY - Host ps1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.58 ms [18:00:21] RECOVERY - Host dns1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [18:00:21] RECOVERY - Host wtp1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [18:00:21] RECOVERY - Host wtp1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [18:00:22] RECOVERY - Host aqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [18:00:23] RECOVERY - Host eventlog1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:00:23] RECOVERY - Host oresrdb1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:00:23] RECOVERY - Host ganeti1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.93 ms [18:00:25] RECOVERY - Host ganeti1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:00:25] RECOVERY - Host analytics1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [18:00:25] RECOVERY - Host druid1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.46 ms [18:00:25] RECOVERY - Host druid1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [18:00:25] RECOVERY - Host druid1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.49 ms [18:00:26] RECOVERY - Host druid1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [18:00:27] RECOVERY - Host aqs1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [18:00:27] RECOVERY - Host elastic1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [18:00:28] RECOVERY - Host cloudvirt1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [18:00:29] RECOVERY - Host cloudvirt1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [18:00:30] RECOVERY - Host mw1278.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [18:00:30] RECOVERY - Host db1094.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [18:00:31] RECOVERY - Host db1093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:00:32] RECOVERY - Host cloudvirt1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [18:00:32] RECOVERY - Host db1140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:00:33] RECOVERY - Host analytics1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [18:00:33] RECOVERY - Host ms-be1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.24 ms [18:00:33] RECOVERY - Host ms-be1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [18:00:34] RECOVERY - Host ms-be1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [18:00:34] RECOVERY - Host wtp1040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:00:35] RECOVERY - Host ores1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:00:35] (03CR) 10C. Scott Ananian: "James: I could use some advice on best practices here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 (owner: 10C. Scott Ananian) [18:00:36] RECOVERY - Host cloudvirt1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:00:37] RECOVERY - Host ps1-b8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.02 ms [18:00:37] RECOVERY - Host elastic1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [18:00:43] RECOVERY - Host ps1-d4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [18:00:43] RECOVERY - Host wtp1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:00:46] (03Merged) 10jenkins-bot: etcd: Use require_once for etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571799 (owner: 10Krinkle) [18:00:51] RECOVERY - Host an-conf1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:00:51] RECOVERY - Host wtp1042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [18:00:51] RECOVERY - Host mw1283.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [18:00:53] RECOVERY - Host labstore1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [18:00:57] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: Iae1f45896 (duration: 01m 06s) [18:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:01] RECOVERY - Host ps1-b5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [18:01:03] RECOVERY - Host analytics1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [18:01:05] RECOVERY - Host mw1243.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:01:07] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:01:07] PROBLEM - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:01:11] RECOVERY - Host cloudnet1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [18:01:11] RECOVERY - Host kafka-main1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [18:01:11] RECOVERY - Host lvs1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [18:01:13] RECOVERY - Host kafka-main1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:01:15] RECOVERY - Host db1132.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:01:15] RECOVERY - Host db1098.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:01:18] (03CR) 10jerkins-bot: [V: 04-1] admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 (owner: 10Jbond) [18:01:23] RECOVERY - Host mw1245.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [18:01:23] RECOVERY - Host mc1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [18:01:23] RECOVERY - Host mw1256.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms [18:01:25] RECOVERY - Host cloudelastic1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:01:25] RECOVERY - Host mw1332.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:01:29] RECOVERY - Host labstore1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:01:31] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [18:01:31] RECOVERY - Host ps1-b4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [18:01:31] RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [18:01:31] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [18:01:33] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [18:01:33] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [18:01:35] RECOVERY - Host mw1255.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [18:01:35] RECOVERY - Host mw1327.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:01:35] RECOVERY - Host dbstore1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:01:35] RECOVERY - Host ms-be1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:01:35] RECOVERY - Host mw1320.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:01:37] RECOVERY - Host db1084.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [18:01:39] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [18:01:39] RECOVERY - Host mw1306.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:01:41] RECOVERY - Host cp1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [18:01:43] RECOVERY - Host ps1-b6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [18:01:45] RECOVERY - Host mw1335.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [18:01:45] RECOVERY - Host mw1347.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [18:01:45] RECOVERY - Host mw1264.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [18:01:45] RECOVERY - Host db1081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:01:57] RECOVERY - Host wdqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:01:57] RECOVERY - Host mc1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [18:01:59] RECOVERY - Host ps1-c2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [18:02:04] (03PS2) 10Jbond: admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 [18:02:13] RECOVERY - Host scb1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [18:02:17] RECOVERY - Host db1123.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.02 ms [18:02:17] RECOVERY - Host mw1240.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [18:02:17] RECOVERY - Host db1102.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms [18:02:17] RECOVERY - Host mw1239.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.61 ms [18:02:17] RECOVERY - Host mc1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [18:02:18] RECOVERY - Host ms-be1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [18:02:18] RECOVERY - Host elastic1043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [18:02:19] RECOVERY - Host elastic1050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [18:02:19] RECOVERY - Host ms-fe1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.83 ms [18:02:19] RECOVERY - Host mw1225.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.72 ms [18:02:20] RECOVERY - Host ms-be1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.62 ms [18:02:21] RECOVERY - Host ms-be1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.71 ms [18:02:21] RECOVERY - Host ms-be1054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:02:22] RECOVERY - Host pc1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [18:02:22] RECOVERY - Host netmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.90 ms [18:02:23] RECOVERY - Host mw1331.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [18:02:23] RECOVERY - Host mw1325.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.89 ms [18:02:24] RECOVERY - Host mw1268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [18:02:24] RECOVERY - Host wdqs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [18:02:24] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [18:02:25] RECOVERY - Host logstash1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [18:02:29] RECOVERY - Host mw1275.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:02:33] RECOVERY - Host an-master1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [18:02:33] RECOVERY - Host analytics1059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [18:02:33] RECOVERY - Host cloudcephmon1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:02:33] RECOVERY - Host cloudcephmon1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [18:02:34] RECOVERY - Host analytics1061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.80 ms [18:02:35] RECOVERY - Host cloudmetrics1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [18:02:36] RECOVERY - Host cescout1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:02:36] RECOVERY - Host an-worker1092.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [18:02:37] RECOVERY - Host cloudvirt1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [18:02:38] RECOVERY - Host cloudvirt1014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [18:02:38] RECOVERY - Host cloudvirt1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [18:02:39] RECOVERY - Host thorium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:02:40] RECOVERY - Host mw1307.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [18:02:41] RECOVERY - Host mw1293.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.46 ms [18:02:41] RECOVERY - Host mw1340.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [18:02:41] RECOVERY - Host mw1241.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:02:43] RECOVERY - Host conf1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [18:02:51] RECOVERY - Host oresrdb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:02:51] RECOVERY - Host restbase1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:02:51] RECOVERY - Host db1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [18:02:53] RECOVERY - Host db1112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [18:02:53] RECOVERY - Host db1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.29 ms [18:02:53] RECOVERY - Host sessionstore1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:02:53] RECOVERY - Host ores1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [18:02:53] RECOVERY - Host db1139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [18:02:53] RECOVERY - Host db1119.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [18:02:54] RECOVERY - Host db1118.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [18:02:55] RECOVERY - Host analytics1055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [18:02:55] RECOVERY - Host scb1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [18:02:55] RECOVERY - Host maps1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [18:02:57] RECOVERY - Host thumbor1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [18:03:01] RECOVERY - Host db1135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:03:01] RECOVERY - Host labweb1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [18:03:03] RECOVERY - Host ganeti1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [18:03:03] RECOVERY - Host analytics1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:03:13] RECOVERY - Host mw1345.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [18:03:17] RECOVERY - Host an-worker1082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [18:03:17] RECOVERY - Host an-worker1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [18:03:17] RECOVERY - Host an-worker1078.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:03:21] RECOVERY - Host wtp1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:03:25] RECOVERY - Host elastic1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [18:03:25] RECOVERY - Host db1106.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [18:03:25] RECOVERY - Host ms-be1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [18:03:25] RECOVERY - Host restbase1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:03:25] RECOVERY - Host mw1304.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [18:03:27] RECOVERY - Host mc1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:03:29] RECOVERY - Host snapshot1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [18:03:29] RECOVERY - Host an-conf1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [18:03:29] RECOVERY - Host an-worker1091.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [18:03:29] RECOVERY - Host authdns1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [18:03:29] RECOVERY - Host restbase-dev1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.29 ms [18:03:29] RECOVERY - Host analytics1071.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.78 ms [18:03:30] RECOVERY - Host analytics1065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [18:03:31] RECOVERY - Host relforge1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [18:03:31] RECOVERY - Host analytics1041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.70 ms [18:03:37] (03CR) 10jerkins-bot: [V: 04-1] admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 (owner: 10Jbond) [18:03:43] Krinkle: Does https://gerrit.wikimedia.org/r/c/mediawiki/core/+/572022 look good to you? [18:04:18] (03CR) 10Bstorm: [C: 04-1] cloudstore: remove dependency on bind mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [18:06:08] (03PS2) 10C. Scott Ananian: Load Parsoid from the vendor repo, not from an ad-hoc deploy dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 [18:06:27] (03CR) 10jerkins-bot: [V: 04-1] Load Parsoid from the vendor repo, not from an ad-hoc deploy dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 (owner: 10C. Scott Ananian) [18:09:09] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10herron) Hey @jijiki, usually to test/validate filters like this I'll cherry pick or live-hack the logstash config on the beta cluster and gene... [18:09:51] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I9d0c8af3c577 (duration: 01m 06s) [18:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:06] Good evening, can someone to check what's happening with https://integration.wikimedia.org/ci/job/beta-scap-eqiad [18:11:08] ? [18:11:09] James_F: I think so [18:11:22] documented my CR in the commitmsg [18:11:47] any other callers or use of mParams/$params there in the class? [18:12:09] (03PS5) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [18:13:17] (03CR) 10Bstorm: [C: 04-1] cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [18:13:38] * Krinkle unlocks deploy handle [18:16:25] (03PS5) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [18:16:35] (03PS3) 10Jbond: admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 [18:16:57] Krinkle: would you be willing to do the honours at https://gerrit.wikimedia.org/r/#/c/mediawiki/tools/phan/+/571724/ ? So I can make a release and unblock Aaron [18:17:40] (03CR) 10Bstorm: [C: 04-1] "I'd almost forgot to change the mount paths in nfsclient.pp! Just caught that." [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [18:17:52] Daimona: just +2 or docker build as well? [18:18:00] Just +2, thanks [18:18:10] Now I'm going to make a release and try upgrading core [18:18:38] (03PS6) 10Effie Mouzeli: WIP mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [18:18:40] (03PS1) 10Effie Mouzeli: WIP hieradata: test streaming apache logs to logstash from mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/572057 (https://phabricator.wikimedia.org/T244472) [18:20:26] (03PS4) 10Jbond: admin: add fr-tech-admins and allow dns-admin and gitpuppt privileges [puppet] - 10https://gerrit.wikimedia.org/r/572052 [18:20:32] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/?auto_refresh=true still no works jobs... [18:21:04] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10RobH) So this came back after my firmware update, and I logged in, but then I logged out after looking that firmware updated. Then Arzhel pointed out it wasn't showing on... [18:27:18] (03PS1) 10EBernhardson: airflow: Turn off smtp starttls [puppet] - 10https://gerrit.wikimedia.org/r/572059 [18:28:08] (03PS2) 10EBernhardson: airflow: Turn off smtp starttls [puppet] - 10https://gerrit.wikimedia.org/r/572059 [18:29:34] 10Operations: Opcache corruption: Object and method magically missing - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:30:57] 10Operations, 10serviceops: Opcache corruption: Object and method magically missing - https://phabricator.wikimedia.org/T245183 (10jijiki) p:05Triage→03Medium [18:31:22] potential recovery spam from icinga-wm soon [18:32:53] (03PS3) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [18:33:43] actually we avoided that too :) [18:34:33] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/20797/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [18:35:36] (03PS4) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [18:36:26] (03PS5) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [18:36:48] !log ns1.wikimedia.org - re-routing from authdns2001 to dns2002 on cr[12]-codfw - T242017 [18:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:46] !log authdns2001 - reboot - T242017 [18:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:13] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call proxied to unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:39:28] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:45:01] Prod clear? I'm going to deploy a fix for the logspam from T245182 [18:45:02] T245182: Use of $wgLogoHD was deprecated in Rename configuration variable to $wgLogos - https://phabricator.wikimedia.org/T245182 [18:46:09] James_F: if the question is about the previous icinga storm: yes all clear (and was anyway not affecting production) [18:46:30] else: I don't know :-) [18:47:03] (03PS1) 10Cwhite: hiera: fix typo in ores statsd exporter mapping [puppet] - 10https://gerrit.wikimedia.org/r/572065 (https://phabricator.wikimedia.org/T233448) [18:48:14] !log ns1.wikimedia.org - re-routing back to authdns2001 instead of dns2002 on cr[12]-codfw - T242017 [18:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:25] (03CR) 10Cwhite: [C: 03+2] hiera: fix typo in ores statsd exporter mapping [puppet] - 10https://gerrit.wikimedia.org/r/572065 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [18:48:29] volans: Mostly just checking people aren't about to scap something. :-) [18:49:48] (03PS4) 10Jforrester: Merge $wgLogo into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [18:53:05] (03PS1) 10Cmjohnson: Adding mac/dhcpd file entry for mw1379[80-81] [puppet] - 10https://gerrit.wikimedia.org/r/572066 (https://phabricator.wikimedia.org/T236437) [18:54:38] (03CR) 10Cmjohnson: [C: 03+2] Adding mac/dhcpd file entry for mw1379[80-81] [puppet] - 10https://gerrit.wikimedia.org/r/572066 (https://phabricator.wikimedia.org/T236437) (owner: 10Cmjohnson) [18:55:59] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:56:41] cscott: So… composer is quite insistent that you can't possibly want parsoid 0.12.0-a1 because that's alpha and MW is real code? Fun. [18:58:25] really? i didn't have that problem. [18:58:50] i did push a bad tag at one point, you might want to `composer cacheclear` to ensure that's not what's going on [18:59:02] 10Operations, 10ops-eqiad: Audit msw1-eqiad cables - https://phabricator.wikimedia.org/T245188 (10ayounsi) p:05Triage→03Low [18:59:05] but we can use a suffix to the version spec to say that alphas are ok [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T1900). [19:00:05] maryum: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:17] okay great! [19:01:24] maryum: want to deploy yourself? [19:01:32] I don't have +2 privileges [19:01:42] but you do have deployment, right? [19:02:02] anyway, I can SWAT today! [19:02:04] I don't have taht [19:02:14] James will do it! [19:02:22] Urbanecm: Staff deploy. I'm dealing. [19:02:27] okay :) [19:02:54] (03PS7) 10Jforrester: Add new MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [19:03:07] Urbanecm: You focus on supporting community members. :-) [19:03:32] yes sir! [19:04:10] (03CR) 10Jforrester: [C: 03+2] Add new MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [19:05:14] (03Merged) 10jenkins-bot: Add new MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [19:05:34] 10Operations, 10ops-eqiad: Audit msw1-eqiad cables - https://phabricator.wikimedia.org/T245188 (10RobH) Please note this appears to also be an ideal time to do T225121 perhaps? Rather than updating netbox for the old msw? [19:05:56] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) Here is another mysterious mis-call which I can't explain: [Logstash single document](https://logstash.... [19:06:27] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) p:05Triage→03Medium [19:06:37] (03PS2) 10Muehlenhoff: Remove two obsolete partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572013 (https://phabricator.wikimedia.org/T156955) [19:07:47] (03CR) 10Gehel: [C: 03+2] airflow: Turn off smtp starttls [puppet] - 10https://gerrit.wikimedia.org/r/572059 (owner: 10EBernhardson) [19:07:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove two obsolete partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572013 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [19:08:00] ebernhardson: ^^ [19:08:23] gehel: shall I puppet-merge your airflow patch along? [19:08:30] moritzm: if you're puppet merging, you can merge mine as well [19:08:36] moritzm: yes, exactly :) [19:09:24] thanks! [19:09:32] done [19:10:12] !log installing e2fsprogs security updates [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:03] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) @herron I fiddled a bit on beta, it appears that for some reason, nothing is being streamed there since today, I am not sure if I brok... [19:17:02] 10Operations, 10serviceops: No mw canary servers in codfw - https://phabricator.wikimedia.org/T242606 (10jijiki) 05Open→03Resolved a:03jijiki thank you daniel! [19:17:28] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T219534 Add new MLR models for Cirrus on zh/ja/kowiki (duration: 01m 03s) [19:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] T219534: Test MLR models for zhwiki, jawiki and kowiki - https://phabricator.wikimedia.org/T219534 [19:20:19] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/includes/resourceloader/ResourceLoaderSkinModule.php: T245182 ResourceLoaderSkinModule: Don't hard-deprecate wgLogoHD just now (duration: 01m 03s) [19:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] T245182: Use of $wgLogoHD was deprecated in Rename configuration variable to $wgLogos - https://phabricator.wikimedia.org/T245182 [19:36:10] (03PS1) 10Andrew Bogott: nova: depool cloudvirt1022 and cloudvirt1014 [puppet] - 10https://gerrit.wikimedia.org/r/572072 (https://phabricator.wikimedia.org/T241494) [19:37:54] (03CR) 10Andrew Bogott: [C: 03+2] nova: depool cloudvirt1022 and cloudvirt1014 [puppet] - 10https://gerrit.wikimedia.org/r/572072 (https://phabricator.wikimedia.org/T241494) (owner: 10Andrew Bogott) [19:44:24] (03Abandoned) 10Jforrester: Restore default $wgFlaggedRevsStatsAge (2 hours) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis) [19:44:52] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.19/includes/api/ApiRollback.php: T245159 ApiRollback: Properly deal with UserIdentity (duration: 01m 04s) [19:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:56] T245159: PHP Warning: htmlspecialchars() expects string (from WikiPage.php) - https://phabricator.wikimedia.org/T245159 [19:45:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudvirt1022 memory errors causing host to crash - https://phabricator.wikimedia.org/T243536 (10Andrew) This server is now drained and ready for whatever. [19:47:18] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) [19:48:32] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Cmjohnson) a:05Cmjohnson→03Joe @Joe These are ready for you now, initial puppet run has been completed. I am removing from ops-eqiad tag. If you need any additional on-sit... [19:54:18] (03PS1) 10Jbond: add missing file [labs/private] - 10https://gerrit.wikimedia.org/r/572074 [19:55:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] add missing file [labs/private] - 10https://gerrit.wikimedia.org/r/572074 (owner: 10Jbond) [19:58:20] (03CR) 10Nuria: "Thanks for doing this, I also corrected docs in wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/571868 (owner: 10Elukey) [19:58:44] James_F, Krinkle: checking in about the last outstanding train blocker, https://phabricator.wikimedia.org/T245062 [19:59:11] looks like there's a patch and test. but failed simply due to the linter (missing space before paren) [19:59:25] 1) is this still considered a ubn and/or blocker? [19:59:48] 2) is there someone around who feels comfortable reviewing/merging the patches? [20:00:04] marxarelli and James_F: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200213T2000). [20:00:28] (03PS6) 10Ottomata: Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) [20:03:37] My view is No, and No. :-( [20:03:47] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/20799/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [20:10:56] alright. i'm making the call to roll to group2 [20:11:11] * marxarelli is only partially filled with dread [20:15:22] (03PS1) 10Dduvall: all wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572079 [20:15:24] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572079 (owner: 10Dduvall) [20:15:26] (03PS1) 10Jforrester: [BETA CLUSTER] Enable DiscussionTools on Beta ar and nl Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572080 (https://phabricator.wikimedia.org/T245165) [20:16:19] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572079 (owner: 10Dduvall) [20:19:09] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.19 [20:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:36] !log restarting blazegraph + updater on wdqs2006 [20:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:36] (03PS1) 10Cwhite: hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/572082 (https://phabricator.wikimedia.org/T233448) [20:22:29] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [20:22:46] (03CR) 10Ottomata: [C: 03+2] Drop all raw event data and specific PII event data after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/572041 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [20:23:05] marxarelli: I don't think the patch is ready even if it passees tests [20:23:09] yes still ubn [20:24:02] (03CR) 10Cwhite: [C: 03+2] hiera: add uwsgi.core.busy_workers to ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/572082 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [20:24:14] Krinkle: ok. the logspam is definitely annoying but all else looks ok. i'll move it to block next week's train [20:24:30] marxarelli: it makes page views and edits aborted with a fatal error [20:24:47] some of them anyway [20:24:52] on commons and wikidata [20:24:58] might even want to go back to group0 [20:25:10] Krinkle: For a small handful of pages with hand-mis-coded URLs. [20:25:15] so inerestingly I don't see the logo and I see some strange font something on some wikipedia pages [20:25:16] I don't know what it'll do on enwiki, we'll have to wait and see [20:25:26] loaded just a couple minutes ago; some pages are ok and some are not [20:25:38] reload does not appear to clear it up [20:26:14] Krinkle: It renders fine: https://en.wikipedia.org/wiki/File:Example.jpg?uselang=fr?uselang=fr [20:26:15] logged in user [20:26:48] apergos: In your console, do you have network fetch failures for load.php or upload.wikimedia.org? [20:26:49] James_F: well that doesn't actually hit the code in question, so yeah that url is fine. [20:27:29] The bug in Wikibase isn't specific to uselang urls on Commons. It's not encoding the data properly. There's probably other ways it can be triggered under normal operation. This is what canary is for so that we see that there's a logic flaw [20:27:29] Krinkle: Exactly. [20:27:47] It's what Beta Cluster is for, but yes. [20:27:50] don't see any, no [20:27:51] all cache access from wikibase code is affected, and it is only invoked on a parser cache miss. [20:28:11] so we'll see it gradually break stuff if there's other ways to trigger it, which we'll see from high traffic group2 [20:28:28] apergos: So you have a fetch for https://en.wikipedia.org/static/images/project-logos/enwiki.png ? [20:28:31] I don't think we should roll forward unles wikibase maintainers are confident the makeKey bug cant be abused by any other way into that low level code [20:28:41] the uselang is just one random front-facing trigger that we happened to find first [20:28:43] well it shows up on some pages [20:29:00] apergos: I'll be due to the CSS rule missing, not the file missing [20:29:13] can you inspect the dom node on the page where it is missing and look at teh CSS rule? [20:29:16] sounds about right, it goes with the weird font something [20:29:44] the logo config we have inprod right now is a combination of four untested layers of code. [20:29:45] I've been having occasional load.php slowness today and yesterday, but it loads eventually (about half a second late). [20:30:22] and now at last it is back on that page [20:30:36] so I can't check csss there, I can check it on another one maybe [20:30:41] !log varnish 500 spike. rolling back [20:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:58] this was definitely not a 1/2 second delay... it was likely a not loaded at all thing [20:31:12] * James_F nods. [20:31:57] oh wow. I went to the css view on anotehr tab and it as if by magic loaded the asset in front of my eyes [20:32:01] that was special [20:32:35] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [20:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:40] now opening the web console in any of these tabs loads the mssing asset.... shrug [20:33:48] !log rollback to group1 due to 500 spike (2k/min) (T233867) [20:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:52] T233867: 1.35.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T233867 [20:36:55] apergos: are you still seeing that problem. i've just rolled back [20:37:09] * apergos goes to load some more pages [20:37:13] (03PS1) 10Ottomata: Double escape \\\\w to get \w via puppet -> systemd -> shell in hadoop purge job [puppet] - 10https://gerrit.wikimedia.org/r/572087 (https://phabricator.wikimedia.org/T245124) [20:38:04] they all loaded fine [20:38:18] i don't see anything obvious in logstash for these 500s on the varnish side, so perhaps it related to the image loading [20:38:32] k. filing a task [20:40:29] well logged in user [20:40:52] and this was surely a css asset so i don't know how that works for a logged in user [20:40:54] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/20800/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572087 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [20:40:54] anyways, thanks [20:43:10] What's going on with train? [20:43:24] Zoranzoki21: Read scrollback. :-) [20:44:11] (Rolled back because of sustained Varnish 500 error rate spike, though nothing on the MW end.) [20:44:36] (03PS1) 10Jbond: add missing cassandra/restbase/restbase2020-a/restbase2020-a.kst file [labs/private] - 10https://gerrit.wikimedia.org/r/572088 [20:45:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] add missing cassandra/restbase/restbase2020-a/restbase2020-a.kst file [labs/private] - 10https://gerrit.wikimedia.org/r/572088 (owner: 10Jbond) [20:45:38] Nice... [20:48:13] Now we have to wait one week more (it is unknown also)... ContentTranslation looks very bad in 1.35.0-wmf.18, and it is reslved with 1.35.0-wmf.19 [20:48:25] *resolved [20:49:32] Uh, I'm a bit nervous, sorry. I made cherry-pick https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ContentTranslation/+/572089/ [20:49:55] could be that https/curl queueing issue. That tends to surface without MW errors making it to Logtash as php-fpm never starts those requests afaik. [20:50:05] Zoranzoki21: OK, talk to the ContentTranslation team about that? [20:50:16] not sure what would cause that thi time around though [20:50:21] Ok [20:50:23] Krinkle: Yeah. :-( [20:50:49] Given we have no stack traces, it makes it hard to know whether it'll be safe to roll forward. [20:51:30] (03PS2) 10Jforrester: [BETA CLUSTER] Enable DiscussionTools on Beta ar and nl Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572080 (https://phabricator.wikimedia.org/T245165) [20:51:36] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Enable DiscussionTools on Beta ar and nl Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572080 (https://phabricator.wikimedia.org/T245165) (owner: 10Jforrester) [20:52:32] (03Merged) 10jenkins-bot: [BETA CLUSTER] Enable DiscussionTools on Beta ar and nl Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572080 (https://phabricator.wikimedia.org/T245165) (owner: 10Jforrester) [20:54:27] (03PS1) 10Jforrester: Revert "all wikis to 1.35.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572092 [20:54:41] (03CR) 10Jforrester: [C: 03+2] "Already deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572092 (owner: 10Jforrester) [20:55:29] (03Merged) 10jenkins-bot: Revert "all wikis to 1.35.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572092 (owner: 10Jforrester) [20:57:46] James_F: I see that beta-mediawiki-config-update-eqiad still no works... [20:59:25] it's got clogged again then [20:59:50] it's going again [21:02:05] It's clean now [21:02:10] (03PS1) 10Ottomata: eventgate - For consistency, add common labels to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/572093 (https://phabricator.wikimedia.org/T242861) [21:04:38] (03PS2) 10Ottomata: eventgate - For consistency, add common labels to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/572093 (https://phabricator.wikimedia.org/T242861) [21:06:22] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - For consistency, add common labels to networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/572093 (https://phabricator.wikimedia.org/T242861) (owner: 10Ottomata) [21:18:46] (03PS4) 10VolkerE: Fix latin Wikipedia (VICIPÆDIA) wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571838 (https://phabricator.wikimedia.org/T240728) [21:19:02] (03PS1) 10Ottomata: Configure production and canary release for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/572095 (https://phabricator.wikimedia.org/T245203) [21:20:42] 10Operations, 10Traffic, 10Release, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), 10Train Deployments: Varnish 500 spike following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10greg) [21:21:09] (03CR) 10Ottomata: [C: 03+2] Configure production and canary release for eventgate-logging-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/572095 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [21:28:26] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [21:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:06] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [21:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:15] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @MoritzMuehlenhoff I believe it would be best if we engaged in a call with Jumpcloud's LDAP engineers to find out exactly what can be... [21:35:42] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10MoritzMuehlenhoff) @HMarcus Sure, we can do that. Let's do Thursday (2/20) - 7am PST, 4pm CET [21:35:58] !log deploying production and canary releases for eventgate-logging-external (and destroying the 'logging-external' release) (safe because eventgate-logging-external is not in use) - T245203 [21:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:02] T245203: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 [21:37:00] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [21:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:21] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [21:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:33] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Koavf) [21:40:48] !log refresh facts on compilers [21:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:18] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [21:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:10] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Koavf) I can confirm that this is impacting many projects. E.g. https://en.wikipedia.org/wiki/Specia... [21:42:16] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [21:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:19] (03PS1) 10Ottomata: eventgate-logging-external - Remove now unused 'logging-external' release [deployment-charts] - 10https://gerrit.wikimedia.org/r/572097 (https://phabricator.wikimedia.org/T245203) [21:43:50] 10Operations, 10LDAP-Access-Requests: Add Amooney to wmf LDAP Group - https://phabricator.wikimedia.org/T245215 (10WDoranWMF) [21:44:04] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - Remove now unused 'logging-external' release [deployment-charts] - 10https://gerrit.wikimedia.org/r/572097 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [21:44:28] 10Operations, 10LDAP-Access-Requests: Add Amooney to wmf LDAP Group - https://phabricator.wikimedia.org/T245215 (10WDoranWMF) [21:46:08] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850 (10Dwisehaupt) [21:51:26] (03CR) 10Herron: "thx for this! left a few comments inline, and will coordinate testing via the task" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [22:01:49] (03PS1) 10Ottomata: Configure production and canary release for eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/572100 (https://phabricator.wikimedia.org/T245203) [22:02:02] (03PS1) 10Volans: apt: cleanup unused package_updates custom fact [puppet] - 10https://gerrit.wikimedia.org/r/572101 [22:03:58] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: special pages has not been updated since November 2019 in jawiki and several other projects - https://phabricator.wikimedia.org/T243599 (10Koavf) I can confirm that this is impacting many projects. E.g. https://en.wikipedia.org/wiki/Specia... [22:04:29] (03PS2) 10Volans: apt: remove unused package_updates custom fact [puppet] - 10https://gerrit.wikimedia.org/r/572101 [22:09:08] (03CR) 10Volans: apt: remove unused package_updates custom fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572101 (owner: 10Volans) [22:09:48] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) a:05Cmjohnson→03None Moving task back to the untriaged pool [22:12:23] (03CR) 10Krinkle: [C: 03+2] Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:12:49] (03CR) 10jerkins-bot: [V: 04-1] Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:12:58] (03PS3) 10Krinkle: Set "coalesceKeys" for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:13:21] (03PS4) 10Krinkle: Beta: Enable "coalesceKeys" for WANObjectCache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:13:25] (03CR) 10Krinkle: [C: 03+2] Beta: Enable "coalesceKeys" for WANObjectCache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:13:30] !log running filesystem tests on cloudvirt1024 T241884 [22:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:34] T241884: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 [22:13:55] (03PS1) 10Ottomata: Configure production and canary release for eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/572106 (https://phabricator.wikimedia.org/T245203) [22:14:27] (03Merged) 10jenkins-bot: Beta: Enable "coalesceKeys" for WANObjectCache in deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571793 (owner: 10Aaron Schulz) [22:29:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. What about the /usr/local/bin/apt2xml file, though? To be removed via Cumin or shall we absent it?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572101 (owner: 10Volans) [22:30:35] 10Operations, 10observability, 10serviceops, 10Patch-For-Review: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10herron) Fwiw I do see logs flowing into logstash-beta generally, but puppet was broken in the beta cluster because the master filled its disk.... [22:35:47] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) @HMarcus would you mind looping me in on that meeting? [22:36:30] (03PS2) 10Jforrester: The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:36:43] (03CR) 10Jforrester: [C: 03+1] "Good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:36:51] (03PS2) 10Jforrester: The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [22:37:01] (03PS3) 10Volans: apt: remove unused package_updates custom fact [puppet] - 10https://gerrit.wikimedia.org/r/572101 [22:37:03] (03PS1) 10Volans: apt: finally remove absented apt2xml file [puppet] - 10https://gerrit.wikimedia.org/r/572111 [22:38:03] (03CR) 10Muehlenhoff: [C: 03+1] apt: remove unused package_updates custom fact [puppet] - 10https://gerrit.wikimedia.org/r/572101 (owner: 10Volans) [22:38:20] (03CR) 10Volans: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572101 (owner: 10Volans) [22:38:23] (03CR) 10Muehlenhoff: [C: 03+1] apt: finally remove absented apt2xml file [puppet] - 10https://gerrit.wikimedia.org/r/572111 (owner: 10Volans) [22:38:57] (03PS1) 10Andrew Bogott: designate nova_fixed_multi: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/572112 [22:40:43] (03CR) 10Andrew Bogott: [C: 03+2] designate nova_fixed_multi: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/572112 (owner: 10Andrew Bogott) [22:57:22] (03PS1) 10CRusnov: Fix -extras for netbox 2.7 upgrade [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/572116 [22:59:36] 10Operations, 10Android-app-Bugs, 10RESTBase, 10Traffic, and 6 others: Varnish 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10Jdforrester-WMF) [23:01:49] 10Operations, 10Android-app-Bugs, 10RESTBase, 10Traffic, and 6 others: Varnish 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment - https://phabricator.wikimedia.org/T245202 (10Jdforrester-WMF) We dug into the Web request logs. All the 500s were coming from `/api/rest_v1/page/... [23:25:35] (03PS1) 10Bstorm: toolforge-k8s: add kube-bench config files for use on the cluster [puppet] - 10https://gerrit.wikimedia.org/r/572117 (https://phabricator.wikimedia.org/T240009) [23:42:09] (03CR) 10Bstorm: toolforge-k8s: add kube-bench config files for use on the cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572117 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [23:51:06] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) [23:51:35] (03PS1) 10Papaul: DNS: Add production DNS elastic2055 to elastic2060 [dns] - 10https://gerrit.wikimedia.org/r/572120 [23:53:15] (03PS1) 10Andrew Bogott: nova_fixed_multi: support adding/deleting records in a 'legacy' domain [puppet] - 10https://gerrit.wikimedia.org/r/572122 (https://phabricator.wikimedia.org/T245173) [23:53:36] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS elastic2055 to elastic2060 [dns] - 10https://gerrit.wikimedia.org/r/572120 (owner: 10Papaul)