[00:01:43] mutante, hi! [00:03:16] Jhs: hello. thanks for catching that [00:03:27] np :) [00:03:52] i was going to just wait until tomorrow now because i know nobody will run createwiki tonight anyways..and then i know they all saw it to adjust the other changes [00:04:18] mutante, someone told me there's a special right in gerrit you have to request to be able to change other people's patches (due to vandalism issues in the past). do you know how/where to request that right? [00:04:22] or maybe the chapter people itself should be aware [00:05:06] yeah, Amir (ladsgroup) was just with me drinking beer here in Wikidatacon, so i doubt he'll be creating any wikis tonight ;P [00:06:03] Jhs: it's called "trusted-users" [00:06:20] Jhs: easiest is probably to fill out this form and say "i want to be a trusted-users" https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=Gerrit-Privilege-Requests [00:06:41] got there from https://www.mediawiki.org/wiki/Gerrit/Privilege_policy [00:07:27] mutante, thanks :o) [00:08:46] Oh, you want the trusted group Jhs [00:08:46] (03CR) 10Dzahn: [C: 03+1] admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [00:08:59] paladox: is there a better link? [00:09:22] there's: https://gerrit.wikimedia.org/r/#/admin/groups/1505,members [00:09:26] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Krinkle) [00:09:35] Jhs i can add you [00:09:44] paladox, that would be great :) [00:09:46] are you jhsoby@ [00:09:49] ? [00:09:51] yes [00:09:57] ok [00:10:04] Jhs {{done}} [00:10:19] i sometimes encounter patches where i could make a simple fix, but need to leave a comment or -1 instead [00:10:27] paladox, thank you :) [00:10:33] your welcome :) [00:13:09] You can add users to it if you want (it's just like phabricator's trusted user group) [00:13:27] (03PS4) 10Jon Harald Søby: mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [00:31:23] (03PS6) 10Jon Harald Søby: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [00:44:44] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) - home dirs copied to individual $user_bast1002.tar.gz files in each user home (where the user exists on both old and new server) so users have their old files if they... [00:48:58] (03CR) 10Dzahn: "i think last time i moved the bast i had to ask network ops to change ACLs to make the install_server part work. i did not want to block y" [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [00:52:55] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/LiquidThreads/classes/View.php: (no justification provided) (duration: 00m 54s) [00:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:19] !log puppetmaster1001 - revoking parsoid.svc.eqiad / parsoid.svc.codfw / parsoid.discovery.wmnet certificates and creating new ones including parsoid-php.discovery.wmnet (T233654) [01:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:25] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [01:19:59] (03PS1) 10Dzahn: ssl: update certificates for parsoid/parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/545989 (https://phabricator.wikimedia.org/T233654) [01:23:40] (03CR) 10Dzahn: "openssl x509 -text -noout -in parsoid.discovery.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/545989 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [01:23:58] (03CR) 10Dzahn: [C: 03+2] ssl: update certificates for parsoid/parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/545989 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [01:27:22] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3045.esams.wmnet [01:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:36] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3046.esams.wmnet [01:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:51] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3061.esams.wmnet [01:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:41] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3041.esams.wmnet [01:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:54] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3052.esams.wmnet [01:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:51] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3047.esams.wmnet [01:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:03] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3063.esams.wmnet [01:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:38] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3042.esams.wmnet [01:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:48] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3062.esams.wmnet [01:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:03] !log cr2-esams + cr3-esams: switch ntp peers list to use dns300[12] instead of nescio/maerlant [01:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:35] !log asw2-esams: switch ntp peers list to use dns300[12] instead of nescio/maerlant [01:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:07] !log mr1-esams: switch ntp peers list to use dns300[12] instead of nescio/maerlant [01:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:35] so I'm using the instructions here: https://wikitech.wikimedia.org/wiki/How_to_run_queries_on_live_data [02:00:51] but I'm getting the error 'Error: Host not configured: "db1124"' [02:01:18] when using the command 'sql enwiki --host db1124' [02:01:27] any ideas? [02:08:10] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:08:13] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:08:15] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [02:09:59] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [02:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:26] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [02:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:30] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `lvs[3001-3004].esams.wmnet` - lvs3001.esams.wmnet (**PASS**) - Downtimed host on Icing... [02:13:43] PROBLEM - snapshot of s2 in eqiad on db1115 is CRITICAL: snapshot for s2 at eqiad taken more than 4 days ago: Most recent backup 2019-10-21 01:49:22 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:36:48] (03PS1) 10BBlack: lvs3001-4 decom puppet removal [puppet] - 10https://gerrit.wikimedia.org/r/545991 (https://phabricator.wikimedia.org/T236451) [02:39:03] (03PS1) 10BBlack: lvs3001-4 decom dns removal [dns] - 10https://gerrit.wikimedia.org/r/545992 (https://phabricator.wikimedia.org/T236451) [02:39:58] (03CR) 10BBlack: [C: 03+2] lvs3001-4 decom puppet removal [puppet] - 10https://gerrit.wikimedia.org/r/545991 (https://phabricator.wikimedia.org/T236451) (owner: 10BBlack) [02:40:07] (03CR) 10BBlack: [C: 03+2] lvs3001-4 decom dns removal [dns] - 10https://gerrit.wikimedia.org/r/545992 (https://phabricator.wikimedia.org/T236451) (owner: 10BBlack) [02:40:12] (03PS2) 10BBlack: lvs3001-4 decom dns removal [dns] - 10https://gerrit.wikimedia.org/r/545992 (https://phabricator.wikimedia.org/T236451) [02:43:41] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:43:54] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) a:03Papaul [02:44:20] RECOVERY - snapshot of s2 in eqiad on db1115 is OK: snapshot for s2 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-10-25 01:12:14 from db1095.eqiad.wmnet:3312 (813 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:44:54] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3043.esams.wmnet [02:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:07] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3064.esams.wmnet [02:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:35] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3049.esams.wmnet [02:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:48] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3065.esams.wmnet [02:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:28] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20446648 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:02] davidwbarratt: so, I don't know that script well at all, but I will note that I don't see db1124 anywhere on https://noc.wikimedia.org/db.php -- I think you probably want to pick one of the other non-master replicas from s1? [02:52:16] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26827856 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:52:42] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17733360 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:04] davidwbarratt: alternatively, clicking through on dbtree, you might need to specify it as db1124:3311 [02:54:12] (03PS1) 10BBlack: esams: remove maerlant/nescio from ntp peers [puppet] - 10https://gerrit.wikimedia.org/r/545994 [02:54:52] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 59424 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:55:00] (03CR) 10BBlack: [C: 03+2] esams: remove maerlant/nescio from ntp peers [puppet] - 10https://gerrit.wikimedia.org/r/545994 (owner: 10BBlack) [02:55:36] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 282704 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:56:04] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 182432 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:04:38] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:05:14] PROBLEM - NTP peers on nescio is CRITICAL: NTP CRITICAL: No response from NTP server https://wikitech.wikimedia.org/wiki/NTP [03:05:46] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [03:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:20] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [03:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:24] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `maerlant.wikimedia.org,nescio.wikimedia.org` - maerlant.wikimedia.org (**PASS**)... [03:08:47] !log cr2-esams + cr3-esams : remove nescio and maerlant from anycast4 neighbor list [03:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:30] PROBLEM - PyBal backends health check on lvs3007 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:11:42] PROBLEM - PyBal IPVS diff check on lvs3006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([nescio.wikimedia.org]) https://wikitech.wikimedia.org/wiki/PyBal [03:12:46] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:12:48] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:13:53] ugh that thing is still around heh [03:16:22] PROBLEM - PyBal IPVS diff check on lvs3007 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([nescio.wikimedia.org]) https://wikitech.wikimedia.org/wiki/PyBal [03:17:14] (03PS1) 10BBlack: dns300[12] - fix resolv.conf and old lvs recdns [puppet] - 10https://gerrit.wikimedia.org/r/545998 (https://phabricator.wikimedia.org/T236217) [03:17:16] (03PS1) 10BBlack: puppet decom for nescio and maerlant [puppet] - 10https://gerrit.wikimedia.org/r/545999 (https://phabricator.wikimedia.org/T236452) [03:18:59] (03PS1) 10BBlack: dns decom for nescio and maerlant [dns] - 10https://gerrit.wikimedia.org/r/546000 (https://phabricator.wikimedia.org/T236452) [03:19:01] (03CR) 10BBlack: [C: 03+2] dns300[12] - fix resolv.conf and old lvs recdns [puppet] - 10https://gerrit.wikimedia.org/r/545998 (https://phabricator.wikimedia.org/T236217) (owner: 10BBlack) [03:19:32] (03CR) 10BBlack: [C: 03+2] dns decom for nescio and maerlant [dns] - 10https://gerrit.wikimedia.org/r/546000 (https://phabricator.wikimedia.org/T236452) (owner: 10BBlack) [03:19:37] (03PS2) 10BBlack: dns decom for nescio and maerlant [dns] - 10https://gerrit.wikimedia.org/r/546000 (https://phabricator.wikimedia.org/T236452) [03:19:55] (03CR) 10BBlack: [C: 03+2] puppet decom for nescio and maerlant [puppet] - 10https://gerrit.wikimedia.org/r/545999 (https://phabricator.wikimedia.org/T236452) (owner: 10BBlack) [03:21:13] (03PS1) 10Dzahn: switch esams prometheus node from bast3002 to bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) [03:22:34] (03CR) 10Dzahn: "@Filippo Can we do this? Otherwise it would block the decom of bast3002. I have rsynced the data but into a separate dir so far" [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [03:22:47] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/dns_rec on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:13] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:23] PROBLEM - Confd template for /srv/config-master/pybal/codfw/dns_rec on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:25] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:25] PROBLEM - Confd template for /srv/config-master/pybal/esams/dns_rec on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:29] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/dns_rec on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:31] PROBLEM - Confd template for /srv/config-master/pybal/esams/dns_rec on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:37] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/dns_rec_udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:53] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/dns_rec on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:53] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/dns_rec_udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:55] PROBLEM - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:23:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:24:04] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=dns3001.* [03:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:18] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=dns300.* [03:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:25] PROBLEM - Confd template for /srv/config-master/pybal/esams/dns_rec_udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:24:26] ..... [03:24:40] all for a useless and overcomplicated dead service :P [03:24:46] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=dns300.* [03:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:59] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:25:19] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/dns_rec is broken https://wikitech.wikimedia.org/wiki/Confd [03:25:29] RECOVERY - PyBal backends health check on lvs3007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:26:04] I'm guessing the config-master things will fix themselves at some point [03:26:07] RECOVERY - PyBal IPVS diff check on lvs3006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:26:10] (03PS2) 10Dzahn: site: replace bast3002 with bast3004, remove from bastion list [puppet] - 10https://gerrit.wikimedia.org/r/545911 (https://phabricator.wikimedia.org/T236329) [03:26:39] RECOVERY - PyBal IPVS diff check on lvs3007 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [03:27:56] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:28:18] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) a:03Papaul [03:28:39] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:28:44] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [03:29:13] PROBLEM - MegaRAID on db2120 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:29:14] ACKNOWLEDGEMENT - MegaRAID on db2120 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T236453 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:29:18] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10ops-monitoring-bot) [03:32:26] guess not [03:33:06] one of those commands is: /usr/local/lib/nagios/plugins/check_confd_template '/srv/config-master/pybal/esams/dns_rec_udp' [03:33:14] Stale template error files present for '/srv/config-master/pybal/esams/dns_rec_udp' [03:33:27] that is different from the eqiad one about it somehow [03:33:45] yeah I'm not sure what's "stale" about it [03:33:47] the eqiad one says "is broken" [03:33:51] the contents all look right [03:34:07] even eqiad [03:34:48] oh "there are error files, which are stale", not "error, file is stale" [03:34:57] because it was temporarily broken [03:35:07] now if only it pointed at the error filenames instead of that filename :P [03:35:11] that makes sense. but why is the eqiad one "broken" [03:35:19] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/dns_rec_udp on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/dns_rec_udp is broken https://wikitech.wikimedia.org/wiki/Confd [03:36:22] if [ ${TS} -gt ${FILE_TS} ]; [03:36:35] the script makes me think they should be in the same directory with some tmpnam suffix, but I don't see them [03:37:13] they are starting with . [03:37:20] .dns_rec072354900.err [03:37:28] in /var/run/confd-template [03:37:59] ah [03:38:32] fixing [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/codfw/dns_rec on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dns_rec on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/dns_rec_udp on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:01] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/dns_rec on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:05] heh [03:39:21] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:39:37] RECOVERY - Confd template for /srv/config-master/pybal/esams/dns_rec_udp on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:51] RECOVERY - Confd template for /srv/config-master/pybal/esams/dns_rec on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:39:59] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/dns_rec_udp on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:01] RECOVERY - Confd template for /srv/config-master/pybal/esams/dns_rec on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:12] so you deleted the stale files and that fixes esams and the warning makes sense. but why compilation was broken for all other templates [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/dns_rec on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dns_rec on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/dns_rec_udp on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:40:39] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/dns_rec on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [03:42:05] mutante: I'm not even going to try to understand how it broke them all. But I did temporarily have esams configured with zero entries (empty file) [03:42:35] ack! [03:46:07] i think the install_server needed an ACL change on router to be a working DHCP/tftp last time we moved it. i have not switched the DHCP server yet but there is a change in gerrit [03:46:48] yeah I think you're right, I'll go look [03:46:52] cool [03:52:46] (03PS1) 10Dzahn: install_server: set 'next-server' for bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/546003 (https://phabricator.wikimedia.org/T236394) [03:54:08] (03CR) 10Vgutierrez: [C: 03+1] install_server: set 'next-server' for bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/546003 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [03:54:54] (03CR) 10Dzahn: [C: 03+2] install_server: set 'next-server' for bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/546003 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [03:57:14] (03PS2) 10Dzahn: switch esams prometheus node from bast3002 to bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) [04:01:13] (03PS1) 10Dzahn: install_server: remove bast3002 from DHCP, decom [puppet] - 10https://gerrit.wikimedia.org/r/546004 (https://phabricator.wikimedia.org/T236329) [04:02:02] (03PS3) 10Dzahn: DHCP: replace bast3002 with bast3004 as next-server [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) [04:02:38] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:02:55] (03PS4) 10Dzahn: DHCP: switch esams DHCP server from bast3002 to bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) [04:04:23] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:04:28] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [04:05:33] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:07:52] hmmm wikibugs needed a break [04:09:33] lol [04:09:50] mutante: I found the dhcp helper in the router config. Can I change it now? [04:10:06] bblack: yes, then i merge the puppet side now as well [04:10:19] oh wait, maybe I found the wrong thing [04:10:49] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/545973 [04:11:02] ah, you did.. just the bot is gone :) [04:11:07] yeah.. [04:11:10] you need more than a +1? ;P [04:11:19] mutante: I think there's actually nothing to change. We currently use install1002 as the dhcp helper in the router config in esams, which is unchanged [04:11:34] (maybe we could/should improve that, but that can happen some other time) [04:11:47] yea, and i just added that to bast3004 DHCP config. that install1002 is the "next" [04:12:00] i just had vague memories i also had to ask netops back then [04:12:04] merging [04:12:14] probably memories from replacing install1001 with install1002 or similar [04:12:16] (03CR) 10Dzahn: [C: 03+2] DHCP: switch esams DHCP server from bast3002 to bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [04:12:27] but yeah I donno [04:12:54] i just did not want to do that earlier and step away.. and then maybe you have to reinstall something and i broke it [04:12:56] that part you're editing isn't DHCP, it's tftp [04:12:59] which doesn't need router help [04:13:43] (well, you're editing the config of the dhcp server, but the lines are just specifying which tftp server to use, which can be routed-to normally) [04:13:47] right, the commit message should say tftp.. hrm [04:13:54] doesn't matter :) [04:14:50] i dont know about switching the prometheus node yet [04:14:57] that is also on bast3002 [04:15:23] ok [04:16:04] there is a patch where i added godog [04:16:23] i think he expected it can move straight to a ganeti VM [04:16:33] apologies in advance for potential icinga spam with the cp decom process [04:16:51] hmmm prometheus_nodes: [04:16:51] - bast3002.wikimedia.org [04:16:53] I'm trying to fit all the jenga peices together right, but I doubt I get it all perfect :) [04:17:03] on hieradata/esams.yaml [04:17:07] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/546001 [04:17:13] that and a second file [04:17:20] snmp_exporter [04:17:53] yup [04:18:01] but i dont know much about it [04:18:28] i did copy the data off of bast3002 though. it is backed up on bast3004 [04:18:33] if the snmp_exporter one grants access to some stuff maybe it makes sense to add there bast3004, run puppet on the affected systems and then swap bast3004 on prometheus_nodes [04:18:39] 60GB prometheus data [04:19:01] (03CR) 10ArielGlenn: "I'd set this at a different hour than the other jobs so we're not doing multiple pulls at once." [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [04:20:07] (03CR) 10Dzahn: "for appdata see https://phabricator.wikimedia.org/T236329#5605033" [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [04:22:19] vgutierrez: here is the data https://phabricator.wikimedia.org/T236329#5605033 [04:22:46] (03PS1) 10BBlack: puppet decom for cp3030-3049 [puppet] - 10https://gerrit.wikimedia.org/r/546005 (https://phabricator.wikimedia.org/T236454) [04:24:10] (03PS1) 10BBlack: dns decom for cp3030-3049 [dns] - 10https://gerrit.wikimedia.org/r/546006 [04:24:48] (03CR) 10BBlack: [C: 03+2] puppet decom for cp3030-3049 [puppet] - 10https://gerrit.wikimedia.org/r/546005 (https://phabricator.wikimedia.org/T236454) (owner: 10BBlack) [04:25:53] so yea, do you want to reinstall something anyways? do you want to test if install works before i go? [04:26:44] reimaging lvs300[567] are the last critical installs I think, but it will be a few before I get to them [04:27:06] (or vg) [04:27:47] i can do one of the ganeti ones [04:28:05] oh yeah that makes a good test too [04:30:15] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3003.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910250429_dzahn_1682... [04:32:59] PROBLEM - Host ganeti3003 is DOWN: PING CRITICAL - Packet loss = 100% [04:33:20] cant ssh to ganeti3002/3003 mgmt to look at console ..uhm [04:34:07] they worked before, that's odd [04:34:13] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance={cp3030:9536,cp3032:9536,cp3033:9536,cp3040:9536,cp3041:9536,cp3042:9536,cp3043:9536} site=esams tunnel={cp1077_v4,cp1077_v6,cp1079_v4,cp1079_v6,cp1081_v4,cp1081_v6,cp1083_v4,cp1083_v6,cp1085_v4,cp1085_v6,cp1087_v4,cp1087_v6,cp1089_v4,cp1089_v6,cp2001_v4,cp2001_v6,cp2004_v4,cp2004_v6,cp2006_v4,cp2006_v6,cp2007_v4,cp2007_v6,cp2010_v4,cp2010_v6,cp2012 [04:34:13] 013_v4,cp2013_v6,cp2016_v4,cp2016_v6,cp2019_v4,cp2019_v6,cp2023_v4,cp2023_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [04:35:29] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2047 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:35:31] RECOVERY - Host ganeti3003 is UP: PING OK - Packet loss = 0%, RTA = 83.45 ms [04:35:50] yea, so the reimage script is waiting for the reboot (earlier i just did the PXE setting and boot in racadm to continue) and mgmt not there.. [04:35:54] at least he host is back :p [04:36:17] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:25] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1045 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [04:38:59] RECOVERY - Check systemd state on ms-be1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:09] PROBLEM - MD RAID on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:39:25] PROBLEM - configured eth on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [04:39:27] bblack: 04:39:12 | ganeti3003.esams.wmnet | Host up (Debian installer) [04:39:32] i guess that counts ?:) [04:39:57] PROBLEM - Disk space on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ganeti3003&var-datasource=esams+prometheus/ops [04:39:57] PROBLEM - dhclient process on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [04:39:57] PROBLEM - DPKG on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:40:13] PROBLEM - Check systemd state on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:37] that is from wmf-auto-reimage output only because no mgmt .. and as we see the downtime part failed again [04:41:21] PROBLEM - puppet last run on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:41:43] ACKNOWLEDGEMENT - Check systemd state on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:43] ACKNOWLEDGEMENT - DPKG on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:41:43] ACKNOWLEDGEMENT - Disk space on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ganeti3003&var-datasource=esams+prometheus/ops [04:41:43] ACKNOWLEDGEMENT - MD RAID on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:41:43] ACKNOWLEDGEMENT - configured eth on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [04:41:43] ACKNOWLEDGEMENT - dhclient process on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [04:41:43] ACKNOWLEDGEMENT - puppet last run on ganeti3003 is CRITICAL: connect to address 10.20.0.33 port 5666: Connection refused daniel_zahn testing new esams install_server https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:44:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:49:29] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [04:49:29] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [04:49:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [04:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:27] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [04:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:38] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [04:51:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:48] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3030,3032-3035].esams.wmnet` - cp3030.esams.wmnet (**PASS**) -... [04:52:03] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [04:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:20] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [04:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:30] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3003.esams.wmnet'] ` [04:53:32] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3036,3038-3041].esams.wmnet` - cp3036.esams.wmnet (**PASS**) -... [04:53:42] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [04:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:57] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [04:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:06] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3042-3046].esams.wmnet` - cp3042.esams.wmnet (**PASS**) - Down... [04:55:16] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [04:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:49] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [04:56:03] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [04:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:14] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3032,3047,3049].esams.wmnet` - cp3032.esams.wmnet (**FAIL**) -... [04:56:40] (03CR) 10BBlack: [C: 03+2] dns decom for cp3030-3049 [dns] - 10https://gerrit.wikimedia.org/r/546006 (owner: 10BBlack) [05:00:03] 10Operations, 10Traffic: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) [05:00:30] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) a:05BBlack→03Papaul [05:03:09] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10BBlack) [05:03:13] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [05:03:15] !log Applying a SSL handshake timeout of 60 secs on ats-tls/cp5007 - T236458 [05:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:19] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [05:04:45] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:06:25] (03PS1) 10BBlack: cleanup for old cp3007-10 [puppet] - 10https://gerrit.wikimedia.org/r/546011 (https://phabricator.wikimedia.org/T208585) [05:08:02] (03PS1) 10BBlack: cleanup for old cp3007-10 [dns] - 10https://gerrit.wikimedia.org/r/546012 (https://phabricator.wikimedia.org/T208585) [05:08:04] (03CR) 10BBlack: [C: 03+2] cleanup for old cp3007-10 [puppet] - 10https://gerrit.wikimedia.org/r/546011 (https://phabricator.wikimedia.org/T208585) (owner: 10BBlack) [05:09:06] (03CR) 10BBlack: [C: 03+2] cleanup for old cp3007-10 [dns] - 10https://gerrit.wikimedia.org/r/546012 (https://phabricator.wikimedia.org/T208585) (owner: 10BBlack) [05:11:15] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:15:44] 10Operations, 10Traffic: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) I'm tracking used TCP sockets on eqsin text nodes in https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now, I've manuall... [05:16:03] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:31:08] 10Operations, 10Traffic: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) p:05Triage→03Normal [05:35:55] !log reimage lvs3007 to let it get the proper partman configuration - T236294 [05:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:01] T236294: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 [05:37:32] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3007.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019102... [05:56:18] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:32] (03PS1) 10Vgutierrez: ATS: Enable the SSL handshake timeout and set it to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) [05:58:11] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:58:22] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:27] (03PS2) 10Vgutierrez: ATS: Enable the SSL handshake timeout and set it to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) [06:03:33] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: add a helper script to initialize a node [puppet] - 10https://gerrit.wikimedia.org/r/545838 [06:03:35] (03PS1) 10Giuseppe Lavagetto: profile::cache::base: include initialize script [puppet] - 10https://gerrit.wikimedia.org/r/546015 [06:04:23] (03CR) 10Vgutierrez: "pcc seems happy" [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [06:04:35] (03CR) 10Vgutierrez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [06:07:05] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` and were **ALL** successful. [06:11:13] (03CR) 10Giuseppe Lavagetto: blubberoid: Add TLS termination (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [06:23:15] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [06:23:40] (03PS7) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [06:25:01] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) 05Open→03Resolved a:03jijiki I believe this is done:) Removing HHVM is continuing under T229792 [06:38:09] (03CR) 10Elukey: "Andrew: not sure if we need to do it in here but the ferm config of HDFS-related daemons needs to be adjusted to include the labstore node" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [06:47:15] (03PS1) 10Vgutierrez: ATS: Provide a generic ats-{instance_name}-restart script [puppet] - 10https://gerrit.wikimedia.org/r/546025 [06:49:52] (03PS2) 10Vgutierrez: ATS: Provide a generic ats-{instance_name}-restart script [puppet] - 10https://gerrit.wikimedia.org/r/546025 [06:54:10] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/19061/" [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [07:07:24] RECOVERY - Juniper alarms on csw2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:08:59] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [07:10:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [07:12:18] (03CR) 10Muehlenhoff: "On the matter of LVM, I'd go with applying our fixes on top of what we have. Rebasing to the latest upstream version might also cause all " [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [07:16:58] PROBLEM - Juniper alarms on csw2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:17:08] (03CR) 10Ema: ATS: Provide a generic ats-{instance_name}-restart script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [07:20:00] PROBLEM - IPMI Sensor Status on ganeti3003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:20:27] (03PS1) 10Ema: secret: dummy key for labweb [labs/private] - 10https://gerrit.wikimedia.org/r/546095 (https://phabricator.wikimedia.org/T210411) [07:21:41] (03CR) 10Muehlenhoff: puppet: manage localcacert in puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [07:22:45] (03CR) 10Jeena Huneidi: [V: 03+2 C: 03+1] "I think it's an improvement over my approach :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [07:26:07] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [07:26:33] (03CR) 10Ema: [C: 03+1] ATS: Enable the SSL handshake timeout and set it to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [07:26:55] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for labweb [labs/private] - 10https://gerrit.wikimedia.org/r/546095 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:27:07] (03PS4) 10Jeena Huneidi: [DNM] Update scaffold template names to use chart name [deployment-charts] - 10https://gerrit.wikimedia.org/r/539220 [07:27:46] (03PS1) 10Ema: labweb: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/546097 (https://phabricator.wikimedia.org/T210411) [07:27:48] (03PS1) 10Ema: Add labweb-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/546098 (https://phabricator.wikimedia.org/T210411) [07:27:50] (03PS1) 10Ema: ATS: use TLS to connect to labweb [puppet] - 10https://gerrit.wikimedia.org/r/546099 (https://phabricator.wikimedia.org/T210411) [07:29:29] !log reboot webperf2002 for disk resize T235455 [07:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:35] T235455: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 [07:34:04] PROBLEM - Host webperf1002 is DOWN: PING CRITICAL - Packet loss = 100% [07:34:34] RECOVERY - Host webperf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [07:35:36] !log reboot webperf1002 for disk resize T235455 [07:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:41] T235455: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 [07:35:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [07:36:06] RECOVERY - Juniper alarms on csw2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:38:50] (03CR) 10Ema: [C: 03+2] labweb: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/546097 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:38:57] (03PS8) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [07:47:30] PROBLEM - Host ganeti3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [07:48:56] (03PS9) 10Giuseppe Lavagetto: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) [07:50:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [07:50:19] (03Merged) 10jenkins-bot: blubberoid: Add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [07:51:45] (03PS3) 10Vgutierrez: ATS: Provide a generic ats-{instance_name}-restart script [puppet] - 10https://gerrit.wikimedia.org/r/546025 [07:52:10] (03CR) 10Vgutierrez: ATS: Provide a generic ats-{instance_name}-restart script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [07:53:12] RECOVERY - Host ganeti3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.04 ms [07:54:08] RECOVERY - IPMI Sensor Status on ganeti3001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:54:24] RECOVERY - IPMI Sensor Status on lvs3005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:55:00] (03CR) 10Vgutierrez: "pcc is still happy: https://puppet-compiler.wmflabs.org/compiler1002/19064/" [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [07:55:57] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable the SSL handshake timeout and set it to 60 seconds [puppet] - 10https://gerrit.wikimedia.org/r/546014 (https://phabricator.wikimedia.org/T236458) (owner: 10Vgutierrez) [07:57:14] (03CR) 10Ema: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [07:58:39] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide a generic ats-{instance_name}-restart script [puppet] - 10https://gerrit.wikimedia.org/r/546025 (owner: 10Vgutierrez) [08:00:08] PROBLEM - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:02:22] !log rolling restart of ats-tls to introduce a SSL handshake timeout of 60 secs - T236458 [08:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:28] T236458: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 [08:04:12] (03PS1) 10Arturo Borrero Gonzalez: toolsdb: pin mariadb version [puppet] - 10https://gerrit.wikimedia.org/r/546102 (https://phabricator.wikimedia.org/T236384) [08:08:42] RECOVERY - IPMI Sensor Status on cp3050 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:10:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolsdb: pin mariadb version [puppet] - 10https://gerrit.wikimedia.org/r/546102 (https://phabricator.wikimedia.org/T236384) (owner: 10Arturo Borrero Gonzalez) [08:11:08] RECOVERY - IPMI Sensor Status on cp3051 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:11:15] (03PS3) 10Muehlenhoff: Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512 [08:16:26] RECOVERY - IPMI Sensor Status on cp3053 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:19:27] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [08:21:51] RECOVERY - IPMI Sensor Status on dns3001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:23:48] (03CR) 10Muehlenhoff: [C: 03+2] Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512 (owner: 10Muehlenhoff) [08:25:42] (03CR) 10Alexandros Kosiaris: "I don't easy a plausible way out of the key issue. Aside from using some other trustworthy store I don't see how we could trust the repo w" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [08:26:36] (03PS2) 10Ema: Add labweb-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/546098 (https://phabricator.wikimedia.org/T210411) [08:29:35] RECOVERY - IPMI Sensor Status on cp3054 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:29:48] (03CR) 10Ema: [C: 03+2] Add labweb-ssl LVS service [puppet] - 10https://gerrit.wikimedia.org/r/546098 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:32:31] !log lvs1016: restart pybal to add labweb-ssl T210411 [08:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:36] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [08:33:11] (03PS3) 10Filippo Giunchedi: add bast3004 to esams prometheus nodes [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [08:33:42] (03CR) 10Muehlenhoff: Add reprepo updates for cassandra311 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [08:34:52] !log ema@cumin1001 conftool action : set/pooled=yes; selector: service=labweb-ssl [08:35:12] (03CR) 10Filippo Giunchedi: [C: 03+2] "I've added bast3004 to the lists as opposed to replace to keep both working in the meantime, merging" [puppet] - 10https://gerrit.wikimedia.org/r/546001 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [08:35:25] ema@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [08:35:46] (03CR) 10Muehlenhoff: Add reprepo updates for cassandra311 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [08:36:18] !log test [08:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:56] !log lvs1015: restart pybal to add labweb-ssl T210411 [08:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:01] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [08:38:39] (03PS1) 10MarcoAurelio: RESTRouter: Add ge.wm.org; remove ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/546111 [08:39:48] (03PS2) 10MarcoAurelio: RESTRouter: Add ge.wm.org; remove ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/546111 [08:40:39] (03CR) 10MarcoAurelio: [C: 03+1] add ge.wikimedia.org for Georgia chapter [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [08:41:24] (03PS2) 10Ema: ATS: use TLS to connect to labweb [puppet] - 10https://gerrit.wikimedia.org/r/546099 (https://phabricator.wikimedia.org/T210411) [08:42:16] (03CR) 10Ema: [C: 03+2] ATS: use TLS to connect to labweb [puppet] - 10https://gerrit.wikimedia.org/r/546099 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:42:47] (03PS7) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [08:44:54] (03PS5) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ge.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) [08:44:58] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [08:45:03] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [08:45:34] !log stop prometheus on bast300[24] and done last round of rsync data - T236329 [08:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:38] T236329: decommission bast3002 - https://phabricator.wikimedia.org/T236329 [08:45:50] I can't english today [08:46:36] ma riesci italiano? [08:47:04] si': pizza+mamma+mandolino [08:47:30] excellent! [08:47:38] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/546116 (https://phabricator.wikimedia.org/T231627) [08:47:40] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/546117 (https://phabricator.wikimedia.org/T231627) [08:48:12] !log switch from nginx to ats-tls on cp3050 - T231627 [08:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:16] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [08:48:43] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/546116 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:49:29] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) [08:49:57] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) Note: domain name changed from `kawikimedia` to `gewikimedia`. [08:51:07] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/546117 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [08:52:34] (03PS7) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [08:52:41] PROBLEM - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [08:52:46] known ^ [08:53:01] (03CR) 10Jbond: puppet: manage localcacert in puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [08:53:34] (03PS1) 10Ema: cache: reimage cp4030 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546118 (https://phabricator.wikimedia.org/T227432) [08:53:58] (03PS3) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [08:54:11] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:54:25] (03CR) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [08:54:35] PROBLEM - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [08:54:51] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [08:55:06] yes yes [08:55:07] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [08:55:37] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3050 is CRITICAL: connect to address 10.20.0.50 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:55:39] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp3050 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [08:55:44] ^^ expected [08:56:54] (03PS2) 10Ema: cache: reimage cp4030 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546118 (https://phabricator.wikimedia.org/T227432) [08:57:03] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp3050 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 553177 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:57:17] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp4030 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546118 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:57:33] (03PS1) 10Filippo Giunchedi: esams: move prometheus to bast3004 [dns] - 10https://gerrit.wikimedia.org/r/546120 (https://phabricator.wikimedia.org/T236329) [08:58:31] !log depool cp4030 and reimage as text_ats T227432 [08:58:35] (03PS8) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [08:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:42] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:59:02] (03CR) 10Ema: [C: 03+2] cache: reimage cp4030 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546118 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:59:57] (03CR) 10Filippo Giunchedi: [C: 03+2] esams: move prometheus to bast3004 [dns] - 10https://gerrit.wikimedia.org/r/546120 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [09:00:20] (03CR) 10MarcoAurelio: [C: 03+1] add ge.wikimedia.org for Georgia chapter (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [09:00:36] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4030.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [09:01:24] !log ema@cumin1001 conftool action : set/weight=100; selector: name=cp4030.ulsfo.wmnet,service=ats-be [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:16] !log disabling persistent journald on db1074 [09:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:44] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) All data sync'd to bast3004 and DNS flipped, bast3002 can continue decom from my POV, thanks @Dzahn ! [09:04:48] (03Abandoned) 10Jcrespo: monitoring: Enable persistent journal storage for logs on test db hosts [puppet] - 10https://gerrit.wikimedia.org/r/535818 (owner: 10Jcrespo) [09:05:34] (03PS4) 10Jcrespo: Revert "nrpe: Don't set PrivateTmp=True" [puppet] - 10https://gerrit.wikimedia.org/r/464601 [09:05:46] (03PS1) 10MarcoAurelio: Revert "Restrict uploads on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546127 (https://phabricator.wikimedia.org/T236307) [09:06:06] !log going to power down mr1-esams (esams mgmt is going to go down) for 30min the time to move power cables [09:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:10] (03CR) 10Jcrespo: [C: 03+1] "Let's do this in November!" [puppet] - 10https://gerrit.wikimedia.org/r/464601 (owner: 10Jcrespo) [09:07:20] (03PS1) 10MarcoAurelio: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 [09:10:31] PROBLEM - Host lvs3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:07] PROBLEM - Host ganeti3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:11] (03PS2) 10MarcoAurelio: Adjust wgUploadNavigationUrl for azwiki to point to commons' UpWiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 [09:11:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [09:11:27] PROBLEM - Host cp3051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:27] PROBLEM - Host cp3054.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:27] PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:27] PROBLEM - Host dns3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:27] PROBLEM - Host dns3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:11:37] (03PS3) 10MarcoAurelio: Adjust wgUploadNavigationUrl for azwiki to point to commons' UpWiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 [09:11:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [09:12:09] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10fgiunchedi) Prometheus data sync'd again from bast3002 and copied in place, DNS flipped, Prometheus is live on this host now and not active on bast3002 anymore [09:12:13] (03PS2) 10MarcoAurelio: Revert "Restrict uploads on azwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546127 (https://phabricator.wikimedia.org/T236307) [09:12:23] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:12:34] (03PS4) 10MarcoAurelio: Adjust wgUploadNavigationUrl for azwiki to point to commons' UpWiz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546129 [09:12:41] PROBLEM - Host bast3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:12:55] PROBLEM - Host lvs3005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:12:55] PROBLEM - Host lvs3006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:03] PROBLEM - Host cp3060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:09] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:13:17] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:13:37] PROBLEM - Host cp3050.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:47] PROBLEM - Host cp3052.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:47] PROBLEM - Host cp3053.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:49] PROBLEM - Host cp3059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:49] PROBLEM - Host cp3061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:49] PROBLEM - Host cp3062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:49] PROBLEM - Host cp3063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:13:49] PROBLEM - Host cp3064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:14:55] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:15:03] PROBLEM - Host ganeti3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:15:33] PROBLEM - Host cp3055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:15:33] PROBLEM - Host cp3057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:15:33] PROBLEM - Host cp3058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:15:35] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:15:59] RECOVERY - BFD status on cr2-knams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:18:05] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:19:09] RECOVERY - Host cp3059.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.12 ms [09:19:09] RECOVERY - Host cp3060.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.07 ms [09:19:09] RECOVERY - Host cp3061.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.08 ms [09:19:09] RECOVERY - Host cp3062.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.10 ms [09:19:09] RECOVERY - Host cp3063.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.06 ms [09:19:09] RECOVERY - Host cp3064.mgmt is UP: PING WARNING - Packet loss = 93%, RTA = 84.01 ms [09:19:17] PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:19:35] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:19:48] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:20:03] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.07 ms [09:20:11] RECOVERY - Host ganeti3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [09:20:13] RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.08 ms [09:20:21] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [09:20:22] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:41] RECOVERY - Host cp3055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.77 ms [09:20:41] RECOVERY - Host cp3057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 97.63 ms [09:20:41] RECOVERY - Host cp3058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.56 ms [09:21:27] !log powering off mr1-esams again [09:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:31] RECOVERY - Host lvs3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.04 ms [09:22:28] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:16] PROBLEM - Host cp3065.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:24:29] RECOVERY - Prometheus bast3004/ops restarted: beware possible monitoring artifacts on bast3004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [09:24:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. For WMCS, please also get the +1 from Andrew Bogott." [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [09:24:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. For WMCS, please also get the +1 from Andrew Bogott." [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [09:24:59] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:42] PROBLEM - Host cp3059.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:42] PROBLEM - Host cp3060.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:42] PROBLEM - Host cp3061.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:42] PROBLEM - Host cp3062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:42] PROBLEM - Host cp3063.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:25:42] PROBLEM - Host cp3064.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:26:06] RECOVERY - Prometheus bast3002/ops restarted: beware possible monitoring artifacts on bast3002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [09:26:38] PROBLEM - Host ganeti3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:26:48] PROBLEM - Host ganeti3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:27:18] PROBLEM - Host cp3055.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:27:18] PROBLEM - Host cp3058.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:27:18] PROBLEM - Host cp3057.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:27:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:28:10] PROBLEM - Host lvs3007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:30:34] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:30:34] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:30:58] RECOVERY - Host cp3059.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.78 ms [09:30:59] RECOVERY - Host cp3061.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.85 ms [09:30:59] RECOVERY - Host cp3060.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.12 ms [09:30:59] RECOVERY - Host cp3062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.33 ms [09:30:59] RECOVERY - Host cp3063.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.92 ms [09:30:59] RECOVERY - Host cp3064.mgmt is UP: PING OK - Packet loss = 0%, RTA = 90.22 ms [09:31:00] the mw errors seems related to wikidata and started at ~8:40, perhaps a script running ? [09:31:54] RECOVERY - Host ganeti3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.07 ms [09:32:06] RECOVERY - Host ganeti3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [09:32:35] (03PS1) 10Faidon Liambotis: Make neighbor block comments consistent [dns] - 10https://gerrit.wikimedia.org/r/546132 [09:32:36] RECOVERY - Host cp3055.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.50 ms [09:32:36] RECOVERY - Host cp3058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.36 ms [09:32:36] RECOVERY - Host cp3057.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.91 ms [09:32:37] (03PS1) 10Faidon Liambotis: Renumber cr2-knams<->cr2-esams neighbor block [dns] - 10https://gerrit.wikimedia.org/r/546133 [09:33:14] RECOVERY - Host bast3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.80 ms [09:33:28] RECOVERY - Host lvs3007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.04 ms [09:33:38] RECOVERY - Host cp3051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.31 ms [09:33:38] RECOVERY - Host cp3054.mgmt is UP: PING OK - Packet loss = 0%, RTA = 89.01 ms [09:33:40] RECOVERY - Host dns3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.04 ms [09:33:40] RECOVERY - Host dns3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.98 ms [09:33:40] RECOVERY - Host ganeti3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.02 ms [09:33:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:34:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [09:34:42] RECOVERY - Host lvs3005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.12 ms [09:34:42] RECOVERY - Host lvs3006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.95 ms [09:35:04] RECOVERY - Host cp3050.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.03 ms [09:35:08] RECOVERY - Host cp3065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [09:35:16] RECOVERY - Host cp3052.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.06 ms [09:35:16] RECOVERY - Host cp3053.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.05 ms [09:35:32] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Aklapper) >>! In T236240#5601713, @Gilles wrote: > The use of -q mean... [09:36:12] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [09:38:44] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4030.ulsfo.wmnet'] ` and were **ALL** successful. [09:39:14] !log pool cp4030 with ATS backend T227432 [09:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:19] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:39:28] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:40:14] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4030 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [09:40:14] PROBLEM - Varnish HTTP text-backend - port 3128 on cp4030 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 2016 bytes in 0.747 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:40:54] ah, icinga still thinks there's varnish on cp4030 [09:42:41] (03PS4) 10Jbond: apereo_cas: add ability to use groovy script to determine MFA [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) [09:45:02] (03PS20) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [09:45:04] (03CR) 10Jcrespo: bacula: Create new backup jobs status check for icinga (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:46:18] (03CR) 10Jbond: [C: 03+2] apereo_cas: add ability to use groovy script to determine MFA [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [09:47:07] 10Operations, 10Traffic: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) 05Open→03Resolved [09:47:10] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:49:04] (03CR) 10Faidon Liambotis: [C: 03+2] Make neighbor block comments consistent [dns] - 10https://gerrit.wikimedia.org/r/546132 (owner: 10Faidon Liambotis) [09:49:34] (03CR) 10Jcrespo: bacula: Create new backup jobs status check for icinga (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [09:50:55] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) Indeed, nice find! Adding `-sstdout=%stderr` fixes the issue. [09:51:04] (03CR) 10Faidon Liambotis: [C: 03+2] Renumber cr2-knams<->cr2-esams neighbor block [dns] - 10https://gerrit.wikimedia.org/r/546133 (owner: 10Faidon Liambotis) [09:53:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:54:43] (03PS1) 10Ema: cache: reimage cp4031 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546136 (https://phabricator.wikimedia.org/T227432) [09:55:37] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp4031 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546136 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:56:30] !log depool cp4031 and reimage as text_ats T227432 [09:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:56:55] (03CR) 10Ema: [C: 03+2] cache: reimage cp4031 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546136 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:57:16] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ge.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10jcrespo) Thanks, being public we DBAs are not a blocker, and we just need a heads up when it is finally deployed to the database. [09:58:38] (03PS2) 10Urbanecm: add ge.wikimedia.org for Georgia user group [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [09:58:47] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [09:59:03] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4031.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [09:59:09] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [09:59:42] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp4031.ulsfo.wmnet,service=ats-be [09:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 49.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:01:19] !log swift eqiad-prod: add weight to ms-be105[1-6] - T232367 [10:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:24] T232367: (2019-09-15) rack/setup/install ms-be105[1-6].eqiad.wmnet - https://phabricator.wikimedia.org/T232367 [10:01:44] (03PS1) 10Vgutierrez: ATS: Keep retry time against parent servers (varnish-fe) to the bare minimum [puppet] - 10https://gerrit.wikimedia.org/r/546137 [10:06:33] (03CR) 10Filippo Giunchedi: "If we're going the apache proxy route this address could live in puppet only I think" [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [10:07:20] (03CR) 10Ema: [C: 03+1] "Looks good, perhaps we should also consider tweaking fail_threshold if we're seeing false positives?" [puppet] - 10https://gerrit.wikimedia.org/r/546137 (owner: 10Vgutierrez) [10:07:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:37] ^ it is wikidata again :/ [10:09:08] (03CR) 10Urbanecm: [C: 04-1] Initial configuration for ge.wikimedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [10:09:18] (03CR) 10Vgutierrez: [C: 03+2] ATS: Keep retry time against parent servers (varnish-fe) to the bare minimum [puppet] - 10https://gerrit.wikimedia.org/r/546137 (owner: 10Vgutierrez) [10:12:32] (03CR) 10Filippo Giunchedi: "LGTM, although the use of 'channel' here is a bit confusing to me: AFAICT the patch will change 'type' key in log messages, not 'channel' " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry) [10:13:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:14:27] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:14:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [10:17:20] PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - 5 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:18:22] RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:18:51] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:54] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:15] RECOVERY - IPMI Sensor Status on cp3062 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:22:46] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10ema) For the record, the User-Agent causing this is `FortiGate (FortiOS 5.0)`. [10:23:01] RECOVERY - IPMI Sensor Status on cp3061 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:23:07] (03PS1) 10Muehlenhoff: Fix groovy_source [puppet] - 10https://gerrit.wikimedia.org/r/546138 [10:23:45] (03PS21) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [10:24:45] RECOVERY - IPMI Sensor Status on ganeti3003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:25:39] RECOVERY - IPMI Sensor Status on cp3064 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:26:18] 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Juniper alarm active [10:26:43] RECOVERY - IPMI Sensor Status on cp3063 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:27:47] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 88, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:22] (03CR) 10Jcrespo: "New format to make Riccardo happy:" [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [10:29:31] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 71.44 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:30:11] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 95, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:31:17] RECOVERY - IPMI Sensor Status on lvs3007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [10:32:36] 10Operations, 10ops-codfw: Degraded RAID on db2120 - https://phabricator.wikimedia.org/T236453 (10jcrespo) a:03Papaul Please ask for a replacement as usual, no emergency here. [10:33:43] (03PS1) 10Jbond: puppet-facts-export: Support multiple puppetdb uri's [puppet] - 10https://gerrit.wikimedia.org/r/546139 (https://phabricator.wikimedia.org/T235655) [10:35:19] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4031.ulsfo.wmnet'] ` and were **ALL** successful. [10:35:59] (03CR) 10MarcoAurelio: Initial configuration for ge.wikimedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [10:37:11] (03PS9) 10MarcoAurelio: Initial configuration for ge.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [10:40:33] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:47:37] 10Operations, 10Puppet, 10puppet-compiler: Cleanup the puppetmaster module so that we stop breaking expectations (and the puppet compiler) - https://phabricator.wikimedia.org/T211547 (10jbond) [10:51:02] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:10] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:51:31] is that cp3008 alert legitimate? [10:51:34] XioNoX: ^ [10:54:18] (03PS1) 10Filippo Giunchedi: bast3002: decom phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/546140 (https://phabricator.wikimedia.org/T236329) [10:54:56] cp3008 is decommed on the software level; https://phabricator.wikimedia.org/T208585 maybe it's getting decommed currently? [10:55:02] ah o [10:55:02] k [10:55:59] (03PS1) 10Faidon Liambotis: Correct interface names in the esams neighbor blocks [dns] - 10https://gerrit.wikimedia.org/r/546142 [10:56:33] yeah [10:56:47] we're wiping all the old CP servers [10:56:52] (03CR) 10Faidon Liambotis: [C: 03+2] Correct interface names in the esams neighbor blocks [dns] - 10https://gerrit.wikimedia.org/r/546142 (owner: 10Faidon Liambotis) [10:56:53] is there any that shouldn't be? [10:57:07] (I wouldn't know) [11:00:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546140 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [11:03:08] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:14] (03CR) 10Filippo Giunchedi: [C: 03+2] bast3002: decom phase 1 [puppet] - 10https://gerrit.wikimedia.org/r/546140 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [11:05:36] RECOVERY - BFD status on cr2-knams is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:05:44] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2047 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:07:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546138 (owner: 10Muehlenhoff) [11:08:03] (03PS1) 10Jbond: hiera_lookup: add message pointing to `puppet lookup` [puppet] - 10https://gerrit.wikimedia.org/r/546143 [11:10:59] (03PS2) 10Muehlenhoff: Fix groovy_source [puppet] - 10https://gerrit.wikimedia.org/r/546138 [11:13:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix groovy_source [puppet] - 10https://gerrit.wikimedia.org/r/546138 (owner: 10Muehlenhoff) [11:14:41] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add ge.wm.org; remove ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/546111 (owner: 10MarcoAurelio) [11:26:53] (03PS1) 10Muehlenhoff: Reference the correct groovy source [puppet] - 10https://gerrit.wikimedia.org/r/546144 [11:27:04] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) [11:31:18] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-esams.mgmt.esams.wmnet recovered from Juniper alarm active [11:39:14] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: fix binding reference names [puppet] - 10https://gerrit.wikimedia.org/r/546145 (https://phabricator.wikimedia.org/T236074) [11:40:19] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) @hashar has added this to Jenkins now https://integration.wikimedia.org/ci/label/puppet-compiler-node/. for future reference this is something done manua... [11:40:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: ingress: fix binding reference names [puppet] - 10https://gerrit.wikimedia.org/r/546145 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez) [11:52:34] (03PS1) 10Jbond: puppet_compiler: Add puppet version to the PCC report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/546146 (https://phabricator.wikimedia.org/T236468) [11:59:07] (03CR) 10Jbond: [C: 03+1] Reference the correct groovy source [puppet] - 10https://gerrit.wikimedia.org/r/546144 (owner: 10Muehlenhoff) [12:03:42] (03CR) 10Muehlenhoff: [C: 03+2] Reference the correct groovy source [puppet] - 10https://gerrit.wikimedia.org/r/546144 (owner: 10Muehlenhoff) [12:05:07] (03PS1) 10Muehlenhoff: Rename groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546155 [12:08:24] RECOVERY - snapshot of s4 in eqiad on db1115 is OK: snapshot for s4 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-10-25 09:38:46 from db1102.eqiad.wmnet:3314 (1080 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [12:11:05] !log pool cp4031 with ATS backend T227432 [12:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:11] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:13:16] (03CR) 10Jbond: [C: 03+1] Rename groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546155 (owner: 10Muehlenhoff) [12:28:52] RECOVERY - Check the Netbox report librenms for fail status. on netbox1001 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:29:30] (03CR) 10Jcrespo: [C: 03+2] "I am going to deploy this based on +1/LGTM received, that way I can start working on enhancements while we observe its behaviour in produc" [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:29:43] (03PS22) 10Jcrespo: bacula: Create new backup jobs status check for icinga [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) [12:30:34] (03PS2) 10Faidon Liambotis: Remove esams exclusion [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/544389 (owner: 10Ayounsi) [12:30:45] (03CR) 10Faidon Liambotis: [C: 03+2] Remove esams exclusion [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/544389 (owner: 10Ayounsi) [12:31:02] (03PS2) 10Muehlenhoff: Rename groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546155 [12:31:49] (03PS1) 10Ema: cache: reimage cp4032 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546163 (https://phabricator.wikimedia.org/T227432) [12:33:58] (03CR) 10Gehel: [C: 04-1] "see inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe) [12:34:17] !log introducing new freshnesh check for bacula T234900 [12:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:22] T234900: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 [12:34:31] I expect that to fail^ (not the deploy, the check) [12:37:04] lol, it failed, but not rightly [12:37:58] PROBLEM - Check the Netbox report management for fail status. on netbox1001 is CRITICAL: management.ManagementConsole CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:42:22] (03PS1) 10Jcrespo: bacula: Fix bacula_check.py location [puppet] - 10https://gerrit.wikimedia.org/r/546164 (https://phabricator.wikimedia.org/T234900) [12:43:13] (03CR) 10Jcrespo: "No one saw this!" [puppet] - 10https://gerrit.wikimedia.org/r/546164 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:43:16] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix bacula_check.py location [puppet] - 10https://gerrit.wikimedia.org/r/546164 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:44:36] (03CR) 10Muehlenhoff: [C: 03+2] Rename groovy script [puppet] - 10https://gerrit.wikimedia.org/r/546155 (owner: 10Muehlenhoff) [12:45:20] PROBLEM - Backup freshness on helium is CRITICAL: NRPE: Unable to read output https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [12:52:08] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Patch-For-Review: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10hashar) That is indeed done via the web interface. The new node can be seen at: https://integration.wikimedia.org/ci/computer/compiler1003.... [12:53:23] 10Operations, 10Puppet, 10observability: update failed puppet checkes so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10jbond) p:05Triage→03Normal [12:54:18] (03PS1) 10Jbond: check_puppetrun: dont alert for diabled puppet agents for 1 day [puppet] - 10https://gerrit.wikimedia.org/r/546165 (https://phabricator.wikimedia.org/T236478) [12:54:36] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:58:24] (03PS4) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [12:58:26] (03PS13) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [13:01:28] 10Operations, 10Puppet, 10observability, 10Patch-For-Review: update failed puppet checkes so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10jbond) I toko a look at how the mod gets the last puppet run data and iut just dose the following `stat -c %Z /var/lib/puppet/state/clas... [13:04:46] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:05:29] !log depool cp4032 and reimage as text_ats T227432 [13:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:34] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:06:22] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [13:06:25] (03CR) 10Ema: [C: 03+2] cache: reimage cp4032 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/546163 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:07:03] !log ema@cumin1001 conftool action : set/weight=100; selector: name=cp4032.ulsfo.wmnet,service=ats-be [13:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:47] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4032.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [13:10:22] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 0.06839 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:11:56] 10Operations, 10Traffic: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10BBlack) p:05Triage→03Normal [13:14:17] (03PS1) 10BBlack: ganeti3003: provision as authdns::server [puppet] - 10https://gerrit.wikimedia.org/r/546171 (https://phabricator.wikimedia.org/T236479) [13:14:19] (03PS1) 10BBlack: authdns_servers: add ganeti3003 [puppet] - 10https://gerrit.wikimedia.org/r/546172 (https://phabricator.wikimedia.org/T236479) [13:14:21] (03PS1) 10BBlack: authdns_servers: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/546173 (https://phabricator.wikimedia.org/T236479) [13:14:58] (03PS1) 10Filippo Giunchedi: Set bast3002 to spare [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) [13:15:16] (03CR) 10Ottomata: [C: 03+2] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [13:15:24] (03PS3) 10Ottomata: Include hadoop client packages and config on dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) [13:15:58] (03PS1) 10Ema: prometheus: add text_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/546176 (https://phabricator.wikimedia.org/T227432) [13:16:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "Email sent to ops@ about bast3002 going away" [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:16:50] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:17:00] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add text_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/546176 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:18:05] (03PS2) 10BBlack: ganeti3003: provision as authdns::server [puppet] - 10https://gerrit.wikimedia.org/r/546171 (https://phabricator.wikimedia.org/T236479) [13:18:50] (03PS1) 10Jcrespo: bacula: Allow nagios user to execute bacula check as bacula user [puppet] - 10https://gerrit.wikimedia.org/r/546177 (https://phabricator.wikimedia.org/T236423) [13:19:48] (03CR) 10Ema: [C: 03+2] prometheus: add text_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/546176 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:21:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, but note that with https://wikitech.wikimedia.org/wiki/Decom_script this intermediate step isn't really needed, you can simply drop " [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:24:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:24:52] (03PS2) 10BBlack: authdns_servers: add ganeti3003 [puppet] - 10https://gerrit.wikimedia.org/r/546172 (https://phabricator.wikimedia.org/T236479) [13:24:54] (03PS2) 10BBlack: authdns_servers: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/546173 (https://phabricator.wikimedia.org/T236479) [13:24:56] (03PS1) 10BBlack: Dots have meaning in regexes :P [puppet] - 10https://gerrit.wikimedia.org/r/546180 [13:26:22] (03PS2) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [13:28:39] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:08] (03PS2) 10Jcrespo: bacula: Allow nagios user to execute bacula check as bacula user [puppet] - 10https://gerrit.wikimedia.org/r/546177 (https://phabricator.wikimedia.org/T236423) [13:29:34] (03CR) 10BBlack: [C: 03+2] ganeti3003: provision as authdns::server [puppet] - 10https://gerrit.wikimedia.org/r/546171 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [13:30:40] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [13:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:02] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `bast3002.wikimedia.org` - bast3002.wikimedia.org (**PASS**) - Downtimed host on... [13:31:16] (03PS3) 10CDanis: wmftest.org: add graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) [13:31:33] 10Operations, 10Puppet: puppet should utilise Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) p:05Triage→03Normal [13:31:44] (03CR) 10CDanis: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [13:32:06] (03PS3) 10Jcrespo: bacula: Allow nagios user to execute bacula check as bacula user [puppet] - 10https://gerrit.wikimedia.org/r/546177 (https://phabricator.wikimedia.org/T236423) [13:32:09] (03CR) 10CDanis: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [13:32:24] (03PS1) 10Jbond: apereo_case: use Binary for storing the keystore [puppet] - 10https://gerrit.wikimedia.org/r/546181 (https://phabricator.wikimedia.org/T236481) [13:32:45] (03PS2) 10Filippo Giunchedi: Decom bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) [13:35:41] (03PS1) 10Jhedden: ceph: add etcd and k8s profile for rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [13:36:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:36:10] 10Operations, 10Traffic, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['ganeti3003.esams.wmnet'] ` The log can be found in `/var/log/wmf-aut... [13:36:50] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:36:53] (03PS5) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [13:36:55] (03PS14) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [13:36:59] (03PS3) 10Filippo Giunchedi: Decom bast3002 [puppet] - 10https://gerrit.wikimedia.org/r/546175 (https://phabricator.wikimedia.org/T236329) [13:37:01] (03CR) 10Jhedden: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) (owner: 10Jhedden) [13:37:46] 10Operations, 10Traffic, 10observability: Add ats-tls status and availability graphs to frontend-traffic - https://phabricator.wikimedia.org/T236482 (10ema) [13:37:53] 10Operations, 10Traffic, 10observability: Add ats-tls status and availability graphs to frontend-traffic - https://phabricator.wikimedia.org/T236482 (10ema) p:05Triage→03Normal [13:39:26] (03CR) 10Jbond: [C: 03+2] apereo_case: use Binary for storing the keystore [puppet] - 10https://gerrit.wikimedia.org/r/546181 (https://phabricator.wikimedia.org/T236481) (owner: 10Jbond) [13:39:36] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) [13:40:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546177 (https://phabricator.wikimedia.org/T236423) (owner: 10Jcrespo) [13:41:10] (03PS1) 10Jbond: Revert "apereo_case: use Binary for storing the keystore" [puppet] - 10https://gerrit.wikimedia.org/r/546183 [13:41:12] (03PS1) 10Filippo Giunchedi: Decom bast3002 [dns] - 10https://gerrit.wikimedia.org/r/546184 (https://phabricator.wikimedia.org/T236329) [13:43:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "apereo_case: use Binary for storing the keystore" [puppet] - 10https://gerrit.wikimedia.org/r/546183 (owner: 10Jbond) [13:43:23] (03CR) 10Jcrespo: [C: 03+2] bacula: Allow nagios user to execute bacula check as bacula user [puppet] - 10https://gerrit.wikimedia.org/r/546177 (https://phabricator.wikimedia.org/T236423) (owner: 10Jcrespo) [13:43:50] (03PS2) 10Jbond: Revert "apereo_case: use Binary for storing the keystore" [puppet] - 10https://gerrit.wikimedia.org/r/546183 [13:43:54] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4032.ulsfo.wmnet'] ` and were **ALL** successful. [13:44:01] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom bast3002 [dns] - 10https://gerrit.wikimedia.org/r/546184 (https://phabricator.wikimedia.org/T236329) (owner: 10Filippo Giunchedi) [13:46:12] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) [13:46:32] (03CR) 10Jbond: [C: 03+2] Revert "apereo_case: use Binary for storing the keystore" [puppet] - 10https://gerrit.wikimedia.org/r/546183 (owner: 10Jbond) [13:47:04] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) a:05Dzahn→03Papaul I took over from @Dzahn in the interest of time, @Papaul host is ready for on site steps! [13:48:57] !log depool mw1334 and pool back [13:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:03] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Patch-For-Review: add compiler1003 to jenkins - https://phabricator.wikimedia.org/T236468 (10jbond) p:05Triage→03Normal [13:53:35] (03PS6) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [13:53:37] (03PS15) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [13:54:24] (03PS1) 10Jcrespo: bacula: Actually run the check with sudo -u bacula [puppet] - 10https://gerrit.wikimedia.org/r/546187 (https://phabricator.wikimedia.org/T236423) [13:54:56] (03PS2) 10Jhedden: ceph: add etcd and k8s master profile for rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [13:55:23] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:26] (03CR) 10jerkins-bot: [V: 04-1] bacula: Actually run the check with sudo -u bacula [puppet] - 10https://gerrit.wikimedia.org/r/546187 (https://phabricator.wikimedia.org/T236423) (owner: 10Jcrespo) [13:56:42] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Halfak) a:05kevinbazira→03colewhite Hi @colewhite, I'm re-assigning to you given that... [13:57:05] !log pool cp4032 with ATS backend T227432 [13:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:10] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:57:28] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:37] (03PS2) 10Jcrespo: bacula: Actually run the check with sudo -u bacula [puppet] - 10https://gerrit.wikimedia.org/r/546187 (https://phabricator.wikimedia.org/T236423) [13:58:55] (03PS1) 10Ottomata: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) [13:59:51] (03CR) 10jerkins-bot: [V: 04-1] Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [14:00:54] 10Operations, 10Traffic, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3003.esams.wmnet'] ` [14:00:58] (03CR) 10Jcrespo: [C: 03+2] bacula: Actually run the check with sudo -u bacula [puppet] - 10https://gerrit.wikimedia.org/r/546187 (https://phabricator.wikimedia.org/T236423) (owner: 10Jcrespo) [14:01:28] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10dbarratt) [14:01:36] (03CR) 10Gehel: [C: 03+2] wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe) [14:01:38] (03PS2) 10Ottomata: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) [14:02:07] (03CR) 10Gehel: [C: 03+2] wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [14:03:12] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10Reedy) You can just do `sql enwiki` which will connect you to a slave - in most cases you don't need to specify a replica [14:03:42] (03CR) 10jerkins-bot: [V: 04-1] Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [14:03:49] RECOVERY - Backup freshness on helium is OK: All failures: 13 (bromine, ...), Stale: 5 (matomo1001, ...), Fresh: 78 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [14:04:24] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10Reedy) I note https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php doesn't know about db1114, which will explain the error [14:04:53] (03PS1) 10Ema: prometheus: load text_ats varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) [14:05:18] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10Reedy) And I guess as it's listed as a "backup testing host", it shouldn't be used for random queries anyway https://github.com/wikimedia/puppet/blob/f63030be76061096b3bc173957948b15ce5ae15b/m... [14:05:26] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10dbarratt) >>! In T236486#5606486, @Reedy wrote: > You can just do `sql enwiki` which will connect you to a slave - in most cases you don't need to specify a replica Oh! that's fancy. Thanks! [14:05:58] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [14:06:00] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10dbarratt) 05Open→03Invalid [14:06:34] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: load text_ats varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:06:52] (03CR) 10jerkins-bot: [V: 04-1] prometheus: load text_ats varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:07:01] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10CDanis) @Reedy that's the wrong place to look nowadays. The two good places to look for production database configs: https://noc.wikimedia.org/db.php https://noc.wikimedia.org/dbconfig/eqiad.j... [14:07:54] 10Operations: Cannot access production replica database from mwmaint1002 - https://phabricator.wikimedia.org/T236486 (10Reedy) >>! In T236486#5606501, @CDanis wrote: > @Reedy that's the wrong place to look nowadays. The two good places to look for production database configs: > https://noc.wikimedia.org/db.php... [14:09:45] !log reboot ganeti3003 [14:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [14:10:01] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:32] Reedy: indeed db1114 still isn't in the configs, but, wanted you to be aware :) [14:10:42] I've just fixed the docs [14:10:48] Telling people to select a host themselves is stupid [14:10:57] thanks! I was about to do that [14:10:59] +1 [14:11:01] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:12:51] (03PS3) 10Ottomata: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) [14:13:01] (03PS2) 10Subramanya Sastry: Direct Parsoid/PHP logs to a parsoid-php log "type" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) [14:14:13] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:15:00] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [14:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:19] hashar: I'm getting some unexpected errors from jenkins -> https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/24559/console [14:15:25] > 16:11:22 fatal: read error: Connection reset by peer [14:16:11] ema: o/ [14:16:47] ema: yeah recheck [14:16:49] digging into it [14:16:52] (03CR) 10BBlack: [C: 03+2] authdns_servers: add ganeti3003 [puppet] - 10https://gerrit.wikimedia.org/r/546172 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [14:16:53] I think that is a git-daemon fault [14:16:56] thanks hashar! [14:17:04] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:17:09] (03PS3) 10BBlack: authdns_servers: add ganeti3003 [puppet] - 10https://gerrit.wikimedia.org/r/546172 (https://phabricator.wikimedia.org/T236479) [14:17:10] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:13] Oct 25 14:11:22 contint2001 git-daemon[27075]: fatal: packet write with format failed: Connection reset by peer [14:18:19] ema: that is on the git-daemon side :\ [14:20:47] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/19072/" [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [14:21:06] (03PS1) 10Jcrespo: bacula: Fix check so it produces the right nagios error code [puppet] - 10https://gerrit.wikimedia.org/r/546192 (https://phabricator.wikimedia.org/T236423) [14:21:28] (03PS2) 10Ema: prometheus: load text_ats varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) [14:21:59] ema: I remember that happens from time to time and probably digged into it at some point but never found the root cause :-\ [14:22:02] (03PS1) 10CRusnov: rename rotatedump to remove .bash suffix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546193 [14:22:32] (03CR) 10CRusnov: [V: 03+2 C: 03+2] rename rotatedump to remove .bash suffix [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546193 (owner: 10CRusnov) [14:22:55] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:59] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:55] (03CR) 10Ema: [C: 03+2] prometheus: load text_ats varnish targets [puppet] - 10https://gerrit.wikimedia.org/r/546190 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:25:48] (03PS2) 10Jcrespo: bacula: Fix check so it produces the right nagios error code [puppet] - 10https://gerrit.wikimedia.org/r/546192 (https://phabricator.wikimedia.org/T236423) [14:26:57] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix check so it produces the right nagios error code [puppet] - 10https://gerrit.wikimedia.org/r/546192 (https://phabricator.wikimedia.org/T236423) (owner: 10Jcrespo) [14:27:08] !log crusnov@deploy1001 Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts T223292 [14:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:40] T223292: Netbox: generate CSV backups - https://phabricator.wikimedia.org/T223292 [14:28:09] !log crusnov@deploy1001 Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts T223292 (duration: 01m 02s) [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:48] !log crusnov@deploy1001 Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) T223292 [14:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] !log crusnov@deploy1001 Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) T223292 (duration: 00m 05s) [14:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:17] (03PS1) 10Mathew.onipe: wdqs: do not use external mirror [cookbooks] - 10https://gerrit.wikimedia.org/r/546194 (https://phabricator.wikimedia.org/T230588) [14:31:45] !log crusnov@deploy1001 Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) -T223292 [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] !log crusnov@deploy1001 Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) -T223292 (duration: 00m 44s) [14:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:35] PROBLEM - Backup freshness on helium is CRITICAL: All failures: 13 (bromine, ...), Stale: 5 (matomo1001, ...), Fresh: 78 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [14:32:43] ^habemus backup check 🎉 [14:33:33] \o/ \o/ jynus [14:34:07] (03CR) 10Gehel: [C: 03+2] wdqs: do not use external mirror [cookbooks] - 10https://gerrit.wikimedia.org/r/546194 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [14:34:23] (03PS1) 10Jbond: check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 [14:35:01] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:36:30] (03CR) 10jerkins-bot: [V: 04-1] check_puppetrun: alert critical after 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/546195 (owner: 10Jbond) [14:36:47] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-reload [14:36:49] 10Operations, 10Discovery-Search, 10Wikidata, 10Wikidata-Query-Service: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) [14:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:43] !log cr[23]-esams: re-route ns2 IP to ganeti3003 [14:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:50] 10Puppet, 10Cloud-VPS: geoipupdate missing on buster on Cloud VPS - https://phabricator.wikimedia.org/T236487 (10Tgr) [14:45:21] (03PS3) 10BBlack: authdns_servers: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/546173 (https://phabricator.wikimedia.org/T236479) [14:47:02] (03CR) 10BBlack: [C: 03+2] authdns_servers: remove multatuli [puppet] - 10https://gerrit.wikimedia.org/r/546173 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [14:55:16] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [14:58:15] (03PS1) 10BBlack: decom multatuli in puppet [puppet] - 10https://gerrit.wikimedia.org/r/546199 (https://phabricator.wikimedia.org/T236489) [15:00:08] ACKNOWLEDGEMENT - Backup freshness on helium is CRITICAL: All failures: 13 (bromine, ...), Stale: 5 (matomo1001, ...), Fresh: 78 jobs Jcrespo Working on fixing broken backups: https://phabricator.wikimedia.org/T235838 - The acknowledgement expires at: 2019-10-29 14:59:01. https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [15:00:11] (03PS1) 10BBlack: decom multatuli from dns [dns] - 10https://gerrit.wikimedia.org/r/546200 (https://phabricator.wikimedia.org/T236489) [15:00:15] (03PS1) 10BBlack: decom bast3002 ipv6 reverse [dns] - 10https://gerrit.wikimedia.org/r/546201 (https://phabricator.wikimedia.org/T236329) [15:00:53] !log bblack@cumin1001 START - Cookbook sre.hosts.decommission [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:01:22] (03CR) 10BBlack: [C: 03+2] decom multatuli in puppet [puppet] - 10https://gerrit.wikimedia.org/r/546199 (https://phabricator.wikimedia.org/T236489) (owner: 10BBlack) [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:25] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `multatuli.wikimedia.org` - multatuli.wikimedia.org (**PASS**) - Dow... [15:02:06] (03CR) 10BBlack: [C: 03+2] decom multatuli from dns [dns] - 10https://gerrit.wikimedia.org/r/546200 (https://phabricator.wikimedia.org/T236489) (owner: 10BBlack) [15:02:09] (03CR) 10BBlack: [C: 03+2] decom bast3002 ipv6 reverse [dns] - 10https://gerrit.wikimedia.org/r/546201 (https://phabricator.wikimedia.org/T236329) (owner: 10BBlack) [15:03:08] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [15:03:10] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10BBlack) [15:03:38] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [15:03:40] !log gehel@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [15:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:54] 10Operations, 10DC-Ops, 10Traffic, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) a:03Papaul [15:08:29] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:08:35] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [15:09:06] (03CR) 10Effie Mouzeli: "After checking the servers and our puppet repo, I had to add a couple of things I think they belong in this commit." [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [15:09:09] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10MoritzMuehlenhoff) >>! In T229998#5398742, @Volans wrote: > The solution that was agreed at the SRE summit for this is to add a `dd` to override the bootloader(s) so t... [15:10:34] (03PS3) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [15:15:00] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) 05Stalled→03Open [15:15:03] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [15:15:05] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Dzahn) [15:15:14] (03PS3) 10Jhedden: ceph: add etcd and k8s master profile for rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [15:16:51] (03PS1) 10Mathew.onipe: Use proxy server to download dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/546210 [15:20:19] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [15:21:29] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) 05Open→03Resolved Thank you, @fgiunchedi! It's set to active in Netbox and i tested an install to confirm DHCP/tftpboot is working after that was switched too. Resolving. [15:21:33] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Dzahn) [15:32:18] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) @Volans @fgiunchedi In addition to: ` root@helium:~$ python3 check_bacula.py --icinga All failures: 13 (bromine, ...), Stale: 5 (matomo1001, .... [15:33:11] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 88, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:33:25] (03PS1) 10Andrew Bogott: openstack::wikitech::web: don't include python-keystone package [puppet] - 10https://gerrit.wikimedia.org/r/546213 [15:35:36] !log ps1-oe14-esams ip info set, rebooting (wont affect servers) via T184066 [15:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:41] T184066: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 [15:37:09] (03CR) 10Andrew Bogott: [C: 03+2] openstack::wikitech::web: don't include python-keystone package [puppet] - 10https://gerrit.wikimedia.org/r/546213 (owner: 10Andrew Bogott) [15:37:25] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 95, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:38:26] (03PS2) 10Dzahn: install_server: remove bast3002 from DHCP, decom [puppet] - 10https://gerrit.wikimedia.org/r/546004 (https://phabricator.wikimedia.org/T236329) [15:38:36] (03Abandoned) 10Dzahn: install_server: remove bast3002 from DHCP, decom [puppet] - 10https://gerrit.wikimedia.org/r/546004 (https://phabricator.wikimedia.org/T236329) (owner: 10Dzahn) [15:40:07] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) p:05Triage→03Normal [15:40:09] (03Abandoned) 10Dzahn: site: replace bast3002 with bast3004, remove from bastion list [puppet] - 10https://gerrit.wikimedia.org/r/545911 (https://phabricator.wikimedia.org/T236329) (owner: 10Dzahn) [15:40:41] (03CR) 10RLazarus: [C: 03+1] "> @RLazarus: perfect example for a small change to Apache config that needs testing / deployment." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [15:46:49] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 88, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:48:05] (03PS1) 10CDanis: WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 [15:48:54] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/app/kibana#/visualize/create?type=histogram&indexPattern=logstash-*&_g=h@05ebc47&_a=h@29cbfe1 | And it's back! ]] [15:50:12] (03CR) 10jerkins-bot: [V: 04-1] WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [15:50:29] (03PS1) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [15:51:00] (03CR) 10Cwhite: [C: 03+2] admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [15:52:51] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10colewhite) [15:52:53] (03PS2) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [15:53:48] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [15:53:50] (03PS3) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [15:54:44] (03PS1) 10CRusnov: dumpbackup.py: Allow retries and optimize all devices dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546218 [15:54:49] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 95, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:07] (03PS2) 10CRusnov: dumpbackup.py: Allow retries and optimize all devices dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546218 [15:56:41] (03CR) 10CRusnov: [V: 03+2 C: 03+2] dumpbackup.py: Allow retries and optimize all devices dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546218 (owner: 10CRusnov) [15:58:10] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10colewhite) The necessary changes have been deployed. Please let me know if you encounter... [15:58:20] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team, 10Patch-For-Review: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10colewhite) 05Open→03Resolved [16:00:39] (03PS1) 10Effie Mouzeli: logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) [16:01:36] (03PS1) 10Gergő Tisza: [WIP] Make lxc work on buster in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/546221 (https://phabricator.wikimedia.org/T236455) [16:04:08] !log crusnov@deploy1001 Started deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox2001) T223292 [16:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] T223292: Netbox: generate CSV backups - https://phabricator.wikimedia.org/T223292 [16:04:51] !log crusnov@deploy1001 Finished deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox2001) T223292 (duration: 00m 43s) [16:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:50] !log crusnov@deploy1001 Started deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox1001) T223292 [16:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:45] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10colewhite) @mepps The list has been created and the password emailed to you. You may need to share it with your co-admin(s). The admin interface can be found here: https://lists... [16:07:58] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10colewhite) 05Open→03Resolved [16:08:18] (03PS1) 10CRusnov: scap.cfg: Fix deploy group, keyholder [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546222 [16:09:54] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: ulsfo returns 504 error (upstream request timeout) for WDQS requests - https://phabricator.wikimedia.org/T236500 (10Bugreporter) [16:10:06] (03PS2) 10CDanis: WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 [16:10:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry) [16:10:12] (03PS4) 10Jcrespo: bacula: Add verbose & single job modes for backup freshness check [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) [16:10:31] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10colewhite) 05Open→03Resolved [16:10:34] (03PS2) 10CRusnov: netbox: Enable CSV dump rotations. [puppet] - 10https://gerrit.wikimedia.org/r/545123 (https://phabricator.wikimedia.org/T223292) [16:10:41] (03CR) 10CRusnov: [V: 03+2 C: 03+2] scap.cfg: Fix deploy group, keyholder [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546222 (owner: 10CRusnov) [16:10:58] (03CR) 10Jcrespo: "+Order verbose job listings alphabetically." [puppet] - 10https://gerrit.wikimedia.org/r/546217 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:11:34] (03CR) 10CRusnov: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545123 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [16:12:09] (03CR) 10CRusnov: netbox: Enable CSV dump rotations. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545123 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [16:12:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nit inline but feel free to ignore" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [16:13:42] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) >>! In T229792#5604812, @Dzahn wrote: > merged the above because we were getting cron spam from appservers with "/usr/local/bin/hh... [16:14:39] (03CR) 10CRusnov: [C: 03+2] netbox: Enable CSV dump rotations. [puppet] - 10https://gerrit.wikimedia.org/r/545123 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [16:16:00] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) Aside from the above, things pending: * Tune thresholds and in general success conditions/make them configurable * Handle "new backups" with "... [16:16:17] PROBLEM - IPMI Sensor Status on cp3062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:17:51] 10Operations, 10ops-ulsfo, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: ulsfo returns 504 error (upstream request timeout) for WDQS requests - https://phabricator.wikimedia.org/T236500 (10Bugreporter) Happens in ulsfo only [16:17:59] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5777 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:18:27] PROBLEM - IPMI Sensor Status on cp3061 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:19:20] !log crusnov@deploy1001 Finished deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox1001) T223292 (duration: 13m 31s) [16:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:25] T223292: Netbox: generate CSV backups - https://phabricator.wikimedia.org/T223292 [16:19:33] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 28 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:37] !log lvs3006 - reimaging to fix partman issue, high-traffic2 (upload/maps) to lvs3007 for the duration [16:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:29] PROBLEM - IPMI Sensor Status on cp3064 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:22:17] PROBLEM - IPMI Sensor Status on cp3063 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:25:15] PROBLEM - IPMI Sensor Status on cp3065 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:25:30] (03PS1) 10Andrew Bogott: Puppet setup for cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/546226 [16:26:49] (03CR) 10Andrew Bogott: [C: 03+2] Puppet setup for cloudbackup2002 [puppet] - 10https://gerrit.wikimedia.org/r/546226 (owner: 10Andrew Bogott) [16:29:20] (03CR) 10Krinkle: [C: 03+1] logstash: send PHP7 fatal-error messages type:mediawiki channel:fatal [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [16:34:00] (03PS1) 10Krinkle: mediawiki: Add 'caught_by' to php7-fatal-error log message [puppet] - 10https://gerrit.wikimedia.org/r/546230 (https://phabricator.wikimedia.org/T234283) [16:34:36] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Update scap targets [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/545416 (owner: 10Paladox) [16:35:14] (03CR) 10Krinkle: [C: 03+1] Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [16:37:05] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) a:05Papaul→03Andrew @Papaul, nevermind, it turns out I can do this from the mgmt console. [16:39:37] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:40:56] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [16:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:02] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:08] mutante: I think 545979 is okay now [16:46:17] RECOVERY - IPMI Sensor Status on cp3062 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:48:31] RECOVERY - IPMI Sensor Status on cp3061 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:49:01] (03CR) 10Bstorm: maintain-kubeusers: add ability to merge and update configs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [16:49:08] (03CR) 10Eevans: Add reprepo updates for cassandra311 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [16:51:37] RECOVERY - IPMI Sensor Status on cp3064 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:52:27] RECOVERY - IPMI Sensor Status on cp3063 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:55:31] RECOVERY - IPMI Sensor Status on cp3065 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [17:03:21] (03PS1) 10CRusnov: rotatedump: Change to overwriting the daily timestamp dump rather than hour timestamps [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/546241 [17:05:51] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:10:16] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Andrew) I've stress-tested this box quite a bit; now I'm building a couple of VMs for the 'video' project (encoding04 and encoding05) over there for... [17:11:54] (03CR) 10Bstorm: [C: 03+1] "Looks good from my end." [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [17:12:14] /win 12 [17:18:00] (03PS3) 10Dzahn: add ge.wikimedia.org for Georgia user group [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) [17:18:06] hauskater: ack, 545979 doing [17:18:14] bien [17:19:25] hauskater: let's edit what is requested in the ticket? [17:19:42] mutante: hum? [17:19:51] what are you talking about? :) [17:20:17] well, it says "Language: Georgian (ka)" and that will be the wiki language, so it's ok [17:20:40] domain = ge.wikimedia; lang = ka [17:20:51] country code vs. lang. code [17:20:53] yes, it's correct now [17:20:56] all good [17:21:14] langcode is important for when sb does the addWiki.php thing [17:21:35] *nod* it's needed for the MW config too [17:21:52] I set that on the config/wmf-config repo [17:21:55] (03CR) 10Dzahn: [C: 03+2] add ge.wikimedia.org for Georgia user group [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [17:22:15] good [17:24:33] great that we also have https://phabricator.wikimedia.org/T236389#5605773 [17:25:55] hauskater: done [17:26:08] great [17:26:14] !log lvs3005 - reimaging to fix partman issue, high-traffic1 (text) to lvs3007 for the duration [17:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] Haciendo ping a dyna.wikimedia.org [91.198.174.192] con 32 bytes de datos: [17:27:16] looks good [17:27:39] the apache one is still pending [17:27:40] "Haciendo ping" heh, trying to learn Spanish [17:28:03] having a spanish nick mutante I though you had some knowledges :P [17:28:24] you could say the same about me re. German [17:28:35] hauskater: https://en.wiktionary.org/wiki/Mutante :p [17:28:50] I just know how to say "Krankenhaus" [17:29:15] that's pretty hard [17:29:55] und "Ladesspital" [17:30:49] did you know there are .ogg files for individual words on wiktionary [17:30:59] like https://en.wiktionary.org/wiki/Krankenhaus#Pronunciation [17:31:07] Yes, I think so [17:32:50] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546139 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [17:34:50] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:34:56] 10Operations, 10Puppet, 10observability, 10Patch-For-Review: update failed puppet checkes so that they go critical 24 hours - https://phabricator.wikimedia.org/T236478 (10Volans) @jbond I had already opened T236345 for this. I guess that can probably be merged into this at this point. [17:38:56] (03PS1) 1020after4: scapify design/style-guide microsite (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546253 [17:38:58] (03PS1) 1020after4: scapify design/style-guide microsite (2 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546254 [17:41:21] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10Volans) @MoritzMuehlenhoff it failed the power off, as reported by the script, see https://phabricator.wikimedia.org/T208585#5599005 [17:44:46] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:47:15] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:23] 10Operations, 10observability: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite) [17:48:38] 10Operations, 10observability: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite) p:05Triage→03Normal a:03colewhite [17:49:21] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:13] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) [17:59:02] (03PS3) 10CDanis: WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 [18:01:25] (03PS1) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) [18:01:37] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) [18:01:39] (03PS4) 10CDanis: WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 [18:01:52] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) 2001 is up and looks good. 2002 is blocked awaiting port setup. [18:03:33] (03CR) 10jerkins-bot: [V: 04-1] prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [18:07:16] (03PS5) 10CDanis: WIP port number / local service definitions [puppet] - 10https://gerrit.wikimedia.org/r/546216 [18:08:31] (03PS2) 10Cwhite: prometheus, profile: add file count feature and enable lists queue tracking [puppet] - 10https://gerrit.wikimedia.org/r/546260 (https://phabricator.wikimedia.org/T236505) [18:10:26] (03PS2) 10Bstorm: maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) [18:11:04] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:24:44] (03PS6) 10CDanis: base: shared definitions for port numbers in /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/546216 [18:28:32] (03CR) 10CDanis: "Reviewers, please be nitpicky as I really don't know what I'm doing." [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [18:28:54] (03PS1) 10Ayounsi: Remove old esams networking devices from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/546270 (https://phabricator.wikimedia.org/T235805) [18:30:26] (03PS2) 10Ayounsi: Remove old esams networking devices from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/546270 (https://phabricator.wikimedia.org/T235805) [18:38:36] (03PS7) 10CDanis: base: shared definitions for port numbers in /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/546216 [18:40:13] (03CR) 10Dzahn: [C: 03+1] "confirmed the switches are not in DNS anymore, only mgmt" [puppet] - 10https://gerrit.wikimedia.org/r/546270 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [18:41:03] (03CR) 10Dzahn: [C: 03+2] Remove old esams networking devices from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/546270 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [18:42:27] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/546219 (https://phabricator.wikimedia.org/T234283) (owner: 10Effie Mouzeli) [18:46:18] (03PS3) 10Paladox: ircecho: Convert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/492314 [18:47:26] (03CR) 10Paladox: [C: 03+1] site: turn cobalt into a spare system (Do not merge) [puppet] - 10https://gerrit.wikimedia.org/r/545328 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:47:36] (03CR) 10Paladox: [C: 03+1] ci: remove cobalt from firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/545330 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:47:46] (03CR) 10Paladox: [C: 03+1] mariadb: remove cobalt from ferm_misc rules [puppet] - 10https://gerrit.wikimedia.org/r/545333 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:47:58] (03CR) 10Paladox: [C: 03+1] acme_chief: remove cobalt from authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/545334 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:48:08] (03CR) 10Paladox: [C: 03+1] gerrit: remove cobalt from ssh known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/545335 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:48:18] (03CR) 10Paladox: [C: 03+1] install_server: remove cobalt from DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/545336 (https://phabricator.wikimedia.org/T236187) (owner: 10Dzahn) [18:50:11] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Halfak) Thank you! [18:50:23] (03PS4) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [18:50:53] (03CR) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [18:54:54] (03PS2) 10Dzahn: scapify design/style-guide microsite (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546253 (owner: 1020after4) [18:57:01] (03PS3) 10Dzahn: scapify design/style-guide microsite (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/546253 (https://phabricator.wikimedia.org/T235677) (owner: 1020after4) [18:57:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19080/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/546253 (https://phabricator.wikimedia.org/T235677) (owner: 1020after4) [18:58:40] (03CR) 10Dzahn: [C: 03+2] "actually does a lot. installs scap, firewall rules, deployment-user etc.. this host has not been a scap target before. but compiler looks " [puppet] - 10https://gerrit.wikimedia.org/r/546253 (https://phabricator.wikimedia.org/T235677) (owner: 1020after4) [19:00:40] (03PS4) 10Jhedden: ceph: add etcd and k8s master profile for rook [puppet] - 10https://gerrit.wikimedia.org/r/546182 (https://phabricator.wikimedia.org/T236290) [19:01:01] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Volker_E) >>! In T235677#5604917, @mmodell wrote:... [19:01:54] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) Merged and deployed part 1 of the "scapify... [19:03:57] (03CR) 10Dzahn: "part 1 deployed - seeing an error but probably normal before first deploy? (pasted on ticket) - waiting with this one for your ok." [puppet] - 10https://gerrit.wikimedia.org/r/546254 (owner: 1020after4) [19:04:12] twentyafterfour: ^ [19:06:23] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) This might be normal before the first depl... [19:09:44] Volker_E: both of the "misc static" webservers hosting design and other pages have been made "scap targets" now. That means there is a deployment_user and rsync and it should be possible to deploy to them. The error i pasted is likely normal before the first deploy. It says your repo is missing a config file. [19:10:58] after the first deploy and if it looks fine we can merge Mukunda's second change and point the webserver to the deployment_path and you would be unblocked [19:19:18] I need a process clarification (shared q on task) [19:21:27] Volker_E: i think it's more about unblocking you from deploying large files. The difference besides that is you get immediate results and dont need to wait up to 30 min. [19:27:50] (03CR) 10CDanis: base: shared definitions for port numbers in /etc/services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/546216 (owner: 10CDanis) [19:41:09] Volker_E: i pinged him about the deployment. we should get that done soon to avoid blocking git pull of other things on those servers [19:41:21] mutante: sounds good! [19:41:45] alright [19:48:20] (03PS1) 10Dzahn: requesttracker: re-enable envoy if on buster [puppet] - 10https://gerrit.wikimedia.org/r/546280 (https://phabricator.wikimedia.org/T210411) [19:48:59] (03PS4) 10CDanis: wmftest.org: add wpt-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) [19:49:04] (03CR) 10CDanis: wmftest.org: add wpt-graphite (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [19:49:31] 10Operations, 10observability, 10Patch-For-Review: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite) Historically, out queue monitoring has been noisy. One idea to have less noisy outbound monitoring is to take the queue depth and estimate how long it will take to... [19:56:31] (03CR) 10Dzahn: [C: 03+2] requesttracker: re-enable envoy if on buster [puppet] - 10https://gerrit.wikimedia.org/r/546280 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:00:13] mutante: thank you! [20:04:48] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@c69242e]: test deploy design/style-guide [20:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:58] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@c69242e]: test deploy design/style-guide (duration: 00m 10s) [20:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:59] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) ran `scap deploy` and it succeeded first... [20:08:57] twentyafterfour: confirmed puppet is happy again ! nice [20:09:34] Volker_E: looking good! I'll have to deploy it for you until we get you set up with deployer privileges but it looks like scap deploy is going to work fine [20:11:25] i am starting the process to make you a deployer by creating a ticket for it [20:11:36] clinic duty will pick it up [20:11:38] mutante & Volker_E: we should probably also set up scap for the other two design microsite repos T235677 [20:11:39] T235677: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 [20:12:26] only really needed if you want to deploy large files there [20:12:37] but if you dont want to switch between 2 methods..yes [20:14:45] twentyafterfour: i assume this is the same as an MW deployer [20:15:48] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10Dzahn) [20:16:23] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Dzahn) [20:16:25] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10Dzahn) [20:16:46] mutante: well, I set it to deploy-service instead of mwdeploy user [20:16:54] maybe that wasn't the best choice though? [20:19:49] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Jclark-ctr) @Bstorm Is this host in use? when can we schedule a good time for me to troubleshoot? [20:21:15] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10mmodell) This probably needs a new group since it's not really a service and it's not part of mediawiki, neither wikidev nor deploy-service are really appropriate. [20:24:26] twentyafterfour: Volker_E: should we switch the document root? [20:25:46] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10Dzahn) And then it would be best to have more than one person in that group. Could you list backup deployers? [20:26:43] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10mmodell) I assume @Ladsgroup would be another deployer. @volker_e: is there anyone else who should be able to deploy the style guide? [20:29:40] 10Operations, 10SRE-Access-Requests: deployment access for Volker Eckl - https://phabricator.wikimedia.org/T236518 (10Volker_E) @Jdrewniak Would be needed as well! [20:30:50] twentyafterfour: pls provide comment on process for deployers on https://phabricator.wikimedia.org/T235677 [20:32:13] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Ladsgroup - https://phabricator.wikimedia.org/T236518 (10Dzahn) [20:34:08] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Ladsgroup - https://phabricator.wikimedia.org/T236518 (10Dzahn) [20:34:26] (03PS1) 10Cwhite: mtail,profile: add smtp metrics collection with mtail [puppet] - 10https://gerrit.wikimedia.org/r/546290 (https://phabricator.wikimedia.org/T236505) [20:37:41] (03PS3) 10Dzahn: ATS/varnish: replace director for RT with moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) [20:38:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Bstorm) It is in use. It can handle a couple disk failures without evacuating. If it needs to be taken offline, it will need some work to get it ready on our end. [20:38:54] Volker_E: sure thing. Also I'll be glad to walk you through it once you have privs. [20:40:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Bstorm) To ask more directly, do you need us to evacuate this host for troubleshooting? @Jclark-ctr [20:41:07] twentyafterfour: and git lfs should be possible with scap? [20:41:54] !log twentyafterfour@deploy1001 Started deploy [design/style-guide@c69242e]: deploying design/style-guide for demonstration purposes [20:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:00] !log twentyafterfour@deploy1001 Finished deploy [design/style-guide@c69242e]: deploying design/style-guide for demonstration purposes (duration: 00m 06s) [20:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:55] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) So the process to deploy is pretty simpl... [20:48:00] Volker_E: I think scap can do git-lfs. It can definitely do git-fat [20:49:37] https://phabricator.wikimedia.org/T180627 [20:50:48] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) and regarding git-lfs, see {T180627} sp... [20:52:53] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) Also: https://wikitech.wikimedia.org/wik... [20:55:03] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) @Yair_rand agreed the relatively larger JS component size is not ideal. We'll need to compensat... [21:03:10] (03CR) 10Jeena Huneidi: [C: 04-1] scaffold: only expose one port as a service by default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [21:05:55] (03CR) 10CDanis: [C: 03+1] "Looks good to me!" (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [21:07:43] (03CR) 10CDanis: [C: 03+1] Initial version of httpbb, the HTTP black box testing tool. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [21:14:29] 10Operations, 10SRE-Access-Requests: new deployment group and access for design site - Volker Eckl, Jan Drewniak, Amir Sarabadani - https://phabricator.wikimedia.org/T236518 (10Ladsgroup) [21:15:04] twentyafterfour: I'd need help on choosing the right one for binary files. Don't want to run into a three-day endeavor with unknown result again [21:16:24] twentyafterfour: so what are the next steps? Am I needed to follow the procedure with ssh login to make latest git repo status available? [21:17:13] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Volker_E) @mmodell How important are the deploy m... [21:22:37] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) @dr0ptp4kt not just JS -- data sources could be far larger component to the graphs - e.g. one graph... [21:23:44] (03PS4) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [21:24:07] (03CR) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [21:26:21] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:27:07] Volker_E: yes and you can have me deploy it in the meantime while we wait for your privs [21:27:43] so just let me know if it needs updating I'll do the needful ;) [21:31:41] 10Operations, 10FR-Q2-FY2019-20-cleanup-list, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Volans) I can confirm this as it happened to me today. I'm seeing the fund raising b... [21:35:14] (03PS5) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [21:36:18] (03PS6) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [21:40:55] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: replace director for RT with moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [21:55:45] !log running puppet on ulsfo cp-ats servers to pick up config change for RT backend [21:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:43] (03PS1) 10Dzahn: ssl: new certificate for RT to contain moscovium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/546301 (https://phabricator.wikimedia.org/T180641) [22:19:15] (03PS1) 10Dzahn: admins: create new deploy group for design, add 3 users [puppet] - 10https://gerrit.wikimedia.org/r/546303 (https://phabricator.wikimedia.org/T236518) [22:23:09] (03CR) 10Dzahn: [C: 03+2] ssl: new certificate for RT to contain moscovium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/546301 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [22:29:01] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Keegan) @dr0ptp4kt thanks for the mention, sorry for the delay. I'm at WikidataCon this weekend, I'll get... [22:33:56] (03CR) 10Jeena Huneidi: [C: 04-1] scaffold: only expose one port as a service by default (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [22:36:23] (03PS1) 10Dzahn: ATS: fix envoy backend port for RT to 443 [puppet] - 10https://gerrit.wikimedia.org/r/546308 (https://phabricator.wikimedia.org/T210411) [22:37:17] (03CR) 10Dzahn: [C: 03+2] ATS: fix envoy backend port for RT to 443 [puppet] - 10https://gerrit.wikimedia.org/r/546308 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [22:40:00] (03PS3) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [22:42:31] !log moscovium - manually deleting envoy listener on 1443 and letting puppet recreate config because it's not removed if you change the port (T180641) [22:42:34] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [22:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:35] T180641: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 [22:44:30] (03CR) 10Gehel: [C: 03+2] Use proxy server to download dumps [cookbooks] - 10https://gerrit.wikimedia.org/r/546210 (owner: 10Mathew.onipe) [22:46:25] PROBLEM - Check systemd state on moscovium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:07] ACKNOWLEDGEMENT - Check systemd state on moscovium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn . https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:23] RECOVERY - Check systemd state on moscovium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:59] !log moscovium rm /dev/shm/envoy_shared_memory_0 to revive envoy which failed to run after changing ports and reinstalling it (T180641) [22:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:04] T180641: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641