[00:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:11] hey [00:00:16] i just added one [00:00:29] jouncebot: reload [00:00:32] jouncebot: refresh [00:00:33] I refreshed my knowledge about deployments. [00:00:35] gj [00:00:44] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@d2ad2da]: bulk_daemon: support ltr model uploads (duration: 04m 29s) [00:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:53] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10wiki_willy) a:03Jclark-ctr @Jgreen - looks like the warranty ended for the server a few months ago in May. Let me know if you're looking to decommission this server soon or if you would lik... [00:02:13] so, anyone wants to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/549227 ? beta-only config patch [00:04:59] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10ayounsi) > Interface saturation See also T224888 > What else is in scope here? That's everything I have in mind right now. > In terms of “how” I can t... [00:05:51] (03CR) 10Reedy: [C: 03+2] Set wgVisualEditorRestbaseParsoidVariant='php' on Beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549227 (https://phabricator.wikimedia.org/T229074) (owner: 10Bartosz Dziewoński) [00:05:58] Reedy: heh. thanks [00:07:03] (03Merged) 10jenkins-bot: Set wgVisualEditorRestbaseParsoidVariant='php' on Beta enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549227 (https://phabricator.wikimedia.org/T229074) (owner: 10Bartosz Dziewoński) [00:08:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:08:59] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/549227 (duration: 01m 00s) [00:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:51] (03PS8) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [00:10:14] (03CR) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:11:06] 10Operations, 10serviceops, 10User-jijiki: mw2225 keeps sending cronspam for hhvm-needs-restart - https://phabricator.wikimedia.org/T236799 (10Dzahn) Mysterious. I could not find "hhvm-needs-restart" anywhere. Not in any crontab, not in systemd timers.. not anywhere else in /etc.... [00:12:23] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:13:06] Reedy: should it be live now? i don't see the expected change [00:13:53] Reedy: never mind, it's good, i think the module was cached or something. it works in debug mode at least [00:14:05] MatmaRex: Well, that and beta scap stuff only just ran :P [00:14:11] on https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page run `mw.config.get('wgVisualEditorConfig').parsoidVariant` in the browser console, should return 'php' [00:14:20] It runs every ~10 minutes [00:14:52] Reedy: oh, i see, when i saw the logmsgbot comment, i assumed it's live [00:15:00] Nope, just making prod consistent [00:15:20] okay, that makes sense. thanks! [00:18:47] (03PS9) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [00:21:15] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:21:19] !log enable interface damping on primary eqsin-codfw link - T236878 [00:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:24] T236878: Improve resiliency of the eqsin transport link - https://phabricator.wikimedia.org/T236878 [00:24:38] (03CR) 10Dzahn: "clearly can't make jenkins-bot happy.. even though none of the lines i changed are appearing in pep8 output. i can just give up on this. i" [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:26:53] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Jclark-ctr) Rob Finished configuring bios and mgmt ` host switch ports ms-be1057 25 ms-be1058 23 ms-be1059 17 ` [00:27:04] 10Operations, 10netops, 10Wikimedia-Incident: Improve resiliency of the eqsin transport link - https://phabricator.wikimedia.org/T236878 (10ayounsi) 05Open→03Resolved a:03ayounsi Damping configured. [00:28:55] (03PS10) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [00:30:34] mdholloway: You about? I'm a little confused about hte MachineVision rollout to production... How has that happene without your dependancies being added to vendor? [00:30:55] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:34:35] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [00:35:23] (03Abandoned) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:41:09] (03PS3) 10Dzahn: cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 [00:44:49] (03CR) 10Dzahn: [C: 03+2] cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 (owner: 10Dzahn) [00:47:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T0100). [01:04:40] (03PS5) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [01:06:37] (03CR) 10EBernhardson: [C: 03+1] [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse) [01:21:13] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:32:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:37:05] (03CR) 10Masumrezarock100: Give commonswiki filemovers `suppressredirect` rights (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549194 (https://phabricator.wikimedia.org/T236348) (owner: 10DannyS712) [01:48:43] (03PS2) 10Tim Starling: Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) [01:49:25] (03CR) 10jerkins-bot: [V: 04-1] Enable REST API on all WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [03:01:17] (03CR) 10Vgutierrez: [C: 03+1] base: certificates: add new GlobalSign CA files [puppet] - 10https://gerrit.wikimedia.org/r/549058 (https://phabricator.wikimedia.org/T237066) (owner: 10Arturo Borrero Gonzalez) [03:08:45] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:10:23] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:23:09] (03PS6) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [03:29:03] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:29:33] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:35:19] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:36:49] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:38:21] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:39:57] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:40:01] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:00] (03PS2) 10Vgutierrez: prometheus: Provide global aggregation rules for trafficserver requests [puppet] - 10https://gerrit.wikimedia.org/r/548954 (https://phabricator.wikimedia.org/T236482) [04:25:37] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:53:35] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:03:14] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [05:15:56] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:24:52] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Peachey88) [05:44:47] (03PS1) 10BryanDavis: cloud: Replace diamond::collector::minimalpuppetagent [puppet] - 10https://gerrit.wikimedia.org/r/549241 (https://phabricator.wikimedia.org/T210993) [05:49:27] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:49:49] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 92 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [06:01:34] (03PS1) 10BryanDavis: cloud: labstore::nfs_mount remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/549242 [06:11:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:17:17] Amir1: how is migration of wb item terms going? asking so I know when to flip the switch and start dumping the new tables (ie. as soon as all the old data is in them and all new data gets written to them) [06:39:29] (03PS3) 10DannyS712: Give commonswiki filemovers `suppressredirect` rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549194 (https://phabricator.wikimedia.org/T236348) [06:40:50] (03PS4) 10DannyS712: Give commonswiki filemovers `suppressredirect` rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549194 (https://phabricator.wikimedia.org/T236348) [06:51:07] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:51:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:03] (03CR) 10Elukey: "Thanks a lot! Sorry for the mistake :(" [homer/public] - 10https://gerrit.wikimedia.org/r/549206 (owner: 10Ayounsi) [06:55:57] there is Zayo maintenance scheduled (not in the gcal but in the maintenance's emails) [07:16:29] (03CR) 10Mobrovac: [C: 04-1] allow different memory limit settings for parsoid-php servers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [07:19:31] (03CR) 10Mobrovac: "> Like this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/548944" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [07:33:17] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Joe) One suggestion: shouldn't we keep the old CA cert around while transitioning? What I mean is we should keep the current CA cert around in our approved certif... [07:33:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:34:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall seems sensible (but I didn't check functionality 1:1)." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [07:45:53] Wait a sec, Telia link down as well? [07:46:45] yes it seems from the calendar [07:47:08] so we have eqord and eqdfw now to use? [07:47:23] in theory yes and it shouldn't be a big issue, but let's triple check [07:47:42] akosiaris, paravoid - around ? Probably nothing but I'd like to triple check with you [07:53:46] also mark (if you are around) [07:54:35] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:47] eqord is definitely getting more traffic https://librenms.wikimedia.org/device/device=140/tab=overview/ [07:57:03] ok so there is a transport between cr1-eqiad and cr2-eqdfw https://librenms.wikimedia.org/device/device=139/tab=port/port=16739/ [07:57:09] so we have two links left [07:58:32] (out of four) [08:01:33] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:07:23] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:09] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:14:45] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:29:51] (03CR) 10Ema: [V: 03+2 C: 03+2] Add architecture diagram [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/549064 (owner: 10Ema) [08:30:40] !log upgrade and restart db2093 [08:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:14] 10Operations, 10Traffic: ATS skipping certain logs due to lack of buffer space - https://phabricator.wikimedia.org/T237608 (10ema) [08:40:22] 10Operations, 10Traffic: ATS skipping certain logs due to lack of buffer space - https://phabricator.wikimedia.org/T237608 (10ema) p:05Triage→03Normal [08:41:38] (03PS2) 10Ema: ATS: double log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/548258 (https://phabricator.wikimedia.org/T237608) [08:48:33] !log stop and upgrade es2011 [08:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [08:51:42] (03PS1) 10Joal: Update turnilo config for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/549423 (https://phabricator.wikimedia.org/T237117) [08:51:42] cool [08:51:47] elukey: --^ [08:56:15] (03CR) 10Elukey: [C: 03+2] Update turnilo config for webrequest_sampled_128 [puppet] - 10https://gerrit.wikimedia.org/r/549423 (https://phabricator.wikimedia.org/T237117) (owner: 10Joal) [08:56:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/548706 (https://phabricator.wikimedia.org/T237259) (owner: 10Jbond) [09:00:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [09:02:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10JAllemandou) Done! Hidding the turnilo aweful link under [[ https://turnilo.wikimedia.org/#webrequest_sampled_... [09:03:22] !log stop and upgrade es2012, es2014 [09:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] (03PS1) 10Jcrespo: mariadb: Depool es1016 (read only es1) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549424 [09:09:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool es1016 (read only es1) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549424 (owner: 10Jcrespo) [09:09:56] (03Merged) 10jenkins-bot: mariadb: Depool es1016 (read only es1) for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549424 (owner: 10Jcrespo) [09:10:27] (03PS2) 10DCausse: [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 [09:10:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:11:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) Done! Example: ` spark2-shell --master yarn --driver-memory 4G --executor-memory 8G --executor-cores 4 --conf spark.dynamicA... [09:12:51] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: depool es1016 (duration: 01m 04s) [09:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:06] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) @Cmjohnson we might need to add a new GPU next quarter (need to triple check with the Research team), is there any of the above ho... [09:18:40] !log installing Java security updates on aqs/druid/Hadoop [09:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:23] !log stop and upgrade es1016 [09:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:44] (03PS1) 10Jcrespo: mariadb: Repool es1016 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549429 [09:39:35] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool es1016 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549429 (owner: 10Jcrespo) [09:40:19] (03Merged) 10jenkins-bot: mariadb: Repool es1016 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549429 (owner: 10Jcrespo) [09:41:48] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: repool es1016 with low weight (duration: 01m 02s) [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Vgutierrez) @BBlack I'm seeing some "nil" values on the TLS KeyExchange field when AES128-SHA is being us... [09:43:27] (03PS1) 10Jcrespo: mariabd: Repool es1016 full after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549430 [09:47:33] 10Operations, 10Puppet, 10serviceops, 10Patch-For-Review, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) Correction: # We will need to restart etcd in eqiad as the CA is used in etcd::v3 for peer-to-peer communicati... [09:49:04] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Joe) >>! In T237259#5641301, @jbond wrote: > @Eevans moritz mentioned there maybe some cassandra consideration to take into account and you could e... [09:49:50] (03PS2) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [09:49:52] (03CR) 10Vgutierrez: [C: 03+1] ATS: double log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/548258 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [09:50:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] base: certificates: add new GlobalSign CA files [puppet] - 10https://gerrit.wikimedia.org/r/549058 (https://phabricator.wikimedia.org/T237066) (owner: 10Arturo Borrero Gonzalez) [09:51:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Vgutierrez) I'm already loving the data, thanks @JAllemandou <3 [09:51:53] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [09:54:43] (03PS1) 10Ema: Revert "ATS: remap stream.wm.org websocket requests" [puppet] - 10https://gerrit.wikimedia.org/r/549431 [09:56:29] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Gehel) a:05Gehel→03Cmjohnson It looks like there is still a few steps before "handoff for service implementat... [09:59:10] (03PS1) 10Ema: Revert "ATS: remap stream.wmo.org requests on ats-tls as well" [puppet] - 10https://gerrit.wikimedia.org/r/549433 [09:59:47] (03CR) 10Ema: [C: 03+2] Revert "ATS: remap stream.wm.org websocket requests" [puppet] - 10https://gerrit.wikimedia.org/r/549431 (owner: 10Ema) [10:00:05] (03PS1) 10Jbond: wmflib - puppet_config: use symbols for interpolate function [puppet] - 10https://gerrit.wikimedia.org/r/549435 [10:00:07] (03CR) 10Ema: [C: 03+2] Revert "ATS: remap stream.wmo.org requests on ats-tls as well" [puppet] - 10https://gerrit.wikimedia.org/r/549433 (owner: 10Ema) [10:04:08] (03CR) 10Jbond: [C: 03+2] wmflib - puppet_config: use symbols for interpolate function [puppet] - 10https://gerrit.wikimedia.org/r/549435 (owner: 10Jbond) [10:04:26] (03CR) 10Jbond: [C: 03+2] profile::base: reorder class [puppet] - 10https://gerrit.wikimedia.org/r/548706 (https://phabricator.wikimedia.org/T237259) (owner: 10Jbond) [10:11:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Modify Restrouter chart to allow for minikube development (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [10:16:51] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Joe) The calico/node service, and the kube-controller-manager service will need to be restarted on the kubernetes workers and masters respectively. [10:17:02] (03PS3) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [10:19:12] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [10:26:17] (03PS1) 10Ema: envoyproxy: allow 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 [10:27:49] !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1080 weight', diff saved to https://phabricator.wikimedia.org/P9548 and previous config saved to /var/cache/conftool/dbconfig/20191107-102747-jynus.json [10:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:29] !log upgrading mw1277-1279 servers to PHP 7.2.24 T237239 [10:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:34] T237239: Upgrade to PHP 7.2.24 - https://phabricator.wikimedia.org/T237239 [10:38:12] (03PS2) 10Ema: envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 [10:40:09] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 (owner: 10Ema) [10:40:23] (03PS1) 10Ema: phabricator: allow websockets via tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/549444 [10:41:29] (03PS3) 10Ema: envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 [10:41:55] (03PS2) 10Ema: phabricator: allow websockets via tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/549444 [10:42:05] (03CR) 10Vgutierrez: [C: 03+1] envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 (owner: 10Ema) [10:44:23] (03PS4) 10Ema: envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 [10:44:55] (03PS16) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [10:45:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Apart from my previous comments on volumes/volumemounts, see the comments to the helmfile values." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [10:46:09] (03CR) 10jerkins-bot: [V: 04-1] puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [10:46:19] !log jynus@cumin1001 dbctl commit (dc=all): 'Fully depool db1080', diff saved to https://phabricator.wikimedia.org/P9549 and previous config saved to /var/cache/conftool/dbconfig/20191107-104618-jynus.json [10:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:04] (03PS3) 10Ema: phabricator: allow websockets via tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/549444 [10:48:17] (03PS4) 10Ema: etherpad: allow websockets via tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/549444 [10:48:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/549443 (owner: 10Ema) [10:49:08] (03PS17) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [10:50:36] (03PS1) 10Jbond: pupet_compiler: update compiler-update-fact [puppet] - 10https://gerrit.wikimedia.org/r/549446 [10:50:43] !log installing Java security updates on wdqs/maps [10:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:28] (03PS2) 10Jbond: pupet_compiler: update compiler-update-fact [puppet] - 10https://gerrit.wikimedia.org/r/549446 [10:52:11] (03CR) 10Ema: [C: 03+2] envoyproxy: support 'websocket' connection upgrades [puppet] - 10https://gerrit.wikimedia.org/r/549443 (owner: 10Ema) [10:58:28] !log installing Java security updates on kafka-main/logstash [10:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:08] !log stop and upgrade db1080 [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:06] (03CR) 10MarcoAurelio: [C: 03+1] "Looks good & uncontroversial." [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [11:20:57] (03CR) 10MarcoAurelio: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [11:20:57] (03PS4) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [11:21:04] (03PS1) 10Kosta Harlan: Flow: Configure enwiki beta to use Parsoid PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549448 (https://phabricator.wikimedia.org/T229078) [11:22:23] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [11:24:07] (03PS1) 10Daniel Kinzler: Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) [11:24:25] (03CR) 10jerkins-bot: [V: 04-1] Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [11:24:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:29:39] (03PS2) 10Daniel Kinzler: Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) [11:30:12] (03CR) 10jerkins-bot: [V: 04-1] Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [11:35:50] (03PS1) 10Jbond: profile::base::puppet: move the puppetca to a global variable [puppet] - 10https://gerrit.wikimedia.org/r/549450 (https://phabricator.wikimedia.org/T234332) [11:35:52] (03PS1) 10Jbond: profile::puppetmaster: update to allow managing the CA file [puppet] - 10https://gerrit.wikimedia.org/r/549451 (https://phabricator.wikimedia.org/T234332) [11:36:12] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1080 at 10%', diff saved to https://phabricator.wikimedia.org/P9550 and previous config saved to /var/cache/conftool/dbconfig/20191107-113611-jynus.json [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:01] (03CR) 10Jcrespo: "I am thinking 'wikitech' and 'wikitech-test', assuming those are good identifiers of the function of the sections. CC @aboggot @marostegui" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [11:43:10] (03PS2) 10Jcrespo: mariabd: Repool es1016 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549430 [11:43:39] (03CR) 10Jbond: [C: 03+2] profile::base::puppet: move the puppetca to a global variable [puppet] - 10https://gerrit.wikimedia.org/r/549450 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [11:43:51] (03CR) 10Jbond: [C: 03+2] profile::puppetmaster: update to allow managing the CA file [puppet] - 10https://gerrit.wikimedia.org/r/549451 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [11:49:53] all: im about to start a rebuild of compiler 1002 [11:50:02] !log rebuilding compiler1002 [11:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:43] (03CR) 10Jbond: [C: 03+2] CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [11:53:51] (03PS5) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) [11:54:21] !log update puppet_version used by CI 545289 [11:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:03] (03CR) 10Jcrespo: [C: 03+2] mariabd: Repool es1016 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549430 (owner: 10Jcrespo) [11:55:47] (03Merged) 10jenkins-bot: mariabd: Repool es1016 fully after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549430 (owner: 10Jcrespo) [11:56:25] (03CR) 10jerkins-bot: [V: 04-1] CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [11:57:56] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: repool es1016 fully (duration: 01m 01s) [11:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:04] (03CR) 10Ema: [C: 03+2] etherpad: allow websockets via tls terminator [puppet] - 10https://gerrit.wikimedia.org/r/549444 (owner: 10Ema) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T1200). [12:00:04] dcausse and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:17] I can SWAT today! [12:00:32] Thanks :-) [12:00:54] yw awight [12:00:58] dcausse: around? [12:01:37] (03CR) 10Urbanecm: [C: 03+2] Give commonswiki filemovers `suppressredirect` rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549194 (https://phabricator.wikimedia.org/T236348) (owner: 10DannyS712) [12:02:24] (03Merged) 10jenkins-bot: Give commonswiki filemovers `suppressredirect` rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549194 (https://phabricator.wikimedia.org/T236348) (owner: 10DannyS712) [12:03:13] (03PS6) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) [12:04:37] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 5253dec: Give commonswiki filemovers `suppressredirect` rights (T236348) (duration: 01m 03s) [12:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:47] T236348: Give suppressredirect right to filemovers on Commons - https://phabricator.wikimedia.org/T236348 [12:06:35] (03CR) 10Jbond: [C: 03+2] CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [12:09:10] Urbanecm: yes, sorry! [12:09:29] (03CR) 10Urbanecm: [C: 03+2] [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 (owner: 10DCausse) [12:09:34] thanks for the ping :) [12:09:38] (03PS3) 10Urbanecm: [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 (owner: 10DCausse) [12:09:44] (03CR) 10Urbanecm: [C: 03+2] [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 (owner: 10DCausse) [12:09:58] yw dcausse [12:10:19] (03PS5) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [12:10:32] (03Merged) 10jenkins-bot: [cirrus] remove cross_cluster_single_shard_search quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546970 (owner: 10DCausse) [12:11:56] dcausse: please test at mwdebug1001 and lmk [12:11:59] sure [12:12:58] (03CR) 10jerkins-bot: [V: 04-1] store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [12:14:25] Hi [12:14:28] Sorry for lating [12:14:53] Urbanecm: looks good to me [12:14:59] Who works for SWAT today? [12:15:01] Zoranzoki21: please provide link to the patch you want to be SWATted [12:15:03] * Urbanecm does [12:15:09] It is in calendar [12:15:27] Zoranzoki21: there is no link in the calendar [12:15:30] dcausse: thanks, syncing [12:16:25] https://gerrit.wikimedia.org/r/549434 [12:16:40] My bad I forgot to add template [12:16:54] Zoranzoki21: could you please add it to the calendar as well? [12:16:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 2be3f86: [cirrus] remove cross_cluster_single_shard_search quirk (duration: 01m 02s) [12:17:00] Sure [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:07] dcausse: synced! [12:17:43] Urbanecm: great, thanks a lot! [12:17:49] yw dcausse [12:19:37] Zoranzoki21: your patch has -1 from jenkins. [12:19:43] Urbanecm: Done [12:19:45] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:19:53] I don't know why he have - 1 [12:19:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:20:12] Jdlrobson? [12:20:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/549446 (owner: 10Jbond) [12:20:38] Zoranzoki21: I'm not going to deploy a patch with -1 [12:20:51] Ok [12:21:06] I will talk with Jdlrobson as patch in master is his [12:21:24] But we need it deployed today because of train [12:21:34] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10MoritzMuehlenhoff) 05Stalled→03Declined Closing the task, this was a one time migration and jessie is on it's way out. [12:22:58] Zoranzoki21: I'm not going to deploy a patch with -1 from jenkins [12:23:15] Ok I understand you [12:23:52] I told you who made the original patch, to it is deployed in master branch. [12:24:05] (03PS12) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [12:24:19] And to I will talk with him, and to it needs to be deployed today in wmf branch because of train [12:25:02] (03CR) 10Jbond: "Thanks updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [12:25:23] Urbanecm: could I add a config patch to this swat window? [12:25:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:23] (03PS3) 10Daniel Kinzler: Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) [12:26:41] kostajh: certainly [12:26:47] It's a no-op patch so it's not urgent, just want to get it out fo the way. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/547831 [12:26:57] I'll add it to the deploy calendar [12:27:28] thx [12:27:39] (03CR) 10Arturo Borrero Gonzalez: "Thanks for dealing with this. I really think we need to take some time to improve our packaging workflows." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [12:27:51] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:27:59] (03PS2) 10Urbanecm: GrowthExperiments: Configure intro links for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547831 (https://phabricator.wikimedia.org/T235723) (owner: 10Catrope) [12:28:01] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547831 (https://phabricator.wikimedia.org/T235723) (owner: 10Catrope) [12:28:56] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549211 (owner: 10Bstorm) [12:28:56] (03Merged) 10jenkins-bot: GrowthExperiments: Configure intro links for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547831 (https://phabricator.wikimedia.org/T235723) (owner: 10Catrope) [12:31:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 19034af: GrowthExperiments: Configure intro links for suggested edits (T235723) (duration: 01m 00s) [12:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:23] T235723: Newcomer tasks: intro and difficulty overlays - https://phabricator.wikimedia.org/T235723 [12:31:31] kostajh: here you are. Anything else? :) [12:31:53] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:32:49] Urbanecm: yes, one more coming in a minute [12:33:48] okay [12:33:50] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review: puppet.git rake fails with ruby 2.5 - https://phabricator.wikimedia.org/T208566 (10jbond) 05Open→03Resolved i think this is resolved now, please reopen if there is still an issue [12:34:39] (03PS1) 10Kosta Harlan: GrowthExperiments: Configure testwiki for suggested edits testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549454 (https://phabricator.wikimedia.org/T237634) [12:34:50] Urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/549454 [12:34:53] Adding to calendar now [12:35:13] Urbanecm: this one I would like to test with mwdebug please :) [12:35:21] sure [12:35:29] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Configure testwiki for suggested edits testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549454 (https://phabricator.wikimedia.org/T237634) (owner: 10Kosta Harlan) [12:36:40] (03Merged) 10jenkins-bot: GrowthExperiments: Configure testwiki for suggested edits testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549454 (https://phabricator.wikimedia.org/T237634) (owner: 10Kosta Harlan) [12:36:51] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:23] kostajh: pulled at mwdebug1001, test and let me know [12:37:31] Urbanecm: looking [12:38:25] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:38:41] Urbanecm: argh, forgot one important value (wgGERestbaseUrl) [12:38:50] kostajh: upload a follow up patch then :) [12:39:08] Urbanecm: doing [12:41:46] (03PS1) 10Kosta Harlan: GrowthExperiments: Set RestbaseUrl for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549455 (https://phabricator.wikimedia.org/T237634) [12:42:23] kostajh: is that ^^ the patch? :) [12:42:46] Urbanecm: yes [12:42:50] (03PS2) 10Kosta Harlan: GrowthExperiments: Set RestbaseUrl for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549455 (https://phabricator.wikimedia.org/T237634) [12:43:26] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Set RestbaseUrl for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549455 (https://phabricator.wikimedia.org/T237634) (owner: 10Kosta Harlan) [12:44:09] (03Merged) 10jenkins-bot: GrowthExperiments: Set RestbaseUrl for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549455 (https://phabricator.wikimedia.org/T237634) (owner: 10Kosta Harlan) [12:44:32] kostajh: please test at mwdebug1001 [12:44:37] Urbanecm: doing [12:45:26] Urbanecm: looks great :) and have confirmed that e.g. suggested edits doesn't load on cswiki (yet) [12:45:37] great. Syncing then [12:46:09] Děkuju! [12:46:31] Rádo se stalo :) [12:47:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 8e71601: a36ed85: GrowthExperiments: Configure testwiki for suggested edits testing + follow up patch (T237634) (duration: 00m 59s) [12:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:27] T237634: Configure testwiki for Suggested Edits testing - https://phabricator.wikimedia.org/T237634 [12:47:29] kostajh: done. Anything else? :) [12:47:40] Urbanecm: that's all, thanks again for accommodating this last minute request [12:47:48] happy to help! [12:47:51] !log EU SWAT done [12:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:06] (03CR) 10Phamhi: [C: 03+1] "I reviewed the original changelog here https://gerrit.wikimedia.org/r/c/operations/puppet/+/306205 and it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/549242 (owner: 10BryanDavis) [12:50:58] (03CR) 10Jcrespo: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [12:53:08] (03CR) 10Jcrespo: "> Wait, I thought wikitech.dblist was the list of the section, as s1.dblist. Is there a section list and a wikitech-related list of dbs? I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [12:53:32] (03CR) 10Phamhi: [C: 03+2] cloud: labstore::nfs_mount remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/549242 (owner: 10BryanDavis) [12:53:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloud: labstore::nfs_mount remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/549242 (owner: 10BryanDavis) [12:58:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, see my suggestion to improve the tests." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549116 (owner: 10RLazarus) [13:03:51] (03CR) 10ArielGlenn: "> > Wait, I thought wikitech.dblist was the list of the section, as" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [13:05:41] Yes, I just found that wiki page. [13:05:44] Reedy: ^ [13:06:17] I plan to spend the morning dropping the dependencies, at least for now. [13:06:47] (03CR) 10Jcrespo: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [13:09:48] (03CR) 10ArielGlenn: ">>>>>>> ... AND the wikitech.dblist (which must" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [13:10:36] (03CR) 10Jcrespo: "> yeah there is a followup in Related Changes in gerrit: Iebc1d1eee35562fb301bfa56a607dbdb8ebeb033" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [13:12:40] (03CR) 10Jcrespo: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [13:22:21] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) [13:22:45] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) Really that is unfair, when all that was needed is to simply rebuild the Jessie base image :-\ [13:24:28] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10MoritzMuehlenhoff) This isn't "unfair", this task is about the update error in the package itself and there's absolutely nothing we can do... [13:27:04] (03PS1) 10Phamhi: toollabs::maintain_kubeusers remove cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/549460 [13:44:29] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) I did, that was in a sub task blocking this change. Anyway, my notes have the workaround mentioned by Filippo so that is good eno... [13:50:19] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1080 at 50%', diff saved to https://phabricator.wikimedia.org/P9551 and previous config saved to /var/cache/conftool/dbconfig/20191107-135018-jynus.json [13:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:58] elukey: Sorry, just saw this. did you figure out? is everything ok? [13:54:47] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:17] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:58:50] akosiaris: o/ - all good, we are down to one link with issues, Telia transport between eqiad and codfw, the Zayo one recovered. Am I right to say that beteen eqiad and codfw we have 4 transport links? (counting eqdfw and eqord) [13:59:04] I want to understand when to start calling people :D [14:01:40] yes, they can communicate via eqord as well [14:02:00] but really anytime we have 2 or more links dead for eqiad/codfw, we should be aware and paying attention [14:02:31] (03CR) 10Andrew Bogott: [C: 03+1] "if PCC is happy then I'm happy" [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [14:03:01] elukey: yes I think so [14:03:08] ah brandon answered already [14:03:28] (03PS6) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [14:03:37] afaik the wikitech diagram is correct (two direct links from telia and zayo, a telia path codfw<->eqord<->eqiad, and a gtt path from codfw<->eqdfw<->eqiad) [14:05:51] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:05:53] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:40] bblack: when I was on clinic duty, I pinged netop if I saw more than 2 scheduled maintenance at the same time, or one for sing [14:06:58] not sure if sing state has improved since [14:07:27] (03CR) 10Jbond: [C: 03+2] puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [14:07:38] yeah singapore still has a single true transport, the fallback is a tunnel which is not ideal [14:08:13] elukey: so maybe we can document that as a genearl rule for now [14:09:09] I think in general anytime we lose even a single transport link anywhere, it's something to be aware of, at least among those awake and around, that we're at increased risk of [14:09:36] losing codfw<->eqiad traffic completely is a pretty bad scenario, but also pretty unlikely given all the redundancy [14:10:01] and singapore is "special" - losing that one transport for an extended period of time, we might be better off depooling [14:11:48] it'd be nice to have some kind of simpler opertional dashboard about this stuff [14:12:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:12:42] (somehow succinctly and visually representing transport outages + schedule transport maint window overlaps affecting various sites... so that it's obvious that we have overlapping maint on a pair of redundant links, or an outage on one while in a maint window on another, etc) [14:13:13] (03CR) 10CDanis: [C: 03+1] prometheus: Provide global aggregation rules for trafficserver requests [puppet] - 10https://gerrit.wikimedia.org/r/548954 (https://phabricator.wikimedia.org/T236482) (owner: 10Vgutierrez) [14:14:13] thx cdanis <3 [14:14:34] vgutierrez: it took me way too long to puzzle through the meanings of all our various overloaded nouns [14:14:40] cluster, job, layer [14:15:01] we can invent new ones to reduce confusion maybe [14:15:20] yeah I don't have anything specific in mind but I like the thought [14:15:33] alsy maybe I just need another ☕ :) [14:15:42] s/alsy/also/ [14:16:03] e.g. if we have two meanings of the word cluster, we can rename the lesser of the two "snarkle" or something. [14:16:20] there aren't enough synonyms for all the overloaded words like these anyways, may as well make stuff up [14:16:50] bblack: thanks! Just to confirm, the gtt link via eqdfw is another transport right? Say that today also the eqord link went down, we'd have had another link to use right? So we are resilient to 3 out of 4 transports between codfw and eqiad, even if not ideal getting in that position of course [14:17:20] ah but the link's capacity is also a factor right [14:17:22] (03PS7) 10Arturo Borrero Gonzalez: [local] toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [14:17:49] elukey: more or less, I believe that's correct. keep in mind that the path via eqord is built out of two transports serially (both telia), and the eqdfw also relies on the codfw<->eqdfw interconnect. [14:17:53] (03PS8) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [14:17:59] all the link capacities should be 10G, so we're ok there [14:18:47] bblack: yep yep you are right [14:19:35] in my mind the multiple transport links that goes through eqord and eqdfw wouldn't have failed :D [14:19:38] you can also look at this stuff manually and see where the routers see the spaces advertised from to confirm the paths that are available and live [14:20:20] today I checked in librenms and eqord was definitely receiving more traffic, meanwhile eqdfw no, this is why I asked [14:20:49] I am still in my learning phase, checking routing tables is still difficult :D [14:20:52] e.g. on cr[12]-eqad, "show route 208.80.153.0/24" [14:21:49] all the paths on ae0.0 are where cr[12]-eqiad have a path through the opposite cr[12]-eqiad [14:22:24] but the others are showing various transport paths that those eqiad routers can see codfw IP space advertised from (different on each eqiad router) [14:23:22] hmmm that command output is way too noisy though, so many sub-routes [14:24:05] but anyways, different transports terminate on exactly one of cr1 or cr2 at each site too, so there's some redundancy issues to consider there as well, if we had a router outage [14:24:17] (03PS2) 10RLazarus: httpbb: Add httpbb module and baseurls test suite cribbed from apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/549116 [14:25:00] perhaps a better command to look at this from eqiad pov: [14:25:10] bblack@re0.cr2-eqiad> show route 208.80.152.0/23 exact [14:25:48] (03PS3) 10RLazarus: httpbb: Add httpbb module and baseurls test suite cribbed from apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/549116 (https://phabricator.wikimedia.org/T236699) [14:25:58] which, if I hit that on cr[12]-eqiad, I see cr1-eqiad only gets it from cr2-eqiad currently, and cr2-eqiad gets it from the Zayo link. [14:26:07] (03CR) 10RLazarus: [C: 03+2] "Merging, thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549116 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [14:26:57] elukey: https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png [14:26:58] which isn't showing me those backup paths via the eqdfw/eqord in play, but I think that may be because of aggregation issues and it's still not the best query command for this [14:27:04] that could be helpful [14:27:09] (03CR) 10Andrew Bogott: "@jbond, adding you because this removes the ruby-httpclient package from all puppetmasters. Harmless, I think." [puppet] - 10https://gerrit.wikimedia.org/r/549224 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [14:27:37] all of these are links, and act on a shortest-path basis [14:28:16] paravoid: is there any easy way on juniper CLI to get a pared-down simple view of "crX-eqiad can see codfw address routes right now from these links?" [14:28:22] in theory even if all eqiad-codfw / eqord-codfw links were severed, traffic could flow via eqiad-eqord-ulsfo-codfw [14:28:29] or even via esams [14:28:40] without "exact" there's a lot of unrelated noise, and 152.0/23 only shows up for the direct transports apparently [14:29:15] (03Abandoned) 10Andrew Bogott: labtestwikitech: use local database for Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547553 (https://phabricator.wikimedia.org/T119154) (owner: 10Andrew Bogott) [14:29:31] not really, no [14:29:34] I was hoping for a way to easily confirm that yes, there is a route available over these X other ports [14:29:38] you could look into the OSPF database [14:29:46] hmmm [14:33:52] paravoid: I am aware of that scheme, I also checked the OSPF one and I was reasonably sure, but didn't want to risk an outage :D [14:34:05] yeah [14:34:20] "show ospf route 208.80.153.0/24" does show the other links in the table [14:34:31] err sorry, "show ospf route 208.80.152.0/23" [14:35:28] * elukey takes notes [14:35:40] the whole reason I was wondering, was because otherwise I don't know to confirm that e.g. for some reason we don't actually have the route available via the eqdfw path, even though the physical possibility is there. [14:35:58] (because of some ongoing configuration battle or whatever) [14:36:05] +1 exactly [14:36:07] BGP only selects the best path and proopagates that, so there is no way* to see all the paths from one vantage point [14:37:13] *: there is a relatively new BGP extension called BGP "add-path" that allows you to advertise multiple paths; we are not using that (and it's dubious whether it would provide any benefit) [14:37:19] although you could script something that pulls output from "show ospf route ..." + "show route ..." on cr[12]-site and filters down the data and shows you the interfaces it can see $othersite's space on [14:37:50] we run IGP (= OSPF + OSPF3) across all confederation subASes, even though we could in theory not do that and rely only on (e)BGP [14:37:51] or maybe just ospf [14:38:00] ok [14:38:07] so yeah, just ospf is enough now [14:38:16] so you could in theory grab that information by the OSPF database, but I don't have a handy way to do that quickly/easily [14:40:17] a more high level question - is there a task or some scheduled work to have a (semi) automated way to see maintenance scheduled on links? [14:40:26] what has been in our TODO for a while is https://phabricator.wikimedia.org/T167306 [14:41:10] that's a juniper configuration thing where it automatically sets up a backup route (installed in the HW) for very quick switchover to a backup link during an outage [14:41:49] also see https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/ospf-link-protection-configuring.html [14:42:01] and https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/ospf-node-link-protection-configuring.html and https://www.juniper.net/documentation/en_US/junos/topics/concept/ospf-loop-free-alternate-routes-overview.html [14:42:20] elukey: XioNoX was looking at https://github.com/wasabi222/janitor but we haven't found time for that yet [14:44:09] paravoid: ah nice! I can have a chat with him about that, maybe I can help setting it up [14:44:44] elukey: that would be https://phabricator.wikimedia.org/T230835 [14:45:07] 10Operations, 10User-Elukey: Investigate janitor, maintenance emails parser - https://phabricator.wikimedia.org/T230835 (10elukey) [14:45:15] heh [14:45:22] <3 [14:46:56] ah! a network conversation ! [14:47:41] lol [14:48:49] another way to look at traffic between sites is the (outdated) diagram on https://phabricator.wikimedia.org/T200277 [14:49:22] I have a more up to date version on google drive [14:53:02] !log rebuilding compiler1001 [14:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:19] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:35] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:57:48] chaomodus: ^ [14:58:10] (03CR) 10Alexandros Kosiaris: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/549444 (owner: 10Ema) [15:01:30] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Jgreen) >>! In T237582#5642841, @wiki_willy wrote: > @Jgreen - looks like the warranty ended for the server a few months ago in May. Let me know if you're looking to decommission this server... [15:06:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:13] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:12:33] (03PS2) 10Andrew Bogott: cloud: Replace diamond::collector::minimalpuppetagent [puppet] - 10https://gerrit.wikimedia.org/r/549241 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [15:15:14] (03CR) 10Andrew Bogott: [C: 03+2] cloud: Replace diamond::collector::minimalpuppetagent [puppet] - 10https://gerrit.wikimedia.org/r/549241 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [15:16:27] hm [15:16:49] (03CR) 10Andrew Bogott: [C: 03+1] "I've never used 'flat.cfg' but otherwise this looks good. Hoping partman treats you well!" [puppet] - 10https://gerrit.wikimedia.org/r/548878 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [15:17:00] (03CR) 10Subramanya Sastry: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548923 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [15:17:21] (03CR) 10Andrew Bogott: [C: 03+1] dnsrecursor: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/543803 (owner: 10Muehlenhoff) [15:25:42] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10herron) Since it looks like cp3056 might be down for some time could we remove it from the config until fixed? It would be good to let the ipsec checks in icinga return to green. https://i... [15:25:53] (03PS1) 10Herron: remove cp3056 from service configs due to host hardware problems [puppet] - 10https://gerrit.wikimedia.org/r/549474 (https://phabricator.wikimedia.org/T236497) [15:31:36] (03PS1) 10BBlack: Add test-lb to DNS for IPv[46] [dns] - 10https://gerrit.wikimedia.org/r/549476 (https://phabricator.wikimedia.org/T237492) [15:31:49] (03PS1) 10BBlack: LVS text: add test-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/549477 (https://phabricator.wikimedia.org/T237492) [15:35:35] (03PS1) 10BBlack: Remove cp3056 from cache::nodes temporarily [puppet] - 10https://gerrit.wikimedia.org/r/549480 [15:36:17] (03PS1) 10Herron: prometheus: add ipsec_status to prometheus/global [puppet] - 10https://gerrit.wikimedia.org/r/549481 (https://phabricator.wikimedia.org/T230236) [15:36:52] (03PS2) 10BBlack: Remove cp3056 from cache::nodes temporarily [puppet] - 10https://gerrit.wikimedia.org/r/549480 (https://phabricator.wikimedia.org/T236497) [15:37:33] (03PS1) 10Andrew Bogott: cloud-vps: remove duplicate definition of node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/549484 (https://phabricator.wikimedia.org/T210993) [15:38:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/549484 (https://phabricator.wikimedia.org/T210993) (owner: 10Andrew Bogott) [15:38:24] (03CR) 10CDanis: "FWIW, I'm pretty strongly in favor of *something* happening here. I think the change as it exists is a big improvement to the status quo." [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [15:40:05] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: remove duplicate definition of node_puppet_agent [puppet] - 10https://gerrit.wikimedia.org/r/549484 (https://phabricator.wikimedia.org/T210993) (owner: 10Andrew Bogott) [15:40:15] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [15:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:06] !log remove BGP to AS3491 on eqiad (left the IX) [15:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:53] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:42:04] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) Sorry I missed that you already had a patch! But in any case, we only need commenting from cache::nodes to fix up this case (there's no good reason to e.g. chu... [15:42:28] (03CR) 10BBlack: [C: 03+2] Remove cp3056 from cache::nodes temporarily [puppet] - 10https://gerrit.wikimedia.org/r/549480 (https://phabricator.wikimedia.org/T236497) (owner: 10BBlack) [15:44:41] (03CR) 10Dzahn: allow different memory limit settings for parsoid-php servers (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [15:45:33] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [15:48:18] (03CR) 10Andrew Bogott: "> it's entirely unclear to me whether the WMCS vs. not-WMCS split is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 (owner: 10Jforrester) [15:49:26] 10Operations, 10User-ArielGlenn: missed pages from kafka outage on July 11 2018 - https://phabricator.wikimedia.org/T199890 (10Dzahn) @RobH Do you know if we are using the "Premium route" per above ? [15:51:40] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10Arjunaraoc) As per https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&var-cluster=maps1 the failure is now 10 days old. An update on the issue and the expected time to fix would be... [15:52:22] (03PS2) 10BBlack: Add test-lb to DNS for IPv[46] [dns] - 10https://gerrit.wikimedia.org/r/549476 (https://phabricator.wikimedia.org/T237492) [15:53:31] (03PS2) 10BBlack: LVS text: add test-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/549477 (https://phabricator.wikimedia.org/T237492) [15:53:33] (03CR) 10Dzahn: "thanks! By the way, there is a file called "typos" in the root of the repo. Probably not for these but if typos are common they can be add" [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [15:54:16] Amir1: not sure if you saw my earlier question: where are we in the wb item terms migration? [15:54:23] (03PS12) 10Dzahn: Fix some typos in code comments [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [15:54:32] (03CR) 10Dzahn: [C: 03+2] Fix some typos in code comments [puppet] - 10https://gerrit.wikimedia.org/r/530989 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [15:54:47] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP Ayounsi Telia fibercut (Ref: 01044239) - The acknowledgement expires at: 2019-11-08 15:54:11. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:54:47] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia fibercut (Ref: 01044239) - The acknowledgement expires at: 2019-11-08 15:54:11. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:56:15] (03PS1) 10Reedy: Add 'authentication' => 'info' logging to beta temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549487 (https://phabricator.wikimedia.org/T237554) [15:56:24] jouncebot: now [15:56:24] No deployments scheduled for the next 1 hour(s) and 3 minute(s) [15:56:26] jouncebot: next [15:56:26] In 1 hour(s) and 3 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T1700) [15:56:35] (03CR) 10Reedy: [C: 03+2] Add 'authentication' => 'info' logging to beta temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549487 (https://phabricator.wikimedia.org/T237554) (owner: 10Reedy) [15:57:21] (03Merged) 10jenkins-bot: Add 'authentication' => 'info' logging to beta temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549487 (https://phabricator.wikimedia.org/T237554) (owner: 10Reedy) [15:58:47] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: beta logging (duration: 01m 00s) [15:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:13] 10Operations: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10ayounsi) p:05Triage→03High [16:02:03] !log mw2225 restart cron (T236799) [16:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:07] T236799: mw2225 keeps sending cronspam for hhvm-needs-restart - https://phabricator.wikimedia.org/T236799 [16:04:38] (03PS1) 10Reedy: Alphasort wmgMonologChannels in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549488 [16:04:46] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10jbond) [16:07:00] (03PS1) 10Elukey: profile::hadoop::kerberos: remove unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/549489 [16:07:59] (03PS3) 10Jforrester: Split out DB-related concerns for wikitech and test wikitech into s-wikitech, s-wikitech-wmcs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 [16:08:27] jynus: Are you able to help deploy the s10/etc. patch? I assume it needs some etcd config changes? [16:08:44] James_F: we should wait for DBA input [16:08:48] (03CR) 10Jforrester: [C: 03+2] Follow-up 0f90f506: Leave labtestwiki in the wikitech dblist for config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 (owner: 10Jforrester) [16:08:58] but yes, I or manuel can do it [16:08:59] OK. [16:09:14] (03CR) 10Jforrester: [C: 03+1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 (owner: 10Jforrester) [16:09:18] he will be back nest week [16:09:23] (03PS2) 10Jforrester: Follow-up 0f90f506: Leave labtestwiki in the wikitech dblist for config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547597 [16:09:38] * James_F nods. [16:10:07] I'll be at TechConf but I can help, of course. [16:10:12] (03CR) 10Elukey: [C: 03+2] profile::hadoop::kerberos: remove unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/549489 (owner: 10Elukey) [16:10:26] yes, in terms of owning the deployment don't worry [16:10:48] it is that manuel is the best person to ok it [16:11:52] * James_F nods. [16:11:54] ooohhh nice to see those two patches go through at last [16:12:25] James_F: if it was another wiki, I would deploy because unbreak now and change it later [16:12:27] (03CR) 10Jforrester: [C: 03+1] Alphasort wmgMonologChannels in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549488 (owner: 10Reedy) [16:12:29] but I think it can wait [16:12:33] a few days [16:12:40] Sure. [16:12:57] !log clear v4 BGP sessions to AS7713 in eqsin (hit max prefix limit) [16:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:28] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10BBlack) p:05Triage→03High [16:13:41] (03PS4) 10Jforrester: Split out DB-related concerns for real and test wikitechs into s10/s11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/547596 [16:13:47] 10Operations, 10Traffic: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10BBlack) [16:15:29] !log add BGP sessions to AS57695 in esams and eqiad [16:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:08] !log add BGP sessions to AS64050 in eqiad [16:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:03] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:23:13] (03PS1) 10Elukey: Deploy kerberos keytabs on stat100[4,5,7] [puppet] - 10https://gerrit.wikimedia.org/r/549565 (https://phabricator.wikimedia.org/T237269) [16:24:31] (03CR) 10Elukey: [C: 03+2] Deploy kerberos keytabs on stat100[4,5,7] [puppet] - 10https://gerrit.wikimedia.org/r/549565 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [16:25:55] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10herron) >>! In T236497#5644446, @BBlack wrote: > Sorry I missed that you already had a patch! But in any case, we only need commenting from cache::nodes to fix up this... [16:26:07] (03CR) 10Reedy: [C: 03+2] Alphasort wmgMonologChannels in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549488 (owner: 10Reedy) [16:26:47] (03Merged) 10jenkins-bot: Alphasort wmgMonologChannels in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549488 (owner: 10Reedy) [16:30:31] (03Abandoned) 10Herron: remove cp3056 from service configs due to host hardware problems [puppet] - 10https://gerrit.wikimedia.org/r/549474 (https://phabricator.wikimedia.org/T236497) (owner: 10Herron) [16:32:11] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:32:37] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1080 at 100%', diff saved to https://phabricator.wikimedia.org/P9553 and previous config saved to /var/cache/conftool/dbconfig/20191107-163235-jynus.json [16:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:17] !log Homer push on cr2-knams: Sampling (disabled), enhanced-hash-key, ospf interfaces re-ordering (noop), policy-statement BGP_from_LVS (unused), lo0 term allow_vmhost [16:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:28] (03Abandoned) 10Ottomata: Add schema.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/549105 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [16:34:43] (03PS1) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [16:34:48] (03CR) 10Cwhite: "Does this work as is?" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [16:37:55] (03PS2) 10Ottomata: Set up cache routing for schema.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) [16:39:10] (03Abandoned) 10Ottomata: Set up schema.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/549106 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [16:39:40] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [16:39:51] (03PS3) 10Ottomata: Set up cache routing for schema.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/549177 (https://phabricator.wikimedia.org/T233630) [16:40:31] (03PS6) 10Ayounsi: Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 [16:40:33] (03PS1) 10Ayounsi: cr2-knams OSPF interfaces to match reality [homer/public] - 10https://gerrit.wikimedia.org/r/549568 [16:40:41] (03CR) 10Ottomata: [C: 03+2] Add schema.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/549173 (https://phabricator.wikimedia.org/T233630) (owner: 10Ottomata) [16:42:26] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: some alphasorted config (duration: 01m 00s) [16:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:17] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 5 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) @ema, @Joe informs me the that the nginx server serving schema.svc should terminate TLS for the 'encrypt all th... [16:46:23] 10Operations, 10Commons, 10Multimedia, 10SRE-swift-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10TheSandDoctor) >>! In T124101#2643954, @Platonides wrote: > Image https://commons.wikimedia.org/wiki... [16:53:03] (03CR) 10Ottomata: "COOOOL" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [16:54:35] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:54:43] (03CR) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [16:58:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=wtp2020.codfw.wmnet [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:50] !log wtp2020 - depooled for T205712 [16:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:55] T205712: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 [17:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:24] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [17:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:37] !log wtp2020 - 2 hours downtime - shut down (T205712) - go ahead @papaul [17:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:29] * elukey dances [17:01:57] hadoop workers restarted without any manual intervetion [17:07:31] elukey: cool! [17:08:31] !log add sampling stanza (disabled) to cr2-esams [17:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:09] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [17:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:30] (this is the test cluster) [17:09:59] PROBLEM - Host wtp2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:19] !log Homer push - forwarding-options - to all cr [17:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:50] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10wiki_willy) [17:20:27] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10wiki_willy) @Jgreen - Child task T237651 created to order the part. For any hardware repair requests going forward, can you follow the template here - https://phabricator.wikimedia.org/maniph... [17:21:41] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] cr2-knams OSPF interfaces to match reality [homer/public] - 10https://gerrit.wikimedia.org/r/549568 (owner: 10Ayounsi) [17:21:52] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Initial forwarding-options templating [homer/public] - 10https://gerrit.wikimedia.org/r/547586 (owner: 10Ayounsi) [17:22:55] (03CR) 10Jbond: [C: 03+1] "LGTM, i don't see `require 'httpclient'` used anywhere and we can add the package back quickly if we see an issue" [puppet] - 10https://gerrit.wikimedia.org/r/549224 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [17:25:25] RECOVERY - Host wtp2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.84 ms [17:25:28] !log mholloway-shell@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MachineVision: Drop currently unsupported external dependencies (T227349) (duration: 05m 19s) [17:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:33] T227349: Deploy the MachineVision extension to production - https://phabricator.wikimedia.org/T227349 [17:26:26] 10Puppet, 10cloud-services-team, 10Patch-For-Review, 10User-jbond: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) 05Open→03Resolved complet [17:26:29] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [17:27:00] 10Puppet, 10Patch-For-Review, 10User-jbond: Populate puppetdb1002 with live data - https://phabricator.wikimedia.org/T235655 (10jbond) new puppetdbs aren ow live [17:27:09] 10Puppet, 10Patch-For-Review, 10User-jbond: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [17:27:12] 10Puppet, 10Patch-For-Review, 10User-jbond: Populate puppetdb1002 with live data - https://phabricator.wikimedia.org/T235655 (10jbond) 05Open→03Resolved [17:28:11] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564 (10jbond) [17:28:13] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) 05Open→03Resolved This is now complte [17:29:18] (03CR) 10Anomie: Enable REST API on all WMF wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549219 (https://phabricator.wikimedia.org/T237555) (owner: 10Tim Starling) [17:30:57] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [17:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:23] 10Operations, 10Puppet, 10Traffic, 10User-jbond: Serve volatile uri from local site - https://phabricator.wikimedia.org/T235427 (10jbond) 05Open→03Resolved a:03jbond This has now been implmented [17:31:54] 10Operations, 10Puppet, 10puppet-compiler, 10User-jbond: puppet-compiler: fix git permissions - https://phabricator.wikimedia.org/T236986 (10jbond) 05Open→03Resolved This is now implemented [17:33:12] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/549208 (https://phabricator.wikimedia.org/T213223) (owner: 10Krinkle) [17:35:17] (03CR) 10Krinkle: "Yeah, that can be fixed in" [puppet] - 10https://gerrit.wikimedia.org/r/549208 (https://phabricator.wikimedia.org/T213223) (owner: 10Krinkle) [17:36:34] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564 (10jbond) The ops production puppet is now all running puppet 5. upgrades to wmcs are tracked in https://phabricator.wikimedia.org/T235218 [17:36:50] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564 (10jbond) 05Open→03Resolved a:03jbond [17:36:52] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561 (10jbond) [17:37:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [17:37:29] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade puppet master servers - https://phabricator.wikimedia.org/T227587 (10jbond) 05Open→03Resolved All puppet masters have been upgraded [17:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:34] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review, 10User-jbond: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [17:38:16] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [17:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:25] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561 (10jbond) 05Open→03Resolved a:03jbond suspect this can be closed but reopen if im wrong [17:38:41] the two zookeeper clusters that I am roll restarting are the druid ones [17:39:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [17:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:09] 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257 (10jbond) What is the status of this considering we are now [i plan to drop the old puppetdb's completely next week] on puppetdb6 [17:41:55] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [17:44:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [17:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:59] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:51:04] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) Running ePSA on the system [17:51:46] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10Jgreen) >>! In T237582#5644946, @wiki_willy wrote: > @Jgreen - can you follow the template here - https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ . Absolutely. I added it to... [17:52:07] 10Operations, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10jbond) what is still left to change in order to complete this? [17:52:28] 10Operations, 10ops-eqiad: frqueue1001 system battery needs replacement - https://phabricator.wikimedia.org/T237582 (10wiki_willy) Thanks @Jgreen - much appreciated. [17:54:44] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10Krinkle) I'm trying to centralise the conversation around field mapping limits and efforts to mitigate/support how we use Logstash in production. Should this be merged into T1800... [17:55:51] (03PS1) 10Elukey: Add defaults to argparse's output for kafka/hadoop/zookeeper [cookbooks] - 10https://gerrit.wikimedia.org/r/549582 [17:58:48] (03PS1) 10Elukey: cumin: add an alias for the new Zookeeper Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/549583 [18:00:04] cscott, arlolra, subbu, halfak, accraze, and mdholloway: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T1800). [18:00:13] (03PS2) 10Elukey: cumin: add an alias for the new Zookeeper Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/549583 [18:02:46] (03PS1) 10Elukey: sre.zookeeper.roll-restart-zookeeper: add zookepeer-analytics cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/549585 [18:04:00] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/integration/docroot/+/549584" [puppet] - 10https://gerrit.wikimedia.org/r/549208 (https://phabricator.wikimedia.org/T213223) (owner: 10Krinkle) [18:08:59] (03PS2) 10Elukey: Add defaults to argparse's output for kafka/hadoop/zookeeper [cookbooks] - 10https://gerrit.wikimedia.org/r/549582 [18:09:02] (03PS2) 10Elukey: sre.zookeeper.roll-restart-zookeeper: add zookepeer-analytics cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/549585 [18:14:15] (03PS3) 10Elukey: Add defaults to argparse's output for kafka/hadoop/zookeeper/cassandra [cookbooks] - 10https://gerrit.wikimedia.org/r/549582 [18:14:17] (03PS3) 10Elukey: sre.zookeeper.roll-restart-zookeeper: add zookepeer-analytics cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/549585 [18:14:32] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@afd41d7]: bulk_daemon: Adjust glent configuration [18:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] 10Operations, 10Puppet: puppetdb4: systemd config review - https://phabricator.wikimedia.org/T187257 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff The systemd unit shipped in the Buster package is fine, this was specific to the Puppet Labs one, so closing. [18:15:27] 10Operations, 10Puppet: Port puppetlabs PuppetDB 4.4 package to stretch - https://phabricator.wikimedia.org/T185502 (10MoritzMuehlenhoff) [18:15:43] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) Using the above method, I now get 11,086 unique fields in the dropdown menu. That's significantly more than last month. The fie... [18:17:18] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) [18:17:21] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10Krinkle) [18:17:23] 10Operations, 10Wikimedia-Logstash, 10observability, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10Krinkle) [18:19:57] (03CR) 10Dzahn: "we should first do https://gerrit.wikimedia.org/r/c/integration/docroot/+/549587 then https://gerrit.wikimedia.org/r/c/integration/docroot" [puppet] - 10https://gerrit.wikimedia.org/r/549208 (https://phabricator.wikimedia.org/T213223) (owner: 10Krinkle) [18:20:20] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@afd41d7]: bulk_daemon: Adjust glent configuration (duration: 05m 49s) [18:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:55] (03CR) 10Mobrovac: [C: 04-1] allow different memory limit settings for parsoid-php servers (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548944 (https://phabricator.wikimedia.org/T236833) (owner: 10Dzahn) [18:22:23] (03CR) 10Elukey: [C: 03+2] cumin: add an alias for the new Zookeeper Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/549583 (owner: 10Elukey) [18:23:17] (03PS1) 10Mholloway: MachineVision: Disable retrying annotation requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549590 [18:24:51] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:52] (03CR) 10Mholloway: [C: 03+2] MachineVision: Disable retrying annotation requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549590 (owner: 10Mholloway) [18:25:42] !log restart mjolnir-kafka-bulk-daemon and mjolnir-kafka-msearch-daemon across `cirrus` dsh group [18:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:17] (03CR) 10Anomie: "Personally, I'd probably take it slightly slower. Put it to group0, wait a few days (maybe a week), then group1, then all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [18:29:20] 10Operations, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) The last remaining patch should be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/525220/ (and some semi-related refa... [18:30:59] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:31:08] 10Operations, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10jbond) awesome thanks [18:31:43] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Disable retrying annotation requests (duration: 05m 17s) [18:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:47] (03PS4) 10Krinkle: Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [18:34:06] (03CR) 10Krinkle: "(removed newline so that Gerrit's index picks it up as meta data instead of body paragraph)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [18:41:33] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:43:19] (03CR) 10Paladox: [C: 03+1] "I'm currently rebuilding gerrit-test5 as gerrit-test6 with buster, should be able to do this either tonight or tomorrow!" [puppet] - 10https://gerrit.wikimedia.org/r/548547 (owner: 10Dzahn) [18:43:59] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) 05Open→03Resolved [18:44:02] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Krinkle) [18:44:10] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10Krinkle) Thanks, confirmed here as well. [18:44:13] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [18:47:35] (03PS2) 10Andrew Bogott: wmcs puppet: remove the mwyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/549224 (https://phabricator.wikimedia.org/T235708) [18:50:09] (03CR) 10Andrew Bogott: [C: 03+2] wmcs puppet: remove the mwyaml hiera backend [puppet] - 10https://gerrit.wikimedia.org/r/549224 (https://phabricator.wikimedia.org/T235708) (owner: 10Andrew Bogott) [18:56:44] (03Abandoned) 10Paladox: WIP: Update gerrit to 2.16.7 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 (owner: 10Paladox) [18:56:48] (03Restored) 10Paladox: WIP: Update gerrit to 2.16.7 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 (owner: 10Paladox) [18:56:56] (03Abandoned) 10Paladox: Testing: Do not merge [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/525867 (owner: 10Paladox) [18:59:39] PROBLEM - Host wtp2020 is DOWN: PING CRITICAL - Packet loss = 100% [19:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T1900). [19:00:05] RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:04:08] I'll do my own SWAT [19:05:14] huh, that's my patch :o [19:05:57] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, and 2 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10EBernhardson) >>! In T189333#5488005, @Krinkle wrote: >>>! In T189333#5483346, @fgiunchedi wrote: >>>>! In T189333#5481492, @Krinkle wrot... [19:06:17] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:34] (03PS1) 10Mholloway: MachineVision: Enqueue annotation job on upload complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549597 [19:08:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:18:10] 10Operations, 10Wikimedia-Logstash, 10observability, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Reduce the number of fields declared in elasticsearch by logstash - https://phabricator.wikimedia.org/T180051 (10Eevans) An additional 2¢ When we implemented logging for Kask we... [19:20:17] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10wiki_willy) a:03RobH [19:23:15] PROBLEM - Host wtp2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:24:09] (03CR) 10Dzahn: [C: 03+2] doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [19:24:18] (03PS8) 10Dzahn: doc: make published files group writable [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [19:25:17] ACKNOWLEDGEMENT - Host wtp2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T205712 [19:28:16] * mdholloway is lurking waiting to do a config deploy [19:28:55] RECOVERY - Host wtp2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.84 ms [19:36:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:39:16] (03PS1) 10Mforns: analytics::refinery::job::data_purge Absent data quality deletion [puppet] - 10https://gerrit.wikimedia.org/r/549612 (https://phabricator.wikimedia.org/T235486) [19:40:28] (03PS1) 10Bstorm: new k8s: Fix ingress object and enable toolsbeta ingress creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/549613 (https://phabricator.wikimedia.org/T236202) [19:40:46] (03PS1) 10Mforns: analytics::refinery::job::data_purge Remove data quality deletion [puppet] - 10https://gerrit.wikimedia.org/r/549615 (https://phabricator.wikimedia.org/T235486) [19:41:14] (03CR) 10Dzahn: "thank you Paladox! let me know with a +1 once it's time" [puppet] - 10https://gerrit.wikimedia.org/r/548547 (owner: 10Dzahn) [19:42:27] ACKNOWLEDGEMENT - Host wtp2020 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T205712 [19:43:45] (03CR) 10Andrew Bogott: [C: 03+1] "good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/549446 (owner: 10Jbond) [19:44:27] ACKNOWLEDGEMENT - SSH wtp2020.mgmt on wtp2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T205712 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:44:39] mdholloway: Sorry, I lost track of time and forgot to do my deploy [19:44:50] mdholloway: Go ahead and do your config deploy any time, my SWAT is a non-config one anyway [19:45:09] RoanKattouw: ah, cool, thanks [19:46:05] (03CR) 10Mholloway: [C: 03+2] MachineVision: Enqueue annotation job on upload complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549597 (owner: 10Mholloway) [19:46:33] (03CR) 10Andrew Bogott: [C: 03+1] pupet_compiler: update compiler-update-fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549446 (owner: 10Jbond) [19:46:56] (03Merged) 10jenkins-bot: MachineVision: Enqueue annotation job on upload complete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549597 (owner: 10Mholloway) [19:50:22] (03PS3) 10Jbond: puppet_compiler: update compiler-update-fact [puppet] - 10https://gerrit.wikimedia.org/r/549446 [19:50:38] (03PS1) 10Hashar: gerrit: raise log retention from 7 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/549618 [19:51:47] (03CR) 10Jbond: [C: 03+2] puppet_compiler: update compiler-update-fact [puppet] - 10https://gerrit.wikimedia.org/r/549446 (owner: 10Jbond) [19:53:05] (03CR) 1020after4: [C: 03+1] gerrit: raise log retention from 7 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/549618 (owner: 10Hashar) [19:53:31] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:54:17] Uhh [19:54:20] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Enqueue annotation job on upload complete (duration: 05m 19s) [19:54:22] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file: "The file … is in an inconsistent state within the internal storage backends" - https://phabricator.wikimedia.org/T236246 (10Krinkle) [19:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:26] Does X-Wikimedia-Debug not work for officewiki? [19:54:46] Huh my browser plugin doesn't work at all [19:55:31] I think officewiki is forced to run the latest branch? not sure about which server it runs on [19:55:35] (03CR) 10Dzahn: [C: 03+1] gerrit: raise log retention from 7 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/549618 (owner: 10Hashar) [19:55:41] Hmm but even on mediawiki.org, XWD doesn't work [19:55:52] twentyafterfour: it's on group0 like test wikis [19:55:53] oh ..hmm [19:55:53] I have the Chrome extension, and I see it sending x-wikimedia-debug: backend=mwdebug1002.eqiad.wmnet [19:55:57] (and mw.org) [19:56:12] But mw.config.get( 'wgHostname') return mw1322 [19:56:14] RoanKattouw: Are you refreshing with 304 allowance a cacheable page? [19:56:23] that makes the browser re-use old html [19:56:28] common pitfall :) [19:56:39] (03PS3) 10Dzahn: gerrit: remove pre-buster support [puppet] - 10https://gerrit.wikimedia.org/r/548547 [19:57:02] also mwdebug1002 is offlimits currently, remember to use mwdebug1001 instead if you're looking at errors/logstash [19:57:13] Krinkle: No I'm navigating around and logged in [19:57:14] It was depooled afaik but looks like it started working again [19:57:18] Oh hmm OK [19:57:20] I'll try 1001 [19:57:33] logging in does not disable private http cache though [19:57:40] but yeah, if it's a new page that shouldn't be the case. [19:57:46] Still not working though [19:57:47] What does Server: http resoonse header say? [19:57:50] I'm logged in at https://www.mediawiki.org/wiki/Special:Preferences [19:57:57] works for me in Chrome, getting mwdebug1001 back from wgHostname [19:58:08] server: mw1269.eqiad.wmnet [19:58:11] (03CR) 10Dzahn: [C: 03+1] "paladox said gerrit-test5 has been deleted" [puppet] - 10https://gerrit.wikimedia.org/r/548547 (owner: 10Dzahn) [19:58:16] Try curl? [19:58:22] And I'm sending x-wikimedia-debug: backend=mwdebug1001.eqiad.wmnet [19:58:39] or "copy as curl" from the devtoolsnetwork panel [19:59:09] (03PS2) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [19:59:45] curl -H "X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet" https://www.mediawiki.org/wiki/Blah [19:59:50] Also comes from an app server [19:59:56] Maybe it's working in esams but broken in ulsfo? [20:00:04] twentyafterfour and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - American version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191107T2000). [20:00:30] that curl works for me [20:00:33] mwdebug1001 [20:01:17] hmm [20:01:17] server: mwdebug1001.eqiad.wmnet [20:01:17] x-cache: cp1077 pass, cp1081 pass [20:01:17] Hmm trying to figure out how the dynamic DNS works now, they changed it [20:01:31] Oh right you're on the east coast [20:01:38] So you're going through eqiad [20:01:41] x-cache: cp4032 miss, cp4028 pass [20:01:45] So I'm going through ulsfo [20:02:03] eh, shouldn't there be a codfw/eqiad at the end of that path? [20:02:20] bblack: ---^^ It looks like X-Wikimedia-Debug is not being respected by Varnishes in ulsfo? [20:02:24] Looks like something has broken in traffic routing in some way relating to the debug director [20:02:53] (03PS1) 10Andrew Bogott: Horizon puppet panel: default to yaml mode [puppet] - 10https://gerrit.wikimedia.org/r/549623 (https://phabricator.wikimedia.org/T149589) [20:03:01] (03CR) 10Hashar: "Bah that changed depended in Gerrit on another change which has not been merged:" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [20:03:32] (03CR) 10Hashar: "Supposedly should have happened before https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/484308/8/modules/profile/manifests/doc.pp :]" [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [20:03:35] Krinkle: No my requests are not going through any eqiad caches [20:03:44] Even normal anon page views with no special headers: [20:03:53] $ curl -I https://www.mediawiki.org/wiki/Blah [20:03:58] x-cache: cp4027 miss, cp4028 miss [20:04:05] Second time: [20:04:06] x-cache: cp4027 miss, cp4028 hit/1 [20:04:52] that's weird. Do we not have layered varnish backends anymore? [20:04:52] Krinkle: Do the responses you get have a header like x-ats-timestamp: 1573156989 ? [20:04:56] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/548439 (owner: 10Dzahn) [20:05:05] No [20:05:07] RoanKattouw: probably that's an ovesight in ATS [20:05:12] And ATS indeed seems like a prime suspect [20:05:16] ulsfo doesn't have backend varnishes, it has backend ATSes now [20:05:34] Right, and they seem to not be routing XWD correclty [20:05:34] probably something overlooked in the conversion of all the things to ATS-land [20:06:07] This breaks the ability for people on the west coast to test MW code changes before deploying :/ [20:06:32] Now trying to figure out the DNS name / IP for the eqiad entry point so I can force my computer to talk to eqiad caches instead [20:06:44] yes, that would be your easiest path for now [20:06:57] 208.80.154.224 [20:08:26] even for eqiad, there's 1/N odds of going through ATS [20:08:31] but your odds are good [20:08:57] OK that worked [20:10:00] it's been this way for all of ulsfo for ~ 2 weeks now [20:11:41] Yikes! How has nobody noticed this before [20:12:03] Maybe the SWAT deploy testers were all in the eqiad/esams/eqsin service areas? [20:12:51] (03PS10) 10Hashar: rsync: readd incoming and outgoing chmod [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) [20:13:34] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Papaul) EPSA pass with no errors [20:13:46] possibly :) [20:14:10] (03CR) 10Hashar: "Rebased :]" [puppet] - 10https://gerrit.wikimedia.org/r/484304 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [20:14:57] (03PS2) 10Bstorm: new k8s: Fix ingress object and enable toolsbeta ingress creation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/549613 (https://phabricator.wikimedia.org/T236202) [20:15:19] (03PS2) 10Herron: prometheus: add ipsec_status to prometheus/global [puppet] - 10https://gerrit.wikimedia.org/r/549481 (https://phabricator.wikimedia.org/T230236) [20:15:33] RECOVERY - Host wtp2020 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [20:16:24] anyways, doing some phab/puppet archeology now to figure out where this X-Wikimedia-Debug -vs- ATS thing is really at (if it was missed or thought-handled or what), will make a ticket if there isn't already one [20:16:55] (03CR) 10Dzahn: "openstack-browser tool can help but it shows there are many:" [puppet] - 10https://gerrit.wikimedia.org/r/548439 (owner: 10Dzahn) [20:17:03] !log cluster restart for cloudelastic to pick JVM upgrade [20:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:13] (03PS1) 10Jhedden: wikimedia.org: add DNS entries for cloudceph mon and osd [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) [20:17:56] (03CR) 10Andrew Bogott: [C: 03+2] "I fixed a couple of doc pages." [puppet] - 10https://gerrit.wikimedia.org/r/549623 (https://phabricator.wikimedia.org/T149589) (owner: 10Andrew Bogott) [20:18:38] (03CR) 10Bstorm: new k8s: Fix ingress object and enable toolsbeta ingress creation (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/549613 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [20:20:45] (03CR) 10Jhedden: "per T224188 these host's public interface will be in the public1-b-eqiad subnet." [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [20:21:38] !log performing rolling reboots of kafka-main hosts for security updates [20:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:55] (03PS2) 10Jhedden: wikimedia.org: add DNS entries for cloudceph mon and osd [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) [20:24:30] (03CR) 10Andrew Bogott: [C: 04-1] "you'll want to add reverse records to the corresponding files. That's usually where I start since it's easy to see what IPs are unused th" [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [20:24:43] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.ArticleTarget.js: Fix error handling (duration: 01m 00s) [20:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:11] (03CR) 10Ayounsi: "LVS and LVS6 are currently defined globally, this drives the `policy-options prefix-list LVS-service-ips` (and v6). Which is used in the b" [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [20:30:35] Any objections to moving wmf.5 to group 2 now? I don't see any blockers on T233853 [20:30:36] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [20:31:17] (03PS1) 1020after4: group2 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549627 [20:31:17] (03CR) 1020after4: [C: 03+2] group2 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549627 (owner: 1020after4) [20:32:07] (03Merged) 10jenkins-bot: group2 wikis to 1.35.0-wmf.5 refs T233853 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549627 (owner: 1020after4) [20:34:11] (03PS3) 10Jhedden: wikimedia.org: add DNS entries for cloudceph mon and osd [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) [20:36:58] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [20:38:11] (03CR) 10Jhedden: [C: 03+2] wikimedia.org: add DNS entries for cloudceph mon and osd [dns] - 10https://gerrit.wikimedia.org/r/549625 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [20:42:11] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.35.0-wmf.5 refs T233853 [20:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:16] T233853: 1.35.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T233853 [20:43:54] (03PS3) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [20:45:04] (03CR) 10Ayounsi: "Nevermind, I think PS 3 is a clean way around it." [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [20:52:24] (03PS4) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [20:56:52] (03CR) 10Faidon Liambotis: Initial templating for CR routing-options (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [20:58:17] 10Operations, 10Traffic: ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10BBlack) p:05Triage→03High [20:59:38] (03CR) 10Thcipriani: [C: 03+1] "I've run into this before as well." [puppet] - 10https://gerrit.wikimedia.org/r/549618 (owner: 10Hashar) [21:00:43] (03CR) 10Dzahn: [C: 03+2] gerrit: raise log retention from 7 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/549618 (owner: 10Hashar) [21:01:56] (03PS1) 10Bartosz Dziewoński: Explicitly set wgVisualEditorRestbaseParsoidVariant='js' everywhere else [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549633 (https://phabricator.wikimedia.org/T229074) [21:02:35] (03CR) 10Bartosz Dziewoński: "@MObrovac: This will cause VE to send `X-Parsoid-Variant: js` with its requests. We can safely do this, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549633 (https://phabricator.wikimedia.org/T229074) (owner: 10Bartosz Dziewoński) [21:03:33] bblack: What are hassium and hassaleh? [21:03:39] Are they mwdebug1001/1002 replacements? [21:04:40] (03PS1) 10Jhedden: wikimedia.org: update cloudcephmon and osd hostnames [dns] - 10https://gerrit.wikimedia.org/r/549634 (https://phabricator.wikimedia.org/T228102) [21:07:35] (03PS2) 10Jhedden: install_server: add cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/548878 (https://phabricator.wikimedia.org/T228102) [21:09:43] (03CR) 10Jhedden: [C: 03+2] wikimedia.org: update cloudcephmon and osd hostnames [dns] - 10https://gerrit.wikimedia.org/r/549634 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [21:10:31] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10nnikkhoui) [21:12:52] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10nnikkhoui) [21:13:11] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10nnikkhoui) Requesting approval for this request from my manager @Fjalapeno [21:13:43] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access for Nikki Nikkhoui - https://phabricator.wikimedia.org/T237689 (10nnikkhoui) [21:14:42] (03PS4) 10Dzahn: gerrit: remove pre-buster support [puppet] - 10https://gerrit.wikimedia.org/r/548547 [21:18:04] (03CR) 10Jhedden: [C: 03+2] install_server: add cloudcephmon servers [puppet] - 10https://gerrit.wikimedia.org/r/548878 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [21:19:58] (03PS1) 10Catrope: beta: Update Parsoid port to 8001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549635 (https://phabricator.wikimedia.org/T231569) [21:20:21] (03CR) 10Herron: [C: 03+2] prometheus: add ipsec_status to prometheus/global [puppet] - 10https://gerrit.wikimedia.org/r/549481 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [21:20:48] (03CR) 10Subramanya Sastry: [C: 03+1] beta: Update Parsoid port to 8001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549635 (https://phabricator.wikimedia.org/T231569) (owner: 10Catrope) [21:23:41] (03PS1) 10Catrope: beta: Point Parsoid to parsoid-php instead of parsoid-js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549637 (https://phabricator.wikimedia.org/T229078) [21:24:17] (03Abandoned) 10Catrope: Flow: Configure enwiki beta to use Parsoid PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549448 (https://phabricator.wikimedia.org/T229078) (owner: 10Kosta Harlan) [21:28:43] !log boron apt-get clean (saved 9G on /) (T237649) [21:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:48] T237649: Boron disk space alert - https://phabricator.wikimedia.org/T237649 [21:30:37] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10Dzahn) home dirs over 1 Gigabyte: ` 2.1G elukey 2.2G ema 8.9G filippo 1.1G fsero 2.1G jiji 7.1G jmm 4.2G oblivian 11G otto 1.4G vgutierrez ` [21:31:46] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10Dzahn) @Ottomata @fgiunchedi Is there maybe something that is not needed anymore ^? [21:32:22] (03CR) 10Paladox: [C: 03+1] gerrit: raise log retention from 7 to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/549618 (owner: 10Hashar) [21:36:04] (03CR) 10Catrope: [C: 03+2] beta: Update Parsoid port to 8001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549635 (https://phabricator.wikimedia.org/T231569) (owner: 10Catrope) [21:36:51] (03Merged) 10jenkins-bot: beta: Update Parsoid port to 8001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549635 (https://phabricator.wikimedia.org/T231569) (owner: 10Catrope) [21:39:42] (03CR) 10Dzahn: [C: 03+1] "https://tools.wmflabs.org/openstack-browser/puppetclass/role::gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/548547 (owner: 10Dzahn) [21:42:37] (03PS5) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [21:43:21] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2020.codfw.wmnet [21:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:26] (03CR) 10Ayounsi: Initial templating for CR routing-options (035 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [21:43:57] 10Operations, 10ops-codfw: (OoW) wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10Dzahn) repooled [21:45:47] (03PS1) 10Andrew Bogott: Move labtestpuppetmaster2001 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/549641 (https://phabricator.wikimedia.org/T235819) [21:47:41] (03CR) 10Andrew Bogott: [C: 03+2] Move labtestpuppetmaster2001 to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/549641 (https://phabricator.wikimedia.org/T235819) (owner: 10Andrew Bogott) [21:50:18] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10Ottomata) Removed a buncha stuff! [21:54:33] !log rebuilding labtestpuppetmaster2001 w/Stretch [21:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:13] (03PS6) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [22:03:30] !log volker-e@deploy1001 Started deploy [design/style-guide@4abbc70]: Update to latest master with components overview additions [22:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:36] !log volker-e@deploy1001 Finished deploy [design/style-guide@4abbc70]: Update to latest master with components overview additions (duration: 00m 06s) [22:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:22] (03CR) 10Ayounsi: "Full diff from the routers available on https://phabricator.wikimedia.org/P9556" [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [22:11:55] (03PS7) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [22:12:00] RoanKattouw: hassium/hassaleh are proxy hosts used by Varnish to reach mwdebug1001, etc, although I'm not really sure why. It's some part of the original setup from o.ri [22:14:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) Received new psu and replaced. Tracking for RMA {F31057616} [22:15:31] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Jclark-ctr) 05Open→03Resolved [22:17:56] (03CR) 10Catrope: [C: 03+2] beta: Point Parsoid to parsoid-php instead of parsoid-js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549637 (https://phabricator.wikimedia.org/T229078) (owner: 10Catrope) [22:18:40] (03Merged) 10jenkins-bot: beta: Point Parsoid to parsoid-php instead of parsoid-js [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549637 (https://phabricator.wikimedia.org/T229078) (owner: 10Catrope) [22:20:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:22:43] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:25:55] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:26:35] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:26:45] (03PS1) 10Mholloway: MachineVision: Remove annotation job delay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549656 [22:27:03] 10Operations, 10User-jbond: Boron disk space alert - https://phabricator.wikimedia.org/T237649 (10MoritzMuehlenhoff) One big disk space hog is the fact that we don't expire old builds in /var/cache/pbuilder/result/*, there are builds which date back to 2016. There's certain value to keep the last, say six mont... [22:28:30] (03CR) 10Bstorm: [C: 04-1] "Until we find out why the ingress object doesn't work in https://phabricator.wikimedia.org/T234037#5646239, we probably don't want to both" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/549613 (https://phabricator.wikimedia.org/T236202) (owner: 10Bstorm) [22:29:35] (03CR) 10Mholloway: [C: 03+2] MachineVision: Remove annotation job delay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549656 (owner: 10Mholloway) [22:32:30] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Remove annotation job delay (duration: 00m 53s) [22:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:51] 10Operations, 10Traffic: ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10BBlack) Reading up on the `debug_proxy` stuff a bit more.... currently hassium/hassaleh are proxies into mwdebug[12]00[12], and use the header to select the destination host, and also has some backward... [22:39:47] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10Krinkle) [22:45:02] (03PS1) 10Andrew Bogott: labtestpuppetmaster2001: manage_puppet_ca_file: false [puppet] - 10https://gerrit.wikimedia.org/r/549660 [22:45:55] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:46:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: analytics1062 lost one of its power supplies - https://phabricator.wikimedia.org/T237133 (10Dzahn) 05Resolved→03Open it is still shown as CRIT in monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1062&s... [22:46:50] (03CR) 10Andrew Bogott: [C: 03+2] labtestpuppetmaster2001: manage_puppet_ca_file: false [puppet] - 10https://gerrit.wikimedia.org/r/549660 (owner: 10Andrew Bogott) [22:50:05] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:50:31] (03PS1) 10Bstorm: kubectl: upgrade /usr/bin/kubectl to 1.15.5 [puppet] - 10https://gerrit.wikimedia.org/r/549661 (https://phabricator.wikimedia.org/T214513) [22:50:47] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:51:43] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:52:28] (03PS2) 10Bstorm: kubectl: upgrade /usr/bin/kubectl to 1.15.5 [puppet] - 10https://gerrit.wikimedia.org/r/549661 (https://phabricator.wikimedia.org/T214513) [22:53:00] !log volker-e@deploy1001 Started deploy [design/style-guide@4abbc70]: Update responsive Illustrations styles changes [22:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:05] !log volker-e@deploy1001 Finished deploy [design/style-guide@4abbc70]: Update responsive Illustrations styles changes (duration: 00m 05s) [22:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:11] (03PS1) 10Andrew Bogott: labtestpuppetmaster2001: mark out more unused puppetmaster features [puppet] - 10https://gerrit.wikimedia.org/r/549662 [22:55:19] (03CR) 10Andrew Bogott: [C: 03+2] labtestpuppetmaster2001: mark out more unused puppetmaster features [puppet] - 10https://gerrit.wikimedia.org/r/549662 (owner: 10Andrew Bogott) [22:57:17] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@11d4ad8]: (no justification provided) [22:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:02] (03PS4) 10Ammarpad: Rename DPL extension variable to non-ambiguous name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 [23:04:06] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@11d4ad8]: (no justification provided) (duration: 06m 48s) [23:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:11] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:04:17] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f63bd829358: Failed to establish a new connection: [Errno 111] Connection [23:04:17] ://wikitech.wikimedia.org/wiki/Search%23Administration [23:06:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) 05Open→03Resolved [23:06:45] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Nuria) 05Open→03Resolved [23:06:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) [23:06:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10observability: Update webrequest_128 dataset in turnilo to include TLS fields once available - https://phabricator.wikimedia.org/T237117 (10Nuria) [23:07:39] That elasticsearch error is triggered by my kibana plugin deploy, it should come back momentarily [23:09:00] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@11d4ad8]: (no justification provided) [23:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:06] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@11d4ad8]: (no justification provided) (duration: 00m 06s) [23:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:59] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.9829 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:11:14] ^ that one I don't know about [23:12:02] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f4a8e485390: Failed to establish a new connection: [Errno 111] Co [23:12:02] )) 20after4 caused by phatality deployment, it should resolve itself, I hope https://wikitech.wikimedia.org/wiki/Search%23Administration [23:17:23] (03PS5) 10Ammarpad: Rename DPL extension variable to non-ambiguous name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 [23:18:24] ok I don't know what's up with logstash1009 [23:20:20] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: revert to previous phatality plugin version [23:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:43] PROBLEM - puppetmaster backend https on labtestpuppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [23:21:10] (03PS1) 10Jhedden: install_server: Update cloudcephmons pxe interface [puppet] - 10https://gerrit.wikimedia.org/r/549664 (https://phabricator.wikimedia.org/T228102) [23:22:41] PROBLEM - puppetmaster https on labtestpuppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [23:22:51] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:22:59] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 0, delayed_unassigned_shards: 0, status: green, timed_out: False, active_shards_percent_as_number: 100.0, task_max_waiting_in_queue_millis: 0, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_data_nodes: 3, active_shards: 646, active_primary_shards: 265, number_of_in_ [23:22:59] cluster_name: production-logstash-eqiad, relocating_shards: 0, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:23:15] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@62e2870]: revert to previous phatality plugin version (duration: 02m 55s) [23:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:17] (03CR) 10Jhedden: [C: 03+2] install_server: Update cloudcephmons pxe interface [puppet] - 10https://gerrit.wikimedia.org/r/549664 (https://phabricator.wikimedia.org/T228102) (owner: 10Jhedden) [23:26:43] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.02472 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:27:10] ok trying this again [23:28:39] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@11d4ad8]: trying again with a longer scap timeout [23:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:41] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@11d4ad8]: trying again with a longer scap timeout (duration: 03m 02s) [23:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:11] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:33:51] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:34:08] what now... [23:34:33] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f1dd38db390: Failed to establish a new connection: [Errno 111] Connection [23:34:33] ://wikitech.wikimedia.org/wiki/Search%23Administration [23:35:38] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10BBlack) Maybe this is closer to a Lua replacement for all of it, although it still has issues! ` local debug_map = { '1' => 'mwdebug1001.eqiad.w... [23:38:29] (03CR) 10Jforrester: [C: 03+1] "This works. Obviously the other two are needed too. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (owner: 10Ammarpad) [23:38:49] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@11d4ad8]: one more time [23:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:08] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) @Krinkle T180051 IMHO implies a different solution. That task, as well as speeding up Kibana, would be accomplished with the work intended here. The last comment fro... [23:39:24] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) [23:41:49] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@11d4ad8]: one more time (duration: 03m 00s) [23:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:43] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.9801 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:42:43] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:43:25] (03PS1) 10Ammarpad: Switch to use new config variable for DynamicPagelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549666 (https://phabricator.wikimedia.org/T237698) [23:44:20] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@62e2870]: revert phatalaty again [23:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:36] !log start elasticsearch on logstash1008 [23:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:06] (03PS6) 10Ammarpad: Rename DPL extension variable to non-ambiguous name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 [23:45:13] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 0, delayed_unassigned_shards: 0, cluster_name: production-logstash-eqiad, active_shards: 646, number_of_data_nodes: 3, status: green, active_shards_percent_as_number: 100.0, number_of_nodes: 6, number_of_pending_tasks: 0, task_max_waiting_in_queue_millis: 0, relocating_shards: 0, [23:45:13] ards: 265, initializing_shards: 0, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [23:46:14] (03PS1) 10RLazarus: [poolcounter] (WIP) Add a poolcounter prometheus exporter. [puppet] - 10https://gerrit.wikimedia.org/r/549668 [23:46:45] shdubsh: can you look at the logs on logstash1008 and see what caused it to die? [23:47:07] yeah, looking now [23:47:23] (03CR) 10jerkins-bot: [V: 04-1] [poolcounter] (WIP) Add a poolcounter prometheus exporter. [puppet] - 10https://gerrit.wikimedia.org/r/549668 (owner: 10RLazarus) [23:47:24] it must be related to my deploy but I can't figure out how exactly [23:47:24] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@62e2870]: revert phatalaty again (duration: 03m 04s) [23:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:53] (03CR) 10Ammarpad: "> This isn't deploy-safe. Deploys are not atomic, they are" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (owner: 10Ammarpad) [23:48:44] twentyafterfour: looks like ES was oom-killed [23:48:56] oooh [23:49:02] that makes sense sortof [23:49:03] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.01649 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [23:49:24] (03CR) 10RLazarus: "CDanis: This is very much WIP. Take an early look and tell me if I'm on the right track?" [puppet] - 10https://gerrit.wikimedia.org/r/549668 (owner: 10RLazarus) [23:49:40] !log removing one file for legal compliance [23:49:41] I guess it's because installing a kibana plugin causes a bunch of stuff to get reloaded and adds memory strain on the machine. that sucks [23:49:42] (03PS7) 10Ammarpad: Rename DPL extension variable to non-ambiguous name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548569 (https://phabricator.wikimedia.org/T237698) [23:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:43] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:50:54] twentyafterfour: does it try to restart elasticsearch? [23:50:55] PROBLEM - Disk space on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labtestpuppetmaster2001&var-datasource=codfw+prometheus/ops [23:50:55] PROBLEM - MD RAID on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:51:01] PROBLEM - Check size of conntrack table on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:51:15] PROBLEM - configured eth on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:51:15] PROBLEM - Check whether ferm is active by checking the default input chain on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:51:45] PROBLEM - Check systemd state on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:45] PROBLEM - Unmerged changes on repository labs-private on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:51:45] PROBLEM - dhclient process on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:51:45] PROBLEM - DPKG on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:52:11] PROBLEM - labspuppetbackend uWSGI web app on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.108: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Services/labspuppetbackend [23:52:23] ouch... [23:53:21] RECOVERY - dhclient process on labtestpuppetmaster2001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:53:21] RECOVERY - Unmerged changes on repository labs-private on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:53:44] shdubsh: I think the plugin install command might do that [23:53:53] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) Pick up was done today at 12:00pm CT {F31057752} [23:53:57] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [23:53:57] RECOVERY - puppetmaster https on labtestpuppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 399 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [23:54:05] RECOVERY - Disk space on labtestpuppetmaster2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labtestpuppetmaster2001&var-datasource=codfw+prometheus/ops [23:54:07] RECOVERY - MD RAID on labtestpuppetmaster2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:54:11] RECOVERY - Check size of conntrack table on labtestpuppetmaster2001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:54:12] 10Operations, 10Commons, 10Multimedia, 10SRE-swift-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Platonides) //(old comment that was waiting as draft in the browser)// Other urls found in the chan... [23:54:17] RECOVERY - puppetmaster backend https on labtestpuppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 398 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [23:54:20] twentyafterfour: hrm, it shouldn't need to. [23:54:23] shdubsh: this is the command: /usr/bin/sudo -u kibana /usr/share/kibana/bin/kibana-plugin install file:///srv/deployment/releng/phatality/deploy/phatality-5.6.15.zip [23:54:27] RECOVERY - Check whether ferm is active by checking the default input chain on labtestpuppetmaster2001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:54:27] RECOVERY - configured eth on labtestpuppetmaster2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:54:34] takes ~3 minutes to finish [23:54:47] so I assume it's reloading something in order for it to take that long to install a tiny file [23:54:53] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:55:00] 10Operations, 10Commons, 10Multimedia, 10SRE-swift-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10Platonides) The file was reuploaded by Denniss on September 2016, but the original file is still mis... [23:55:10] it shouldn't really effect elasticsearch though, just the kibana frontend [23:55:49] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:56:37] yeah, that's not affecting ES. might be something in the exec that's called in the end though... [23:59:02] 10Operations, 10ops-codfw: codfw: recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) 05Open→03Resolved This is complete resolving this task {F31057757} [23:59:14] 10Operations, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10Papaul)