[00:05:48] the biggest one is probably translate - that patch should be backported asap imo, everything else can wait for swat [00:13:26] @brennen actually we're doing reverts instead, and will fix later; can you deploy the reverts? [00:14:51] DannyS712: who else is we, and how many reverts are we talking? [00:15:17] sorry, per Krinkle; 5 reverts, Translate is the most important imo [00:15:30] yeah, i can deploy. [00:15:45] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10aaron) >>! In T133821#6092867, @Joe wrote: > At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for t... [00:16:01] patches at https://gerrit.wikimedia.org/r/#/q/topic:revert-pageupdater+(status:open+OR+status:merged) [00:16:44] DannyS712: It looks like there are unrelated changes to method visibility and return values [00:16:58] Are we sure nothing else in this train started to rely on those changes? [00:17:35] (looking at the first one, ApiTranslateSandbox.php) [00:17:46] um, oops, not sure at all [00:17:49] brennen: this will take a while, lets rollback for now. [00:17:57] I can't review all these now at least. Going to bed soon. [00:18:20] yeah, rolling back. this feels more complex than is safe to deal with at the moment. [00:18:50] group 1 is the most impacted I think - translate is only used in group 1 if I understand correctly [00:18:58] so group0 [00:19:02] wikipedias shouldn't be impacted at all [00:19:03] back to group0 [00:19:09] ack, back to group0. [00:19:28] !log rolling 1.35.0-wmf.31 train back to group0 for T252179 [00:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:32] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [00:20:34] ...15 seconds too late - the only hard deprecation recently (last 4 weeks) that I see is PageArchive::getPreviousRevision [00:20:44] DannyS712: I believe Babel is on all wikis, but not sure if this code path is relevant there. In any event, we don't want group2 to be ahead of group1 [00:21:01] its only babel autocreation of new categories, but that makes sense [00:22:16] we might even want to go al the way since meidawiki.org patrolling of translated pages would presumably be impacted as well. But depends on which direction it fails. Does it mark too many as already/auto-patrolled or not enough? [00:22:29] not enough [00:22:50] https://phabricator.wikimedia.org/T151172 suggests it means they were unable to patrol edits [00:23:09] meaning they are all marked by default as patrolled, bypassing review? [00:23:33] they are now; thats how I noticed it (fuzzybot wasn't being patrolled; I filed that as a normal bug report, and then realized that it was all translation edits that weren't being patrolled) [00:25:05] DannyS712: ah okay, so it was not preventing any edits from being patrrolled, it just meant too many showed up as needing review. [00:25:09] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Translate/+/585319/ [00:25:25] what about this change, is that not needed if the underlying cause is fixed, or is this unrelated? [00:25:40] so this is a: rollback to only testwiki-type situation? [00:25:53] Krinkle, DannyS712: before i sync-wikiversions over here, do we still believe... [00:25:59] er, starting over: revert all to .30? [00:26:07] unrelated: I wanted to be able to patrol edits by non-autopatrolled editors to translation pages; now I can, but nothing is autopatrolled [00:26:21] thcipriani: impact is minimal for group0 wikis, don't mind either way but yeah mediawiki.org translation review is affected as well [00:26:27] so maybe to testwikii yeah [00:26:40] I can do a run with RTRC on mediawiki.org and clean up there [00:26:48] :) [00:26:59] ok, reverting to testwiki. [00:27:02] ack [00:30:25] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert all wikis except test to 1.35.0-wmf.30 for T252179 [00:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:28] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [00:35:19] (03PS1) 10Brennen Bearnes: All wikis except test to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595075 (https://phabricator.wikimedia.org/T249963) [00:35:22] (03CR) 10Brennen Bearnes: [C: 03+2] All wikis except test to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595075 (https://phabricator.wikimedia.org/T249963) (owner: 10Brennen Bearnes) [00:36:06] (03Merged) 10jenkins-bot: All wikis except test to 1.35.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595075 (https://phabricator.wikimedia.org/T249963) (owner: 10Brennen Bearnes) [00:36:37] okay. What can I do to help resolve my mess? [00:40:40] at this point, the train is halted for the day, and (by policy at least) until monday. [00:41:07] i'm assuming either fixes or reverts for these known things can be worked out by then. [00:41:25] reverts are pending for all problematic patches against master [00:42:34] yeah, so that's good, though it seems like they might need further review. in terms of the whole train, i would prefer to avoid playing whack-a-mole with this class of bugs on monday with next week's train looming on tuesday, so i guess any extra assurances or warnings we might be able to get before then would also be pretty helpful. [00:43:07] (hopefully this was the last one, but it would be nice to feel a little more sure of that if possible.) [00:43:39] So I caused 7 train blockers for the same train; might be a good idea for me to pause the Revision work for a bit [00:44:24] but if there is anything I can do to help with the cleanup now, just let me know [00:44:24] i don't want to be discouraging, but i also think i'd like to avoid a repeat of this scenario. [00:44:37] DannyS712: prep reverts for Monday. Don't worry about the rest for now :) [00:44:53] should I cherry pick the reverts now, or wait until they are merged? [00:45:35] Decoupling WikiPage/Reivsion is probalbly in the top 10 hardest things we've done in the past 5 years, so don't worry about how difficult it is. I don't think anyone could have known all the angles ahead of time. [00:45:40] DannyS712: wait for merge [00:46:04] I assume the merger will create the cherry picks [00:46:27] but if not, can do it later when figuring out when to deploy [00:46:53] DannyS712: yeah, and again, thanks for quick response times and good comms on all of this. [00:47:30] * DannyS712 looks for a tiny sliver of a silver lining, and is surprised that https://versions.toolforge.org/ is completely accurate despite group0 being split between versions [00:47:57] s/surprised/pleasantly surprised [00:48:16] it does come up from time to time. :) [00:48:25] on that note, i'm signing off for the evening; thanks all. [00:48:59] good night, and again, sorry for the mess [00:49:03] DannyS712: perhaps add some checkboxes to the new task for this revert to make sure we get them all in the deployed version [00:49:14] easy to miss one otherwise :) [00:49:18] doing [00:56:12] Thanks. [00:56:16] and good nightr [01:28:05] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: cloudservices2003-dev.wikimedia.org, restbase1028.eqiad.wmnet, restbase1029.eqiad.wmnet, mw2165.codfw.wmnet, analytics1055.eqiad.wmnet, restbase1030.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:28:38] (03CR) 10Nuria: "Nice idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595025 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [04:32:30] (03PS2) 10Nuria: Count automated traffic as bots in turnilo's homescreen [puppet] - 10https://gerrit.wikimedia.org/r/594272 (https://phabricator.wikimedia.org/T238357) [04:33:08] (03CR) 10Nuria: "Tested now, please note corrections" [puppet] - 10https://gerrit.wikimedia.org/r/594272 (https://phabricator.wikimedia.org/T238357) (owner: 10Nuria) [04:38:21] (03CR) 10Nuria: [C: 03+1] "Tested this in turnilo and looks good." [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) (owner: 10Joal) [05:10:38] !log Upgrade pc1010 [05:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:37] (03PS1) 10Elukey: Add override to analytics1055 to skip one datanode partition [puppet] - 10https://gerrit.wikimedia.org/r/595084 (https://phabricator.wikimedia.org/T252070) [05:19:43] (03CR) 10Elukey: [C: 03+2] Add override to analytics1055 to skip one datanode partition [puppet] - 10https://gerrit.wikimedia.org/r/595084 (https://phabricator.wikimedia.org/T252070) (owner: 10Elukey) [05:24:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:02] (03PS1) 10Marostegui: Revert "install_server: Reimage pc1010|pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595085 [06:29:38] (03PS2) 10Giuseppe Lavagetto: Improve the kafka consumer interface [software/purged] - 10https://gerrit.wikimedia.org/r/594953 [06:40:42] (03PS1) 10Elukey: Disable Analytics refine failure check for sanitize timers [puppet] - 10https://gerrit.wikimedia.org/r/595087 [06:46:43] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22406/an-launcher1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/595087 (owner: 10Elukey) [06:47:18] (03CR) 10Muehlenhoff: [C: 03+1] admin: add Tobias Andersson to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/595051 (https://phabricator.wikimedia.org/T251997) (owner: 10Cwhite) [06:48:01] (03CR) 10Muehlenhoff: [C: 03+1] admin: add Nikolaos Gkountas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594974 (https://phabricator.wikimedia.org/T252100) (owner: 10Cwhite) [06:49:36] (03CR) 10Dzahn: [C: 03+1] "when actually adding to groups, it is usually both wmde and nda group" [puppet] - 10https://gerrit.wikimedia.org/r/595051 (https://phabricator.wikimedia.org/T251997) (owner: 10Cwhite) [06:53:12] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10Dzahn) @WMDE-leszek He still needs to be added to https://www.wikimedia.de/mitarbeitende/ i assume. [06:54:43] (03CR) 10Dzahn: [C: 03+1] "employeeType: Full Time" [puppet] - 10https://gerrit.wikimedia.org/r/594974 (https://phabricator.wikimedia.org/T252100) (owner: 10Cwhite) [06:58:17] (03PS1) 10Vgutierrez: ATS: Puppetize max_connections_in and max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/595088 (https://phabricator.wikimedia.org/T249335) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200508T0700) [07:00:47] (03CR) 10Vgutierrez: "pcc looks sane & happy: https://puppet-compiler.wmflabs.org/compiler1003/22407/" [puppet] - 10https://gerrit.wikimedia.org/r/595088 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:01:24] !log installing php5 security updates [07:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:08] (03PS9) 10Dzahn: phabricator: Drop phd.pid-directory as it's now uneeded [puppet] - 10https://gerrit.wikimedia.org/r/594162 (owner: 10Paladox) [07:03:40] (03CR) 10Dzahn: [C: 03+2] phabricator: Drop phd.pid-directory as it's now uneeded [puppet] - 10https://gerrit.wikimedia.org/r/594162 (owner: 10Paladox) [07:05:11] (03CR) 10Vgutierrez: [C: 03+2] ATS: Puppetize max_connections_in and max_connections_active_in [puppet] - 10https://gerrit.wikimedia.org/r/595088 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:07:14] !log phabricator rmdir /var/run/phd/pid - empty and now unused [07:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:51] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [07:11:53] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Reimage pc1010|pc2010" [puppet] - 10https://gerrit.wikimedia.org/r/595085 (owner: 10Marostegui) [07:12:30] (03PS1) 10Vgutierrez: ATS: Tune max_connections_in and max_connections_active_in on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/595089 (https://phabricator.wikimedia.org/T249335) [07:16:21] (03CR) 10Dzahn: "Also: https://phabricator.wikimedia.org/rPHABb5da96f67af36376568432460d23300f6b7a5adb" [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [07:16:32] (03CR) 10Vgutierrez: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22408/" [puppet] - 10https://gerrit.wikimedia.org/r/595089 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [07:16:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10wiki_willy) a:03Jclark-ctr [07:19:17] !log ats-tls restart on cp3050 and cp3052 (max_connections_active_in experiment) - T249335 [07:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:21] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [07:25:27] (03PS4) 10Dzahn: create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [07:26:16] (03CR) 10jerkins-bot: [V: 04-1] create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [07:28:21] (03PS5) 10Dzahn: create simplelamp2, replace mysql->mariadb, apache->httpd [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [07:28:56] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Marostegui) Sad news :-/ If that is the case, I don't think there is much point on going forward with this, specially if we'd need to patch partman in many places - and then ma... [07:29:10] (03CR) 10jerkins-bot: [V: 04-1] create simplelamp2, replace mysql->mariadb, apache->httpd [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [07:37:07] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add thanos sidecar to k8s instances [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [07:40:07] (03PS6) 10Dzahn: create simplelamp2, replace mysql->mariadb, apache->httpd [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [07:45:20] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [07:50:52] (03PS7) 10Dzahn: create simplelamp2, replace mysql->mariadb, apache->httpd [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [07:58:23] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) >>! In T251572#6117851, @RKemper wrote: > `define contactgroup { > contactgroup_... [08:05:44] (03PS8) 10Dzahn: create simplelamp2, replace mysql->mariadb, apache->httpd [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [08:07:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos_sidecar site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:09:04] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/595133 (https://phabricator.wikimedia.org/T135991) [08:11:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:12:05] (03CR) 10Dzahn: [C: 03+1] Enable base::service_auto_restart for Apache on miscweb [puppet] - 10https://gerrit.wikimedia.org/r/595133 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:12:50] (03PS1) 10Vgutierrez: ATS: Increase max_connections_in and max_connections_active_in on esams [puppet] - 10https://gerrit.wikimedia.org/r/595134 (https://phabricator.wikimedia.org/T249335) [08:13:22] 10Operations, 10netops: Upgrade Routinator 3000 to 0.7.0 - https://phabricator.wikimedia.org/T252010 (10ayounsi) 05Open→03Resolved Added `routinator_rtr_current_connections` to the Grafana dashboard. [08:15:33] (03CR) 10Vgutierrez: [C: 03+2] "pcc: https://puppet-compiler.wmflabs.org/compiler1001/22409/" [puppet] - 10https://gerrit.wikimedia.org/r/595134 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:16:00] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Gehel) @RKemper : for the moment, you don't need to be in "team-interactive". We can discus... [08:16:01] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T221259 - The acknowledgement expires at: 2020-05-08 14:15:20. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:01] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T221259 - The acknowledgement expires at: 2020-05-08 14:15:20. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:16:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos_sidecar site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:16:28] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) ACKed for 6 more hours the time Telia fixes it. [08:18:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:18:43] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:19] (03CR) 10Dzahn: [C: 03+2] "this is for cloud VPS systems to have something to use instead of simplelamp so we can get rid of mysql and apache puppet modules" [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [08:20:23] !log rolling restart of ats-tls on esams - T249335 [08:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:28] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [08:23:53] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:24:19] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:25:41] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:26:07] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:29:37] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:36:47] (03CR) 10ArielGlenn: "What mechanism is in place to deal with failures, i.e. when the source directory is not present because the files were not generated, for " [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:40:21] (03PS1) 10Filippo Giunchedi: thanos: fix query class arguments and path [puppet] - 10https://gerrit.wikimedia.org/r/595138 (https://phabricator.wikimedia.org/T233956) [08:49:59] (03PS4) 10Jcrespo: mariadb: Enable read_only monitoring on misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) [08:52:17] (03CR) 10Mforns: "@ArielGlenn" [puppet] - 10https://gerrit.wikimedia.org/r/594773 (https://phabricator.wikimedia.org/T251858) (owner: 10Mforns) [08:59:03] (03CR) 10Jcrespo: [C: 03+2] mariadb: Enable read_only monitoring on misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [08:59:50] (03PS1) 10Busecolak: add overlord to allowed druid daemon names [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/595140 [09:00:02] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) I have sent email to all administrators of the projects using it and told them I created a new role simplelamp2 to replace sim... [09:02:15] (03PS1) 10Zoranzoki21: Drop scowiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595141 (https://phabricator.wikimedia.org/T252048) [09:03:30] (03PS16) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [09:04:32] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [09:07:05] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: fix query class arguments and path [puppet] - 10https://gerrit.wikimedia.org/r/595138 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:08:01] (03PS1) 10Zoranzoki21: Drop itwiki mainpage special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595142 (https://phabricator.wikimedia.org/T252065) [09:11:02] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10akosiaris) [09:18:38] (03PS1) 10Jcrespo: mariadb: Add read_only monitoring to other misc dbs: tendril, phab, event [puppet] - 10https://gerrit.wikimedia.org/r/595143 (https://phabricator.wikimedia.org/T172489) [09:24:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] redis::instance: Deduplicate base_settings [puppet] - 10https://gerrit.wikimedia.org/r/595064 (owner: 10Alexandros Kosiaris) [09:25:05] (03PS17) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [09:26:08] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [09:26:33] (03PS1) 10JMeybohm: wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) [09:26:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/595033 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [09:27:44] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:18] (03CR) 10Dzahn: "compiler compiles it now and has no complaint..yet jerkins-bot still talks about not finding resource Service[apache2] which should still " [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [09:30:04] (03CR) 10JMeybohm: "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (owner: 10JMeybohm) [09:30:21] (03PS2) 10JMeybohm: _tls_helpers: Use defaults provided in docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 [09:31:52] (03PS3) 10JMeybohm: _tls_helpers: Use defaults provided in docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (https://phabricator.wikimedia.org/T235411) [09:32:43] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add read_only monitoring to other misc dbs: tendril, phab, event [puppet] - 10https://gerrit.wikimedia.org/r/595143 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [09:33:02] (03CR) 10JMeybohm: [C: 03+2] _tls_helpers: Use defaults provided in docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [09:33:14] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:19] (03Merged) 10jenkins-bot: _tls_helpers: Use defaults provided in docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [09:34:08] (03PS18) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [09:35:08] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [09:39:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] redis: Allow override of instance config [puppet] - 10https://gerrit.wikimedia.org/r/595065 (owner: 10Alexandros Kosiaris) [09:39:54] (03CR) 10Elukey: [V: 03+2 C: 03+2] add overlord to allowed druid daemon names [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/595140 (owner: 10Busecolak) [09:42:56] (03PS19) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [09:45:12] (03PS1) 10Jcrespo: mariadb: Add replication monitoring to misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/595145 (https://phabricator.wikimedia.org/T237927) [09:45:31] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=eventgate-analytics.* [09:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:47] (03PS2) 10Jcrespo: mariadb: Add replication monitoring to misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/595145 (https://phabricator.wikimedia.org/T237927) [09:48:25] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10Dzahn) @RKemper Your email address was ryankemper@ in the mail alias for root@ and wdqs-adm... [09:51:51] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics [09:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:59] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/22415/" [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [09:52:11] <_joe_> akosiaris: set pooled=false :) [09:52:17] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics [09:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] !log depool eqiad eventgate-analytics for a test involving reinitializing the eqiad kubernetes cluster [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:31] _joe_: done [09:52:35] let's see... [09:53:43] <_joe_> akosiaris: as expected the p99 of calls to eventgate-analytics goes up [09:53:46] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?panelId=6&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=eventgate-analytics&from=now-15m&to=now [09:54:10] <_joe_> but it's not exploding for now [09:54:40] !log disabling puppet on puppetmasters temporarily to switch them carefully to use httpd module and not apache module which we want to get rid of [09:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:41] <_joe_> this is p99 [09:54:48] ouch [09:54:53] <_joe_> mutante: on a friday? [09:54:58] <_joe_> akosiaris: that's expected [09:55:01] 10 to 250? [09:55:02] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) After ticket 01157098 was resolved, the link didn't come back up. Ticket 01157707 was opened. Telia setup a loop on the Chicago side towards SF, which brought the SF interface up, but the C... [09:55:08] <_joe_> and it's not creating huge issues right now [09:56:54] _joe_: note at https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=28&fullscreen&orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=now-30m&to=now how the envoy throttled cpu time becomes zero [09:57:03] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1003/22417/" [puppet] - 10https://gerrit.wikimedia.org/r/595145 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [09:57:06] (03CR) 10Marostegui: [C: 03+1] "+1 per https://puppet-compiler.wmflabs.org/compiler1003/22417/" [puppet] - 10https://gerrit.wikimedia.org/r/595145 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [09:57:08] damn scheduler artifact... [09:57:19] <_joe_> yeah we need to upgrade the k8s workers [09:57:26] we can use 4.19 [09:57:31] it has the patched [09:57:35] it has the patches* [09:57:48] or upgrade to buster, which is a bit harder [09:57:56] _joe_: after talking with John, compiling it and agreeing to first run only on 2003 which we could take out if any issues. but i can wait if you'd prefer it [09:58:28] <_joe_> mutante: no no if you are confident, definitely go on [09:58:54] alright [09:58:56] <_joe_> I was suprised by the courage - a puppetmaster debugging session is not what I'd want to spend the weekend on :P [09:59:34] (03CR) 10Jcrespo: [C: 03+2] mariadb: Add replication monitoring to misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/595145 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [09:59:34] <_joe_> akosiaris: i'd call the experiment a success [09:59:45] <_joe_> as in - i can still browse the wikis just fine [09:59:49] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-30m&to=now says that everything looks fine [09:59:58] so, I am pretty happy about that test [10:00:32] <_joe_> envoy ftw! [10:00:37] _joe_: let's leave it like that for 1h or so? [10:01:05] that should simulate a bit better what we want to do, and also during high EU usage [10:02:03] <_joe_> yes [10:02:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] wikifeeds: enable TLS with chart defaults (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [10:02:56] (03PS1) 10Zoranzoki21: Add tw-photomedia.de in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595146 (https://phabricator.wikimedia.org/T252141) [10:03:39] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi T221259 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:33] <_joe_> so p99 is stable at 250 ms, why is that? [10:06:47] (03PS20) 10Dzahn: puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 [10:06:49] <_joe_> it's the cost of doing TLS whenever there is some packet loss cross-dc? [10:08:55] <_joe_> RTT is stable at 36ms between a random appserver and eventgate-analytics, and given they're talking TLS 1.2 [10:11:24] Ignore that page [10:13:37] <_joe_> akosiaris: so for a simple TLS connection I get ~ 125-150 ms from curl [10:13:50] <_joe_> we still have 90 unaccounted ms there [10:15:04] (03CR) 10Dzahn: [C: 03+2] puppetmaster/configmaster: convert from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/451821 (owner: 10Dzahn) [10:15:41] does not submit yet because disabling puppet on all masters did not actually work it seems, heh [10:19:09] (03PS2) 10JMeybohm: wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) [10:19:23] did now and tests on master 2003 [10:20:23] only change at all on puppet run is LogFormat [10:22:14] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) a:05ayounsi→03None [10:22:30] <_joe_> I think we can live with that :P [10:24:05] _joe_: it's weird indeed... packet loss perhaps? [10:24:53] <_joe_> possibly! [10:25:02] <_joe_> XioNoX: ^^ would love your insights :) [10:25:30] <_joe_> to be clear, I timed a curl from mw1331 to https://eventgate-analytics.discovery.wmnet:4592/?spec [10:25:48] looking [10:26:08] missed opportunity to use mw1337 btw [10:26:44] <_joe_> ahah [10:26:48] <_joe_> I never thought about it [10:26:57] <_joe_> so my naive test is done like this [10:27:00] <_joe_> curl -w "%{time_appconnect}s\n" -o /dev/null -s https://eventgate-analytics.discovery.wmnet:4592/?spec [10:27:10] <_joe_> that name now resolves to codfw [10:27:29] <_joe_> this way I get usually a time between 125 and 150 ms [10:27:41] yeah, mtr shows ~36.1ms latency and no packet loss [10:27:43] <_joe_> akosiaris: ohhh I know what it is [10:27:52] <_joe_> we're sending sometimes huge amounts of data [10:28:06] <_joe_> like megabytes of data IIRC [10:28:33] <_joe_> ofc in a local dc that's faster than across datacenters, but I don't know if it can account for 90-100ms [10:29:04] <_joe_> remember this is a p99, so it's influenced by the long tail events [10:29:47] <_joe_> in fact, p50 is 41 ms [10:29:53] <_joe_> because persistent connections [10:29:56] the 250ms is for the whole session? [10:30:16] <_joe_> XioNoX: request / response + (I think) tls negotiation [10:31:29] <_joe_> p75 is 50ms, again clearly p99 is strongly influenced by tail events, as expected [10:31:33] so yeah request/response size would impact that [10:32:54] <_joe_> it's amazing - p75 is 50 ms, p80 is 100ms! I think it tells us it probably depends on either the size distribution of events, or a specific eventgate-analytics worker misbehaving [10:33:58] hmm [10:34:18] with p80 at 100% more than p75 it's quite possible you are right about the request size [10:34:27] interesting [10:36:29] <_joe_> or, if it's 5 pods, one is responding slower than the others [10:37:21] _joe_: so, average packet size is 700bytes [10:37:40] https://w.wiki/QGm [10:37:49] so, nowhere close to the maxes [10:38:17] to the max of 1500bytes, this adds credence to your hypothesis [10:39:41] (03PS1) 10Jcrespo: monitoring: remove usages of 'dba' contact group [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) [10:40:06] but interestingly, max() seems to be around there too... hmm [10:41:03] <_joe_> somehow eventgate is not registering its telemetry from envoy in k8s [10:41:18] <_joe_> that's disappointing, we could test my idea if it did [10:41:35] <_joe_> probably it's still using an old version of the envoy code that didn't do telemetry correctly [10:42:09] (03PS1) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [10:44:29] so, with an avg packet size reaching eventgate-analytics of 660 and a stddev of 127 and a max of ~700bytes I am starting to think less that this is the reason [10:46:47] (03CR) 10Jcrespo: "This should stop the sms of non-critical alerts." [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [10:48:08] (03PS2) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [10:49:58] (03CR) 10Marostegui: monitoring: remove usages of 'dba' contact group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [10:50:35] _joe_: yeah, so p50, p75, p80 and p99 quantiles of packet sizes reaching eventgate-analytics are in the range of 670-700 with very small diffs between them. I doubt it's request sizes [10:52:40] (03PS3) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [10:53:20] however, container_network_receive_packets_dropped_total reports a pretty constanct 0.7pps being dropped [10:53:38] could be unrelated ofc [11:01:46] (03CR) 10Hoo man: [C: 03+1] monitoring: remove usages of 'dba' contact group [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [11:04:14] <_joe_> akosiaris: should we end the test? [11:04:45] yeah I was about to ask [11:04:55] look at the latency graphs btw at https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventgate-analytics&var-site=codfw&var-ops_datasource=codfw%20prometheus%2Fops&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&from=now-1h&to=now [11:05:00] (03PS4) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [11:05:08] eventgate is reporting extremely low numbers... [11:05:14] something's wrong there [11:05:21] μs ? that can't be [11:05:30] <_joe_> yeah that's wrong, it's ms [11:06:00] <_joe_> the numbers in the htts by source part are correct though [11:06:24] <_joe_> it's the numbers locally, so the latency envoy sees calling its backend, which is the expected 10ms [11:07:08] _joe_: all 7 prod puppetmasters are swiched. only difference is logrotate/LogFormat as expected from apache->httpd switch :) [11:07:20] <_joe_> mutante: \o/ [11:08:10] anyway, rolling back the depooling, ending the test [11:08:21] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventgate-analytics [11:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:23] now checking if we can finally delete the apache module [11:08:42] !log repool eqiad eventgate-analytics. Test concluded [11:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:07] <_joe_> latency going down fast :) [11:10:54] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [11:15:49] (03PS5) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [11:16:30] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10Cmjohnson) This server is out of warranty, @wiki_willy we can order a new 4TB disk. [11:21:35] 10Operations: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Dzahn) [11:22:07] 10Operations: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Dzahn) p:05Triage→03Medium [11:22:30] 10Operations, 10serviceops: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Dzahn) [11:23:18] 10Operations, 10serviceops: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Dzahn) [11:23:34] (03PS1) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [11:24:09] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1028.eqiad.wmnet'] ` Of which those **FAIL... [11:25:35] (03CR) 10Marostegui: "I am not sure about this change, while I agree it is not a critical service as core, what if we see a sudden increase on disk space (ie: a" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:25:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [11:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:03] (03PS6) 10Jbond: apereo_cas: support staging environment [puppet] - 10https://gerrit.wikimedia.org/r/595150 [11:28:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:20] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:30:26] (03PS1) 10Dzahn: simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 [11:31:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] redis::instance: Deduplicate base_settings [puppet] - 10https://gerrit.wikimedia.org/r/595064 (owner: 10Alexandros Kosiaris) [11:31:12] (03CR) 10jerkins-bot: [V: 04-1] simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 (owner: 10Dzahn) [11:31:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] redis: Allow override of instance config [puppet] - 10https://gerrit.wikimedia.org/r/595065 (owner: 10Alexandros Kosiaris) [11:31:39] (03CR) 10Marostegui: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:31:54] (03PS3) 10Alexandros Kosiaris: ores: Pass the current config to redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/595066 [11:33:14] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:33:28] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/22427/" [puppet] - 10https://gerrit.wikimedia.org/r/595150 (owner: 10Jbond) [11:33:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] ores: Pass the current config to redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/595066 (owner: 10Alexandros Kosiaris) [11:34:43] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:37:38] (03CR) 10Marostegui: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:39:58] (03CR) 10Jbond: [C: 03+2] pcc: refactor code [puppet] - 10https://gerrit.wikimedia.org/r/594697 (owner: 10Jbond) [11:40:04] (03CR) 10Jbond: [C: 03+2] pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [11:41:01] (03CR) 10Jbond: [C: 03+2] pcc: Parse `git log` correctly for HEAD/last/latest change number. [puppet] - 10https://gerrit.wikimedia.org/r/594797 (owner: 10RLazarus) [11:41:10] (03PS2) 10Jbond: pcc: Parse `git log` correctly for HEAD/last/latest change number. [puppet] - 10https://gerrit.wikimedia.org/r/594797 (owner: 10RLazarus) [11:46:32] (03PS1) 10Alexandros Kosiaris: redis: Followup for 63addad7b79d8dff7bbf [puppet] - 10https://gerrit.wikimedia.org/r/595155 [11:46:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [11:46:40] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, 10Release-Engineering-Team (CI & Testing services): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) Just a quick not that i added the ability to [[ https://gerrit.wiki... [11:48:50] (03CR) 10BearND: [C: 03+1] wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [11:49:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] redis: Followup for 63addad7b79d8dff7bbf [puppet] - 10https://gerrit.wikimedia.org/r/595155 (owner: 10Alexandros Kosiaris) [11:49:38] !log restarting cassandra on restbase2009 for java updates [11:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:13] (03PS1) 10Muehlenhoff: Make /var/log/cas writable to the tomcat systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/595156 [11:58:41] (03PS1) 10Alexandros Kosiaris: DNM: Bypass a PCC issue [puppet] - 10https://gerrit.wikimedia.org/r/595157 [11:58:43] (03PS1) 10Alexandros Kosiaris: DNM: Force pass true to redis slave [puppet] - 10https://gerrit.wikimedia.org/r/595158 [12:00:17] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22432/" [puppet] - 10https://gerrit.wikimedia.org/r/595156 (owner: 10Muehlenhoff) [12:04:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/595156 (owner: 10Muehlenhoff) [12:05:03] (03PS1) 10Muehlenhoff: Reenable the ssoSessions endpoint on the staging IDP [puppet] - 10https://gerrit.wikimedia.org/r/595159 [12:08:40] (03CR) 10Muehlenhoff: [C: 03+2] Make /var/log/cas writable to the tomcat systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/595156 (owner: 10Muehlenhoff) [12:09:45] 10Operations, 10netops: Peer with SFMIX at ulsfo - https://phabricator.wikimedia.org/T251536 (10faidon) [12:09:53] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) [12:10:33] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) LoA received and cross-connect task created. [12:10:40] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) [12:10:58] 10Operations, 10netops: Peer with SFMIX at ulsfo (May 2020) - https://phabricator.wikimedia.org/T251536 (10faidon) [12:11:03] (03PS2) 10Dzahn: simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 [12:14:04] (03PS1) 10QChris: Add qchris to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/595162 (https://phabricator.wikimedia.org/T252194) [12:14:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add qchris to gerrit-root to allow testing gerrit-3.1 deployment and eventually deploy it - https://phabricator.wikimedia.org/T252194 (10Dzahn) [12:14:45] (03PS3) 10Dzahn: simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 [12:15:45] (03CR) 10Dzahn: [C: 03+1] "+1 qchris will work on upgrading gerrit. this role will give access to gerrit1002 (temp. setup for testing the upgrade) as well as the act" [puppet] - 10https://gerrit.wikimedia.org/r/595162 (https://phabricator.wikimedia.org/T252194) (owner: 10QChris) [12:18:27] (03CR) 10Muehlenhoff: simplelap: add support for buster/PHP 7.3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595154 (owner: 10Dzahn) [12:20:05] (03PS2) 10Alexandros Kosiaris: DNM: Bypass a PCC issue [puppet] - 10https://gerrit.wikimedia.org/r/595157 [12:20:07] (03PS2) 10Alexandros Kosiaris: redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 [12:20:09] (03PS1) 10Alexandros Kosiaris: Revert "redis: Followup for 63addad7b79d8dff7bbf" [puppet] - 10https://gerrit.wikimedia.org/r/595163 [12:22:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "redis: Followup for 63addad7b79d8dff7bbf" [puppet] - 10https://gerrit.wikimedia.org/r/595163 (owner: 10Alexandros Kosiaris) [12:22:21] (03PS3) 10Alexandros Kosiaris: redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 [12:23:15] (03PS4) 10Alexandros Kosiaris: redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 [12:23:19] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/595164 (https://phabricator.wikimedia.org/T135991) [12:23:36] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 (owner: 10Alexandros Kosiaris) [12:23:46] (03PS5) 10Alexandros Kosiaris: redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 [12:23:49] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] redis: Restore $slaveof parameter [puppet] - 10https://gerrit.wikimedia.org/r/595158 (owner: 10Alexandros Kosiaris) [12:25:52] 10Operations, 10Gerrit, 10Patch-For-Review: gerritro user getting access denied from gerrit1002 - https://phabricator.wikimedia.org/T243800 (10Dzahn) p:05Medium→03High Prio High now because work on the upgrade will start again. [12:28:00] (03Abandoned) 10Alexandros Kosiaris: DNM: Bypass a PCC issue [puppet] - 10https://gerrit.wikimedia.org/r/595157 (owner: 10Alexandros Kosiaris) [12:40:43] (03PS1) 10Alexandros Kosiaris: ores: Parameterize redis ports [puppet] - 10https://gerrit.wikimedia.org/r/595167 [12:41:46] (03CR) 10jerkins-bot: [V: 04-1] ores: Parameterize redis ports [puppet] - 10https://gerrit.wikimedia.org/r/595167 (owner: 10Alexandros Kosiaris) [12:45:50] (03PS2) 10Alexandros Kosiaris: ores: Parameterize redis ports [puppet] - 10https://gerrit.wikimedia.org/r/595167 [12:49:09] !log T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle [12:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:15] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [12:49:32] !log T243106 redo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348 [12:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:04] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/595171 (https://phabricator.wikimedia.org/T135991) [12:56:45] (03PS4) 10Dzahn: simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 [12:57:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/595154 (owner: 10Dzahn) [13:01:31] (03PS1) 10Vgutierrez: ATS: Increase max_connections_in and max_connections_active_in globally [puppet] - 10https://gerrit.wikimedia.org/r/595172 (https://phabricator.wikimedia.org/T249335) [13:08:15] 10Operations, 10Puppet, 10Mail, 10cloud-services-team (Kanban): Stop using letsencrypt::cert::integrated - https://phabricator.wikimedia.org/T252199 (10Krenair) [13:16:30] !log T243106 undo experiment with REJECT, DROP iptable rules now that we have envoy in the middle. Use mw1331, mw1348. Experiment done successfully, no issues to the infrastructure. [13:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:33] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [13:18:32] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable base::service_auto_restart for Apache on Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/595164 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:18:53] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/22437/" [puppet] - 10https://gerrit.wikimedia.org/r/595172 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:19:52] (03CR) 10Mholloway: [C: 03+1] "Awesome! Bernd or I will bump this to +2 and get it deployed on Monday." [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [13:20:52] !log T243106 redo experiment with DROP iptable rules this time around. Use mw1331, mw1348 [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:33] !log rolling restart of ats-tls on eqiad, codfw, ulsfo and eqsin - T249335 [13:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:35] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:29:59] (03CR) 10JMeybohm: "> Patch Set 2: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [13:30:28] (03CR) 10CDanis: [C: 03+1] Enable base::service_auto_restart for Apache on Grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/595164 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:37:00] (03PS3) 10BearND: wikifeeds: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/595144 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [13:40:31] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [13:46:54] (03PS7) 10Ottomata: Initial debian commit [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) [13:48:41] (03PS8) 10Ottomata: Initial debian commit [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) [13:49:07] (03CR) 10Dzahn: [C: 03+2] simplelap: add support for buster/PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/595154 (owner: 10Dzahn) [13:49:26] (03CR) 10Ottomata: Initial debian commit (032 comments) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [13:50:45] (03CR) 10Jcrespo: [C: 03+1] Enable base::service_auto_restart for Apache on dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/595171 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:57:12] (03PS1) 10Apakhomov: Resolved merge conflicts in several files [deployment-charts] - 10https://gerrit.wikimedia.org/r/595177 [13:57:19] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595177 (owner: 10Apakhomov) [13:58:57] (03CR) 10Jdlrobson: [C: 03+1] "yay!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595142 (https://phabricator.wikimedia.org/T252065) (owner: 10Zoranzoki21) [14:00:00] (03PS2) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [14:05:53] !log T243106 undo experiment with DROP iptable rules this time around. Use mw1331, mw1348 [14:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:58] T243106: Phased rollout of sessionstore to production fleet - https://phabricator.wikimedia.org/T243106 [14:07:54] (03CR) 10Jdlrobson: [C: 03+1] "yay! this makes me happy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595141 (https://phabricator.wikimedia.org/T252048) (owner: 10Zoranzoki21) [14:09:23] (03CR) 10Ottomata: [C: 03+2] Update turnilo pageview config with new dimensions [puppet] - 10https://gerrit.wikimedia.org/r/594472 (https://phabricator.wikimedia.org/T243090) (owner: 10Joal) [14:15:12] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [14:16:29] (03PS1) 10Andrew Bogott: Add 'dumps' mount to the ores-staging project [puppet] - 10https://gerrit.wikimedia.org/r/595178 (https://phabricator.wikimedia.org/T252204) [14:17:07] (03PS3) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [14:18:13] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [14:19:40] (03PS1) 10Ottomata: Increase number of mappers used for Camus mediawiki_analtyics_events [puppet] - 10https://gerrit.wikimedia.org/r/595179 (https://phabricator.wikimedia.org/T252203) [14:20:40] (03PS4) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [14:21:53] (03CR) 10Jcrespo: "Once this is deployed, we will be close to remove the profile::mariadb::monitor and profile::mariadb::monitor::dba classes and create a ne" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [14:24:38] (03CR) 10Ottomata: [C: 03+2] Increase number of mappers used for Camus mediawiki_analtyics_events [puppet] - 10https://gerrit.wikimedia.org/r/595179 (https://phabricator.wikimedia.org/T252203) (owner: 10Ottomata) [14:30:41] (03PS2) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [14:32:48] (03PS8) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [14:32:56] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) p:05Triage→03Medium [14:39:07] (03CR) 10Hnowlan: changeprop: add cpjobqueue configuration switching (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:42:40] (03PS1) 10Papaul: Partman: Add thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/595186 (https://phabricator.wikimedia.org/T251635) [14:45:09] (03CR) 10Papaul: [C: 03+2] Partman: Add thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/595186 (https://phabricator.wikimedia.org/T251635) (owner: 10Papaul) [14:48:58] (03CR) 10Muehlenhoff: "Looks great, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595150 (owner: 10Jbond) [14:50:39] !log otto@deploy1001 Started deploy [analytics/refinery@4a2c530]: fix for camus wrapper, deploy to an-launcher1001 only [14:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:42] !log otto@deploy1001 Finished deploy [analytics/refinery@4a2c530]: fix for camus wrapper, deploy to an-launcher1001 only (duration: 00m 03s) [14:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:08] (03CR) 10Muehlenhoff: Initial debian commit (031 comment) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [15:00:40] (03PS9) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [15:05:12] (03PS3) 10Nuria: Count automated traffic as bots in turnilo's homescreen [puppet] - 10https://gerrit.wikimedia.org/r/594272 (https://phabricator.wikimedia.org/T238357) [15:07:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) ` edit interfaces interface-range vlan-private1-a-codfw] member xe-4/0/22 { ... } + member xe-2/0/3; [edit interfaces interface-range d... [15:07:41] (03PS10) 10Rush: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [15:12:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2001.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [15:13:32] 10Operations, 10vm-requests: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [15:13:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [15:13:51] 10Operations, 10Security-Team, 10vm-requests: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [15:14:37] (03PS5) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [15:14:40] (03CR) 10Ottomata: [C: 03+2] Count automated traffic as bots in turnilo's homescreen [puppet] - 10https://gerrit.wikimedia.org/r/594272 (https://phabricator.wikimedia.org/T238357) (owner: 10Nuria) [15:14:43] 10Operations, 10Security-Team, 10vm-requests: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [15:16:03] 10Operations, 10Security-Team, 10vm-requests: Eqiad: 1VM request for Peek (PM service in use by Security Team) - https://phabricator.wikimedia.org/T252210 (10chasemp) [15:17:38] (03PS3) 10EBernhardson: [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 [15:18:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (owner: 10EBernhardson) [15:25:14] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2002.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [15:27:10] !log stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203 [15:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:17] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['thanos-fe2001.codfw.wmnet'] ` [15:27:23] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [15:28:56] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:00] (03CR) 10Cwhite: [C: 03+2] update blog.wm.o to its new home [dns] - 10https://gerrit.wikimedia.org/r/594763 (https://phabricator.wikimedia.org/T251931) (owner: 10Cwhite) [15:31:24] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:46] (03PS3) 10Krinkle: mtail: update varnishrls compatibility with rc35 [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:34:54] (03CR) 10Krinkle: "Hm.. is there not a way to consume this optional match in another way that produces no warning? I find it hard to imagine upstream does no" [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [15:34:59] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [15:35:13] PROBLEM - Check systemd state on kafka-jumbo1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:37] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog, 10Patch-For-Review: Update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10colewhite) 05Open→03Resolved DNS update deployed. Please reach out if something is amiss. [15:36:40] !log starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203 [15:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:43] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [15:37:01] RECOVERY - Check systemd state on kafka-jumbo1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['thanos-fe2002.codfw.wmnet'] ` [15:40:40] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10Vgutierrez) do we have an ETA on this one? :) [15:41:16] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:46] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:14] (03CR) 10Ppchelko: changeprop: add cpjobqueue configuration switching (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:50:45] (03PS1) 10Papaul: Fix TYPO for thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/595193 (https://phabricator.wikimedia.org/T251635) [15:51:19] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [15:51:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:27] (03CR) 10Papaul: [C: 03+2] Fix TYPO for thanos-fe200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/595193 (https://phabricator.wikimedia.org/T251635) (owner: 10Papaul) [15:53:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:55:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2001.codfw.wmnet ` The log can be fou... [15:55:47] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['thanos-fe2001.codfw.wmnet'] ` [15:56:38] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2001.codfw.wmnet ` The log can be fou... [15:58:26] (03PS2) 10Cwhite: admin: add Tobias Andersson to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/595051 (https://phabricator.wikimedia.org/T251997) [15:59:29] (03CR) 10Cwhite: [C: 03+2] admin: add Tobias Andersson to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/595051 (https://phabricator.wikimedia.org/T251997) (owner: 10Cwhite) [16:02:35] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10Patch-For-Review: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10colewhite) 05Open→03Resolved User added to `wmde` and `nda` groups (thanks @Dzahn for the tip). Please feel free to reopen if you encounter... [16:06:58] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) @Dzahn Oh, thanks for catching/fixing that! `rkemper@` is fine. [16:07:39] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [16:11:13] (03PS2) 10Cwhite: admin: add Nikolaos Gkountas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594974 (https://phabricator.wikimedia.org/T252100) [16:12:34] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:59] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:13] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2001.codfw.wmnet'] ` and were **ALL** successful. [16:20:30] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [16:21:02] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2002.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [16:21:45] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [16:30:54] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) -----BEGIN PGP PUBLIC KEY BLOCK----- mQINBF61iMEBEAC6+rvVR3kAESzZ6jrftpDY8ELofcI4... [16:31:10] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, and 2 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Krinkle) [16:36:02] (03Abandoned) 10Bearloga: Add analytics-product system user [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [16:36:55] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1014.eqiad.wmnet ` The log can be fou... [16:37:02] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:22] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) The cable was bad and was not getting a link light, swapped the network cable and imaging 1014 now [16:37:27] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-fe2003.codfw.wmnet ` The log can be found in `/var/log/wmf-au... [16:39:33] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:27] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10Cmjohnson) @wiki_willy updated cable id numbers...please verify and resolve this [16:43:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2002.codfw.wmnet'] ` and were **ALL** successful. [16:45:37] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [16:46:20] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install backup1002 + array - https://phabricator.wikimedia.org/T250816 (10Cmjohnson) @jynus I moved the DAC cable to the correct network port now. You should be good to go [16:48:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:00] (03CR) 10BryanDavis: [C: 03+2] "Obviously we need to deprecate the jessie containers as soon as we can. :(" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595062 (https://phabricator.wikimedia.org/T197930) (owner: 10Bstorm) [16:50:26] (03Merged) 10jenkins-bot: jessie: try to make jessie containers build at least one more time [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595062 (https://phabricator.wikimedia.org/T197930) (owner: 10Bstorm) [16:51:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:27] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:07] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Blog: Update DNS records for blog.wikimedia.org - https://phabricator.wikimedia.org/T251931 (10Varnent) It appears we are all set - thank you, @colewhite! [16:55:51] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:08] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1014.eqiad.wmnet'] ` and were **ALL** successful. [17:01:15] (03PS1) 10Andrew Bogott: Make cloudcontrol1005 a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/595196 (https://phabricator.wikimedia.org/T247471) [17:01:26] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-fe2003.codfw.wmnet'] ` and were **ALL** successful. [17:02:25] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol1005 a cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/595196 (https://phabricator.wikimedia.org/T247471) (owner: 10Andrew Bogott) [17:05:27] (03PS1) 10Andrew Bogott: Add initial hiera hosts defs for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595197 (https://phabricator.wikimedia.org/T252121) [17:05:38] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) [17:06:48] (03CR) 10Andrew Bogott: [C: 03+2] Add initial hiera hosts defs for cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595197 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [17:06:57] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TDB) rack/setup/install thanos-fe200[123] - https://phabricator.wikimedia.org/T251635 (10Papaul) 05Open→03Resolved @fgiunchedi this is ready for service. Thanks. [17:11:24] 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Cmjohnson) 05Open→03Resolved updated cable number [17:14:24] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10Cmjohnson) @wiki_willy This server is in netbox as decommissioning. I didn't do that? Can you ask around and figure out the status please. I a... [17:17:25] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) 05Open→03Resolved Looks good now @Cmjohnson. Resolving task [17:26:27] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [17:26:46] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) 05Open→03Resolved these are ready! resolving [17:27:26] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [17:32:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [17:39:02] (03PS1) 10Ayounsi: Cables: fix test_blank_cable_label [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595200 (https://phabricator.wikimedia.org/T250405) [17:43:59] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [17:45:42] (03PS1) 10Bstorm: wikireplicas: remove MCR-obsoleted fields from the replica views [puppet] - 10https://gerrit.wikimedia.org/r/595201 (https://phabricator.wikimedia.org/T252219) [17:48:24] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10wiki_willy) a:05Cmjohnson→03RobH @Vgutierrez - my apologies, I initially mistook this as a host in eqiad, instead of eqsin, so had assigned it to the wrong person last week. Re-assigning now to @Rob... [17:48:50] (03CR) 10Cwhite: [C: 03+2] admin: add Nikolaos Gkountas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594974 (https://phabricator.wikimedia.org/T252100) (owner: 10Cwhite) [17:51:55] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to LogStash for Nikolaos Gkountas - https://phabricator.wikimedia.org/T252100 (10colewhite) 05Open→03Resolved Now a member of `wmf` ldap group which has access to Logstash. Please feel free to reopen if you encounter any related... [17:53:45] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) It looks like the 5yr server lifecycle will be ending next month. @elukey - would it be possible to decom this server instead? Thanks, Willy [17:53:56] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) a:03wiki_willy [17:56:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add qchris to gerrit-root to allow testing gerrit-3.1 deployment and eventually deploy it - https://phabricator.wikimedia.org/T252194 (10colewhite) p:05Triage→03Medium a:03colewhite [17:58:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add qchris to gerrit-root to allow testing gerrit-3.1 deployment and eventually deploy it - https://phabricator.wikimedia.org/T252194 (10colewhite) @thcipriani do you approve of this request? [17:59:37] !log Extend /srv by 500G on labsdb1011 T249188 [17:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:43] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [18:00:12] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [18:00:14] (03PS2) 10Cwhite: Add qchris to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/595162 (https://phabricator.wikimedia.org/T252194) (owner: 10QChris) [18:01:58] (03PS1) 10Bstorm: wikireplicas: remove the wb_terms_no_long_updated view [puppet] - 10https://gerrit.wikimedia.org/r/595203 (https://phabricator.wikimedia.org/T251598) [18:05:26] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595200 (https://phabricator.wikimedia.org/T250405) (owner: 10Ayounsi) [18:07:42] (03CR) 10Ayounsi: [C: 03+2] Cables: fix test_blank_cable_label [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/595200 (https://phabricator.wikimedia.org/T250405) (owner: 10Ayounsi) [18:07:52] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10elukey) @wiki_willy yes no problem, we are going to refresh this node soon! [18:08:43] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) Thanks @elukey [18:12:14] !log reprepro copy buster-wikimedia stretch-wikimedia prometheus-openstack-exporter for T252121 [18:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:17] T252121: Rebuild cloudcontrol hosts with Debian buster - https://phabricator.wikimedia.org/T252121 [18:17:01] (03CR) 10Andrew Bogott: [C: 03+2] Add 'dumps' mount to the ores-staging project [puppet] - 10https://gerrit.wikimedia.org/r/595178 (https://phabricator.wikimedia.org/T252204) (owner: 10Andrew Bogott) [18:28:11] (03PS1) 10Jeena Huneidi: blubberoid: update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595206 (https://phabricator.wikimedia.org/T250764) [18:34:14] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) The servers have been moved to 10G racks, in order to keep 2 in row D, KJ1008/1009 are in the same rack, D7. Once we are able to get a 3rd switc... [18:34:38] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) Network switch has been updated, old entries removed and ports disabled. [18:37:01] (03CR) 10Thcipriani: [C: 03+2] blubberoid: update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595206 (https://phabricator.wikimedia.org/T250764) (owner: 10Jeena Huneidi) [18:37:20] (03Merged) 10jenkins-bot: blubberoid: update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/595206 (https://phabricator.wikimedia.org/T250764) (owner: 10Jeena Huneidi) [18:39:24] (03PS1) 10Andrew Bogott: mysql misc: add access for cloudcontrol1005 to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) [18:39:55] (03PS1) 10Cmjohnson: Fixing dhcp entries for kafkajumbo1007-9 to 10G [puppet] - 10https://gerrit.wikimedia.org/r/595208 (https://phabricator.wikimedia.org/T244506) [18:40:54] (03CR) 10Cmjohnson: [C: 03+2] Fixing dhcp entries for kafkajumbo1007-9 to 10G [puppet] - 10https://gerrit.wikimedia.org/r/595208 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [18:50:56] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Krinkle) [18:51:24] (03PS1) 10Cmjohnson: Adding kafka-jumbo100[789] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/595209 (https://phabricator.wikimedia.org/T244506) [18:52:14] (03CR) 10Cmjohnson: [C: 03+2] Adding kafka-jumbo100[789] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/595209 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [18:52:16] (03PS2) 10Andrew Bogott: mysql misc: add access for cloudcontrol1005 to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) [18:52:18] (03PS1) 10Andrew Bogott: misc dbs: temporarily add a hacked openestack_controllers setting [puppet] - 10https://gerrit.wikimedia.org/r/595210 (https://phabricator.wikimedia.org/T252121) [18:54:08] (03PS2) 10Andrew Bogott: misc dbs: temporarily add a hacked openestack_controllers setting [puppet] - 10https://gerrit.wikimedia.org/r/595210 (https://phabricator.wikimedia.org/T252121) [18:54:10] (03PS3) 10Andrew Bogott: mysql misc: add access for cloudcontrol1005 to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) [18:56:27] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1007.eqiad.wmne... [19:02:59] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10Cmjohnson) Network port is removed from private vlan and disabled for now. [edit interfaces interface-range disabled] member ge-6/0/35 { .... [19:06:22] (03PS1) 10Papaul: DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 [19:06:45] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 (owner: 10Papaul) [19:07:13] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [19:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:43] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) {F31808337} @elukey, it doesn't appear to be a partman thing. Attached is a picture of the console monitor during the ini... [19:10:55] (03PS4) 10EBernhardson: [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 [19:10:57] (03PS1) 10EBernhardson: logstash: Set tcp input type to class $title [puppet] - 10https://gerrit.wikimedia.org/r/595215 [19:12:26] !log jhuneidi@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] (03PS3) 10Andrew Bogott: misc dbs: temporarily add a hacked openstack_controllers setting [puppet] - 10https://gerrit.wikimedia.org/r/595210 (https://phabricator.wikimedia.org/T252121) [19:16:14] (03CR) 10Andrew Bogott: [C: 03+2] misc dbs: temporarily add a hacked openstack_controllers setting [puppet] - 10https://gerrit.wikimedia.org/r/595210 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [19:16:18] !log jhuneidi@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [19:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:44] (03PS1) 10Andrew Bogott: Revert "misc dbs: temporarily add a hacked openstack_controllers setting" [puppet] - 10https://gerrit.wikimedia.org/r/595220 [19:25:28] (03PS2) 10Andrew Bogott: Rework "misc dbs: temporarily add a hacked openstack_controllers setting" [puppet] - 10https://gerrit.wikimedia.org/r/595220 [19:27:27] (03CR) 10Andrew Bogott: [C: 03+2] Rework "misc dbs: temporarily add a hacked openstack_controllers setting" [puppet] - 10https://gerrit.wikimedia.org/r/595220 (owner: 10Andrew Bogott) [19:35:04] 10Operations, 10ops-eqiad, 10netops: Eqiad: C6 mgmt switch down - https://phabricator.wikimedia.org/T249309 (10Jclark-ctr) [19:50:42] (03PS1) 10Andrew Bogott: Keystone: update fernet key rotation plan for 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/595221 (https://phabricator.wikimedia.org/T252121) [19:56:34] (03PS2) 10Papaul: DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 [19:56:37] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1007.eqiad.wmnet'] ` [19:56:57] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 (owner: 10Papaul) [19:57:35] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10wiki_willy) Hi @Gehel - just wanted to follow up on this one, to hopefully wrap up the task. I couldn't find too much on the current status of... [20:05:00] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10Gehel) @wiki_willy / @Cmjohnson : this server has been decommed as part of T239821, so yep, nothing to do here except get rid of it. Thanks! And... [20:07:02] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10wiki_willy) [20:07:04] 10Operations, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Memory correctable errors -EDAC- for elastic1029.eqiad.wmnet - https://phabricator.wikimedia.org/T233578 (10wiki_willy) 05Open→03Resolved >>! In T233578#6119916, @Gehel wrote: > @wiki_willy / @Cmjohnson : this server has been decommed as part of T2... [20:12:48] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: update fernet key rotation plan for 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/595221 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [20:13:04] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10colewhite) 05Open→03Resolved Added read-only access to the listed domains in Google Search Console. Please feel free to reach out if you encounter any related issue. [20:15:24] (03PS3) 10Papaul: DNS: Add mgmt and public DNS entries for cloudceph200[1-3]-dev [dns] - 10https://gerrit.wikimedia.org/r/595212 [20:15:45] (03PS5) 10EBernhardson: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) [20:20:47] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [20:24:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10MBinder_WMF) Thanks for moving this along. Please let me know if I can help with anything. :) [20:30:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add qchris to gerrit-root to allow testing gerrit-3.1 deployment and eventually deploy it - https://phabricator.wikimedia.org/T252194 (10thcipriani) >>! In T252194#6119591, @colewhite wrote: > @thcipriani do you approve of this request? Approved. [20:40:46] 10Operations, 10Discovery-Search, 10SDC General, 10Structured Data Engineering, and 3 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10EBernhardson) Instructions for booting a new instance. Currently this requires pointing the instance at a p... [20:45:33] (03PS1) 10Andrew Bogott: OpenStack: move all openstack API support to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595227 [20:45:55] (03CR) 10jerkins-bot: [V: 04-1] OpenStack: move all openstack API support to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595227 (owner: 10Andrew Bogott) [20:47:40] (03PS1) 10Thcipriani: Merge tag 'v2.15.14' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/595228 [20:48:27] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'v2.15.14' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/595228 (owner: 10Thcipriani) [20:50:56] (03CR) 10Paladox: "Failure expected as long as it builds for you it's ok to merge." [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/595228 (owner: 10Thcipriani) [20:52:46] (03PS3) 10Cwhite: admin: Add qchris to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/595162 (https://phabricator.wikimedia.org/T252194) (owner: 10QChris) [20:53:55] (03CR) 10Cwhite: [C: 03+2] admin: Add qchris to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/595162 (https://phabricator.wikimedia.org/T252194) (owner: 10QChris) [20:54:53] (03PS2) 10Andrew Bogott: OpenStack: move all openstack API support to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/595227 (https://phabricator.wikimedia.org/T252121) [20:55:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add qchris to gerrit-root to allow testing gerrit-3.1 deployment and eventually deploy it - https://phabricator.wikimedia.org/T252194 (10colewhite) 05Open→03Resolved Change deployed. Should be applied in the next hour or so. [20:55:56] (03PS1) 10Andrew Bogott: Openstack: move API traffic to cloudcontrol1005 [dns] - 10https://gerrit.wikimedia.org/r/595229 (https://phabricator.wikimedia.org/T252121) [20:57:33] (03CR) 10Thcipriani: [C: 04-1] Merge tag 'v2.15.14' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/595228 (owner: 10Thcipriani) [21:06:32] !log running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203 [21:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:35] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [21:16:04] (03CR) 10Bstorm: [C: 03+2] wikireplicas: remove the wb_terms_no_long_updated view [puppet] - 10https://gerrit.wikimedia.org/r/595203 (https://phabricator.wikimedia.org/T251598) (owner: 10Bstorm) [21:23:04] 10Operations, 10Puppet, 10serviceops: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Peachey88) [21:33:09] !log cleaning up wb_terms_no_longer_updated view on labsdb1009 T251598 [21:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:13] T251598: Clean up wb_terms related views - https://phabricator.wikimedia.org/T251598 [21:45:27] !log cleaned up wb_terms_no_longer_updated view on labsdb1012 T251598 [21:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:30] T251598: Clean up wb_terms related views - https://phabricator.wikimedia.org/T251598 [21:45:56] !log cleaned up wb_terms_no_longer_updated view for testwikidatawiki and testcommonswiki on labsdb1010 T251598 [21:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:43] PROBLEM - IPMI Sensor Status on sodium is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:56:43] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:02] (03CR) 10Krinkle: [C: 04-1] "Recording here that Aaron found commit()/begin() statements that in Flow's code that look like they might start fatalling if this is deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz)