[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T0000). [00:00:05] Jdlrobson: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:21] I'll do SWAT in a bit. [00:00:33] ok, i'm logged in to wtp1025 and watching logstash for it [00:00:46] cscott: Have you pulled config? [00:01:01] not yet, i was about to double check with mutante that wtp1025 has been depooled [00:01:07] o/ [00:01:23] Sorry Jdlrobson, this is taking longer than anticipated. [00:01:53] cscott: yea, it's depooled [00:02:08] ok, i apologize, i'm trying to be slow & careful :-p [00:02:23] that's good [00:02:30] i need to be root to scap pull in /srv/mediawiki from wtp1025 or not? [00:02:48] No, should Just Work™. [00:02:49] no [00:02:58] In fact, doing it as root will probably break things. [00:03:08] ok. and /srv/mediawiki isn't a git checkout, so i can't just git log to see if it worked [00:03:15] Oh, sorry, yeah. [00:03:34] `cat wmf-config/CommonSettings.php | grep parsoidDir` [00:03:37] scap pull should show you some rsync output [00:03:46] James_F: yeah, just figured that out [00:03:51] ok, here goes nothing! [00:04:05] On a de-pooled server this should be very dull. [00:04:16] cscott@wtp1025:/srv/mediawiki$ scap pull [00:04:16] 00:03:57 Copying from deploy1001.eqiad.wmnet to wtp1025.eqiad.wmnet [00:04:16] 00:03:57 Started rsync common [00:04:16] sudo: a password is required [00:04:25] Huh. [00:04:36] followed by miscellaneous errors. looks like either i don't have the required perms or my sudo setup is weird [00:04:39] Are you not in the wikidev group? [00:04:42] you can see it's depooled on https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wtp1025&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&from=1583192620718&to=1583193867861 [00:04:54] cscott@wtp1025:/srv/mediawiki$ groups [00:04:54] wikidev parsoid-admin [00:05:19] Shouldn't you be in `deployment` too? [00:05:24] let me try the scap pull then [00:05:28] Thanks, mutante. [00:05:51] !log wtp1025 - scap pull [00:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:56] finished without errors [00:06:38] OK, and now to manually trigger a parse on wtp1025 and check it works. [00:06:51] James_F: yeah, probably (re: groups). will fix later, i guess. [00:06:52] Then we can sync it everywhere and repool wtp1025? [00:07:00] i'll trigger the parse, I think i've got enough access to do that. [00:09:00] $ curl -x localhost:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Main%20Page/939357440 [00:09:04] gives me a 404 [00:09:20] i think it's deploy-service group, btw [00:09:20] no worries James_F i'm here when you are ready :) [00:09:21] i think we saw this before when we were testing on scandium. [00:09:51] it means that we're not actually loading the parsoid extension, but we're also not trying to (or we'd get a 500 instead) [00:10:05] mutante: I'm in `wikidev deployment deploy-service`. [00:10:11] cscott: Well that's not great. [00:10:23] unless my curl command isn't right [00:11:06] actually i take it back! [00:11:15] curl -x wtp1025:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Main%20Page/939357440 [00:11:19] seems to work fine [00:11:22] Oh, wrong server? :-) [00:11:24] but localhost:80 didn't work [00:11:49] OK, let's sync? [00:12:03] let me just double check that logstash doesn't have anything unusual for wtp1025 [00:12:07] mutante: FWIW I'm in `wikidev deployment deploy-service`. [00:12:44] nothing in logstash for wtp1025 in the past 15 minutes, so that looks good [00:12:58] Doing now. [00:13:04] ok, fingers crossed! [00:13:05] !log jforrester@deploy1001 Started scap: wmf-config/CommonSettings.php T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out [00:13:08] !log jforrester@deploy1001 sync aborted: wmf-config/CommonSettings.php T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out (duration: 00m 03s) [00:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:10] T240055: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 [00:13:13] Ha, whoops. [00:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:23] Yeah, don't do that. [00:13:26] * James_F coughs. [00:14:11] James_F: deploy-service is the group that has sudo for stuff like restarting parsoid and checking if restart php-fpm is needed [00:14:14] mutante: Maybe we should add a wtp host to the canaries? [00:14:17] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T240055: Point the Parsoid cluster at the train version of Parsoid, not a special check-out (duration: 00m 56s) [00:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:24] OK, the Parsoid switch is live. [00:14:26] * James_F checks. [00:14:40] James_F: maybe one of the new parse* servers, yea [00:14:50] they are coming soon [00:14:51] Seems to work for me. [00:14:54] ok, sbailey is checking as well [00:14:58] Excellent. [00:15:02] mutante: I'll file a task. [00:15:20] re-pool wtp1025 as well? [00:15:27] afair the localhost vs hostname thing is not unique to this, cscott [00:15:40] ok, repooling [00:15:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [00:15:57] mutante: yeah, error between my head and keyboard [00:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:33] i should make a note in our deploy docs to remind my future self about this though [00:16:44] James_F: great! [00:16:46] OK, are we done? [00:16:49] cluster load looks fine, so we're not immediately failing at least. [00:16:52] now we should see traffic coming back to it [00:17:11] waiting for grafana to confirm it [00:17:17] PROBLEM - PHP7 rendering on scandium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2436 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:17:20] heh [00:17:58] Oh, yeah, we should ACK that. [00:18:05] Scandium is now broken, as I foretold. [00:18:23] can we fix it? [00:18:23] cscott: Or you could fix scandium with a mere `git clone`. :-) [00:18:33] James_F: yeah, let me try that. [00:18:36] mutante: Yes, but cscott wanted to do that tomorrow? [00:18:44] OK, I'm moving on to Jdlrobson's SWATs. [00:18:51] (03PS9) 10Jforrester: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [00:18:53] sweeeeettt [00:18:54] hmm. i am still waiting for the "network utilization" to do something again https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wtp1025&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&from=1583193536774&to=1583193867861 [00:18:57] (03CR) 10Jforrester: [C: 03+2] Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [00:18:57] PROBLEM - Apache HTTP on scandium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2437 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:19:38] i do have root on scandium at least ;) [00:19:44] Indeed. [00:19:45] ACKNOWLEDGEMENT - Apache HTTP on scandium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2437 bytes in 0.038 second response time daniel_zahn . https://wikitech.wikimedia.org/wiki/Application_servers [00:19:45] ACKNOWLEDGEMENT - PHP7 rendering on scandium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2437 bytes in 0.037 second response time daniel_zahn . https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:19:50] Thanks, mutante [00:19:52] yea, i did that the other day [00:19:57] (03Merged) 10jenkins-bot: Enable lead paragraph in user namespace on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/562486 (https://phabricator.wikimedia.org/T242030) (owner: 10Ammarpad) [00:20:04] James_F: i wonder if we should push our luck and swat the preprocessor-related config changes. [00:20:07] upgraded to test-roots [00:20:54] James_F: i don't know why wtp1025 isn't getting any traffic after repooling it [00:21:18] cscott: Schedule them tomorrow. I'm busy now. [00:21:25] mutante: Maybe it'll tick over? [00:21:39] James_F: it looked to me like wtp1025 never stopped getting some traffic? i didn't see the drop-off i expected when i zoomed out [00:21:51] James_F: wed. but sure. [00:21:54] Jdlrobson: Can you test? mwdebug1001 looks the same to me [00:22:04] cscott: Or I can just do them. [00:22:05] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1321.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:22:07] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1323.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:22:12] James_F: whichever you prefer [00:22:13] cscott: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wtp1025&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&from=1583193536774&to=1583193688438&fullscreen&panelId=8 [00:22:17] i just keep forgetting to do them [00:22:34] mutante: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=wtp1025&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&from=now-1h&to=now&fullscreen&panelId=8 [00:22:41] Looks OK to me? [00:23:06] James_F: ah.. that looks much more like what i expected [00:23:13] mutante: i'm looking at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1583193173928&to=1583194973928&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-instance=wtp1025 [00:23:16] Choice of time window is important. [00:23:24] and the cpu load looks like i expect, down then up [00:23:31] Excellent. [00:23:44] the network pane didn't go down to zero like i expected in this view, but it did dip then rise [00:23:47] (03PS7) 10Jforrester: Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [00:23:50] cscott: ack, i was looking at only wtp1025, from host overview [00:24:32] cscott: and that's the whole cluster, though it's confusing [00:24:33] oh, yeah, i guess i was making that mistake too. [00:24:44] network per host looks right [00:24:47] you have the instace selected but that graph is still cluster [00:25:00] which is confusing [00:25:23] James_F: where did you think a simple 'git pull' would work? /srv/mediawiki is still not a git checkout [00:25:40] cscott: `scap pull`. Not `git pull`. [00:26:07] James_F: ah. makes sense. done. [00:26:27] And on an MW server a plain `scap pull` will work from any directory and do it for /srv/mediawiki, so… [00:26:42] It's a scap2 command. [00:26:58] None of your scap3 luxuries of gradual deployment. ;-) [00:27:17] well, that didn't by itself fix scandium, but the scap completed w/o error at least. [00:27:35] Oh, it should, yeah, but it doesn't test anything. [00:27:54] scandium is part of the mediawiki-installation "dsh group" [00:27:58] did we get scandium added to SERVERGROUP=parsoid ? [00:28:01] Yes [00:28:03] so scap should update it as well [00:28:12] yes, it is now part of the parsoid SERVERGROUP [00:28:19] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T242030 Enable lead paragraph in user namespace on nlwiki (duration: 00m 56s) [00:28:21] (03CR) 10Jforrester: [C: 03+2] Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [00:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:24] T242030: Site request: Enable lead paragraph on user namespace for Dutch Wikipedia - https://phabricator.wikimedia.org/T242030 [00:28:57] by the way, about pooling/depooling, we can do that separately for each service [00:29:03] there are 2 services on each wtp* machine [00:29:07] parsoid and parsoid-php [00:29:14] (03Merged) 10jenkins-bot: Drop legacy main page special casing on select projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575376 (https://phabricator.wikimedia.org/T32405) (owner: 10Jdlrobson) [00:30:02] James_F: ah, yeah, the problem is that the magic parsoid-testing directory isn't there any more [00:30:08] James_F: but i can hack around that temporarily [00:31:31] mutante: Which is which? Is the "parsoid" service still running? Should it? We only talk to the "parsoid-php" service now, right? [00:31:45] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 56s) [00:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:53] RECOVERY - Apache HTTP on scandium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:31:55] James_F: ok, scandium is 'fixed' [00:32:08] cscott: Does is run the right code? [00:32:23] RECOVERY - PHP7 rendering on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 77187 bytes in 0.230 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:32:38] i just commented out the line which sets $parsoidDir = __DIR__ . "/../parsoid-testing" [00:32:44] so it's running the train code like everything else [00:33:03] James_F: need me to test? (there's not much testing to do here but I can sanity check) [00:33:05] which won't let us run rt testing, yet, but it will keep icinga happy at least [00:33:28] Jdlrobson: LGTM. Syncing now. [00:33:40] cscott: Aha, that'll get over-written the next time someone scaps CS. [00:33:46] James_F: should i turn that one line hack into a proper patch then? [00:33:47] cscott: Which will be early tomorrow morning. [00:34:00] Yes, or just fix scandium with a `git clone` already. [00:34:33] the parsoid-testing directory is gone, I guess you're saying to git-clone to (re)create it? [00:34:37] Yes. [00:34:56] `git clone …/parsoid.git /srv/parsoid-testing` [00:35:15] !log jforrester@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: T32405 Drop legacy main page special casing on select projects (duration: 00m 56s) [00:35:18] it looked like the path was going to be /srv/mediawiki/parsoid-testing [00:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:19] T32405: [EPIC] MobileFrontend extension should stop special-casing main page - https://phabricator.wikimedia.org/T32405 [00:35:23] i think we still need a patch to wmf-config [00:35:25] Jdlrobson: It's live, e.g. https://ak.m.wikipedia.org/wiki/Krataafa_Titiriw [00:35:37] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575336/7/wmf-config/CommonSettings.php [00:35:52] __DIR__ is /srv/mediawiki/wmf-config I believe [00:36:25] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 56s) [00:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:29] It should be /srv/mediawiki shouldn't it? [00:36:48] If not, yes, we need an extra `dirname()`. [00:36:49] cscott@scandium:/srv/mediawiki$ find . -name CommonSettings.php [00:36:49] ./wmf-config/CommonSettings.php [00:37:11] I know where the file is, but PHP is sometimes magical in what it thinks the "current" dir is. [00:37:25] yeah, that's exactly why i'm talking it through with you ;) [00:37:25] cscott: BTW, I deployed your parser cleanup patches last week, it seems. [00:37:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:37:32] James_F: oh, wonderful [00:37:39] https://gerrit.wikimedia.org/r/q/project:operations%252Fmediawiki-config+owner:cananian%2540wikimedia.org [00:37:47] OK, SWAT done. [00:37:59] James_F: ok, i'll create /srv/parsoid-testing and then see if it works before I besmirch your beautiful CommonSettings.php patch further [00:38:05] thanks James_F - is the other change live? [00:38:35] Jdlrobson: Both live, yes, but I didn't see the difference for nlwiki? [00:38:44] Maybe I checked the wrong shape pages. [00:39:35] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:39:37] i need to run some tests on nl [00:39:39] is it synced? [00:40:34] James_F: works. https://nl.m.wikipedia.org/wiki/Gebruiker:Jdlrobson/draft [00:40:48] James_F: sorry, i'm being slow. but it looks like parsoid is checked out in /srv/deployment/parsoid/deploy [00:41:07] (03PS1) 10Jforrester: [Beta Cluster] Only soft-override wgLogos; let 'wordmark' continue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576168 [00:41:15] cscott: Yes, we didn't delete those existing ones (yet). [00:41:18] thank you James_F ! [00:41:34] cscott: But it should now be using the one in /srv/mediawiki/…/vendor/wikimedia/parsoid? [00:41:53] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Only soft-override wgLogos; let 'wordmark' continue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576168 (owner: 10Jforrester) [00:42:21] i'm trying to understand your ``git clone …/parsoid.git /srv/parsoid-testing` command (and feeling slightly stupid/in need of more coffee) [00:42:33] what is your cwd there? [00:42:48] (03Merged) 10jenkins-bot: [Beta Cluster] Only soft-override wgLogos; let 'wordmark' continue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576168 (owner: 10Jforrester) [00:43:06] The expected checkout for new parsoid is /srv/parsoid-testing (i.e., not inside the managed-by-scap /srv/mediawiki tree). [00:43:49] (03CR) 10Jforrester: "Ic6fd0a2f641e20e249b7041a3bb4d5017a380696 should be sufficient for this, I think." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576157 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [00:46:23] (03CR) 10Jdlrobson: "HURRAH" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576168 (owner: 10Jforrester) [00:47:38] Jdlrobson: Cheer me *if* it works. :-) [00:48:41] James_F: ok, it's working, but it did end up needing an extra .. in $parsoidDir [00:48:56] Jdlrobson: Aha, https://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page now has a wordmark again. [00:49:12] cscott: Write it into a patch and I'll deploy it now? [00:49:26] Better fix now than leave it in prod to blow up for someone else. [00:49:37] yeah, doing so now [00:50:42] (03PS1) 10C. Scott Ananian: Update special scandium configuration to load from /srv/parsoid-testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576169 [00:50:49] James_F: ^ [00:50:58] (03CR) 10Jforrester: [C: 03+2] Update special scandium configuration to load from /srv/parsoid-testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576169 (owner: 10C. Scott Ananian) [00:51:00] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Nuria) Approved on my end [00:51:02] On it. [00:51:56] (03Merged) 10jenkins-bot: Update special scandium configuration to load from /srv/parsoid-testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576169 (owner: 10C. Scott Ananian) [00:52:58] * James_F wanders what scap will do to the local changes. [00:53:03] Probably silently over-write. [00:53:22] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T240055 Update special scandium configuration to load from /srv/parsoid-testing (duration: 00m 58s) [00:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:26] Live. [00:53:27] T240055: Craft a deployment strategy to transition Parsoid/PHP from a faux extension to a composer library without breaking incoming requests - https://phabricator.wikimedia.org/T240055 [00:53:29] Thank you, cscott. [00:54:45] pulled to scandium [00:54:48] seems to work [00:55:00] cscott: I pushed to all, including scandium. :-) [00:55:06] hacked a quick update to https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing [00:55:15] (And 375 other boxes.) [00:55:18] James_F: thanks! [00:55:34] doesn't seem to have broken anything [00:55:40] Jinx. [00:56:05] ok, since i'm going to be largely offline tomorrow would you mind checking in w/ subbu at some point and just let him know what we did? [00:56:12] Of course. [00:56:59] he might want to make further tweaks to scandium, but things are at least Not Broken right now. :) [01:02:44] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10cscott) Ok, tested deploy rights today. We depooled wtp1025 temporarily for testing. On wtp1025... [01:03:33] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:04:39] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:10:20] (03CR) 10Jforrester: "> Patch Set 6:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [01:15:06] (03PS2) 10Dzahn: installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) [01:18:22] (03CR) 10jerkins-bot: [V: 04-1] installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:20:30] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21203/install1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:20:54] (03PS3) 10Dzahn: installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) [01:23:34] (03CR) 10jerkins-bot: [V: 04-1] installserver/apt: allow setting gpg_user different from reprepro user [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:26:02] 10Operations: mw1248 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T246730 (10Dzahn) [01:27:09] 10Operations: mw1248 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T246730 (10Dzahn) ` Mar 2 09:04:50 mw1248 kernel: [23463394.610091] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Mar 2 09:04:50 mw1248 kernel: [23463394.610097] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c... [01:29:35] 10Operations: mw1248 - Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T246730 (10Dzahn) 05Open→03Declined closing again per comments on similar T238018 [01:29:55] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1248 is CRITICAL: 18 ge 4 daniel_zahn https://phabricator.wikimedia.org/T246730 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1248&var-datasource=eqiad+prometheus/ops [01:36:15] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "this is just to make sure there is no change on install1002 and the root user still owns its GPG keys like before (unlike on releases*)" [puppet] - 10https://gerrit.wikimedia.org/r/576164 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [01:48:57] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:49:25] (03CR) 10Dzahn: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/572381 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [01:51:11] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:09:41] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [02:11:23] (03PS4) 10Sharvaniharan: Enabling depicts count [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575611 [02:14:25] 10Operations, 10serviceops-radar, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 (10Dzahn) [02:14:44] 10Operations, 10Core Platform Team, 10serviceops-radar, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 (10Dzahn) [02:20:35] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 132.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [02:34:52] 10Operations, 10Core Platform Team, 10serviceops-radar, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 (10Reedy) 1606b... [02:37:27] 10Operations, 10Core Platform Team, 10serviceops-radar, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 (10Reedy) p:05... [02:38:40] 10Operations, 10Core Platform Team, 10serviceops-radar, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 (10Reedy) This... [02:39:26] !log manually running updateSpecialPages.php maintenance cron on s8 for AncientPages to confirm it was fixed by gerrit:574726 a few days ago (T243599) [02:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:32] T243599: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 [02:41:25] 10Operations, 10Core Platform Team, 10serviceops-radar, 10User-Urbanecm, 10Wikimedia-maintenance-script-run: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects... - https://phabricator.wikimedia.org/T243599 [02:56:13] (03CR) 10Andrew Bogott: [C: 03+1] "> but grafana-labs will stay, right?" [puppet] - 10https://gerrit.wikimedia.org/r/572381 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [02:57:10] !log manually running "Ancientpages" cron on s3 (T243599) [02:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:16] T243599: MWMultiVersion.php Fatal error: no version entry for `#` (was: special pages has not been updated since November 2019 in jawiki and several other projects) - https://phabricator.wikimedia.org/T243599 [03:03:29] (03PS1) 10Reedy: Make english [puppet] - 10https://gerrit.wikimedia.org/r/576176 [03:05:08] (03PS2) 10Dzahn: scap/foreachwikiindblist: Make comment English [puppet] - 10https://gerrit.wikimedia.org/r/576176 (owner: 10Reedy) [03:05:43] (03CR) 10Dzahn: [C: 03+2] scap/foreachwikiindblist: Make comment English [puppet] - 10https://gerrit.wikimedia.org/r/576176 (owner: 10Reedy) [03:31:43] (03PS5) 10Andrew Bogott: wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (https://phabricator.wikimedia.org/T246551) [03:33:35] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks: make safe to run in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/576155 (https://phabricator.wikimedia.org/T246551) (owner: 10Andrew Bogott) [04:06:33] 10Operations, 10WMF-Communications: Feasibility of hosting podcast setup on Wikimedia servers - https://phabricator.wikimedia.org/T148061 (10Yair_rand) When viewing a category on Commons, there's a link in the sidebar titled "RSS feed", linking to a feed of that category, generated by the CatFood tool hosted o... [04:09:45] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 4 (apt1001, ...), Fresh: 89 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [04:22:39] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [05:44:03] 10Operations, 10SRE-swift-storage: swift capacity planning - https://phabricator.wikimedia.org/T1268 (10Aklapper) @fgiunchedi: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via {nav name=Add Action... > Change Status} in the dropdown menu), or is there more to do in this task?... [06:03:31] (03PS1) 10Vgutierrez: ATS: Switch unified cert vendor to Let's Encrypt on eqiad & codfw [puppet] - 10https://gerrit.wikimedia.org/r/576188 (https://phabricator.wikimedia.org/T230687) [06:17:19] (03PS4) 10Marostegui: db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) [06:17:31] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) [06:20:05] (03PS1) 10Marostegui: Revert "wikireplica_analytics.yaml: Decrease running time" [puppet] - 10https://gerrit.wikimedia.org/r/576190 [06:21:15] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplica_analytics.yaml: Decrease running time" [puppet] - 10https://gerrit.wikimedia.org/r/576190 (owner: 10Marostegui) [06:23:06] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/21205/" [puppet] - 10https://gerrit.wikimedia.org/r/576188 (https://phabricator.wikimedia.org/T230687) (owner: 10Vgutierrez) [06:25:31] !log Switch from globalsign to LE as unified cert vendor on codfw - T230687 [06:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:37] T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 [06:33:23] !log Switch from globalsign to LE as unified cert vendor on eqiad - T230687 [06:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:27] T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 [06:36:49] (03PS3) 10Giuseppe Lavagetto: mediawiki: stop installing the nginx-based proxy [puppet] - 10https://gerrit.wikimedia.org/r/576078 (https://phabricator.wikimedia.org/T244843) [06:38:20] (03CR) 10Ori.livneh: [C: 03+1] "Nice :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [06:39:24] 10Operations: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457 (10ArielGlenn) This should be closed as no longer relevant at this point. What status is that? :-) [06:41:33] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [06:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:39] (03PS4) 10Marostegui: install_server: Allow manual reimage db109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) [06:42:49] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs3006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576191 (https://phabricator.wikimedia.org/T245984) [06:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3315, db1096:3316 for reimage T246604', diff saved to https://phabricator.wikimedia.org/P10587 and previous config saved to /var/cache/conftool/dbconfig/20200303-064316-marostegui.json [06:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:21] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [06:44:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: stop installing the nginx-based proxy [puppet] - 10https://gerrit.wikimedia.org/r/576078 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [06:44:20] 10Operations, 10Analytics, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5933951, @nshahquinn-wmf wrote: > > I don't have any serious concerns, so you don't have to pay me too much attention; I joined... [06:44:31] (03CR) 10Marostegui: [C: 03+2] install_server: Allow manual reimage db109[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/575991 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:45:13] _joe_: ok to merge your changes? [06:45:16] <_joe_> yes [06:45:19] ok, merging [06:45:21] <_joe_> I was about to ask [06:45:55] (03PS2) 10Vgutierrez: install_server,lvs: Reimage lvs3006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576191 (https://phabricator.wikimedia.org/T245984) [06:45:58] merged! [06:47:27] (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs3006 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576191 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [06:48:50] (03PS1) 10Marostegui: db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/576193 (https://phabricator.wikimedia.org/T246604) [06:49:49] !log Stop MySQL on db1096:3315,3316 for reimage - T246604 [06:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:54] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [06:50:10] (03CR) 10Marostegui: [C: 03+2] db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/576193 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [06:51:34] !log reimage lvs3006 with buster - T245984 [06:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:39] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [06:52:33] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3006.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [07:07:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [07:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [07:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:59] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [07:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:38] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] ` and were **ALL** successful. [07:31:01] 10Operations: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10CDanis) [07:31:46] 10Operations: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10CDanis) [07:31:55] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs3006 [puppet] - 10https://gerrit.wikimedia.org/r/576265 (https://phabricator.wikimedia.org/T245984) [07:33:00] (03CR) 10DCausse: [C: 03+1] wdqs: added link to runbook entry for categories update lag. [puppet] - 10https://gerrit.wikimedia.org/r/576064 (https://phabricator.wikimedia.org/T246497) (owner: 10Gehel) [07:33:22] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs3006 [puppet] - 10https://gerrit.wikimedia.org/r/576265 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [07:39:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [07:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:07] !log Re-enable BGP in lvs3006 - T245984 [07:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:11] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [07:45:41] !log START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 1) [07:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:45] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [07:46:57] 10Operations: write up impact estimation procedure - https://phabricator.wikimedia.org/T246739 (10CDanis) p:05Triage→03Medium [07:48:40] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [07:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [07:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:49] (03PS1) 10Addshore: Read from the new term store up to Q15 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576268 (https://phabricator.wikimedia.org/T219123) [08:01:23] (03CR) 10Muehlenhoff: [C: 03+2] Bump max CAS session life time to a day [puppet] - 10https://gerrit.wikimedia.org/r/576072 (owner: 10Muehlenhoff) [08:05:07] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [08:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:58] !log addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=25 --sleep=1 --file=27feb1125-30to40-holes # T219123 [08:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:03] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:11:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [08:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:47] !log START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 2) [08:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:11] !log addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=25 --sleep=1 --file=27feb1125-40to50-holes # T219123 [08:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:16] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:26:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:28:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:29:58] 10Operations, 10cloud-services-team: Move labtestpuppetmaster2001 to Puppet 5 - https://phabricator.wikimedia.org/T246655 (10MoritzMuehlenhoff) 05Open→03Declined Declining in favour of T242607 [08:35:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:36:08] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [08:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:54] !log START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 3) [08:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:00] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [08:55:10] (03PS1) 10Giuseppe Lavagetto: profile::tcp_fast_open: create tiny profile [puppet] - 10https://gerrit.wikimedia.org/r/576279 [08:55:12] (03PS1) 10Giuseppe Lavagetto: profile::service_proxy: absent everywhere [puppet] - 10https://gerrit.wikimedia.org/r/576280 [09:00:04] marostegui and jynus: I, the Bot under the Fountain, allow thee, The Deployer, to do es4 database deployment deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T0900). [09:00:17] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:00:18] so what is the first step? [09:00:26] will you test on mwdebug? [09:00:29] deploying https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/574696/ which should be a NOOP [09:00:33] ok [09:00:38] yes, I am going to first test shell.php [09:00:43] once that's deployed [09:00:47] and then on mwdebug [09:00:52] ok [09:01:07] let me see what is the status of eqiad.json [09:01:27] ok, going to merge for now [09:01:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:02:23] so, for what I see, es4 and es5 already has the correct weights there [09:02:27] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:02:41] (on etcd) [09:02:43] jynus: yep [09:03:40] 10Operations: Install nodejs, nginx and other dependencies on francium - https://phabricator.wikimedia.org/T94457 (10akosiaris) 05Open→03Invalid `Invalid` I 'd say. I went ahead and did that, feel free to undo. [09:03:44] 10Operations, 10Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113 (10akosiaris) [09:04:06] people know not to do mw deployments now, right? [09:04:17] I blocked it on the deployment page yeah [09:04:18] jouncebot: next [09:04:18] In 2 hour(s) and 55 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1200) [09:04:19] I do :) [09:04:27] I have also created a lock [09:04:29] jouncebot: now [09:04:30] For the next 1 hour(s) and 55 minute(s): es4 database deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T0900) [09:04:31] on scap [09:04:44] marostegui: cool [09:05:18] I am checking logs and metrics [09:05:40] I am going to deploy that first patch [09:05:48] marostegui: patch LGTM, btw [09:07:31] let's see what shell.php says [09:07:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s) [09:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:40] T246072: Enable es4 and es5 as writable new external store sections - https://phabricator.wikimedia.org/T246072 [09:08:19] I am also monitoring es1020 binlog btw [09:08:23] to make sure there are no writes [09:08:31] I see no errors on log, metrics [09:08:39] shell.php looking good [09:08:53] going to deploy to codfw.php first, before going to mwdebug [09:09:03] could we maybe try to read a blob, and see it failing? [09:09:09] Yes, I tried that [09:09:10] ah, ok, too [09:09:12] with shell.php [09:09:25] Can I get a last review on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575016/ ? [09:09:26] ah, I didn't understand it [09:09:29] sorry [09:09:49] >>> ExternalStore::fetchFromURL( 'DB://cluster26/1' ) [09:09:50] => false [09:09:59] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s) [09:10:00] (03CR) 10Jcrespo: [C: 03+1] db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:10:00] no other errors as if I would try with cluster27 [09:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:06] (03CR) 10Muehlenhoff: cache: map logstash-next.wikimedia.org to kibana-next lvs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [09:10:21] wait [09:10:35] ExternalStoreDB::fetchBlob Table 'enwiki.blobs' doesn't exist (10.64.16.150) SELECT blob_text FROM `blobs` WHERE blob_id = '1111' LIMIT 1 [09:10:43] what's that? [09:10:50] 3 errors on log [09:10:51] the table isn't called blobs [09:10:56] I expected an error [09:10:59] but not that one [09:11:06] lets stop for a second [09:11:13] That comes from mwmaint1002? [09:11:24] Because if so, that's me [09:11:40] yes: /srv/mediawiki-staging/multiversion/MWScript.php shell.php --wiki=enwiki [09:11:47] yes, I was testing before doing the merge [09:11:52] to make sure we'd get a proper error [09:12:10] yes, but is the enwiki.blobs expected? [09:12:20] vs enwiki.blobs* [09:12:29] at this point [09:13:11] I am guessing yes, until this deployment, right? [09:13:21] Yeah, it switched as soon as I merged [09:13:28] and it started returning false, which is expected [09:13:44] cool, then [09:13:53] ok, going to modify mwdebug [09:13:55] sorry, just being very careful if I see errors [09:13:56] to start writing [09:14:00] No, please, be verbose [09:14:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 93 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [09:14:21] Applying https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575016/ to mwdebug1001 [09:15:01] ok [09:15:10] done [09:15:11] testing [09:16:49] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [09:17:45] I am seeing my inserts on the new host ### INSERT INTO `eswiki`.`blobs_cluster26` [09:17:47] jynus: ^ [09:17:56] I see the edit you tried being available from the whole cluster [09:18:09] if you confirm you see it inserted on the new server [09:18:17] that is most of the cofirmation [09:18:23] otherwise I don't know what to test [09:18:29] yes, I see it on the master, checking now it is on all the slaves [09:18:29] any idea? [09:18:39] will check log or monitoring [09:19:23] I am going to made an edit on my page on enwiki too [09:19:38] no related db errors on log since 10 minutes ago [09:21:01] maybe also create a new page (a subpage) [09:21:16] and edit a few other projects? [09:21:22] makes sense [09:21:24] let me do that too [09:21:26] I don't know [09:21:36] I am speaking aloud to think things that could fail [09:22:17] wait, how did your edit landed on the new cluster, and not had a 1/3 chance to do it? [09:22:33] I am doing multiple edits, to reduce that chance [09:22:40] not all of them are arriving to the new cluster of course [09:22:45] so it is random, I see [09:22:47] yeah [09:22:56] I am now editing wiktionary [09:22:58] that was another worry: all edits to new [09:27:06] I will also keep monitoring editing time and editing rate grafana dashboards [09:27:59] I see my edits also on wiktionary [09:28:06] I think we are good [09:28:19] ok, I see no errors [09:28:34] ok, so let's deploy https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/575016/ [09:28:38] do we prepared the rollback in advance? [09:28:40] (03PS3) 10Marostegui: db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) [09:28:41] *prepare [09:28:48] just to be fast? [09:28:59] the rollback would be basically: read-only to es1020 and reverting the config [09:29:04] I have the read_only ready [09:29:08] ok [09:29:23] (03CR) 10Marostegui: db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:29:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:30:11] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs3005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576283 (https://phabricator.wikimedia.org/T245984) [09:30:12] oh, it wasn't merged [09:30:17] hehe no :) [09:30:18] did you applied it locally? [09:30:21] yeah [09:30:27] I usually just pull it with scap [09:30:29] after merge [09:30:44] although I don't know if people liked that [09:30:46] yeah, I didn't want any risks this time [09:30:50] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add es4 to default ES cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575016 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [09:30:52] I usually do that too [09:30:55] ok, let's deploy codfw first [09:31:28] did you check eqiad -> codfw replication, too? [09:31:30] btw [09:31:32] yep [09:31:35] cool [09:31:39] Posted above, that the change was present on all hosts [09:31:47] I just think things that could be broken :-D [09:31:52] I 100% trust you [09:31:55] checked for allthe projects I edited and saw on binlogs [09:32:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 57s) [09:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:17] T246072: Enable es4 and es5 as writable new external store sections - https://phabricator.wikimedia.org/T246072 [09:32:22] and the readonly shouldn't need check or it would alarm [09:32:35] ^ that deployment did nothing as I didn't rebase hehe [09:32:40] rebasing and deploying again to codfw [09:32:40] ok [09:33:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s) [09:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:25] Let's go for eqiad then.....https://giphy.com/gifs/hulu-scared-3ohhwF34cGDoFFhRfy [09:33:45] wait a sec [09:33:47] !log marostegui@deploy1001 sync-file aborted: Enable es4 as new writable external store section - T246072 (duration: 00m 02s) [09:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:51] canceled [09:33:52] I am seeing something [09:33:57] what is it [09:34:06] the master of es4 is es1020 [09:34:08] right? [09:34:11] yes [09:34:33] sorry, I missread it with es1021read_only: False [09:34:44] the state of the read_only [09:34:46] my fault [09:34:48] hahaha [09:34:49] good to go [09:34:56] * marostegui checks his pulse again [09:34:59] sorry [09:35:04] don't worry, better be safe [09:35:06] I prefere to stop you and check [09:35:08] deploying again! [09:35:12] sorry for that [09:35:26] I would measure the latency of the queries there [09:35:35] !log START warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 4) [09:35:37] and see how high it gets with load [09:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:39] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:35:44] ^^ ignore that, just keeping the caches warm for me after :) [09:36:04] yeah, the good thing is that it will slowly get 1/3 traffic [09:36:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s) [09:36:11] not a super 0 -> 100% increase [09:36:13] ok, we are live [09:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:19] I can see writes [09:36:27] binlog description :-D [09:36:45] no mountain of errors so far [09:36:55] in fact no error at all [09:37:10] checking edit metrics [09:37:17] host metrics increaseing as expected [09:38:02] 4 save failures (all conflicts), withing the normal ratio [09:38:59] no errors, I think that happens when deployments slows down the activity [09:39:11] I am examining binlogs and all looking good [09:39:20] can people edit? let me see rcs [09:39:37] yeah, I have it opened, and looking good [09:39:47] (03CR) 10Vgutierrez: [C: 03+2] install_server,lvs: Reimage lvs3005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576283 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [09:40:37] The Edit Count graph is broken? [09:40:44] not sure [09:40:49] Ah no [09:40:50] https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&refresh=5m&fullscreen&panelId=8 [09:40:52] it works [09:41:00] I think it has lag? [09:42:04] the graph or the hosts? [09:42:36] the graph :-D [09:43:00] let me compare it on the aggregated metrics [09:43:04] the es* hosts [09:43:10] !log reimage lvs3005 with buster - T245984 [09:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:15] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [09:43:36] query latency looking good on the new hosts too [09:44:18] deployment effect is nicely seen here: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=es2&var-shard=es3&var-shard=es4&var-role=All&from=1583225048118&to=1583228648118&fullscreen&panelId=7 [09:44:33] we will not see a lot of reads [09:44:56] because they will have 5000Mill edits vs 1000 [09:45:08] yeah [09:45:12] but there are some, which is good [09:45:31] well, my bet that is from monitoring and checks :-D [09:45:35] for the most part [09:45:52] but connections may be seen from mw [09:46:27] the aggregated number of write seems to be a the same level [09:46:36] yeah [09:46:41] same for latency [09:46:51] but with edit counts, we will have to wait a bit if it had neutral or bad effect [09:46:58] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: queens: stretch: add nochange repository [puppet] - 10https://gerrit.wikimedia.org/r/576284 (https://phabricator.wikimedia.org/T246287) [09:47:45] so we leave it like that for some time, and later we remove es2? [09:48:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: queens: stretch: add nochange repository [puppet] - 10https://gerrit.wikimedia.org/r/576284 (https://phabricator.wikimedia.org/T246287) (owner: 10Arturo Borrero Gonzalez) [09:49:07] 10Operations, 10SRE-swift-storage: investigate swift used space spikes since June 2016 - https://phabricator.wikimedia.org/T140075 (10fgiunchedi) 05Open→03Declined Resolving since the root cause has been found [09:49:09] 10Operations, 10SRE-swift-storage: swift capacity planning - https://phabricator.wikimedia.org/T1268 (10fgiunchedi) [09:49:11] 10Operations, 10SRE-swift-storage: swift capacity planning - https://phabricator.wikimedia.org/T1268 (10fgiunchedi) 05Open→03Resolved Sure, we can resolve this [09:49:20] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) [09:49:50] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3005.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [09:49:50] jynus: yeah, let's remove es2 tomorrow at the same time maybe? [09:49:54] to give it fully 24h? [09:49:56] I only see database-level errors the one with proxysql on es1 [09:49:57] I can prepare the patches now [09:50:17] and 1 error at 9.04 on es4, what you tried [09:50:27] 1 or very few [09:50:47] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1583225441278&to=1583229041278&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All&fullscreen&panelId=10 [09:51:08] something else to check, aside from background monitoring? [09:52:00] I was making sure replication or anything else doesn't suffer [09:52:04] but there's not much load [09:52:09] so nothing to worry about I think [09:52:26] wow, es1020 is "suffering" from 60QPS! [09:52:29] :-D [09:52:34] hahaha [09:52:37] I know :) [09:53:14] 7 database connections more than idle [09:53:51] Let me comment on the task and prepare the patches for es2 going RO tomorrow [09:56:28] ok, so then let's close the deployment time "officially" [09:56:33] did you remove the lock? [09:56:50] yeah [09:56:52] lock removed [09:57:08] !log es4 deployment window finished [09:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:12] thanks [09:57:25] thank you for the support [09:59:58] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) [10:00:11] (03CR) 10Marostegui: [C: 04-2] "Wait until 4th March" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [10:01:08] marostegui: all done? [10:01:30] (03PS2) 10Addshore: Read from the new term store up to Q15 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576268 (https://phabricator.wikimedia.org/T219123) [10:02:29] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` Of which those **FAILED**: ` ['lvs3005.esams.wmnet'] ` [10:03:17] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:03:17] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q15 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576268 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:24] jouncebot next [10:03:24] In 1 hour(s) and 56 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1200) [10:04:27] (03Merged) 10jenkins-bot: Read from the new term store up to Q15 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576268 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [10:04:45] (03CR) 10Volans: [C: 03+1] "LGTM in general, a typo and an open question inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:05:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:53] (03PS1) 10Vgutierrez: lvs: Test BGP in lvs2008 [puppet] - 10https://gerrit.wikimedia.org/r/576287 (https://phabricator.wikimedia.org/T196560) [10:06:16] !log END warm cache for db1111 & db1126 for Q12-15 million T219123 (pass 4) [10:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:21] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:09:03] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q15M for the new term store everywhere (was Q12M) + warm db1126 & db1111 caches (T219123) (duration: 00m 56s) [10:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:10] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q15M for the new term store everywhere (was Q12M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s) [10:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:30] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21208/" [puppet] - 10https://gerrit.wikimedia.org/r/576287 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [10:14:25] !log START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 1) [10:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:31] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:15:46] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) >>! In T245833#5934803, @leila wrote: > > @Miriam @elukey the layered permission system can have internal use-cases,... [10:17:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Vgutierrez) [10:17:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: queens: stretch: introduce some apt pinnings [puppet] - 10https://gerrit.wikimedia.org/r/576290 (https://phabricator.wikimedia.org/T246671) [10:19:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: queens: stretch: introduce some apt pinnings [puppet] - 10https://gerrit.wikimedia.org/r/576290 (https://phabricator.wikimedia.org/T246671) (owner: 10Arturo Borrero Gonzalez) [10:22:29] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [10:25:31] edit rate increased quite a lot since :13 [10:25:55] but nothing that hadn't happened before when wikidata bots start :-D [10:26:20] (03PS1) 10Vgutierrez: lvs: Decommission lvs2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576293 (https://phabricator.wikimedia.org/T246756) [10:28:14] (03PS1) 10Vgutierrez: install_server: Reimage lvs3005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576294 (https://phabricator.wikimedia.org/T245984) [10:29:12] after the high load, db1021 peaked to 119 selects per second :-D [10:29:18] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: introduce repo configuration for rocky/stretch [puppet] - 10https://gerrit.wikimedia.org/r/576295 (https://phabricator.wikimedia.org/T246287) [10:30:23] jynus: you mean es1021? [10:30:25] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs3005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576294 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [10:30:37] marostegui: yes, sorry [10:30:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: introduce repo configuration for rocky/stretch [puppet] - 10https://gerrit.wikimedia.org/r/576295 (https://phabricator.wikimedia.org/T246287) (owner: 10Arturo Borrero Gonzalez) [10:33:02] (03PS1) 10Volans: mysql: update CORE_SECTIONS for external storage [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) [10:34:19] (03PS2) 10Vgutierrez: lvs: Decommission lvs2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576293 (https://phabricator.wikimedia.org/T246756) [10:34:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs3005.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [10:37:11] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Clean up SSL configuration - https://phabricator.wikimedia.org/T240941 (10jbond) [10:39:22] (03PS3) 10Vgutierrez: lvs: Decommission lvs2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576293 (https://phabricator.wikimedia.org/T246756) [10:39:51] (03CR) 10Volans: [C: 04-1] "Needs to wait the transition to es4/5 to be over before merging it." [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [10:41:05] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) 05duplicate→03Open [10:42:23] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/21210/" [puppet] - 10https://gerrit.wikimedia.org/r/576293 (https://phabricator.wikimedia.org/T246756) (owner: 10Vgutierrez) [10:42:44] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) >>! In T241719#5932488, @Krenair wrote: > Sure, let's make this the tracking task? Or do you think we should have a separate task to track custom puppetmaster... [10:43:23] (03PS9) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [10:46:21] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [10:46:23] !log running the decommission cookbook against lvs2002 - T246756 [10:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:29] T246756: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 [10:46:39] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:47:04] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [10:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:11] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2002.codfw.wmnet` - lvs2002.codfw.wmnet (**PASS**) - Downtime... [10:47:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [10:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:48] (03CR) 10Volans: [C: 04-1] "There is still a bunch of work to do here, see details inline." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:48:41] (03PS1) 10Hnowlan: jobrunner: add simple HTTP check [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) [10:49:28] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [10:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:28] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Volans) [10:50:42] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10akosiaris) >>! In T242705#5926753, @Halfak wrote: > I was able to replicate the behavior with a very simple Fl... [10:50:44] 10Operations, 10cloud-services-team (Kanban): Migrate Cloud VPS to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Volans) [10:52:32] (03PS5) 10Jbond: ferm: Add status check [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) [10:53:18] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10MoritzMuehlenhoff) [10:57:09] (03CR) 10CDanis: [C: 03+1] db-eqiad,db-codfw.php: Set es2 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576286 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [11:01:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [11:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:46] \o/ [11:01:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:01:51] nice :D [11:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:13] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:46] * elukey needs to create a cookbook for Presto [11:05:22] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [11:06:05] (03PS1) 10Vgutierrez: Remove lvs2002 production entries [dns] - 10https://gerrit.wikimedia.org/r/576305 (https://phabricator.wikimedia.org/T246756) [11:06:50] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2002 production entries [dns] - 10https://gerrit.wikimedia.org/r/576305 (https://phabricator.wikimedia.org/T246756) (owner: 10Vgutierrez) [11:07:48] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2002.codfw.wmnet - https://phabricator.wikimedia.org/T246756 (10Vgutierrez) a:05Vgutierrez→03Papaul [11:08:20] !log START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 2) [11:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:09:05] !log installing Java security updates on an-airflow, an-launcher and an-presto* [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:32] (03CR) 10Jbond: "recheck" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:12:46] (03CR) 10Jbond: ferm: Add status check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:12:55] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` and were **ALL** successful. [11:17:55] (03CR) 10Muehlenhoff: ferm: Add status check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:18:21] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10jbond) in relation to jbond-stretch-pm.puppet.eqiad.wmflabs, this is avalible so i can continue to test changes work on puppet version 4. once every... [11:20:56] (03CR) 10Jbond: ferm: Add status check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576101 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:21:57] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10fgiunchedi) [11:22:09] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10fgiunchedi) [11:25:14] (03PS10) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [11:28:40] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:32:52] (03PS1) 10Filippo Giunchedi: syslog: allow DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/576309 [11:32:54] (03PS1) 10Filippo Giunchedi: graphite: allow DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/576310 [11:32:56] (03PS1) 10Filippo Giunchedi: icinga: allow DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/576311 [11:32:58] (03PS1) 10Filippo Giunchedi: statsd: allow DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/576312 [11:36:36] (03PS11) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [11:38:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576309 (owner: 10Filippo Giunchedi) [11:39:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576310 (owner: 10Filippo Giunchedi) [11:39:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576311 (owner: 10Filippo Giunchedi) [11:40:06] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:40:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576312 (owner: 10Filippo Giunchedi) [11:49:25] (03PS3) 10Arturo Borrero Gonzalez: codfw: cloudnet: allocate addresses in the cloud transport network [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) [11:50:10] (03PS12) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) [11:53:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] codfw: cloudnet: allocate addresses in the cloud transport network [dns] - 10https://gerrit.wikimedia.org/r/574452 (https://phabricator.wikimedia.org/T245606) (owner: 10Arturo Borrero Gonzalez) [11:53:56] (03CR) 10Jbond: [C: 03+2] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/526664 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [11:54:09] !log disable puppet in order to add netbox hiera backend [11:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:49] (03PS4) 10KartikMistry: ContentTranslation: Add URL campaign for WikiGapFinder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1200). [12:00:04] kart_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] * kart_ is around.. [12:00:30] Let me start. [12:00:31] kart_: please go ahead and ping me when you're done - thanks! [12:00:40] Sure Urbanecm [12:00:43] thanks [12:01:04] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) (owner: 10KartikMistry) [12:02:15] (03Merged) 10jenkins-bot: ContentTranslation: Add URL campaign for WikiGapFinder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575974 (https://phabricator.wikimedia.org/T246335) (owner: 10KartikMistry) [12:03:23] !log cutting branch for 1.35.0-wmf.22 train [12:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:49] (03PS2) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [12:06:25] jouncebot: now [12:06:26] For the next 0 hour(s) and 53 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1200) [12:06:41] !log START warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 3) [12:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:46] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:08:59] (03PS3) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [12:09:09] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|575974|ContentTranslation: Add URL campaign for WikiGapFinder (T246335)]] (duration: 00m 56s) [12:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:13] T246335: Create URL campaign for Wiki for WikiGap - https://phabricator.wikimedia.org/T246335 [12:10:31] Urbanecm: finishing 2nd round of scap soon.. [12:10:35] k [12:10:43] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|575974|ContentTranslation: Add URL campaign for WikiGapFinder (T246335)]], take II (duration: 00m 56s) [12:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:55] Urbanecm: done [12:10:58] thanks [12:12:02] !log urbanecm@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 00s) [12:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] !log urbanecm@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 01s) [12:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:44] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576316 [12:12:46] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576316 (owner: 10Urbanecm) [12:13:44] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) [12:13:46] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576316 (owner: 10Urbanecm) [12:14:45] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 10s) [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:26] (03PS3) 10Urbanecm: Throttle rule for Czech Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) [12:16:50] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) (owner: 10Urbanecm) [12:18:01] (03Merged) 10jenkins-bot: Throttle rule for Czech Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576012 (https://phabricator.wikimedia.org/T246356) (owner: 10Urbanecm) [12:19:35] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 7b48737: Throttle rule for Czech Wikigap (T246356) (duration: 00m 56s) [12:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] T246356: Throttle rule for Czech WIkiGap - https://phabricator.wikimedia.org/T246356 [12:19:43] * Urbanecm is done [12:23:29] (03CR) 10Hnowlan: [C: 03+2] changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [12:23:49] (03Merged) 10jenkins-bot: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [12:31:48] (03PS1) 10Addshore: Read from the new term store up to Q20 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576319 (https://phabricator.wikimedia.org/T219123) [12:32:30] If noone objects then, I'll do the next stage of terms for wikidatat, 15 mill to 20 mill [12:32:56] (03PS4) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [12:33:02] (03CR) 10Addshore: [C: 03+2] Read from the new term store up to Q20 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576319 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [12:34:00] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:34:18] (03Merged) 10jenkins-bot: Read from the new term store up to Q20 mill everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576319 (https://phabricator.wikimedia.org/T219123) (owner: 10Addshore) [12:34:23] (03CR) 10Jbond: "i went through the puppet issues but im not familiar enough with the netbox api to address the python comments" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:35:27] !log END warm cache for db1111 & db1126 for Q15-20 million T219123 (pass 3) (finished it early) [12:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:33] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:36:01] * Amir1 sites standby in case [12:36:04] *sits [12:36:06] :P [12:36:09] o/ Amir1 [12:36:14] o/ [12:36:40] the increase from 12 -> 15 showed basically no change in resource usage or even # of rows returned etc, probably a very quiet set of ids [12:37:07] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q20M for the new term store everywhere (was Q15M) + warm db1126 & db1111 caches (T219123) (duration: 00m 55s) [12:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:09] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Reading up to Q20M for the new term store everywhere (was Q15M) + warm db1126 & db1111 caches (T219123) cache bust (duration: 00m 56s) [12:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:21] do I still need to do a cache busting sync? I think I saw something get merged about that? [12:39:15] the ticket is not resolved As far as I can see [12:39:24] ack! :) [12:40:02] well, that looked like an uneventful new 3 million being read from [12:41:25] addshore: We still have spikes of 300k reads (on the new term store) [12:42:04] https://usercontent.irccloud-cdn.com/file/i3ghAR5n/image.png [12:42:06] ^^ yup [12:42:13] would be interesting to figure out what that is [12:42:57] https://usercontent.irccloud-cdn.com/file/nQ4ZRCy8/image.png [12:43:03] and looking further back, also up to 500k ops [12:43:07] have you check it with lua calls? [12:44:20] it doesnt line up with a spike in formattercache access, so doubt it is lua [12:45:21] !log START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 1) [12:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:26] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:47:03] that's interesting now [12:47:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:47:38] the ratio of the reads of the new to old term store is now 4:1 [12:47:43] \o/ [12:48:08] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [12:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:27] Amir1: indeed [12:49:42] I'm basically a cheerleader now [12:49:48] Amir1: and once this weeks train rolls past wikidata this week, we can probably turn off writing : [12:49:48] * Amir1 puts up his uniform [12:49:53] haha [12:50:18] yes, I think we can already start with properties and call them done [12:51:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:53:13] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) I'm seeing this in the openstack BGP speaker: ` 2020-03-03 12:29:28.322 2724 ERROR neutron_dynamic_routing.services.bgp.agent.bgp_dragent [re... [12:53:55] (03PS1) 10Ayounsi: Add routers BGP to LVS/Pybal config [homer/public] - 10https://gerrit.wikimedia.org/r/576320 [12:55:18] (03CR) 10Ayounsi: "Diff for 2 devices: ['cr1-codfw.wikimedia.org', 'cr2-codfw.wikimedia.org']" [homer/public] - 10https://gerrit.wikimedia.org/r/576320 (owner: 10Ayounsi) [12:58:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1300) [13:00:38] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:21] (03PS1) 10Jbond: mediawiki::maintenance: force removal of directories [puppet] - 10https://gerrit.wikimedia.org/r/576323 [13:01:37] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:03] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:45] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: force removal of directories [puppet] - 10https://gerrit.wikimedia.org/r/576323 (owner: 10Jbond) [13:05:37] (03PS2) 10Jbond: mediawiki::maintenance: force removal of directories [puppet] - 10https://gerrit.wikimedia.org/r/576323 [13:09:55] addshore: an idea for the spikes, let's move the formatter cache to the service instead of lua [13:10:11] what do you think? That sounds like a low hanging fruit [13:10:14] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:57] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/21211/" [puppet] - 10https://gerrit.wikimedia.org/r/576312 (owner: 10Filippo Giunchedi) [13:11:03] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21211/" [puppet] - 10https://gerrit.wikimedia.org/r/576311 (owner: 10Filippo Giunchedi) [13:11:11] (03CR) 10Filippo Giunchedi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21211/" [puppet] - 10https://gerrit.wikimedia.org/r/576310 (owner: 10Filippo Giunchedi) [13:11:16] (03CR) 10Filippo Giunchedi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21211/" [puppet] - 10https://gerrit.wikimedia.org/r/576309 (owner: 10Filippo Giunchedi) [13:12:08] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [13:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:46] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:59] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:13:02] (03PS3) 10Jbond: mediawiki::maintenance: force removal of directories [puppet] - 10https://gerrit.wikimedia.org/r/576323 (https://phabricator.wikimedia.org/T242910) [13:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:12] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:47] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [13:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:41] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [13:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [13:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] (03PS1) 10Vgutierrez: lvs: Add lvs2007 as a high-traffic1 load balancer [puppet] - 10https://gerrit.wikimedia.org/r/576328 (https://phabricator.wikimedia.org/T196560) [13:27:11] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) [13:27:56] (03PS2) 10Vgutierrez: lvs: Add lvs2007 as a high-traffic1 load balancer [puppet] - 10https://gerrit.wikimedia.org/r/576328 (https://phabricator.wikimedia.org/T196560) [13:29:32] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/21212/" [puppet] - 10https://gerrit.wikimedia.org/r/576328 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [13:30:27] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) Everything but the initial puppet run has been completed. Did the puppet certification process change? This fails now cmjohnson@puppetmaster... [13:30:37] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs3005 [puppet] - 10https://gerrit.wikimedia.org/r/576332 (https://phabricator.wikimedia.org/T245984) [13:32:02] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2007.codfw.wmnet ` The log can be found in... [13:32:26] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs3005 [puppet] - 10https://gerrit.wikimedia.org/r/576332 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [13:33:23] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:33:25] PROBLEM - PHP7 rendering on mw1315 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:33:39] PROBLEM - Apache HTTP on mw1315 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:33:55] uh.. [13:34:46] !log Re-enable BGP in lvs3005 - T245984 [13:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:51] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [13:36:16] (03PS1) 10Jbond: role::lists: use mod_cgid instead fo mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) [13:37:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:03] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [13:38:39] (03CR) 10jerkins-bot: [V: 04-1] role::lists: use mod_cgid instead fo mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [13:39:41] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 5229 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [13:41:40] !log zpapierski@deploy1001 Started deploy [wdqs/wdqs@8da3ae6]: (no justification provided) [13:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:06] !log zpapierski@deploy1001 Finished deploy [wdqs/wdqs@8da3ae6]: (no justification provided) (duration: 00m 26s) [13:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:18] (03PS1) 10Alexandros Kosiaris: changeprop: Package 0.9.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/576335 (https://phabricator.wikimedia.org/T213193) [13:42:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop: Package 0.9.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/576335 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [13:42:58] (03Merged) 10jenkins-bot: changeprop: Package 0.9.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/576335 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [13:43:08] (03CR) 10Marostegui: [C: 03+1] "+1 once es4 and es5 are fully deployed and es2 and es3 set to RO and replication disconnected" [software/spicerack] - 10https://gerrit.wikimedia.org/r/576297 (https://phabricator.wikimedia.org/T226704) (owner: 10Volans) [13:44:03] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [13:44:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:04] !log zpapierski@deploy1001 Started deploy [wdqs/wdqs@8da3ae6]: (no justification provided) [13:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:47] (03PS1) 10Vgutierrez: install_server: Reimage lvs1016 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576336 (https://phabricator.wikimedia.org/T245984) [13:48:24] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs1016 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576336 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [13:50:09] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:09] (03PS1) 10Lars Wirzenius: Group0 to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576337 [13:52:05] (03PS2) 10Jbond: role::lists: use mod_cgid instead fo mod_cgi [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) [13:52:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:04] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [13:54:07] !log reimage lvs1016 with buster - T245984 [13:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:11] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [13:54:22] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1016.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [13:55:52] !log zpapierski@deploy1001 Finished deploy [wdqs/wdqs@8da3ae6]: (no justification provided) (duration: 10m 48s) [13:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:05] liw and Brennen: Your horoscope predicts another unfortunate Mediawiki train - European+American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1400). [14:01:28] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2007.codfw.wmnet'] ` and were **ALL** successful. [14:04:06] (03CR) 10Herron: cache: map logstash-next.wikimedia.org to kibana-next lvs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [14:04:26] (03PS2) 10Ottomata: Add eventgate-analytics-external.svc entries [dns] - 10https://gerrit.wikimedia.org/r/573362 (https://phabricator.wikimedia.org/T233629) [14:06:11] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:39] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:42] !log liw@deploy1001 Pruned MediaWiki: 1.35.0-wmf.20 (duration: 15m 54s) [14:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:09] (03PS1) 10Filippo Giunchedi: puppetdb: fix error log filename [puppet] - 10https://gerrit.wikimedia.org/r/576339 [14:10:41] I'm running late and still doing pre-deploymentwindow things, train will start soon [14:11:04] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10Dzahn) 05Resolved→03Open [14:11:10] !log liw@deploy1001 Started scap: testwiki to php-1.35.0-wmf.22 and rebuild l10n cache [14:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:30] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Dzahn) a:03Dzahn [14:11:36] I am about to roll restart ES in codfw, I am not crazy, working with gehel on it :) [14:12:15] * gehel is keeping his fingers crossed, but has complete trust in elukey [14:12:46] !log elukey@cumin1001 START - Cookbook sre.elasticsearch.rolling-restart [14:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:50] (03CR) 10Ottomata: [C: 03+2] Add discovery for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573366 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [14:13:02] (03PS2) 10Ottomata: Add discovery for eventgate-analytics-external [puppet] - 10https://gerrit.wikimedia.org/r/573366 (https://phabricator.wikimedia.org/T233629) [14:13:06] (03CR) 10Vgutierrez: Add routers BGP to LVS/Pybal config (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/576320 (owner: 10Ayounsi) [14:13:27] !log beginning procedure to add LVS and discovery for eventgate-analytics-external - T233629 [14:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:32] T233629: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 [14:14:14] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1016.eqiad.wmnet'] ` and were **ALL** successful. [14:16:46] (03PS3) 10Ottomata: Add eventgate-analytics-external.svc entries [dns] - 10https://gerrit.wikimedia.org/r/573362 (https://phabricator.wikimedia.org/T233629) [14:16:58] (03PS2) 10Ayounsi: Add routers BGP to LVS/Pybal config [homer/public] - 10https://gerrit.wikimedia.org/r/576320 [14:18:22] (03PS1) 10Vgutierrez: lvs: Test BGP in lvs2007 [puppet] - 10https://gerrit.wikimedia.org/r/576342 (https://phabricator.wikimedia.org/T196560) [14:18:41] (03CR) 10Ottomata: [C: 03+2] Add eventgate-analytics-external.svc entries [dns] - 10https://gerrit.wikimedia.org/r/573362 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [14:20:06] (03PS2) 10Ottomata: Add LVS for eventgate-analytics-external on port 4692 [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) [14:20:08] (03CR) 10Vgutierrez: [C: 03+2] lvs: Test BGP in lvs2007 [puppet] - 10https://gerrit.wikimedia.org/r/576342 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [14:20:19] (03PS1) 10Alexandros Kosiaris: changeprop: Add missing . to -tls-proxy-certs template call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576343 [14:20:21] (03PS1) 10Alexandros Kosiaris: changeprop: Correctly align the prometheus-statsd.conf call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576344 (https://phabricator.wikimedia.org/T213193) [14:22:32] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [14:23:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [14:24:00] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Investigate using the rich_data option to support Binary and binary_file for binary data - https://phabricator.wikimedia.org/T236481 (10jbond) [14:24:10] (03PS3) 10Ottomata: Add LVS for eventgate-analytics-external on port 4692 [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) [14:25:54] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Vgutierrez) [14:28:19] (03CR) 10Ottomata: "Now that we use state: to rollout the new LVS, should the addition of the new monitoring stuff in monitor_services.pp be done in a separat" [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [14:28:48] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:28:52] !log postponing LVS for eventgate-analytics-external unti tomorrow [14:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:28] (03PS1) 10Vgutierrez: lvs: Decommission lvs2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576346 (https://phabricator.wikimedia.org/T246779) [14:29:33] !log update puppet compiler facts [14:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:37] (03CR) 10Herron: [C: 04-1] "-1 because I think this would break lists. please see comments inline, and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/573711/ " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [14:29:56] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [14:32:07] (03PS1) 10Marostegui: Revert "db1096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/576348 [14:33:16] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2001 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [14:36:44] (03PS1) 10Andrew Bogott: Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/576351 [14:37:12] (03PS1) 10Muehlenhoff: Enable cas-server-reports-core in CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576352 [14:38:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576352 (owner: 10Muehlenhoff) [14:38:53] (03PS1) 10Ayounsi: GoBGP don't restart automatically the deamon [puppet] - 10https://gerrit.wikimedia.org/r/576353 [14:39:23] (03CR) 10Vgutierrez: [C: 03+2] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/21213/" [puppet] - 10https://gerrit.wikimedia.org/r/576346 (https://phabricator.wikimedia.org/T246779) (owner: 10Vgutierrez) [14:40:09] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: add 'queens' config files [puppet] - 10https://gerrit.wikimedia.org/r/576351 (owner: 10Andrew Bogott) [14:40:58] (03PS1) 10Ayounsi: check_bgp, add more AS# to --critasn [puppet] - 10https://gerrit.wikimedia.org/r/576354 [14:41:08] !log START warm cache for db1111 & db1126 for Q20-25 million T219123 (pass 2) [14:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:41:37] (03CR) 10Marostegui: [C: 03+2] Revert "db1096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/576348 (owner: 10Marostegui) [14:42:28] !log replace lvs2001 with lvs2007 - T196560 [14:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:33] T196560: (Need by: TBD) rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [14:43:10] (03PS1) 10Andrew Bogott: designate: add queens service manifest [puppet] - 10https://gerrit.wikimedia.org/r/576355 [14:43:20] !log running the decommission cookbook against lvs2001.codfw.wmnet - T246779 [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:24] T246779: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 [14:43:28] !log vgutierrez@cumin2001 START - Cookbook sre.hosts.decommission [14:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:38] (03CR) 10Ayounsi: "Alex, let me know if I should also add the k8s AS#." [puppet] - 10https://gerrit.wikimedia.org/r/576354 (owner: 10Ayounsi) [14:44:09] !log vgutierrez@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin2001 for hosts: `lvs2001.codfw.wmnet` - lvs2001.codfw.wmnet (**PASS**) - Downtime... [14:44:59] (03CR) 10Andrew Bogott: [C: 03+2] designate: add queens service manifest [puppet] - 10https://gerrit.wikimedia.org/r/576355 (owner: 10Andrew Bogott) [14:47:13] (03CR) 10Ayounsi: [C: 03+2] GoBGP don't restart automatically the deamon [puppet] - 10https://gerrit.wikimedia.org/r/576353 (owner: 10Ayounsi) [14:47:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.422e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:48:17] (03PS1) 10Andrew Bogott: cloudservices2002-dev: move to openstack queens [puppet] - 10https://gerrit.wikimedia.org/r/576356 [14:48:38] (03PS1) 10Vgutierrez: Remove lvs2001 production entries [dns] - 10https://gerrit.wikimedia.org/r/576357 (https://phabricator.wikimedia.org/T246779) [14:48:40] (03PS2) 10Gehel: wdqs: added link to runbook entry for categories update lag. [puppet] - 10https://gerrit.wikimedia.org/r/576064 (https://phabricator.wikimedia.org/T246497) [14:49:29] (03CR) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [14:49:31] (03CR) 10Vgutierrez: [C: 03+2] Remove lvs2001 production entries [dns] - 10https://gerrit.wikimedia.org/r/576357 (https://phabricator.wikimedia.org/T246779) (owner: 10Vgutierrez) [14:50:10] (03CR) 10Gehel: [C: 03+2] wdqs: added link to runbook entry for categories update lag. [puppet] - 10https://gerrit.wikimedia.org/r/576064 (https://phabricator.wikimedia.org/T246497) (owner: 10Gehel) [14:51:19] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Vgutierrez) a:05Vgutierrez→03Papaul [14:51:20] I'm going overrun the train window [14:51:32] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [14:52:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10591 and previous config saved to /var/cache/conftool/dbconfig/20200303-145230-marostegui.json [14:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:37] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [14:53:04] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:53:33] elukey: ^ [14:54:08] that's probably goign to recover by itself in a few seconds [14:54:17] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs1015 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576358 (https://phabricator.wikimedia.org/T245984) [14:54:40] gehel: what does the alarm mean ? [14:55:08] the codfw cluster is processing less updates than usual [14:55:53] we created this alert as at one point we had an issue with the cookbook to restart the cluster and it was left in readonly mode for way longer than expected [14:55:54] ah I see from https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1&from=now-3h&to=now [14:55:58] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:56:37] there are only three hosts left I think [14:56:40] the alert might be too aggressive, it is expected that update rate will be lower during a cluster restart, as we do put the cluster in read only mode on purpose [14:56:59] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/21214/" [puppet] - 10https://gerrit.wikimedia.org/r/576358 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [14:56:59] I think 10 nodes left [14:57:44] yes my bad, didn't see the others [14:57:54] (03CR) 10Jbond: "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576354 (owner: 10Ayounsi) [14:57:56] you need a larger screen! [14:58:08] guilty [14:58:09] :D [14:58:40] note that the issue with this alert on codfw might be that in codfw the cluster recovers much faster [14:59:11] so we iterate through nodes a lot faster, with shorter periods of read/write between 2 batch of restarts [15:00:23] (03PS6) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [15:00:34] (03PS1) 10BBlack: check_bgp: add optional descriptions to crit ASNs [puppet] - 10https://gerrit.wikimedia.org/r/576359 [15:00:36] (03CR) 10jerkins-bot: [V: 04-1] Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:00:45] (03CR) 10Mholloway: Add chart for mobileapps (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) (owner: 10Mholloway) [15:00:54] (03PS1) 10Ottomata: Remove monitoring and alerting for eventgate to-delete services [puppet] - 10https://gerrit.wikimedia.org/r/576361 (https://phabricator.wikimedia.org/T245203) [15:00:57] !log reimage lvs1015 with buster - T245984 [15:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:02] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [15:01:26] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1015.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:01:53] (03PS2) 10Jcrespo: mysql: Fix mysql server configuration for the percona flavour [puppet] - 10https://gerrit.wikimedia.org/r/575496 (https://phabricator.wikimedia.org/T193224) [15:01:57] (03CR) 10Andrew Bogott: [C: 03+2] cloudservices2002-dev: move to openstack queens [puppet] - 10https://gerrit.wikimedia.org/r/576356 (owner: 10Andrew Bogott) [15:02:29] (03PS2) 10Ottomata: Remove monitoring and alerting for eventgate to-delete services [puppet] - 10https://gerrit.wikimedia.org/r/576361 (https://phabricator.wikimedia.org/T245203) [15:02:41] (03PS7) 10Mholloway: Add chart for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/570162 (https://phabricator.wikimedia.org/T218733) [15:02:46] (03CR) 10Ppchelko: [C: 03+1] changeprop: Correctly align the prometheus-statsd.conf call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576344 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [15:04:43] (03PS1) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [15:06:05] (03PS3) 10Dzahn: admins: add mholloway to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/575388 (https://phabricator.wikimedia.org/T246019) [15:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease a bit the weight for db1126', diff saved to https://phabricator.wikimedia.org/P10594 and previous config saved to /var/cache/conftool/dbconfig/20200303-150712-marostegui.json [15:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:16] (03CR) 10Dzahn: [C: 03+2] "approved by Nuria, other check boxes already checked" [puppet] - 10https://gerrit.wikimedia.org/r/575388 (https://phabricator.wikimedia.org/T246019) (owner: 10Dzahn) [15:07:18] (03CR) 10jerkins-bot: [V: 04-1] lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [15:08:19] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21215/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/576361 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [15:08:50] (03PS2) 10Muehlenhoff: Enable cas-server-reports-core in CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576352 [15:09:18] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 3882 MB (2% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [15:10:48] (03PS1) 10Jbond: systemd::syslog: ensure log dir is removed if resource is absent [puppet] - 10https://gerrit.wikimedia.org/r/576364 (https://phabricator.wikimedia.org/T242910) [15:11:24] (03CR) 10Hnowlan: [C: 03+2] changeprop: Correctly align the prometheus-statsd.conf call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576344 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [15:11:33] (03PS2) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [15:13:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:07] (03CR) 10Muehlenhoff: cache: map logstash-next.wikimedia.org to kibana-next lvs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:15:27] !log elukey@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) [15:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:34] \o/ [15:15:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:40] 10Operations, 10netops: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10CDanis) Happy to help here, e.g. to perform this at an off-peak time in esams/knams. [15:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:42] elukey: congrats! [15:15:55] (03CR) 10Jbond: [C: 03+1] Enable cas-server-reports-core in CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576352 (owner: 10Muehlenhoff) [15:16:24] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Dzahn) @Nuria Thanks! Done. @Mholloway You have been added to the analytics-privatedata-users... [15:16:43] 10Operations, 10Product-Infrastructure-Team-Backlog, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Michael Holloway - https://phabricator.wikimedia.org/T246019 (10Dzahn) 05Open→03Resolved [15:17:19] (03CR) 10Hnowlan: [C: 03+2] changeprop: Add missing . to -tls-proxy-certs template call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576343 (owner: 10Alexandros Kosiaris) [15:17:41] (03Merged) 10jenkins-bot: changeprop: Add missing . to -tls-proxy-certs template call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576343 (owner: 10Alexandros Kosiaris) [15:17:43] (03Merged) 10jenkins-bot: changeprop: Correctly align the prometheus-statsd.conf call [deployment-charts] - 10https://gerrit.wikimedia.org/r/576344 (https://phabricator.wikimedia.org/T213193) (owner: 10Alexandros Kosiaris) [15:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10595 and previous config saved to /var/cache/conftool/dbconfig/20200303-151805-marostegui.json [15:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:10] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [15:18:48] (03CR) 10Ayounsi: [C: 03+1] "+1 based on IRC chat." [puppet] - 10https://gerrit.wikimedia.org/r/576359 (owner: 10BBlack) [15:19:41] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1015.eqiad.wmnet'] ` and were **ALL** successful. [15:20:25] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:20:25] (03PS5) 10Mholloway: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:20:49] (03PS3) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [15:20:55] (03PS6) 10Mholloway: WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:21:19] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:21:24] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) [15:21:45] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10Dzahn) Running scap from the deployment server (what James does) and running scap pull on a single... [15:22:14] (03CR) 10jerkins-bot: [V: 04-1] prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [15:22:34] !log liw@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.22 and rebuild l10n cache (duration: 71m 23s) [15:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:38] (03CR) 10Mholloway: WIP: Add chart for chromium-render (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:23:18] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs1015 [puppet] - 10https://gerrit.wikimedia.org/r/576369 (https://phabricator.wikimedia.org/T245984) [15:25:02] (03CR) 10Ayounsi: check_bgp, add more AS# to --critasn (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576354 (owner: 10Ayounsi) [15:25:13] (03PS1) 10Ottomata: Switch change-prop and restbase event_service_uri to new TLS eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/576370 (https://phabricator.wikimedia.org/T245203) [15:25:24] (03PS2) 10Ayounsi: check_bgp, add more AS# to --critasn [puppet] - 10https://gerrit.wikimedia.org/r/576354 [15:25:25] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [15:25:33] (03PS7) 10Mholloway: WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:26:01] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:28:41] (03CR) 10Lars Wirzenius: [C: 03+2] Group0 to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576337 (owner: 10Lars Wirzenius) [15:29:03] (03CR) 10Ppchelko: [C: 03+1] Switch change-prop and restbase event_service_uri to new TLS eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/576370 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [15:29:41] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs1015 [puppet] - 10https://gerrit.wikimedia.org/r/576369 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [15:29:57] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576337 (owner: 10Lars Wirzenius) [15:30:46] you can run scap on just a single server also without going to that server directly, right? [15:30:58] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:14] !log liw@deploy1001 Started scap: group0 to 1.35.0-wmf.22 [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:20] !log Re-enable BGP in lvs1015 - T245984 [15:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:26] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [15:31:27] so in addition to running 'scap pull' directly on server , you could achieve the same from deploy1001 [15:33:01] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs1014 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576371 (https://phabricator.wikimedia.org/T245984) [15:33:18] (03PS2) 10Muehlenhoff: cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:33:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:34:22] (03PS8) 10Mholloway: WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:34:38] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:34:51] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:18] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:35:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:26] (03CR) 10Vgutierrez: [C: 04-1] "kibana-next.svc.eqiad.wmnet doesn't seem to be a valid SAN for the TLS certificate served on kibana-next.svc.eqiad.wmnet:443" [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [15:36:27] herron: that TLS cert needs to be updated with the new SANs [15:36:39] * addshore looks up [15:36:41] (03CR) 10Nikerabbit: [C: 03+1] cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) (owner: 10Alexandros Kosiaris) [15:36:54] right now it contains "X509v3 Subject Alternative Name: DNS:kibana.discovery.wmnet, DNS:kibana.svc.eqiad.wmnet, DNS:kibana.svc.codfw.wmnet, DNS:logstash.wikimedia.org" [15:37:53] so ats-be will refuse to connect to that endpoint for logstash-next.wikimedia.org, kibana-next.svc.eqiad.wmnet or cas-logstash.wikimedia.org [15:38:47] herron: that should be https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate -> /srv/private/modules/secret/secrets/certificates/certificate.manifests.d/kibana.certs.yaml [15:39:27] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:42:09] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) [15:42:18] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21219/" [puppet] - 10https://gerrit.wikimedia.org/r/576371 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [15:42:26] (03PS9) 10Mholloway: WIP: Add chart for chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) (owner: 10MSantos) [15:43:14] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) a:05Eevans→03None [15:43:21] (03PS1) 10Alexandros Kosiaris: Add a nutcracker container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576372 (https://phabricator.wikimedia.org/T21319) [15:43:23] !log wtp1025 - scap pull [15:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a nutcracker container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576372 (https://phabricator.wikimedia.org/T21319) (owner: 10Alexandros Kosiaris) [15:44:43] (03CR) 10Jbond: [C: 03+1] check_bgp, add more AS# to --critasn [puppet] - 10https://gerrit.wikimedia.org/r/576354 (owner: 10Ayounsi) [15:44:58] !log reimage lvs1014 with buster - T245984 [15:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:03] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [15:45:09] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1014.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:45:18] !log wtp1025 - scap pull as user cscott - testing sudo privs issue [15:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:25] !log Stopping pybal on lvs2009 to let lvs2010 get its traffic - T246686 [15:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:29] T246686: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 [15:45:39] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add a nutcracker container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/576372 (https://phabricator.wikimedia.org/T21319) (owner: 10Alexandros Kosiaris) [15:46:12] 10Operations, 10netops: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10ayounsi) Steps are: # Depool esams # Ssh to the mgmt interface `re0.cr2-esams.mgmt.esams.wmnet` less likely to be impacted by the flaps # run `conf` then `set routing-options graceful-restart` # `commit` #... [15:46:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops: new payments2001 and payments2003 bonded ethernet network error/warning - https://phabricator.wikimedia.org/T246492 (10Jgreen) p:05Triage→03Medium [15:46:52] 10Operations, 10ops-codfw, 10fundraising-tech-ops: new payments2001 and payments2003 bonded ethernet network error/warning - https://phabricator.wikimedia.org/T246492 (10Jgreen) a:05Jgreen→03None [15:47:29] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Jgreen) [15:47:45] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10Eevans) Yeah, it's ready to be closed, but AFAIK, we're supposed to wait for the PM (@CCicalese_WMF) to close it after movi... [15:49:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10596 and previous config saved to /var/cache/conftool/dbconfig/20200303-154913-marostegui.json [15:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:18] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [15:49:20] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Enable cas-server-reports-core in CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576352 (owner: 10Muehlenhoff) [15:54:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/576339 (owner: 10Filippo Giunchedi) [15:55:44] !log liw@deploy1001 Finished scap: group0 to 1.35.0-wmf.22 (duration: 24m 29s) [15:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:00] PROBLEM - Check systemd state on mw2290 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:01] (03CR) 10Ppchelko: jobrunner: add simple HTTP check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [15:59:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] 10Operations, 10procurement: eqiad: (16) Hadoop worker node refresh - FY19/20 Q3 - https://phabricator.wikimedia.org/T246784 (10Ottomata) [16:00:15] 10Operations, 10procurement: eqiad: (16) Hadoop worker node refresh - FY19/20 Q3 - https://phabricator.wikimedia.org/T246784 (10Ottomata) [16:00:39] (03PS1) 10Hnowlan: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) [16:04:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1096:3315 and db1096:3316 after reimage to buster T246604', diff saved to https://phabricator.wikimedia.org/P10597 and previous config saved to /var/cache/conftool/dbconfig/20200303-160433-marostegui.json [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:39] T246604: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 [16:05:05] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1014.eqiad.wmnet'] ` and were **ALL** successful. [16:05:49] !log Starting pybal on lvs2009 - T246686 [16:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:53] T246686: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 [16:06:13] (03CR) 10BBlack: [C: 03+2] check_bgp: add optional descriptions to crit ASNs [puppet] - 10https://gerrit.wikimedia.org/r/576359 (owner: 10BBlack) [16:07:19] (03CR) 10Ppchelko: [C: 04-1] changeprop: Bump CPU usage (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:07:26] vgutierrez: ah of course, thanks will fix [16:08:22] (03PS4) 10Krinkle: multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) [16:08:39] * Krinkle staging on mwdebug1002 [16:08:42] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs1014 [puppet] - 10https://gerrit.wikimedia.org/r/576379 (https://phabricator.wikimedia.org/T245984) [16:08:45] had planned it for a follow-up patch but you are right makes more sense to do ahead of time [16:08:50] (03CR) 10Krinkle: [C: 03+2] multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:09:04] (03PS1) 10Alexandros Kosiaris: pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 [16:09:10] (03CR) 10Ayounsi: [C: 03+2] check_bgp, add more AS# to --critasn [puppet] - 10https://gerrit.wikimedia.org/r/576354 (owner: 10Ayounsi) [16:09:26] 10Operations, 10procurement: eqiad: (16) Hadoop worker node refresh - FY19/20 Q3 - https://phabricator.wikimedia.org/T246784 (10Ottomata) [16:10:05] (03Merged) 10jenkins-bot: multiversion: Optimise readDbListFile() function by 40% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576091 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:10:16] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs1014 [puppet] - 10https://gerrit.wikimedia.org/r/576379 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:10:46] XioNoX: merge faster man! ;P [16:11:07] (03PS2) 10Ottomata: Switch change-prop and restbase event_service_uri to new TLS eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/576370 (https://phabricator.wikimedia.org/T245203) [16:11:12] in a rush? :) [16:11:16] aahaha nah [16:11:23] just joking [16:12:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Switch change-prop and restbase event_service_uri to new TLS eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/576370 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [16:12:39] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10jbond) > I'll ask folks on our end about duplicate entry thinking (also worth asking @jbond if the automated logic can handle... [16:12:41] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2290 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:13:02] !log Re-enable BGP in lvs1014 - T245984 [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:07] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:13:13] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetdb: fix error log filename [puppet] - 10https://gerrit.wikimedia.org/r/576339 (owner: 10Filippo Giunchedi) [16:13:21] PROBLEM - Check systemd state on mw2178 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:49] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [16:14:06] !log switching restbase & change prop to new eventgate-main LVS TLS ports [16:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:23] PROBLEM - Check the last execution of php7.2-fpm_check_restart on mw2178 is CRITICAL: CRITICAL: Status of the systemd unit php7.2-fpm_check_restart https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:15:33] (03PS1) 10Vgutierrez: install_server,lvs: Reimage lvs1013 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576381 (https://phabricator.wikimedia.org/T245984) [16:16:21] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): Bootstrap new Cassandra instances: restbase202[123]-{a,b,c} - https://phabricator.wikimedia.org/T243000 (10CCicalese_WMF) 05Open→03Resolved Thanks for letting me know. Marking resolved. [16:16:23] 10Operations, 10ops-codfw, 10Core Platform Team Workboards (Clinic Duty Team): (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10CCicalese_WMF) [16:16:29] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) [16:17:44] !log restart restbase on 2009 for T242224 [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:48] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [16:17:57] 10Operations, 10Pybal, 10Traffic: Add graceful-restart capability to PyBal - https://phabricator.wikimedia.org/T246788 (10ayounsi) p:05Triage→03Lowest [16:18:17] (03PS4) 10Jcrespo: prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) [16:19:38] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/21224/" [puppet] - 10https://gerrit.wikimedia.org/r/576381 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:19:46] (03PS2) 10Hnowlan: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) [16:19:52] (03PS2) 10Vgutierrez: install_server,lvs: Reimage lvs1013 with buster [puppet] - 10https://gerrit.wikimedia.org/r/576381 (https://phabricator.wikimedia.org/T245984) [16:20:07] (03CR) 10Marostegui: WIP - Introduce profile::mariadb::misc::analytics (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [16:20:53] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/21223/" [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [16:22:18] godog: may I merge puppetdb: fix error log filename (329c592cd0)? [16:23:27] (03PS5) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [16:23:48] 10Operations, 10netops, 10Wikimedia-Incident: Add graceful-restart to cr2-esams - https://phabricator.wikimedia.org/T246338 (10RLazarus) [16:24:10] 10Operations, 10ops-codfw, 10Traffic, 10netops: device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors - https://phabricator.wikimedia.org/T246686 (10Papaul) 05Open→03Resolved I think all good now on the interface . closing this task. Just replaced the transceiver on the switch side. ` Laser... [16:24:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 typo inline but also to answer this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:24:28] (03PS1) 10Dzahn: admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) [16:24:36] (03CR) 10Marostegui: [C: 03+1] "Heh, I was first confused on why db1095 would have changes, as it is stretch, but it is just the items order." [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [16:24:53] (03PS3) 10Hnowlan: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) [16:25:40] (03CR) 10jerkins-bot: [V: 04-1] admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [16:26:01] (03PS6) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [16:26:39] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (Need by: TBD) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) @herron is it possible to create another task to track this down and close the racking and setup task? thanks. [16:26:41] (03CR) 10Jcrespo: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [16:27:24] (03PS4) 10Ottomata: Add LVS for eventgate-analytics-external on port 4692 [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) [16:27:29] (03CR) 10Ottomata: Add LVS for eventgate-analytics-external on port 4692 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:28:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:28:07] (03CR) 10Jcrespo: [C: 03+2] prometheus-mysqld-exporter: Fix options for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/576368 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [16:28:08] !log stopping pybal on lvs5003 to test the new icinga checks (will cause a BGP alert, among others) [16:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add LVS for eventgate-analytics-external on port 4692 [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [16:30:14] 10Operations, 10ops-codfw, 10Traffic: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10Papaul) 05Open→03Declined declining this since there is a decommissioning task @ T246756 [16:30:37] 10Operations, 10ops-codfw: (OoW) lvs2002 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148017 (10Papaul) 05Open→03Declined declining this since there is a decommissioning task @ T246756 [16:30:39] (03PS4) 10Hnowlan: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) [16:30:49] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:30:55] (03PS1) 10Elukey: Ensure readability settings for home dirs of Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) [16:31:35] (03PS5) 10Hnowlan: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) [16:31:38] (03CR) 10Dzahn: "tested as "cscott@wtp1025":" [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [16:32:27] !log reimage lvs1013 with buster - T245984 [16:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:31] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:32:35] PROBLEM - pybal on lvs5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:32:39] (03PS2) 10Dzahn: admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) [16:32:43] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:32:43] ^^ that's bblack test [16:32:44] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs1013.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [16:33:46] (03CR) 10Hnowlan: [C: 03+2] changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:34:03] (03Merged) 10jenkins-bot: changeprop: Bump CPU usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/576377 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:34:05] (03CR) 10Krinkle: [C: 03+2] multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [16:34:45] !log krinkle@deploy1001 Synchronized multiversion/MWWikiversions.php: I8815be28d6a26a1 - T169821 (duration: 01m 04s) [16:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:49] T169821: Investigate whether wmf-config's configuration cache is still needed - https://phabricator.wikimedia.org/T169821 [16:35:11] !log otto@deploy1001 Started restart [restbase/deploy@bfdd342] (dev-cluster): Restart (dev-cluster) to pick up new LVS TLS port for eventgate T242224 [16:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:15] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [16:35:37] (03PS1) 10Alexandros Kosiaris: create_new_service: Fix symlink creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/576385 [16:36:35] (03PS2) 10Elukey: Ensure readability settings for home dirs of Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) [16:37:05] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:37:11] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:19] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [16:39:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] create_new_service: Fix symlink creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/576385 (owner: 10Alexandros Kosiaris) [16:39:38] (03Merged) 10jenkins-bot: create_new_service: Fix symlink creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/576385 (owner: 10Alexandros Kosiaris) [16:40:48] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21226/" [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:42:43] (03CR) 10Jbond: [C: 03+1] "lgtm type checking would be nice" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [16:42:43] Krinkle: do you have an active scap deploy? [16:42:54] ottomata: I do [16:42:55] scap says it is globally locked [16:42:56] ok [16:43:33] (03PS4) 10Krinkle: multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 [16:43:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:43:40] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10Dzahn) The change above would let parsoid-admins run "scap pull" AS user "mw... [16:43:44] (03CR) 10Krinkle: [C: 03+2] "Try again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [16:43:52] ottomata: will be just a minute :) [16:44:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [16:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:37] ottomata: I've yielded the lock, assuming it's not for a MW deploy. [16:44:43] I'll do my testing on mwdebug1002 now without a lock [16:44:50] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1385.eqiad.wmnet ` The log can be found in `... [16:45:05] (03Merged) 10jenkins-bot: multiversion: Remove support for passing file path to readDbListFile() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576097 (owner: 10Krinkle) [16:45:19] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 3174 MB (2% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [16:45:31] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:08] ok danke [16:47:13] !log otto@deploy1001 Started restart [restbase/deploy@bfdd342]: Restart to pick up new LVS TLS port for eventgate T242224 [16:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:18] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [16:47:33] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:10] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1385.eqiad.wmnet ` The log can be found in `... [16:48:21] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10Dzahn) p:05Triage→03Medium [16:48:22] (03CR) 10Ottomata: Ensure readability settings for home dirs of Analytics clients (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:48:24] !log krinkle@deploy1001 Synchronized wmf-config/import.php: I9d658ff41b78 (duration: 01m 03s) [16:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:27] (03PS1) 10Muehlenhoff: Add cas-server-core-util to Gradle dependencies [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/576387 [16:49:12] !log reload icinga config on icinga1001 [16:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:26] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1386.eqiad.wmnet ` The log can be found in `... [16:49:47] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [16:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:01] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1387.eqiad.wmnet ` The log can be found in `... [16:50:02] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [16:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:17] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:25] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:35] there we go [16:50:36] bblack: is that your new check? :) [16:50:49] the eqsin I expected, not the rest [16:50:55] bringing lvs5003 pybal back :) [16:51:00] !log krinkle@deploy1001 Synchronized multiversion/MWWikiversions.php: I9d658ff41b78 (duration: 01m 04s) [16:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:06] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:13] !log lvs5003 - restart pybal, back to normal operations [16:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:16] lvs1013 is currently being reimaged on eqiad [16:51:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [16:51:27] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 16, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:34] but AFAIK everything should be ok in codfw [16:52:05] do we have excess peers configured from the transition? [16:52:10] (03CR) 10Elukey: Ensure readability settings for home dirs of Analytics clients (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:52:16] XioNoX: maybe we are missing some of the recently decommed lvs in codfw? [16:52:18] e.g. lvs2001 is now-decommed but still listed as a peer in router config, it will trip that alert [16:52:31] AFAIK XioNoX deleted lvs2001 from the router config [16:52:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs1013.eqiad.wmnet'] ` and were **ALL** successful. [16:52:45] maybe I've failed to ask him one of the others [16:53:08] let me check what I have in codfw [16:53:16] it's only one router too, so probably just a leftover [16:53:31] (03PS2) 10Alexandros Kosiaris: pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 [16:53:31] https://www.irccloud.com/pastebin/ttXPvajg/ [16:53:48] from memory... 1.2 and 1.3 shouldn't be there anymore [16:53:52] * vgutierrez double checking [16:54:39] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 16 connections established with conf2003.codfw.wmnet:2379 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [16:54:50] or could compare to cr2-codfw which isn't alerting [16:54:58] (03CR) 10Ottomata: Ensure readability settings for home dirs of Analytics clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:55:19] XioNoX: right, you can remove 10.192.1.2 and 10.192.1.3 [16:55:45] (03PS1) 10Vgutierrez: lvs: Re-enable BGP in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/576388 (https://phabricator.wikimedia.org/T245984) [16:55:47] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10Dzahn) @RStallman-legalteam Could you please contact @AbbanWMDE for the NDA procedure? @AbbanWMDE Hi! Rachel will need your personal info to start with the NDA process.... [16:56:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:02] ok [16:56:11] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:56:43] vgutierrez: removed [16:56:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 31, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:51] (03CR) 10Vgutierrez: [C: 03+2] lvs: Re-enable BGP in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/576388 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:56:59] (03CR) 10Elukey: Ensure readability settings for home dirs of Analytics clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:57:13] 10Operations, 10LDAP-Access-Requests: access to Superset for Alex Hollender - https://phabricator.wikimedia.org/T244490 (10Dzahn) Hi, just a friendly ping for manager approval since this has been waiting for a bit. [16:58:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:58:37] !log Re-enable BGP in lvs1013 - T245984 [16:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:41] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:59:06] (03CR) 10Jbond: [C: 03+1] pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [16:59:15] 10Operations, 10LDAP-Access-Requests: access to Superset for Alex Hollender - https://phabricator.wikimedia.org/T244490 (10MNovotny_WMF) Approved!! apologies for my delay. [16:59:29] (03CR) 10Alexandros Kosiaris: pdu: Fix type for $breaker variable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [16:59:32] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [16:59:33] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 37, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:59:38] \o/ [16:59:45] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 197, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:59:47] vgutierrez: yessss [16:59:51] all LVS migrated to buster [16:59:52] (03PS3) 10Alexandros Kosiaris: pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 [16:59:54] (03PS1) 10Alexandros Kosiaris: facilities:monitor_pdu_service: Add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/576390 [16:59:55] * Amir1 dances a little [17:00:05] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:07] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1388.eqiad.wmnet ` The log can be found in `... [17:00:41] RECOVERY - pybal on lvs5003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:00:56] (03CR) 10Thcipriani: [C: 03+1] admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [17:01:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:20] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1389.eqiad.wmnet ` The log can be found in `... [17:02:01] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:11] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1391.eqiad.wmnet ` The log can be found in `... [17:04:12] vgutierrez: \o/ [17:04:52] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1392.eqiad.wmnet ` The log can be found in `... [17:05:07] (03PS3) 10Elukey: Ensure readability settings for home dirs of Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) [17:05:44] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10jbond) 05Resolved→03Open reopen this ticket as we need to ensure it handles the subgroups [17:06:38] (03PS1) 10CRusnov: reports/coherence.py: Add test for racked devices with no position [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576391 (https://phabricator.wikimedia.org/T239244) [17:06:59] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:30] (03PS4) 10Elukey: Ensure readability settings for home dirs of Analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) [17:11:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21229/" [puppet] - 10https://gerrit.wikimedia.org/r/576384 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [17:12:12] (03PS1) 10Papaul: DNS: Add mgmt DNS for civi2001 [dns] - 10https://gerrit.wikimedia.org/r/576392 [17:13:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:13:02] (03PS4) 10Alexandros Kosiaris: pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 [17:13:04] (03PS2) 10Alexandros Kosiaris: facilities:monitor_pdu_service: Add types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/576390 [17:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:06] (03CR) 10Jforrester: [C: 03+1] admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [17:14:18] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1393.eqiad.wmnet ` The log can be found in `... [17:14:43] !log otto@deploy1001 Started restart [changeprop/deploy@e2fe8ca]: Restart to pick up new LVS TLS port for eventgate T242224 [17:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:48] T242224: Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 [17:15:19] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1394.eqiad.wmnet ` The log can be found in `... [17:16:10] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1395.eqiad.wmnet ` The log can be found in `... [17:16:28] (03Abandoned) 10Jdlrobson: Restore beta cluster logo on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576157 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [17:16:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:03] (03CR) 10Jbond: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [17:17:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:12] (03CR) 10Dzahn: [C: 03+2] admins: let parsoid-admins run scap pull as mwdeploy [puppet] - 10https://gerrit.wikimedia.org/r/576383 (https://phabricator.wikimedia.org/T245877) (owner: 10Dzahn) [17:17:22] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10Urbanecm) It refers to: should I be in both #acl_security_volunteer and #acl_security_steward, when I gained security access... [17:17:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:55] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [17:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:50] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/576390 (owner: 10Alexandros Kosiaris) [17:19:54] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1396.eqiad.wmnet ` The log can be found in `... [17:20:22] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'changeprop' for release 'staging' . [17:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:13] 10Operations, 10netbox, 10Patch-For-Review: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10RobH) Please note 0U PDUs have no position in rack, but are the ONLY devices that should be that way. [17:23:23] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10Dzahn) @cscott @ssastry After the merge above and the next puppet run all ex... [17:23:31] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1397.eqiad.wmnet ` The log can be found in `... [17:25:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1002/21230/" [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [17:25:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] pdu: Fix type for $breaker variable [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [17:25:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/576380 (owner: 10Alexandros Kosiaris) [17:26:56] (03PS1) 10Jhedden: keepalived: add initial module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/576395 (https://phabricator.wikimedia.org/T236606) [17:27:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:54] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1399.eqiad.wmnet ` The log can be found in `... [17:28:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:24] (03PS1) 10Ottomata: Set eventgate-*-to-delete LVS services to state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/576396 (https://phabricator.wikimedia.org/T245203) [17:29:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:39] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/575353 (owner: 10Volans) [17:30:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:10] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:19] (03PS2) 10Ottomata: Set eventgate-*-to-delete LVS services to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576396 (https://phabricator.wikimedia.org/T245203) [17:30:51] (03CR) 10Volans: [C: 03+2] netbox: fine tune log and exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/575353 (owner: 10Volans) [17:31:32] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1400.eqiad.wmnet ` The log can be found in `... [17:32:10] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1401.eqiad.wmnet ` The log can be found in `... [17:32:15] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 349 MB (0% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [17:32:40] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576391 (https://phabricator.wikimedia.org/T239244) (owner: 10CRusnov) [17:32:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:53] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Workaround upstream package regression [puppet] - 10https://gerrit.wikimedia.org/r/576398 (https://phabricator.wikimedia.org/T242702) [17:33:06] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1402.eqiad.wmnet ` The log can be found in `... [17:34:29] (03CR) 10Ottomata: [C: 03+2] Set eventgate-*-to-delete LVS services to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576396 (https://phabricator.wikimedia.org/T245203) (owner: 10Ottomata) [17:34:41] (03PS3) 10Ottomata: Set eventgate-*-to-delete LVS services to state: lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/576396 (https://phabricator.wikimedia.org/T245203) [17:35:34] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1403.eqiad.wmnet ` The log can be found in `... [17:35:55] (03Merged) 10jenkins-bot: netbox: fine tune log and exception messages [software/spicerack] - 10https://gerrit.wikimedia.org/r/575353 (owner: 10Volans) [17:36:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:03] 10Operations, 10DC-Ops, 10decommission: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Papaul) ` [edit interfaces interface-range disabled] member ge-1/0/1 { ... } + member "ge-[0-1]/0/1"; + member "ge-[0-1]/0/3"; [edit interfaces interface... [17:38:31] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1398.eqiad.wmnet ` The log can be found in `... [17:38:44] 10Operations, 10DC-Ops, 10decommission: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Papaul) ` [edit interfaces interface-range disabled] member ge-1/0/1 { ... } + member "ge-[0-1]/0/1"; + member "ge-[0-1]/0/3"; [edit interfaces interface... [17:39:00] (03PS1) 10Andrew Bogott: designate: install python3 versions of sink handlers [puppet] - 10https://gerrit.wikimedia.org/r/576399 (https://phabricator.wikimedia.org/T242766) [17:39:30] (03CR) 10Jcrespo: "Proof of fix if applied manually: https://phab.wmfusercontent.org/file/data/wwdowwpipn2knf6rvl5c/PHID-FILE-o6kuufjv2hnrmwfu37o6/Screenshot" [puppet] - 10https://gerrit.wikimedia.org/r/576398 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [17:39:41] 10Operations, 10DC-Ops, 10decommission: decommission WMF6143 (old payments2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246698 (10Papaul) [17:40:10] 10Operations, 10DC-Ops, 10decommission: decommission WMF6142 (old payments2003.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246699 (10Papaul) [17:40:39] (03PS1) 10Hnowlan: changeprop: Create new changeprop release [deployment-charts] - 10https://gerrit.wikimedia.org/r/576400 (https://phabricator.wikimedia.org/T213193) [17:40:45] (03CR) 10Andrew Bogott: [C: 03+2] designate: install python3 versions of sink handlers [puppet] - 10https://gerrit.wikimedia.org/r/576399 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:40:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:34] 10Operations, 10DC-Ops, 10decommission: decommission WMF6141 (old payments2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246697 (10Papaul) [17:42:37] (03CR) 10Ppchelko: [C: 03+1] changeprop: Create new changeprop release [deployment-charts] - 10https://gerrit.wikimedia.org/r/576400 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:42:42] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1404.eqiad.wmnet ` The log can be found in `... [17:43:53] (03CR) 10Hnowlan: [C: 03+2] changeprop: Create new changeprop release [deployment-charts] - 10https://gerrit.wikimedia.org/r/576400 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:44:25] (03Merged) 10jenkins-bot: changeprop: Create new changeprop release [deployment-charts] - 10https://gerrit.wikimedia.org/r/576400 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:44:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:45] (03PS1) 10Ottomata: Set eventgate-*-to-delete LVS services to state: service_setup [puppet] - 10https://gerrit.wikimedia.org/r/576402 (https://phabricator.wikimedia.org/T245203) [17:45:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:30] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt DNS for civi2001 [dns] - 10https://gerrit.wikimedia.org/r/576392 (owner: 10Papaul) [17:46:31] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 5218 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [17:47:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:54] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1405.eqiad.wmnet ` The log can be found in `... [17:49:17] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1406.eqiad.wmnet ` The log can be found in `... [17:49:36] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1407.eqiad.wmnet ` The log can be found in `... [17:50:35] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:22] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1403.eqiad.wmnet ` The log can be found in `... [17:53:50] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1408.eqiad.wmnet ` The log can be found in `... [17:55:12] (03PS1) 10Andrew Bogott: designate: include python3-git -- we need this for wmfsink [puppet] - 10https://gerrit.wikimedia.org/r/576403 (https://phabricator.wikimedia.org/T242766) [17:55:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [17:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:45] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1409.eqiad.wmnet ` The log can be found in `... [17:57:48] (03CR) 10Jcrespo: [C: 04-1] "the non-instanced class on the default port may need the same change." [puppet] - 10https://gerrit.wikimedia.org/r/576398 (https://phabricator.wikimedia.org/T242702) (owner: 10Jcrespo) [17:58:15] (03CR) 10Andrew Bogott: [C: 03+2] designate: include python3-git -- we need this for wmfsink [puppet] - 10https://gerrit.wikimedia.org/r/576403 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [17:58:15] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 226 MB (0% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [17:59:35] (03PS2) 10Hnowlan: jobrunner: add simple HTTP check [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) [17:59:49] (03CR) 10Hnowlan: jobrunner: add simple HTTP check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/576301 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [17:59:55] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [18:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1800). [18:01:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:16] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:37] 10Operations: Racktables: clearly show when hosts are decommissioned - https://phabricator.wikimedia.org/T164042 (10RobH) 05Open→03Declined rackables is dead/static/depreciated for use by netbox, so this is defunct. [18:03:49] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1410.eqiad.wmnet ` The log can be found in `... [18:04:14] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Ban email domains from emailing lists.wikimedia.org (mailman) - https://phabricator.wikimedia.org/T105093 (10RobH) [18:04:17] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1411.eqiad.wmnet ` The log can be found in `... [18:04:31] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1412.eqiad.wmnet ` The log can be found in `... [18:04:47] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [18:04:49] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` mw1413.eqiad.wmnet ` The log can be found in `... [18:06:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:35] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Give all members of the Parsing team production `deployment` access ( add arlolra to deployers) - https://phabricator.wikimedia.org/T245877 (10Dzahn) 05Open→03Resolved [18:10:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:07] !log updating firmware on scs-oe16-esams via T174475 [18:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:11] T174475: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 [18:11:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:10] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 (10RobH) a:05mark→03RobH Ok, so this is still open but thing have settled at esams and I know whats going on there: * scs-oe10-esams is defunct - no updates needed as its gone * scs-oe16-esams is being upda... [18:13:09] (03PS1) 10Dzahn: add ganeti role to new eqiad ganeti expansion servers [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) [18:14:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] 10Operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565 (10RobH) 05Open→03Declined This is very old, and likely defunct, so I'm just going to close it. If it turns out this needs doing still (even though its last update was in 2015), it can be reopened. [18:16:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:02] (03CR) 10Dzahn: "@akosiaris Does it make sense to start with the expansion before replacing existing servers? I expect i would do https://wikitech.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/576406 (https://phabricator.wikimedia.org/T228924) (owner: 10Dzahn) [18:17:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:17:12] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1408.eqiad.wmnet'] ` and were **ALL** successful. [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [18:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:40] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1409.eqiad.wmnet'] ` and were **ALL** successful. [18:19:42] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) 05Open→03Resolved [18:19:44] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264 (10RobH) [18:20:57] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 (10RobH) 05Open→03Resolved scs-oe16-esams updated, resolving task. [18:21:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:12] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:30] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:05] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1410.eqiad.wmnet'] ` and were **ALL** successful. [18:26:16] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1411.eqiad.wmnet'] ` and were **ALL** successful. [18:26:21] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1412.eqiad.wmnet'] ` and were **ALL** successful. [18:26:46] 10Operations, 10ops-eqiad, 10serviceops: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1413.eqiad.wmnet'] ` and were **ALL** successful. [18:28:51] (03PS1) 10Dzahn: remove grafana-labs-admin [dns] - 10https://gerrit.wikimedia.org/r/576408 [18:29:37] (03CR) 10Jforrester: [C: 03+1] "Well, oops." [puppet] - 10https://gerrit.wikimedia.org/r/574726 (owner: 10Urbanecm) [18:30:05] (03CR) 10Andrew Bogott: [C: 03+1] "related: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/575759/" [dns] - 10https://gerrit.wikimedia.org/r/576408 (owner: 10Dzahn) [18:30:25] (03PS2) 10Dzahn: remove grafana-labs-admin [dns] - 10https://gerrit.wikimedia.org/r/576408 (https://phabricator.wikimedia.org/T246508) [18:30:42] (03PS1) 10Elukey: role::swap: allow stat hosts to rsync home dir files from notebooks [puppet] - 10https://gerrit.wikimedia.org/r/576409 [18:30:44] (03CR) 10Dzahn: "ok to remove from DNS already? https://gerrit.wikimedia.org/r/c/operations/dns/+/576408" [puppet] - 10https://gerrit.wikimedia.org/r/575758 (https://phabricator.wikimedia.org/T246508) (owner: 10Andrew Bogott) [18:31:57] (03CR) 10Dzahn: [C: 03+2] "thanks for the prompt review!" [dns] - 10https://gerrit.wikimedia.org/r/576408 (https://phabricator.wikimedia.org/T246508) (owner: 10Dzahn) [18:32:29] (03CR) 10Ottomata: [C: 03+1] role::swap: allow stat hosts to rsync home dir files from notebooks [puppet] - 10https://gerrit.wikimedia.org/r/576409 (owner: 10Elukey) [18:33:44] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:34:19] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) grafana-labs-admin.wikimedia.org has been removed from DNS in https://gerrit.wikimedia.org/r/c/operations/dns/+/576408 therefore also removed here [18:35:26] (03CR) 10Elukey: [C: 03+2] role::swap: allow stat hosts to rsync home dir files from notebooks [puppet] - 10https://gerrit.wikimedia.org/r/576409 (owner: 10Elukey) [18:37:18] (03PS1) 10Andrew Bogott: wmf_sink: remove utf8 encoding [puppet] - 10https://gerrit.wikimedia.org/r/576410 (https://phabricator.wikimedia.org/T242766) [18:37:57] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [18:38:42] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: remove utf8 encoding [puppet] - 10https://gerrit.wikimedia.org/r/576410 (https://phabricator.wikimedia.org/T242766) (owner: 10Andrew Bogott) [18:38:50] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) labmon1001 has been replaced by cloudmetrics1002 and is still hosting grafana-labs and graphite-labs. [18:40:26] (03PS1) 10Herron: add kibana-next SANs to kibana cert [puppet] - 10https://gerrit.wikimedia.org/r/576411 (https://phabricator.wikimedia.org/T234854) [18:46:56] (03CR) 10Cwhite: [C: 03+1] "LGTM. It doesn't include a kibana-next.discovery.wmnet, but I can't tell if that's necessary." [puppet] - 10https://gerrit.wikimedia.org/r/576411 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:48:37] (03PS1) 10Andrew Bogott: designate codfw1dev: don't specify a legacy domain, we don't have one [puppet] - 10https://gerrit.wikimedia.org/r/576414 [18:48:49] (03CR) 10Herron: "> LGTM. It doesn't include a kibana-next.discovery.wmnet, but I" [puppet] - 10https://gerrit.wikimedia.org/r/576411 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:49:23] (03CR) 10Herron: [C: 03+2] add kibana-next SANs to kibana cert [puppet] - 10https://gerrit.wikimedia.org/r/576411 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [18:55:07] (03CR) 10Andrew Bogott: [C: 03+2] designate codfw1dev: don't specify a legacy domain, we don't have one [puppet] - 10https://gerrit.wikimedia.org/r/576414 (owner: 10Andrew Bogott) [18:56:27] (03PS1) 10Jforrester: [vecwiki] Update project logo with temporary 20k branding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576415 (https://phabricator.wikimedia.org/T246808) [18:57:58] (03PS1) 10Andrew Bogott: designate codfw1dev: don't specify a legacy domain, we don't have one [puppet] - 10https://gerrit.wikimedia.org/r/576416 [18:58:28] !log generating new certs for grafana-labs/graphite-labs [18:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:33] (03CR) 10Jforrester: [C: 03+2] [vecwiki] Update project logo with temporary 20k branding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576415 (https://phabricator.wikimedia.org/T246808) (owner: 10Jforrester) [18:59:32] (03Merged) 10jenkins-bot: [vecwiki] Update project logo with temporary 20k branding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576415 (https://phabricator.wikimedia.org/T246808) (owner: 10Jforrester) [18:59:40] (03CR) 10Andrew Bogott: [C: 03+2] designate codfw1dev: don't specify a legacy domain, we don't have one [puppet] - 10https://gerrit.wikimedia.org/r/576416 (owner: 10Andrew Bogott) [19:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T1900). [19:00:04] ottomata: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:33] Hey ottomata, I'm just sneaking out a logo change. One sec. [19:00:36] (03CR) 10Herron: "> kibana-next.svc.eqiad.wmnet doesn't seem to be a valid SAN for the" [puppet] - 10https://gerrit.wikimedia.org/r/576151 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [19:00:59] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Jgreen) [19:01:28] !log jforrester@deploy1001 Synchronized static/images/project-logos/: T246808 [vecwiki] Update project logo with temporary 20k branding (duration: 01m 10s) [19:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:33] T246808: Change of VEC Wikipedia logo due to reaching 20.000 articles - https://phabricator.wikimedia.org/T246808 [19:02:04] !log Manually purged vecwiki logos from Varnish for T246808 [19:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:08] ottomata: Over to you. [19:03:03] (03PS4) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [19:05:38] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/21232/" [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [19:05:40] (03PS1) 10Dzahn: add fake key for grafana-labs.discovery.wmnet cert [labs/private] - 10https://gerrit.wikimedia.org/r/576417 (https://phabricator.wikimedia.org/T210411) [19:06:57] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake key for grafana-labs.discovery.wmnet cert [labs/private] - 10https://gerrit.wikimedia.org/r/576417 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:08:44] oh hi ya [19:09:09] James_F: i'm not sure i've done a code deploy before, only have done sync filie [19:09:13] although... [19:09:15] this is just one file [19:09:23] * James_F looks. [19:09:42] Oh, a back-port? [19:09:47] I can do that for you if you want. [19:09:50] ya [19:09:55] would much appreciate it! [19:11:11] (03PS1) 10Andrew Bogott: Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576418 [19:11:28] (03PS1) 10Andrew Bogott: Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576419 [19:11:55] (03CR) 10jerkins-bot: [V: 04-1] Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576419 (owner: 10Andrew Bogott) [19:12:20] (03CR) 10Andrew Bogott: [C: 03+2] Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576418 (owner: 10Andrew Bogott) [19:12:39] (03PS2) 10Andrew Bogott: Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576419 [19:14:02] (03CR) 10Andrew Bogott: [C: 03+2] Revert "designate codfw1dev: don't specify a legacy domain, we don't have one" [puppet] - 10https://gerrit.wikimedia.org/r/576419 (owner: 10Andrew Bogott) [19:14:19] (03PS2) 10Dzahn: wmcs::monitoring: add envoy for TLS termination for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/572381 (https://phabricator.wikimedia.org/T210411) [19:17:38] ottomata: Live on mwdebug1001 – can you test? [19:18:07] James_F: i think its good! thank you [19:18:27] 10Operations, 10Pybal, 10Traffic: Minor fixes in pybal checks - https://phabricator.wikimedia.org/T246431 (10Dzahn) p:05Triage→03Medium [19:18:48] Syncing now. [19:19:49] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.21/extensions/WikimediaEvents/includes/WikimediaEventsHooks.php: T246030 T226986: Set wgWMEClientErrorIntakeURL in onResourceLoaderGetConfigVars (duration: 01m 05s) [19:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:55] T246030: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 [19:19:56] T226986: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 [19:25:30] (03PS5) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [19:25:52] htank you James_F ! [19:26:59] ottomata: Any time. [19:28:07] (03CR) 10jerkins-bot: [V: 04-1] lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [19:33:25] (03PS6) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [19:33:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21233/cloudmetrics1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572381 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:36:16] (03PS7) 10Herron: lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) [19:37:00] (03PS1) 10Dzahn: wmcs::monitoring: remove grafana-labs-admin from comments [puppet] - 10https://gerrit.wikimedia.org/r/576426 [19:37:26] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/21236/" [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [19:37:31] (03CR) 10jerkins-bot: [V: 04-1] wmcs::monitoring: remove grafana-labs-admin from comments [puppet] - 10https://gerrit.wikimedia.org/r/576426 (owner: 10Dzahn) [19:38:37] (03PS1) 10Dzahn: varnish: remove grafana-labs-admin [puppet] - 10https://gerrit.wikimedia.org/r/576427 [19:38:47] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.22/extensions/AbuseFilter: T213006 T246539: Minor fixes for the updateVarDumps script (duration: 01m 05s) [19:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:54] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [19:38:54] T213006: Create a script to update afl_var_dump, drop back-compat code - https://phabricator.wikimedia.org/T213006 [19:38:54] Daimona: Done. [19:39:10] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove grafana-labs-admin [puppet] - 10https://gerrit.wikimedia.org/r/576427 (owner: 10Dzahn) [19:39:39] (03CR) 10Jbond: [C: 03+1] "LGTM thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [19:42:52] (03PS1) 10Dzahn: ssl: add certificate for grafana-labs.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576428 (https://phabricator.wikimedia.org/T210411) [19:44:26] (03CR) 10Dzahn: [C: 03+2] ssl: add certificate for grafana-labs.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/576428 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [19:47:02] (03CR) 10Jforrester: "Oops, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575692 (https://phabricator.wikimedia.org/T246535) (owner: 10Urbanecm) [19:50:50] !log cloudmetrics1002 - removed port 8080 from apache's ports.conf and restarted the service (cloudmetrics1001 did not have this) [19:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:51] (03PS1) 10Ottomata: Add check_eventgate_analyltics_external_cluster [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) [19:53:37] (03CR) 10Ottomata: [C: 04-1] "To be done once LVS state: production" [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:55:38] (03CR) 10jerkins-bot: [V: 04-1] Add check_eventgate_analyltics_external_cluster [puppet] - 10https://gerrit.wikimedia.org/r/576431 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:59:03] (03PS1) 10Dzahn: ATS: switch grafana-labs backends from http to https [puppet] - 10https://gerrit.wikimedia.org/r/576434 (https://phabricator.wikimedia.org/T210411) [20:00:04] liw and Brennen: (Dis)respected human, time to deploy Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T2000). Please do the needful. [20:01:23] (03PS2) 10Dzahn: ATS: switch grafana-labs backends from http to https [puppet] - 10https://gerrit.wikimedia.org/r/576434 (https://phabricator.wikimedia.org/T210411) [20:01:53] (03PS2) 10Dzahn: wmcs::monitoring: remove grafana-labs-admin from comments [puppet] - 10https://gerrit.wikimedia.org/r/576426 [20:02:13] (03CR) 10Dzahn: [C: 03+2] wmcs::monitoring: remove grafana-labs-admin from comments [puppet] - 10https://gerrit.wikimedia.org/r/576426 (owner: 10Dzahn) [20:02:25] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Abban Dunne to the ldap/wmde group - https://phabricator.wikimedia.org/T246664 (10RStallman-legalteam) Thanks @AbbanWMDE. I'll send you the NDA via docusign and will update the ticket when it's fully executed. [20:04:40] (03PS2) 10Dzahn: varnish: remove grafana-labs-admin [puppet] - 10https://gerrit.wikimedia.org/r/576427 [20:05:17] (03CR) 10Dzahn: "..unless it makes no sense anymore to remove stuff from varnish ?" [puppet] - 10https://gerrit.wikimedia.org/r/576427 (owner: 10Dzahn) [20:11:38] 10Operations, 10DC-Ops: codfw spare pool system for partman testing - https://phabricator.wikimedia.org/T215301 (10RobH) [20:15:37] 10Operations, 10hardware-requests, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10RobH) 05Stalled→03Declined So this has been sitting blocked for months and months. The asset/server referenced has be... [20:15:46] 10Operations, 10Performance-Team (Radar): eqiad: (1) misc single cpu server allocation for performance browser testing - https://phabricator.wikimedia.org/T204589 (10RobH) [20:19:40] (03PS1) 10RhinosF1: Insert the description of the change. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 [20:25:40] (03CR) 10Jhedden: [C: 03+1] ATS: switch grafana-labs backends from http to https [puppet] - 10https://gerrit.wikimedia.org/r/576434 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:28:09] (03PS2) 10RhinosF1: Add throttle exempt for 2020-03-07 GenderGap Event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) [20:28:34] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:28:34] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:41] !log joal@deploy1001 Started deploy [analytics/refinery@264c7ec]: Regular [20:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:20] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [20:32:44] 10Operations, 10DC-Ops, 10decommission: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Jgreen) [20:33:11] 10Operations, 10fundraising-tech-ops, 10netops: DHCP routing issue with civi2001 - https://phabricator.wikimedia.org/T246812 (10Jgreen) 05Open→03Stalled [20:33:31] (03CR) 10jerkins-bot: [V: 04-1] Add throttle exempt for 2020-03-07 GenderGap Event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [20:33:55] PROBLEM - Check systemd state on lvs1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:32] 10Operations, 10DC-Ops, 10decommission: decommission WMF6147 (old frpig2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T246824 (10Jgreen) [20:35:07] (03CR) 10Dzahn: [C: 03+2] ATS: switch grafana-labs backends from http to https [puppet] - 10https://gerrit.wikimedia.org/r/576434 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:37:15] hmmm what's going on with lvs1013? [20:38:33] vgutierrez: ● ifup@enp4s0f0.service loaded failed failed ifup for enp4s0f0 [20:38:51] jouncebot: now [20:38:51] For the next 0 hour(s) and 21 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200303T2000) [20:39:15] (03PS1) 10MarcoAurelio: [WIP] offboard-user: Include new security subprojects [puppet] - 10https://gerrit.wikimedia.org/r/576440 [20:39:34] convenient. [20:40:10] (03CR) 10jerkins-bot: [V: 04-1] [WIP] offboard-user: Include new security subprojects [puppet] - 10https://gerrit.wikimedia.org/r/576440 (owner: 10MarcoAurelio) [20:40:43] vgutierrez: but ip link show says it's UP, so maybe just needs reset-failed? [20:40:50] nope [20:40:59] it is actually an issue [20:42:01] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [20:42:01] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [20:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:42] !log stopping pybal on lvs1013 [20:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:46] !log joal@deploy1001 Finished deploy [analytics/refinery@264c7ec]: Regular (duration: 13m 05s) [20:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:52] (03PS3) 10RhinosF1: Add throttle exempt for 2020-03-07 GenderGap Event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) [20:44:16] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [20:44:16] !log joal@deploy1001 Started deploy [analytics/refinery@264c7ec] (thin): Regular weekly analytics deploy [20:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:24] !log joal@deploy1001 Finished deploy [analytics/refinery@264c7ec] (thin): Regular weekly analytics deploy (duration: 00m 07s) [20:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:00] (03CR) 10Dzahn: "tcpdump port 443 on cloudmetrics1002 shows how this is now getting https traffic and web interfaces are still up and working" [puppet] - 10https://gerrit.wikimedia.org/r/576434 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [20:46:43] (03PS2) 10MarcoAurelio: offboard-user: Include new security subprojects [puppet] - 10https://gerrit.wikimedia.org/r/576440 [20:46:58] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/576440 (owner: 10MarcoAurelio) [20:49:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:49:12] (03PS3) 10MarcoAurelio: offboard-user: Include new security subprojects [puppet] - 10https://gerrit.wikimedia.org/r/576440 [20:49:20] ^^ that's me as well [20:49:20] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/576440 (owner: 10MarcoAurelio) [20:49:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:54:11] !log rebooting lvs1013 [20:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:44] vgutierrez: thanks [20:56:41] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [20:57:11] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) grafana-labs and graphite-labs have switched to TLS now. [x] cloudmetrics1002.eqiad.wmnet - http://grafana-labs.wikimedia.org http://graphite-labs.wikimedia.org [20:59:38] !log Starting pybal on lvs1013 [20:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:11] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 37, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:00:13] it looked like a weird race condition configuring the main interface [21:00:21] it's all good now :/ [21:00:31] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 197, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:10] alright, good [21:01:50] the good news is, successful test of the new BGP alerts for pybal 🙃 [21:01:51] RECOVERY - Check systemd state on lvs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:36] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [21:03:08] cdanis: oh, we tested that a few hours ago as well :) [21:03:31] (03Abandoned) 10Dzahn: ATS: switch backend URL to https/discovery for graphite-labs [puppet] - 10https://gerrit.wikimedia.org/r/572391 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:03:38] anyways.. time to sleep here.. talk to you tomorrow folks :) [21:03:54] (03Abandoned) 10Dzahn: ATS: switch backend URL to https for grafana-labs [puppet] - 10https://gerrit.wikimedia.org/r/572382 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [21:04:21] good night, ttyl [21:05:56] night [21:06:04] (03CR) 10Dzahn: "I forgot because this has been a couple months. Was there an issue with merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/513201 (owner: 10Dzahn) [21:13:25] !log thcipriani@deploy1001 Synchronized php-1.35.0-wmf.22/includes/Defines.php: [[gerrit:576439|Update MW_VERSION to 1.35.0-wmf.22]] (duration: 01m 06s) [21:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:34] (03PS1) 10RLazarus: Install httpbb on deployment servers, alongside apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/576448 (https://phabricator.wikimedia.org/T236699) [21:15:43] ^ thanks longma [21:16:21] thanks for doing the deploy :P [21:16:39] (03CR) 10Jforrester: [C: 03+1] Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) (owner: 10Thcipriani) [21:18:39] (03CR) 10Dzahn: [C: 03+1] Install httpbb on deployment servers, alongside apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/576448 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:18:59] (03CR) 10RLazarus: [C: 03+2] Install httpbb on deployment servers, alongside apache-fast-test. [puppet] - 10https://gerrit.wikimedia.org/r/576448 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:33:12] (03CR) 10Jhedden: [C: 03+2] keepalived: add initial module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/576395 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [21:33:46] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [21:33:46] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [21:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:59] (03PS6) 10Jforrester: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [21:43:42] (03CR) 10Jforrester: [C: 03+2] Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [21:44:11] (03CR) 10Jforrester: [C: 03+1] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [21:44:26] (03Merged) 10jenkins-bot: Synchronize and fix DisableQueryPageUpdate for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573969 (owner: 10Matěj Suchánek) [21:46:46] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [wikidatawiki] Note that MostRevisions and MostLinked have been disabled (duration: 01m 05s) [21:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:31] (03PS1) 10Alex Monk: cloud: Clean cloud-puppetmaster hiera up and catch up with reality [puppet] - 10https://gerrit.wikimedia.org/r/576450 (https://phabricator.wikimedia.org/T235218) [21:47:44] (03PS2) 10Jforrester: Set GrowthExperiments help panel search API for beta viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575762 (https://phabricator.wikimedia.org/T246511) (owner: 10Gergő Tisza) [21:47:51] (03CR) 10Jforrester: [C: 03+2] Set GrowthExperiments help panel search API for beta viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575762 (https://phabricator.wikimedia.org/T246511) (owner: 10Gergő Tisza) [21:48:00] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 04s) [21:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:04] (03PS6) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [21:49:04] (03Merged) 10jenkins-bot: Set GrowthExperiments help panel search API for beta viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575762 (https://phabricator.wikimedia.org/T246511) (owner: 10Gergő Tisza) [21:50:12] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [21:50:18] (03PS2) 10Jforrester: Log csp and csp-report-only channels in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575976 (owner: 10Brian Wolff) [21:50:48] (03CR) 10Jforrester: [C: 03+2] Log csp and csp-report-only channels in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575976 (owner: 10Brian Wolff) [21:50:49] Urbanecm: Shall I stick https://phabricator.wikimedia.org/T246832 on the same patch or another? [21:51:14] RhinosF1: I'd go with another [21:51:27] kk [21:51:33] (03CR) 10CRusnov: "As we discussed i have changed the puppet manifest." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [21:51:38] (03Merged) 10jenkins-bot: Log csp and csp-report-only channels in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575976 (owner: 10Brian Wolff) [21:51:49] (03PS1) 10RLazarus: httpbb: Add python3-clustershell to required packages. [puppet] - 10https://gerrit.wikimedia.org/r/576451 (https://phabricator.wikimedia.org/T236699) [21:52:31] (03PS1) 10RhinosF1: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 [21:53:01] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [21:53:46] (03PS6) 10Jforrester: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [21:54:10] (03CR) 10Dzahn: "i'm afraid this is only in buster but deploy1001 is still stretch" [puppet] - 10https://gerrit.wikimedia.org/r/576451 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:54:24] (03CR) 10Jforrester: "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 (owner: 10DCausse) [21:55:51] (03CR) 10Dzahn: [C: 03+1] "nevermind, we have our own packages apparently:" [puppet] - 10https://gerrit.wikimedia.org/r/576451 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:58:06] (03CR) 10RLazarus: [C: 03+2] "Appreciate the reminder anyway -- I hadn't thought to check! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/576451 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [21:58:39] (03PS2) 10RhinosF1: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) [21:59:30] Urbanecm: ^ Shall I go for tommorow in a later window or same? My only thought is it will need a rebase. [21:59:51] RhinosF1: same is fine, and rebases are common and easily solved :) [22:00:37] Urbanecm: are you okay to go down as dev for that then? [22:00:56] yeah 🙂 [22:01:16] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:02:21] (03PS3) 10Jforrester: Update three logos with more detailed versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555620 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [22:02:21] (03PS2) 10Jforrester: [arwikibooks] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555629 (https://phabricator.wikimedia.org/T150618) (owner: 10Bjornskjald) [22:02:21] (03PS1) 10Jforrester: [cawikibooks] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576453 (https://phabricator.wikimedia.org/T150618) [22:02:21] (03PS1) 10Jforrester: [plwikivoyage] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576454 (https://phabricator.wikimedia.org/T150618) [22:05:36] Urbanecm: can you set Jenkins off? [22:05:47] (03CR) 10Platonides: [C: 03+1] "Maybe add 'commonswiki' to the wiki list" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [22:05:49] RhinosF1: what do you mean? [22:05:57] Urbanecm: recheck [22:06:12] ah. The +1 Platonides just gave you should work in same way [22:06:48] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:07:06] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:07:51] (03PS3) 10Jforrester: Added betawikiversity and ukwikinews hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556863 (owner: 10TechneSiyam) [22:07:53] (03PS3) 10Jforrester: [betawikiversity] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 (https://phabricator.wikimedia.org/T150618) (owner: 10TechneSiyam) [22:07:55] (03PS1) 10Jforrester: [ukwikinews] Add HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576455 (https://phabricator.wikimedia.org/T150618) [22:08:03] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10RobH) Both of these items have arrived at the dc site: T238587 arrived today (@Jclark-ctr mentioned in irc) and T242149 arrived on 2020-03-02. [22:08:08] Urbanecm: cool. See https://phabricator.wikimedia.org/T246832#5939133? I'll add commonswiki but are we ok with enwiki + frwiki as well? [22:08:37] RhinosF1: anything that can make things work more smoothly at the event is possible here. Feel free to add both [22:08:50] {{doing}} [22:08:55] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:08:58] (03CR) 10RhinosF1: "Doing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [22:09:46] (03PS3) 10RhinosF1: IP Cap Lift for University of Mannheim Wikimedia Event (2020-04-01) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) [22:09:56] Urbanecm: ^ [22:10:23] recheck submitted [22:10:34] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576452 (https://phabricator.wikimedia.org/T246832) (owner: 10RhinosF1) [22:10:46] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576437 (https://phabricator.wikimedia.org/T246813) (owner: 10RhinosF1) [22:12:14] Urbanecm: Thanks! I'm catching the bug quickly here. [22:13:06] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) [22:17:10] (03PS12) 10Jforrester: [cywikiquote] Add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560386 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [22:17:12] (03PS1) 10Jforrester: [jawikiquote] Add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576457 (https://phabricator.wikimedia.org/T150618) [22:17:14] (03PS1) 10Jforrester: [fawikivoyage] Add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576458 (https://phabricator.wikimedia.org/T150618) [22:27:49] Urbanecm: I'll probably schedule Monday or Tuesday to remove any complete events. [22:28:54] (03PS3) 10Jforrester: Modified files with correct sized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556212 (owner: 10TechneSiyam) [22:29:35] (03PS7) 10CRusnov: netbox: Add framework for exposing scripts to internal services [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) [22:29:37] (03PS1) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [22:29:41] (03CR) 10Herron: [C: 03+2] lists: don't assume lists IP is an interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/576362 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [22:32:40] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [22:34:04] (03PS1) 10Dzahn: installserver: allow stopping the DHCP server via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) [22:37:28] (03Abandoned) 10Herron: dns: add logstash-next.wikimedia.org record [dns] - 10https://gerrit.wikimedia.org/r/576152 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [22:37:44] (03CR) 10Herron: [C: 03+1] Add new DNS entries for logstash-next plus the CAS counter parts [dns] - 10https://gerrit.wikimedia.org/r/575530 (owner: 10Muehlenhoff) [22:37:53] (03PS1) 10Volans: scripts: add decommission device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (https://phabricator.wikimedia.org/T244315) [22:38:03] (03PS2) 10Dzahn: installserver: allow stopping DHCP server in Hiera, apply new role [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) [22:38:12] (03CR) 10jerkins-bot: [V: 04-1] scripts: add decommission device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (https://phabricator.wikimedia.org/T244315) (owner: 10Volans) [22:39:44] (03PS2) 10Volans: scripts: add decommission device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (https://phabricator.wikimedia.org/T244315) [22:40:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) ` [edit interfaces interface-range disabled] member ge-5/0/5 { ... } + member xe-2/0/47; [edit interfaces xe-2/0/47] - description lvs2001:eno1... [22:40:54] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission lvs2001.codfw.wmnet - https://phabricator.wikimedia.org/T246779 (10Papaul) [22:41:13] (03PS2) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [22:41:15] (03CR) 10Herron: [C: 04-1] "Since we addressed the IP address bit with Ibafec8f7fb2779646156f85232019e59a6d682c0 I think all that's left here is cgi vs cgid." [puppet] - 10https://gerrit.wikimedia.org/r/576333 (https://phabricator.wikimedia.org/T242910) (owner: 10Jbond) [22:41:25] (03CR) 10Volans: "Example generated document with 2 decom hosts available here:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (https://phabricator.wikimedia.org/T244315) (owner: 10Volans) [22:42:35] * Krinkle testing on mwdebug1002 [22:45:35] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [22:46:51] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/21239/ if you click through to the change catalogs and search "ensure_service" you can s" [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [22:52:54] (03PS3) 10CRusnov: prometheus::ops: Add prometheus job to scrape Netbox scripts [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) [22:55:19] (03CR) 10CRusnov: "compiler output:" [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [22:56:12] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/575603 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [23:00:41] (03CR) 10Dzahn: [C: 03+2] installserver: allow stopping DHCP server in Hiera, apply new role [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:05:23] (03CR) 10Dzahn: "noop on install1002/2002. adding DHCP/TFTP role on install1003/2003. pulling all the tftp files.." [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:07:16] * Krinkle is donen testing [23:10:24] (03PS1) 10RLazarus: cumin: Replace apache-fast-test with httpbb in reimage scripts [puppet] - 10https://gerrit.wikimedia.org/r/576464 [23:12:21] 10Operations, 10ops-eqiad, 10DC-Ops: audit/rebalance power in a5-eqiad - https://phabricator.wikimedia.org/T245655 (10wiki_willy) a:03wiki_willy Hi @ayounsi - since each line pair can go up to 30amps (for 30amp PDUs), we should probably set ours to 12amps for alerting. (which would still include the 20% bu... [23:13:08] (03PS1) 10Jforrester: [Beta Cluster] Tell parsoid10 that it's in the parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/576466 (https://phabricator.wikimedia.org/T246833) [23:13:47] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - commonswiki_content_1582399079(65gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [23:14:47] (03CR) 10Dzahn: [C: 03+2] [Beta Cluster] Tell parsoid10 that it's in the parsoid group [puppet] - 10https://gerrit.wikimedia.org/r/576466 (https://phabricator.wikimedia.org/T246833) (owner: 10Jforrester) [23:16:08] (03PS1) 10Papaul: DBS: Add mgmt DNS for pay-lvs200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/576468 [23:17:27] (03CR) 10Papaul: [C: 03+2] DBS: Add mgmt DNS for pay-lvs200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/576468 (owner: 10Papaul) [23:17:54] (03CR) 10Dzahn: [C: 03+1] DBS: Add mgmt DNS for pay-lvs200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/576468 (owner: 10Papaul) [23:20:40] (03PS1) 10Bstorm: toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) [23:24:11] (03CR) 10Dzahn: "No subnet declaration for ens5 (10.192.0.140)." [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:29:03] (03CR) 10Bstorm: "Well, that got the list of things to clean up at least from the PCC." [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) (owner: 10Bstorm) [23:36:01] (03CR) 10Dzahn: "On jessie the DHCP server simply listens on all if "INTERFACES=" is not set. since that is the case on the old servers and they work. Thou" [puppet] - 10https://gerrit.wikimedia.org/r/576460 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [23:39:17] (03PS4) 10Cwhite: monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 [23:39:54] 10Operations, 10ops-codfw: (Need by: TBD) codfw: rack/setup/install wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T242301 (10Papaul) @Gehel can you please make another task to for service implementation and resolve this task ? Thanks [23:40:09] (03PS5) 10Cwhite: monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 [23:40:42] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): (Need by: TBD) codfw: rack/setup/install elastic20{55,56,57,58,59,60}.codfw.wmnet - https://phabricator.wikimedia.org/T241337 (10Papaul) @Gehel can you please make another task for service implementation and resolve this task ? Thanks [23:43:41] (03CR) 10Cwhite: [C: 03+2] monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 (owner: 10Cwhite) [23:44:13] (03CR) 10Dzahn: [C: 03+1] monitoring: remove hostname from mgmt definitions [puppet] - 10https://gerrit.wikimedia.org/r/526165 (owner: 10Cwhite) [23:47:00] (03PS2) 10Bstorm: toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) [23:56:04] (03PS3) 10Bstorm: toolforge: remove special configuration for kubernetes on proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/576469 (https://phabricator.wikimedia.org/T214513) [23:57:23] (03CR) 10Volans: [C: 03+1] "LGTM, two questions inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576464 (owner: 10RLazarus) [23:59:36] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frpm2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242269 (10Dwisehaupt) a:05Jgreen→03Dwisehaupt