[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T0000). [00:00:05] ebernhardson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:28] (03PS2) 10Jforrester: Stop setting wgSessionsInObjectCache, it's being removed from MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489408 [00:00:37] (03CR) 10Jforrester: [C: 03+2] Stop setting wgSessionsInObjectCache, it's being removed from MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489408 (owner: 10Jforrester) [00:01:01] James_F: hi! thx for SWATtin', I have a last-minute addition :) [00:01:06] ebernhardson: You around? [00:01:14] AndyRussG: Sure. :-) [00:01:18] thx! [00:01:36] cdanis: is icingen the plural of icinga? :-P [00:01:45] that was my intent 😂 [00:02:01] classic [00:02:01] (03Merged) 10jenkins-bot: Stop setting wgSessionsInObjectCache, it's being removed from MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489408 (owner: 10Jforrester) [00:03:11] i thought icinga was already plural of icingum [00:03:24] (03CR) 10Volans: [C: 03+1] "LGTM, nitpick on the naming" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489914 (owner: 10CDanis) [00:03:32] cwd: well played ;) [00:03:35] icinga, icingæ, obviously. [00:03:46] aha [00:04:03] (03CR) 10CDanis: icinga: fix manual sync procedure during failovers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489914 (owner: 10CDanis) [00:04:04] icinga / icingus [00:04:47] * volans sends everyone to #wikimedia-latin :-P [00:04:57] icingae [00:05:04] I like icingus just because it rhymes with dingus [00:05:12] AndyRussG: What patch do you want deployed? [00:05:15] I think that's as good a reason as any to prefer that pluralization [00:05:34] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Stop setting wgSessionsInObjectCache, it's being removed from MW I2946b5b9a (duration: 00m 47s) [00:05:35] But …us is singular. [00:05:38] James_F: Just added it to the deployments page, https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CentralNotice/+/489309/-1..1 [00:05:40] * James_F stops, he promises. [00:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:56] (03PS2) 10CDanis: icinga: fix manual sync procedure during failovers [puppet] - 10https://gerrit.wikimedia.org/r/489914 [00:06:04] AndyRussG: Ouch, a merge patch? That's… not normal. [00:06:13] James_F: correct [00:06:22] See preceeding discussion with thcipriani ^ [00:06:29] Yeah. [00:06:42] And also his patch to fix the CN headaches real soon: https://gerrit.wikimedia.org/r/#/c/mediawiki/tools/release/+/489906/ [00:06:54] Yeah, was reviewing. [00:07:17] It's a small number of actual changes being merged in on this merge patch, so it shouldn't contravene that aspect of SWAT rules [00:07:49] So you want https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/489309 pushed on both wmf.14 and wmf.16? [00:07:59] Oh, wait, we don't have wmf.14 deployed, ignore me. [00:08:01] (03CR) 10CDanis: "reran PCC out of an excess of paranoia, looks good still https://puppet-compiler.wmflabs.org/compiler1002/14611/" [puppet] - 10https://gerrit.wikimedia.org/r/489914 (owner: 10CDanis) [00:08:02] Ah no, wait, which one are we on now? [00:08:03] Okie-dokie. [00:08:05] (03CR) 10CDanis: [C: 03+2] icinga: fix manual sync procedure during failovers [puppet] - 10https://gerrit.wikimedia.org/r/489914 (owner: 10CDanis) [00:08:14] We're on wmf.16 everywhere. [00:08:26] James_F: ok yee [00:08:52] (03CR) 10jenkins-bot: Stop setting wgSessionsInObjectCache, it's being removed from MW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489408 (owner: 10Jforrester) [00:09:06] So now that the CN merge patch has merged, the submodule pointer should point to the new stuff [00:09:10] mutante, cdanis got page, lol for the text ;) [00:09:22] lol [00:09:24] AndyRussG: Live on mwdebug1002; testable? [00:09:44] okay, I am going to email ops@ about the icinga failover and then sign off for the night [00:10:11] James_F: checking! [00:10:37] ebernhardson: I don't feel confident enough to deploy your Cirrus config patch without you here to sign-off, sorry. [00:10:53] cdanis: thanks, if it's too late I can do that too [00:11:11] you'd let me do this much of the work and then not get the credit?! ;) [00:11:50] James_F: looks good! [00:12:02] AndyRussG: OK, syncing now. [00:12:07] yeee [00:13:17] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/CentralNotice/: SWAT Merge branch 'master' into wmf_deploy I8e52d222eb (duration: 00m 49s) [00:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:35] OK, all looks quiet. SWAT is done. [00:14:54] James_F: thanks much!!!! [00:15:35] cdanis: lol, didn't meant that ofc :-P [00:21:49] James_F: no worries i completely spaced...can do it myself [00:22:02] volans: 💓 [00:22:06] ebernhardson: All clear, go for it. :-) [00:22:11] emailed ops@, ciao all [00:22:14] sweet [00:23:00] (03PS2) 10EBernhardson: Promote new wbsearchentities profiles to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489780 (https://phabricator.wikimedia.org/T214515) [00:23:07] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489780 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:23:33] (03PS5) 10Paladox: WIP: Update gerrit to 2.16.4 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/486711 [00:23:52] thanks a lot cdanis! [00:23:58] thank you volans! [00:24:02] (03Merged) 10jenkins-bot: Promote new wbsearchentities profiles to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489780 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:26:10] James_F: ooops I think we needed a full scap [00:26:33] AndyRussG: Ah, OK. ebernhardson, can you do one once you're done? [00:26:56] yeah forgot there were new messagies [00:28:01] James_F: yup, i've started syncing mine now [00:28:05] !log ebernhardson@deploy1001 Synchronized wmf-config/WikibaseSearchSettings.php: gerrit:489780 T214515 Promote new wbsearchentities profiles to default in de, fr, es (duration: 00m 46s) [00:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:08] T214515: Run wikidata entitiy autocomplete AB test in de, fr, es - https://phabricator.wikimedia.org/T214515 [00:28:37] James_F ebernhardson thx! [00:28:38] AndyRussG: I saw there were message changes but didn't think to ask if you cared. :-) [00:28:39] simply `scap sync` should do the whole world right? Last time i ran one i think it was undef 10 minutes (surprising to me, will see this time...) [00:29:24] ebernhardson: Yes. It will take between 10 and 70 minutes depending. I can run it if you're not sure how long you can be around. [00:29:35] ("Depending on what?" "Indeed.) [00:30:00] (03PS6) 10Paladox: WIP: Update gerrit to 2.16.4 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/486711 [00:30:48] James_F well most of them are just the normal translation updates, so not urgent. Mmmm yeah the reason I didn't notice when smoke testing on the mwdebug1002 was that I was just testing the code that runs everywhere (as opposed to the CN admin interface on Meta) [00:30:52] !log ebernhardson@deploy1001 Started scap: SWAT: full sync for gerrit:489309 i18n [00:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] (03CR) 10jenkins-bot: Promote new wbsearchentities profiles to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489780 (https://phabricator.wikimedia.org/T214515) (owner: 10EBernhardson) [00:34:28] (03PS7) 10Paladox: WIP: Update gerrit to 2.16.5 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/486711 [00:49:13] !log ebernhardson@deploy1001 Finished scap: SWAT: full sync for gerrit:489309 i18n (duration: 18m 20s) [00:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:59] AndyRussG: Should be all done. [00:53:33] (03CR) 10BryanDavis: "> For the record, there is a standard tool which can be used to" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [00:54:12] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14612/" [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:54:26] (03PS4) 10Dzahn: testreduce: no require_package for nodejs, avoid dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/484811 (https://phabricator.wikimedia.org/T201366) [00:58:22] James_F: thanks much! [00:58:32] apologies for not catching that before [00:58:45] (the messages I mean) [01:24:19] (03PS7) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:25:41] (03CR) 10jerkins-bot: [V: 04-1] testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:27:50] (03PS1) 10Catrope: Enable ORES (damaging-only) on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489934 (https://phabricator.wikimedia.org/T211032) [01:33:18] (03PS8) 10Dzahn: testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) [01:34:35] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES (damaging-only) on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489934 (https://phabricator.wikimedia.org/T211032) (owner: 10Catrope) [01:35:47] (03CR) 10Dzahn: [C: 04-2] testreduce: pin npm to backports, use install_options [puppet] - 10https://gerrit.wikimedia.org/r/486185 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [01:45:01] 10Operations, 10Gerrit: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) [01:45:10] 10Operations, 10Gerrit: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) p:05Triage→03Unbreak! [01:47:04] Though why isen’t icinga reporting it as slow or timing out? [01:47:44] Look at what the plugin is saying? :) [01:47:46] Maybe there's a bug [01:48:40] https://gerrit.wikimedia.org/r/config/server/healthcheck~status [01:48:42] Passes [01:50:46] There’s a spike in threads [01:50:54] According to https://gerrit.wikimedia.org/r/monitoring [01:53:27] Maybe someone should contact releng so that this can be debuged ? [01:53:32] *debugged [01:59:18] This feels like a bug reedy [01:59:35] Happening 2 times within 2 weeks [02:05:14] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) [02:06:28] Loads getting higher :( [02:09:10] Gerrit just rejected my patchset after assigning it a number?! [02:09:24] https://www.irccloud.com/pastebin/elBULuJh/ [02:10:29] Internal error [02:14:09] You should be able to view gerrit error log in logstssh [02:17:01] Any oppers around for gerrit? [02:18:31] oh good [02:18:58] !log restarting gerrit due to high load [02:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:27] I’m leaning towards this being a bug thcipriani [02:21:36] thcipriani: I sent the jvm a SIGQUIT I think just before you restarted it, so there are a bunch of thread stacks on the jvm's output now, which looks to have been caught by systemd journal [02:21:46] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - page size 1529 too small - 1529 bytes in 0.156 second response time [02:21:53] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) p:05Unbreak!→03High @thcipriani restarted gerrit. Keeping this task open for now. [02:22:56] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27343 bytes in 0.444 second response time [02:24:46] Thanks cdanis [02:26:50] cdanis: k, thanks, yep I definitely see a ton of stuff in syslog [02:27:06] i'm extracting the stack traces to a phab paste in a sec [02:28:30] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [02:30:09] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10CDanis) maybe this will be illuminating for someone -- it is stack traces from the gerrit jvm process at the time it was guzzling CPU {P8070} [02:32:21] cdanis: I’ve asked one of the gerrit maintainers about ^^ [02:32:59] thanks. nothing jumped out at me [02:33:06] anyway, I'm offline for the night [02:33:18] Apparently it’s been reported [02:33:40] By other users [02:34:46] cdanis: thcipriani it’s a locking issue [02:35:01] That someone tryed to fix with another library but didn’t work [02:35:37] what do you mean a locking issue? what's being locked? [02:36:33] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) I spoke with upstream who said another user had reported that ( its a locking issue ) they tryed to fix it with a another library but that didn’t work. [02:36:59] thcipriani: the cache files I think [02:39:43] thcipriani: https://github.com/GerritCodeReview/gerrit/search?q=caffeine+&type=Commits [02:39:53] * https://github.com/GerritCodeReview/gerrit/commit/00fc15ac0073b86270e7c0f40d386f95dfe31e86 [02:40:09] That was the user that reported it and tryed to fix it, but was later reverted [02:41:01] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10thcipriani) I noticed that we've been having high cpu usage at about this time every day, unsure if this is some cleanup or indexing that is run on a schedule. I captured a few things prior to resta... [02:43:20] alright, gerrit looks calm again. I pasted the few things I captured prior to restart on that task. Leaving keyboard again. [02:44:05] Ok, /me goes too :) [02:53:11] (03PS2) 10Krinkle: mediawiki: Remove beta-cluster specific auto_prepend_file override [puppet] - 10https://gerrit.wikimedia.org/r/488524 (https://phabricator.wikimedia.org/T176370) [02:54:26] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:49:40] (03PS1) 10Andrew Bogott: bootstrap-vz: set up a root terminal on S1 [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) [04:09:07] 10Operations, 10monitoring: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10TJH2018) I think a good example we could build off of would be https://status.discordapp.com/ as it has the basics and explains why an issue happens. We could easily... [06:02:27] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Yes, judging from our preliminary test, if we get... [06:04:31] !log Fourth manual run of unpublished draft purge script (T203059) [06:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:34] T203059: Fourth manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T203059 [06:09:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489968 (https://phabricator.wikimedia.org/T210713) [06:10:00] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Smalyshev) I get the idea of server-side HTML rendering to avoid delays. But I am kinda questioning whether the advantage of splitting code... [06:11:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489968 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:12:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489968 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:14:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 (duration: 00m 52s) [06:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:43] !log Deploy schema change on db1099:3318 T210713 [06:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:45] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [06:18:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489968 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [06:21:20] (03PS1) 10Marostegui: analytics-grans: Remove globaldev [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) [06:22:34] (03PS2) 10Marostegui: analytics-grans: Remove globaldev user [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) [06:23:20] (03PS3) 10Marostegui: analytics-grants: Remove globaldev user [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) [06:26:03] (03PS1) 10Tulsi Bhagat: Enable Rollbackers User Group Right on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489971 [06:29:56] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:33:25] !log Finished fourth manual run of unpublished draft purge script (T203059) [06:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:28] T203059: Fourth manual run of unpublished draft purge script - https://phabricator.wikimedia.org/T203059 [06:34:16] (03PS2) 10Tulsi Bhagat: Enable Rollbackers User Group Right on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489971 (https://phabricator.wikimedia.org/T215200) [06:40:35] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:51:14] (03PS1) 10Marostegui: dbstore1003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489972 (https://phabricator.wikimedia.org/T210478) [06:52:25] (03CR) 10Marostegui: [C: 03+2] dbstore1003: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/489972 (https://phabricator.wikimedia.org/T210478) (owner: 10Marostegui) [06:55:42] 10Operations, 10Patch-For-Review, 10User-Elukey: tmpreaper doesn't play along with PrivateTmp systemd units - https://phabricator.wikimedia.org/T185195 (10Joe) FYI, I've merged a change yesterday that should've fixed the problem from now on. [06:55:58] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Joe) [06:58:20] 10Operations, 10Patch-For-Review, 10User-Elukey: tmpreaper doesn't play along with PrivateTmp systemd units - https://phabricator.wikimedia.org/T185195 (10Joe) 05Open→03Resolved a:03Joe [07:09:48] !log Rename ep_* tables on db1089 (s1) - T174802 [07:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:51] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [07:24:51] (03PS1) 10Alexandros Kosiaris: package_builder: Export all env vars to debug shell [puppet] - 10https://gerrit.wikimedia.org/r/489979 [07:26:13] !log update analytics-in4 term mysql-dbstore on cr1/cr2 eqiad [07:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:34] (03PS1) 10Muehlenhoff: Remove apache systemd override now that tmpreaper is fixed [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) [07:30:21] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@125354e]: testing simplified virtualenv deploy [07:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:28] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@125354e]: testing simplified virtualenv deploy (duration: 01m 07s) [07:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "- We probably need to do the same for profile::mediawiki::webserver" [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [07:35:49] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@125354e]: testing simplified virtualenv deploy (take 2) [07:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:02] (03PS1) 10Elukey: profile::mariadb::dbstore_multiinstance: add a note for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/489983 (https://phabricator.wikimedia.org/T210478) [07:39:13] (03CR) 10Muehlenhoff: "Ack, I'll take care of mediawiki::webserver in a separate patch when the jobrunners are done." [puppet] - 10https://gerrit.wikimedia.org/r/489982 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [07:40:03] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@125354e]: testing simplified virtualenv deploy (take 2) (duration: 04m 14s) [07:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:39] (03PS1) 10Catrope: Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489984 (https://phabricator.wikimedia.org/T215354) [07:41:45] (03CR) 10jerkins-bot: [V: 04-1] Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489984 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope) [07:44:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489985 [07:46:04] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@125354e]: maintain symlink for old venv path with new virtualenv deploy script [07:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:59] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@125354e]: maintain symlink for old venv path with new virtualenv deploy script (duration: 03m 55s) [07:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:04] (03PS3) 10KartikMistry: Add ExternalGuidance extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489627 (https://phabricator.wikimedia.org/T213076) [08:00:23] (03CR) 10Marostegui: [C: 03+1] profile::mariadb::dbstore_multiinstance: add a note for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/489983 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey) [08:01:32] (03CR) 10Elukey: [C: 03+2] profile::mariadb::dbstore_multiinstance: add a note for Analytics [puppet] - 10https://gerrit.wikimedia.org/r/489983 (https://phabricator.wikimedia.org/T210478) (owner: 10Elukey) [08:02:33] (03CR) 10Smalyshev: Add wdqs data transfer cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [08:04:32] (03CR) 10Elukey: [C: 03+1] analytics-grants: Remove globaldev user [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) (owner: 10Marostegui) [08:04:51] (03PS1) 10Muehlenhoff: Reimage stat1005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/489987 [08:04:53] (03CR) 10Marostegui: [C: 03+2] analytics-grants: Remove globaldev user [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) (owner: 10Marostegui) [08:04:58] (03PS4) 10Marostegui: analytics-grants: Remove globaldev user [puppet] - 10https://gerrit.wikimedia.org/r/489970 (https://phabricator.wikimedia.org/T200801) [08:11:46] moritzm: woooooo [08:12:06] (03PS2) 10Elukey: Reimage stat1005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/489987 (owner: 10Muehlenhoff) [08:12:56] (03CR) 10Elukey: [C: 03+2] Reimage stat1005 with buster [puppet] - 10https://gerrit.wikimedia.org/r/489987 (owner: 10Muehlenhoff) [08:13:49] (03PS1) 10Vgutierrez: certcentral: Implement staging time [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489988 (https://phabricator.wikimedia.org/T213737) [08:13:51] (03PS1) 10Vgutierrez: Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489989 (https://phabricator.wikimedia.org/T207389) [08:14:42] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489985 (owner: 10Marostegui) [08:15:49] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489985 (owner: 10Marostegui) [08:16:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 (duration: 00m 49s) [08:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489991 (https://phabricator.wikimedia.org/T210713) [08:18:11] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489991 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:19:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489991 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:20:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489985 (owner: 10Marostegui) [08:20:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 (duration: 00m 46s) [08:20:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489991 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [08:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] !log Depool db1101:3318 - T210713 [08:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:41] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [08:20:50] !log Deploy schema change on db1101:3318 - T210713 [08:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:29] (03PS1) 10Vgutierrez: debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) [08:28:58] (03PS2) 10Muehlenhoff: prometheus: upgrade prometheus-node-exporter to latest patchset [puppet] - 10https://gerrit.wikimedia.org/r/489756 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [08:37:14] (03CR) 10Muehlenhoff: [C: 03+2] prometheus: upgrade prometheus-node-exporter to latest patchset [puppet] - 10https://gerrit.wikimedia.org/r/489756 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [08:38:01] (03PS2) 10Vgutierrez: debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) [08:47:14] (03CR) 10Vgutierrez: "recheck" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:59:53] (03CR) 10Vgutierrez: [C: 03+1] "package build tested in boron successfully... the only warning got by lintian is the usual & expected one:" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:03:01] 10Operations, 10Analytics, 10WMF-Legal, 10Privacy: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831 (10Gilles) 05Open→03Declined Sure, there will probably be more precise standards in the future for this sort of thing. [09:05:53] (03PS7) 10Gehel: admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [09:05:56] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:52] (03CR) 10Gehel: [C: 03+2] admin: create new system groups for cloudelastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/487040 (https://phabricator.wikimedia.org/T214922) (owner: 10Mathew.onipe) [09:07:02] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 75065 bytes in 0.783 second response time [09:27:34] (03PS2) 10Alexandros Kosiaris: package_builder: Export all env vars to debug shell [puppet] - 10https://gerrit.wikimedia.org/r/489979 [09:27:45] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] package_builder: Export all env vars to debug shell [puppet] - 10https://gerrit.wikimedia.org/r/489979 (owner: 10Alexandros Kosiaris) [09:30:46] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14613/" [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [09:31:07] (03PS4) 10Elukey: role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) [09:35:56] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: add basic camus support [puppet] - 10https://gerrit.wikimedia.org/r/489243 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [09:36:00] !log reimaging stat1005 to buster [09:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:07] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Move evaluation of wikimedia_trust/nets to puppet"" [puppet] - 10https://gerrit.wikimedia.org/r/489999 (https://phabricator.wikimedia.org/T213475) [09:37:07] \o/ [09:42:18] hi, since the image cache issue yesterday, all thumbs disappeared here: https://commons.wikimedia.org/wiki/Category:GAP_works_in_The_J._Paul_Getty_Museum,_working_category [09:42:31] even if I purge the cat [09:42:45] should I open a new report? [09:44:13] idem here https://commons.wikimedia.org/wiki/Category:J._Paul_Getty_Museum [09:53:14] (03PS2) 10Mathew.onipe: use underscore for optional args [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 [09:55:21] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 5.001 ge 4 Effie Mouzeli Server has memory issues -T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [09:55:21] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 5.001 ge 4 Effie Mouzeli Server has memory issues -T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [09:57:31] yannf: I see the thumbs in that page [09:58:01] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) 05Resolved→03Open ` [Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged [Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [... [09:58:03] (03CR) 10Filippo Giunchedi: "LGTM, description nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:58:19] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) a:05jijiki→03RobH [10:00:21] !log installing ghostscript security updates on thumbor1001 [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:34] akosiaris, what should I do to see them? [10:05:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (cc traffic folks)" [puppet] - 10https://gerrit.wikimedia.org/r/489754 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:05:22] yannf: I am not yet sure of the problem you are experiencing. You try to load https://commons.wikimedia.org/wiki/Category:GAP_works_in_The_J._Paul_Getty_Museum,_working_category and the images don't show up at all ? [10:05:41] I guess you already tried the full reload (ctrl+f5) and so on? [10:06:04] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add admin settings [puppet] - 10https://gerrit.wikimedia.org/r/490004 (https://phabricator.wikimedia.org/T212259) [10:06:47] for me, even the slideshow works fine [10:06:57] ah yes, it worked [10:07:11] probably some cache artifact of you browser then [10:07:12] :-) [10:07:19] if ctrl+f5 solved that issue, great! [10:07:29] ok, thanks [10:07:36] yw, thanks for the report [10:08:11] (03CR) 10Filippo Giunchedi: "LGTM, however I don't know what's the trusty VM count at the minute and/or we need to do anything in particular for that." [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:08:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490005 [10:08:59] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14615/" [puppet] - 10https://gerrit.wikimedia.org/r/490004 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [10:09:06] (03CR) 10Filippo Giunchedi: [C: 03+1] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [10:10:17] (03PS1) 10Vgutierrez: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/490006 (https://phabricator.wikimedia.org/T213737) [10:12:01] (03CR) 10Urbanecm: [C: 04-1] "A reason why you remove rules that didn't expire yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [10:12:42] (03CR) 10Vgutierrez: [C: 03+2] Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/490006 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [10:14:26] (03Merged) 10jenkins-bot: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/490006 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [10:14:59] https://wikitech.wikimedia.org/wiki/Category:Runbooks feel free to add as new ones are created [10:16:14] (03CR) 10jenkins-bot: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] - 10https://gerrit.wikimedia.org/r/490006 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [10:17:44] (03CR) 10Vgutierrez: [C: 03+1] "recheck" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:17:46] 10Operations, 10Icinga, 10monitoring: icinga really needs to check puppet run success of passive icinga hosts - https://phabricator.wikimedia.org/T215848 (10Peachey88) [10:18:33] (03CR) 10Filippo Giunchedi: WIP: initial (strawman) configuration for session storage (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487885 (owner: 10Eevans) [10:19:04] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490005 (owner: 10Marostegui) [10:20:09] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490005 (owner: 10Marostegui) [10:20:53] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) p:05Triage→03Normal [10:21:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 (duration: 00m 46s) [10:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490010 (https://phabricator.wikimedia.org/T210713) [10:21:48] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) [10:21:58] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) [10:23:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:24:04] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:24:06] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "LGTM, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [10:24:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Let's merge this as a dev only chart and work more on it in subsequent patches" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/484498 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [10:25:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 (duration: 00m 46s) [10:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:10] !log Deploy schema change on db1092 T210713 [10:25:11] (03PS3) 10Vgutierrez: debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) [10:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:13] (03PS1) 10Vgutierrez: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490011 (https://phabricator.wikimedia.org/T213737) [10:25:13] T210713: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 [10:25:18] PROBLEM - HHVM rendering on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:49] (03PS4) 10Effie Mouzeli: WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [10:25:58] (03CR) 10Vgutierrez: [C: 03+2] Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490011 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [10:26:08] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 75058 bytes in 0.139 second response time [10:26:30] (03CR) 10jerkins-bot: [V: 04-1] WIP: initial (strawman) configuration for session storage [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [10:26:50] (03PS2) 10Mathew.onipe: elasticsearch: unassigned shard icinga check [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) [10:27:04] (03PS1) 10Muehlenhoff: Extend late_command script for buster/facter [puppet] - 10https://gerrit.wikimedia.org/r/490012 [10:27:32] (03CR) 10Mathew.onipe: elasticsearch: unassigned shard icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:31:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490005 (owner: 10Marostegui) [10:31:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490010 (https://phabricator.wikimedia.org/T210713) (owner: 10Marostegui) [10:31:52] (03PS1) 10Muehlenhoff: role::statistics::gpu: Enable ferm [puppet] - 10https://gerrit.wikimedia.org/r/490013 [10:32:27] (03PS1) 10Alexandros Kosiaris: Package kafka-dev helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/490014 [10:33:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package kafka-dev helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/490014 (owner: 10Alexandros Kosiaris) [10:39:05] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 (owner: 10Mathew.onipe) [10:40:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Drop requires_os checks for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489625 (owner: 10Muehlenhoff) [10:41:00] (03PS3) 10Mathew.onipe: use hyphen(-) for optional args [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 [10:43:01] (03PS1) 10Gehel: elasticsearch: cleanup typo [puppet] - 10https://gerrit.wikimedia.org/r/490017 [10:43:03] (03CR) 10GTirloni: [C: 03+1] "We only have a couple of months left for the Toolforge Trusty servers as we're actively moving to the new Stretch cluster. I wouldn't spen" [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [10:43:30] (03PS4) 10Mathew.onipe: use hyphen(-) for optional args [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 [10:43:43] (03CR) 10Alex Monk: [C: 03+2] debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:44:00] (03CR) 10Mathew.onipe: use hyphen(-) for optional args (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 (owner: 10Mathew.onipe) [10:45:04] (03CR) 10Arturo Borrero Gonzalez: "> > For the record, there is a standard tool which can be used to" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/489409 (https://phabricator.wikimedia.org/T178601) (owner: 10BryanDavis) [10:50:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:52:22] (03PS1) 10Jcrespo: mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 [10:52:48] (03PS3) 10Gehel: elasticsearch: unassigned shard icinga check [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:53:55] (03CR) 10Gehel: [C: 03+2] elasticsearch: unassigned shard icinga check [puppet] - 10https://gerrit.wikimedia.org/r/489765 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:54:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGMT." [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [10:54:19] (03CR) 10Mathew.onipe: [C: 03+1] "My bad! typo is on me! THanks" [puppet] - 10https://gerrit.wikimedia.org/r/490017 (owner: 10Gehel) [10:54:24] (03PS1) 10Jcrespo: mariadb: Depool es1014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490021 [10:56:06] (03PS2) 10Muehlenhoff: Extend late_command script for buster/facter [puppet] - 10https://gerrit.wikimedia.org/r/490012 [10:59:58] (03CR) 10Vgutierrez: [C: 03+2] certcentral: Implement staging time [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489988 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [11:00:03] (03CR) 10Vgutierrez: [C: 03+2] Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489989 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:00:20] (03PS2) 10Gehel: elasticsearch: cleanup typo [puppet] - 10https://gerrit.wikimedia.org/r/490017 [11:01:19] (03CR) 10Gehel: [C: 03+2] elasticsearch: cleanup typo [puppet] - 10https://gerrit.wikimedia.org/r/490017 (owner: 10Gehel) [11:01:39] (03Merged) 10jenkins-bot: certcentral: Implement staging time [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489988 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [11:01:58] (03Merged) 10jenkins-bot: Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489989 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:01:59] (03Merged) 10jenkins-bot: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490011 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [11:02:01] (03Merged) 10jenkins-bot: debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:03:13] (03CR) 10jenkins-bot: certcentral: Implement staging time [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489988 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [11:03:31] (03CR) 10jenkins-bot: Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489989 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:03:50] (03CR) 10jenkins-bot: Release 0.9 This release includes the following changes: * Implement staging time * Rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/490011 (https://phabricator.wikimedia.org/T213737) (owner: 10Vgutierrez) [11:03:52] (03CR) 10jenkins-bot: debian: rename certcentral to acme-chief [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/489993 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:05:24] (03PS3) 10Muehlenhoff: Extend late_command script for buster/facter [puppet] - 10https://gerrit.wikimedia.org/r/490012 [11:08:27] (03PS1) 10Alexandros Kosiaris: Package eventgate-analytics chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/490024 [11:08:32] (03CR) 10Muehlenhoff: [C: 03+2] Extend late_command script for buster/facter [puppet] - 10https://gerrit.wikimedia.org/r/490012 (owner: 10Muehlenhoff) [11:10:20] (03PS2) 10Muehlenhoff: role::statistics::gpu: Enable ferm [puppet] - 10https://gerrit.wikimedia.org/r/490013 [11:10:31] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:10:37] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Open→03Resolved [11:11:00] (03PS1) 10Jbond: Off board user Erik Zachte (ezachte) [puppet] - 10https://gerrit.wikimedia.org/r/490025 (https://phabricator.wikimedia.org/T215790) [11:11:34] (03CR) 10jerkins-bot: [V: 04-1] Off board user Erik Zachte (ezachte) [puppet] - 10https://gerrit.wikimedia.org/r/490025 (https://phabricator.wikimedia.org/T215790) (owner: 10Jbond) [11:11:39] (03CR) 10Muehlenhoff: [C: 03+2] role::statistics::gpu: Enable ferm [puppet] - 10https://gerrit.wikimedia.org/r/490013 (owner: 10Muehlenhoff) [11:13:47] (03PS2) 10Jbond: Off board user Erik Zachte (ezachte) [puppet] - 10https://gerrit.wikimedia.org/r/490025 (https://phabricator.wikimedia.org/T215790) [11:14:36] (03PS3) 10Jbond: Off board user Erik Zachte (ezachte) [puppet] - 10https://gerrit.wikimedia.org/r/490025 (https://phabricator.wikimedia.org/T215790) [11:15:20] (03CR) 10Jbond: [C: 03+2] Off board user Erik Zachte (ezachte) [puppet] - 10https://gerrit.wikimedia.org/r/490025 (https://phabricator.wikimedia.org/T215790) (owner: 10Jbond) [11:20:45] !log installing ghostscript security updates on remaining thumbor hosts [11:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:55] (03PS1) 10Muehlenhoff: role::statistics::gpu: Install AMD GPU firmware [puppet] - 10https://gerrit.wikimedia.org/r/490026 [11:22:16] (03PS2) 10Alexandros Kosiaris: Package eventgate-analytics chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/490024 [11:22:18] (03PS1) 10Alexandros Kosiaris: Add a simple README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 [11:22:20] (03PS1) 10Alexandros Kosiaris: Add a default Apache-2 license to the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/490028 [11:22:48] (03PS2) 10Elukey: role::statistics::gpu: Install AMD GPU firmware [puppet] - 10https://gerrit.wikimedia.org/r/490026 (owner: 10Muehlenhoff) [11:22:51] (03PS3) 10Muehlenhoff: role::statistics::gpu: Install AMD GPU firmware [puppet] - 10https://gerrit.wikimedia.org/r/490026 [11:23:36] (03CR) 10Muehlenhoff: [C: 03+2] role::statistics::gpu: Install AMD GPU firmware [puppet] - 10https://gerrit.wikimedia.org/r/490026 (owner: 10Muehlenhoff) [11:23:54] (03CR) 10Elukey: [C: 03+1] "For the moment I'd say to use the role without any more specific profile to ease the testing, eventually I'll refactor the code :)" [puppet] - 10https://gerrit.wikimedia.org/r/490026 (owner: 10Muehlenhoff) [11:24:10] 10Operations, 10SRE-Access-Requests: Requesting access to production for dsharpe - https://phabricator.wikimedia.org/T214130 (10revi) 05Open→03Resolved [11:27:52] !log rebooting stat1005 [11:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) p:05Triage→03Normal [11:31:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14614/ says it's noop. This time around it was a fleet wide PCC so we should be ok. I 'll" [puppet] - 10https://gerrit.wikimedia.org/r/489999 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [11:31:51] (03PS2) 10Alexandros Kosiaris: Revert "Revert "Move evaluation of wikimedia_trust/nets to puppet"" [puppet] - 10https://gerrit.wikimedia.org/r/489999 (https://phabricator.wikimedia.org/T213475) [11:32:04] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:32:06] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [11:33:13] (03PS1) 10Giuseppe Lavagetto: admin: add brennen to contint-{admins,docker}, deployment [puppet] - 10https://gerrit.wikimedia.org/r/490030 (https://phabricator.wikimedia.org/T215328) [11:34:26] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Closing this old bug, we're now using ghostscript 9.26 everywhere. If there's any specific other Gho... [11:35:25] (03PS2) 10Muehlenhoff: admins: create gpu-testers, add ebernhardson, root on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [11:40:13] (03CR) 10Muehlenhoff: [C: 03+2] admins: create gpu-testers, add ebernhardson, root on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/488606 (https://phabricator.wikimedia.org/T215384) (owner: 10Dzahn) [11:40:15] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package eventgate-analytics chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/490024 (owner: 10Alexandros Kosiaris) [11:43:03] (03PS3) 10Elukey: Revert "Revert "Move evaluation of wikimedia_trust/nets to puppet"" [puppet] - 10https://gerrit.wikimedia.org/r/489999 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [11:43:32] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM (apache2 is a standard license after all), but I'm not sure if it's ok that sub-charts could have a different license. Actually I don" [deployment-charts] - 10https://gerrit.wikimedia.org/r/490028 (owner: 10Alexandros Kosiaris) [11:46:11] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Erik Bernhardson to have root access on stat1005 for GPU testing - https://phabricator.wikimedia.org/T215384 (10MoritzMuehlenhoff) 05Open→03Resolved stat1005 is now running Debian buster and I've enabled Erik's access. [11:46:30] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:47:10] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) > TODOS: > * stat1005 will be reimaged to Debian Stretch when the SRE team is ready (work is currently in progress to import Buster in production)... [11:53:47] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) ` elukey@stat1005:~$ sudo dmesg | grep amdgpu [ 17.830797] [drm] amdgpu kernel modesetting enabled. ` \o/ Following https://rocm.github.io/ROCmInstall.ht... [11:56:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/490030 (https://phabricator.wikimedia.org/T215328) (owner: 10Giuseppe Lavagetto) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1200). [12:00:05] TBhagat: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:46] TBhagat: around for swat? [12:00:47] (03CR) 10Arturo Borrero Gonzalez: "I would use a copyleft license if possible :-P" [deployment-charts] - 10https://gerrit.wikimedia.org/r/490028 (owner: 10Alexandros Kosiaris) [12:01:00] Hi zeljkof, [12:01:05] Yeah, Ready [12:01:24] TBhagat: ok, I'll get ready and ping you when the first patch is at mwdebug [12:01:31] I can SWAT today! [12:01:44] Great. Okay [12:02:40] TBhagat: the first patch is marked WIP 489971 [12:02:56] there should be a button like "start review" [12:03:00] that you should push :D [12:03:15] !log Stop MySQL on db1092 to upgrade mysql and kernel [12:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:18] Oops, [12:03:23] all good, no longer WIP : [12:03:24] :) [12:04:27] ;) [12:04:40] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489971 (https://phabricator.wikimedia.org/T215200) (owner: 10Tulsi Bhagat) [12:05:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add brennen to contint-{admins,docker}, deployment [puppet] - 10https://gerrit.wikimedia.org/r/490030 (https://phabricator.wikimedia.org/T215328) (owner: 10Giuseppe Lavagetto) [12:05:23] (03PS2) 10Giuseppe Lavagetto: admin: add brennen to contint-{admins,docker}, deployment [puppet] - 10https://gerrit.wikimedia.org/r/490030 (https://phabricator.wikimedia.org/T215328) [12:05:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor nitpick, but overall nice!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [12:05:32] (03CR) 10ArielGlenn: "Maybe a one-liner about _scaffold so people know to skip over it?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 (owner: 10Alexandros Kosiaris) [12:08:06] (03Merged) 10jenkins-bot: Enable Rollbackers User Group Right on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489971 (https://phabricator.wikimedia.org/T215200) (owner: 10Tulsi Bhagat) [12:09:30] TBhagat: 489971 is at mwdebug1002, please test and let me know if I can deploy [12:09:39] zeljkof, Looks good. Please deploy. :D [12:10:00] ok, deploying [12:10:36] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:489971|Enable Rollbackers User Group Right on azwiki (T215200)]] (duration: 00m 47s) [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:39] T215200: Request for upload a rollbacker's right to azwiki - https://phabricator.wikimedia.org/T215200 [12:10:56] TBhagat: it's deployed, please check [12:11:08] (03PS4) 10Zfilipin: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [12:11:10] (03PS19) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [12:11:35] zeljkof, Checked. Everything is fine. [12:11:47] (03PS20) 10Jbond: Improve CI checks to ensure a basic catalogue compiles on all supported OS's [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) [12:12:22] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [12:13:13] (03CR) 10Jbond: "Thanks Alexandros, have updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/487882 (https://phabricator.wikimedia.org/T215275) (owner: 10Jbond) [12:13:27] (03Merged) 10jenkins-bot: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [12:14:04] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490041 [12:14:09] TBhagat: 489612 is at mwdebug1002, please test and let me know if I can deploy [12:14:40] zeljkof, Tested. Looks good. Please deploy. :D [12:14:44] (03CR) 10jenkins-bot: Enable Rollbackers User Group Right on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489971 (https://phabricator.wikimedia.org/T215200) (owner: 10Tulsi Bhagat) [12:14:46] (03CR) 10jenkins-bot: Create 'extendedconfirmed' user group for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489612 (https://phabricator.wikimedia.org/T215493) (owner: 10Tulsi Bhagat) [12:14:50] ok [12:15:46] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:489612|Create extendedconfirmed user group for viwiki (T215493)]] (duration: 00m 47s) [12:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:49] T215493: Create ‘extendedconfirmed’ for viwiki - https://phabricator.wikimedia.org/T215493 [12:15:55] TBhagat: deployed, please test [12:16:12] (03PS2) 10Zfilipin: Add https://polona.pl/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) (owner: 10Tulsi Bhagat) [12:16:28] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool db1120 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490020 (owner: 10Jcrespo) [12:16:28] Tested. Looks good. [12:17:25] (03CR) 10Zfilipin: [C: 04-1] Add https://polona.pl/ to $wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) (owner: 10Tulsi Bhagat) [12:17:37] TBhagat: the last patch has a whitespace problem [12:17:44] "problem" :D [12:17:51] but anyway, should be fixed [12:18:36] Checking and fixing [12:18:42] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:18:42] thanks [12:19:00] PROBLEM - puppet last run on mw2268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:19:10] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:19:12] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:19:43] !log install ghostscript security updates on scb* [12:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:58] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:18] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:20] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:20] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:26] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:30] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:20:48] ACKNOWLEDGEMENT - MegaRAID on cloudvirt1024 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215892 [12:20:52] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10ops-monitoring-bot) [12:20:57] (03PS3) 10Tulsi Bhagat: Add https://polona.pl/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) [12:21:30] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:21:34] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:22:35] zeljkof, Check it please [12:22:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] bootstrap-vz: set up a root terminal on S1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [12:22:49] TBhagat: on it [12:23:04] PROBLEM - puppet last run on mw2276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:23:06] PROBLEM - puppet last run on mw2263 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:23:08] TBhagat: looks good! :D [12:23:29] * zeljkof used to be in whitespace-scouts ;P [12:23:32] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:23:38] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:23:38] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) (owner: 10Tulsi Bhagat) [12:23:50] PROBLEM - puppet last run on mw2284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:24:18] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:24:20] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:24:42] (03Merged) 10jenkins-bot: Add https://polona.pl/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) (owner: 10Tulsi Bhagat) [12:25:30] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:26:12] (03CR) 10jenkins-bot: Add https://polona.pl/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489614 (https://phabricator.wikimedia.org/T215501) (owner: 10Tulsi Bhagat) [12:26:20] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:26:24] * TBhagat wondering for Wi-Fi connection. [12:26:24] PROBLEM - puppet last run on mw1317 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:27:24] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:27:32] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:27:43] TBhagat: 489614 is at mwdebug1002, please test and let me know if I can deploy [12:28:02] PROBLEM - puppet last run on doc1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[contint-admins_ensure_members] [12:28:12] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:28:14] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [12:28:27] zeljkof, Sorry, I have no idea how to test this. [12:28:50] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:02] PROBLEM - puppet last run on mw1341 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:04] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:04] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:04] the domain should probably appear at an upload page, but I don't know the specifics too :) [12:29:06] I'll deploy [12:29:32] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:32] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:35] TBhagat: in the future, ask how to test it in the task [12:29:36] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:29:59] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:489614|Add https://polona.pl/ to $wgCopyUploadsDomains (T215501)]] (duration: 00m 46s) [12:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:02] T215501: Add https://polona.pl/ to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T215501 [12:30:15] TBhagat: all deployed, thanks for deploying with #releng :) [12:30:27] Ok. :'( [12:30:34] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:30:45] !log EU SWAT finished [12:30:46] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:13] Thank you zeljkof! ;P [12:31:25] TBhagat: no problemo :) [12:31:34] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [12:32:02] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:32:02] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:32:02] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:32:40] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:33:10] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:33:36] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:34:30] <_joe_> oh sigh the cumin command didn't work [12:34:42] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:34:50] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:34:50] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:34:54] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:35:14] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:35:32] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:35:48] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[contint-admins_ensure_members],Exec[docker_ensure_members] [12:36:04] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:36:10] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [12:36:42] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [12:39:00] (03PS1) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [12:39:47] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) There are no source packages for the debs, given that they seem are otherwise pretty focused on FLOSS (e.g. https://rocm.github.io/ROCmInstall.ht... [12:41:00] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[contint-admins_ensure_members],Exec[docker_ensure_members] [12:41:08] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [12:43:24] (03PS2) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [12:43:26] (03PS1) 10Giuseppe Lavagetto: admin: ensure correct ordering of resources [puppet] - 10https://gerrit.wikimedia.org/r/490043 [12:43:57] <_joe_> elukey, moritzm ^^ [12:44:56] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[all-users_ensure_members] [12:46:24] _joe_ in theory that should be guaranteed by the ordering of the statements ? (just to understand) [12:46:32] <_joe_> yes [12:46:44] <_joe_> I'm starting to think we're not respecting that [12:46:54] <_joe_> probably some dependency below, I dunno [12:47:00] <_joe_> I'll have to dig deeper [12:47:06] (03CR) 10Elukey: [C: 03+1] admin: ensure correct ordering of resources [puppet] - 10https://gerrit.wikimedia.org/r/490043 (owner: 10Giuseppe Lavagetto) [12:49:37] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) >>! In T148843#4947165, @MoritzMuehlenhoff wrote: > we should ask them to also publish them. Did that in https://github.com/RadeonOpenCompute/ROC... [12:50:02] looking [12:50:13] elukey: elukey, _joe_ i dont think the order in the mnanifest garunteas they will be applied in that order so i think you do need that [12:50:34] * jbond42 searching the docs on ordering [12:50:46] <_joe_> jbond42: in theory after puppet 4, the default ordering is "manifest" [12:50:55] i think this is also something that has changed between versions [12:51:30] <_joe_> https://puppet.com/docs/puppet/4.8/lang_relationships.html [12:51:46] <_joe_> "By default, Puppet applies resources in the order they’re declared in their manifest. However, if a group of resources must always be managed in a specific order, you should explicitly declare such relationships with relationship metaparameters, chaining arrows, and the require function. [12:51:48] <_joe_> " [12:51:56] <_joe_> doesn't really clarify the issue :D [12:53:41] ack cheers [12:54:04] RECOVERY - puppet last run on doc1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:54:20] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:55:20] <_joe_> ok let's try this way [12:57:15] !log installing openssl1.0 security updates [12:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14616/bast1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/490043 (owner: 10Giuseppe Lavagetto) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1300) [13:01:36] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:50] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:46] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:07:04] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:07:10] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:07:58] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:09:00] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:02] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:02] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:02] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:04] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:08] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:10] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 47 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:12] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:20] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:21] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:23] <_joe_> wat [13:09:26] PROBLEM - puppet last run on mw1328 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:26] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:34] PROBLEM - puppet last run on mw1332 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:44] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:09:56] <_joe_> oh ok it's our puppet check that's broken :D [13:10:04] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:08] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:14] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:14] PROBLEM - puppet last run on mw2263 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:14] PROBLEM - puppet last run on mw2276 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:14] PROBLEM - puppet last run on mw2261 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:27] 10Operations, 10monitoring, 10Patch-For-Review: EDAC events not being reported by node-exporter? - https://phabricator.wikimedia.org/T214529 (10CDanis) Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recent... [13:10:40] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:46] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:46] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:10:51] <_joe_> sorry about this [13:10:58] PROBLEM - puppet last run on mw1341 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:00] PROBLEM - puppet last run on mw2284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:00] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:00] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:00] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:02] PROBLEM - puppet last run on mw1333 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:02] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:11:03] <_joe_> but it was really a regression in the admin module [13:11:12] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:18] PROBLEM - puppet last run on mw2268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:18] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:20] PROBLEM - puppet last run on mw2281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:24] PROBLEM - puppet last run on mw2272 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:24] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2270 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:26] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:27] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:27] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:28] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:28] PROBLEM - puppet last run on mw2264 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:29] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:29] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:30] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:30] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:31] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:34] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:11:52] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:08] PROBLEM - puppet last run on mw1337 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:08] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:20] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:28] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:28] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:32] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:34] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:38] PROBLEM - puppet last run on mw2223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:38] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:42] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:46] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:48] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:12:49] 👀 [13:12:50] PROBLEM - puppet last run on mw1284 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:00] PROBLEM - puppet last run on mw1338 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:08] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:10] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:20] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:24] PROBLEM - puppet last run on mw1317 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:40] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:40] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:46] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:46] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:13:48] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:13:48] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[deployment_ensure_members] [13:14:14] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:14:18] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:14:18] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:14:18] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:14:28] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:14:36] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:14:36] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:14:40] RECOVERY - puppet last run on mw1328 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:14:42] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:15:00] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:15:18] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:15:22] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:15:28] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:15:28] RECOVERY - puppet last run on mw2276 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:15:28] RECOVERY - puppet last run on mw2261 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:15:28] RECOVERY - puppet last run on mw2263 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:15:54] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:15:58] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:15:58] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:10] RECOVERY - puppet last run on mw1341 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:12] RECOVERY - puppet last run on mw2284 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:12] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:12] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:12] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:16] RECOVERY - puppet last run on mw1333 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:24] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:16:30] RECOVERY - puppet last run on mw2268 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:30] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:32] RECOVERY - puppet last run on mw2281 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:16:36] RECOVERY - puppet last run on mw2272 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:36] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2270 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:38] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:16:39] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:39] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:40] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:16:40] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:41] RECOVERY - puppet last run on mw2264 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:41] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:16:42] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:42] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:16:43] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:16:46] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:17:04] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:17:20] RECOVERY - puppet last run on mw2210 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:17:21] RECOVERY - puppet last run on mw1337 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:17:34] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:17:40] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:17:40] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:44] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:17:46] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:17:50] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:17:50] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:17:56] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:17:58] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:18:00] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:04] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:18:14] RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:18:20] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:24] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:30] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:18:36] RECOVERY - puppet last run on mw1317 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:18:50] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:18:52] RECOVERY - puppet last run on mw2205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:18:58] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:19:00] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:19:00] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:19:06] (03CR) 10Gehel: [C: 03+2] use hyphen(-) for optional args [cookbooks] - 10https://gerrit.wikimedia.org/r/489751 (owner: 10Mathew.onipe) [13:19:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment, contint-admins, and contint-docker for Brennen Bearnes - https://phabricator.wikimedia.org/T215328 (10Joe) 05Open→03Resolved @brennen your key was added to production; let me know if you have any problem accessing p... [13:19:30] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:19:36] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:19:38] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:20:02] RECOVERY - puppet last run on mw1332 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:28:11] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10CDanis) It appears that how to make Prometheus node_exporter play nice with rasdaemon is an unresolved issue: https://github.com/prometheus/node_exporter/iss... [13:30:36] 10Operations: sw raid1 doesnt install grub on sdb - https://phabricator.wikimedia.org/T215183 (10CDanis) a:03CDanis [13:31:11] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490041 [13:35:10] (03PS3) 10GTirloni: toolforge::clush::master - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) [13:35:47] (03PS3) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [13:36:34] (03CR) 10GTirloni: [C: 03+2] toolforge::clush::master - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489393 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [13:38:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490041 (owner: 10Marostegui) [13:39:41] !log uploaded acme-chief 0.9 to apt.wikimedia.org (stretch) - T207389 T213737 [13:39:42] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490041 (owner: 10Marostegui) [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] T213737: Allow specifying a custom period of time before deploying a newly issued certificate - https://phabricator.wikimedia.org/T213737 [13:39:46] T207389: Rename the Certcentral project - https://phabricator.wikimedia.org/T207389 [13:40:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1092 (duration: 00m 47s) [13:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:22] PROBLEM - HHVM rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:43:40] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 75076 bytes in 4.511 second response time [13:45:49] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490041 (owner: 10Marostegui) [13:52:21] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [13:53:20] (03PS1) 10Marostegui: db-eqiad.php: Give api traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490049 [13:53:40] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [13:54:38] (03PS2) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [13:54:40] (03PS2) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [13:54:42] (03PS1) 10Vgutierrez: site: Add acmechief[12]001 as spare servers [puppet] - 10https://gerrit.wikimedia.org/r/490050 [13:55:45] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:56:15] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [13:56:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give api traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490049 (owner: 10Marostegui) [13:57:29] (03Merged) 10jenkins-bot: db-eqiad.php: Give api traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490049 (owner: 10Marostegui) [13:57:43] (03CR) 10jenkins-bot: db-eqiad.php: Give api traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490049 (owner: 10Marostegui) [13:58:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some api traffic to db1092 (duration: 00m 46s) [13:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:45] (03PS3) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [13:59:47] (03PS3) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [13:59:49] (03PS1) 10Vgutierrez: install_server: Use buster in acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490051 (https://phabricator.wikimedia.org/T207389) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1400) [14:00:07] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [14:00:44] (03CR) 10Vgutierrez: [C: 03+2] site: Add acmechief[12]001 as spare servers [puppet] - 10https://gerrit.wikimedia.org/r/490050 (owner: 10Vgutierrez) [14:00:48] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:01:12] (03CR) 10Vgutierrez: [C: 03+2] install_server: Use buster in acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490051 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:02:18] (03PS2) 10Alexandros Kosiaris: varnish: Add new WMCS IP space as trusted [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) [14:02:31] (03PS2) 10Vgutierrez: site: Add acmechief[12]001 as spare servers [puppet] - 10https://gerrit.wikimedia.org/r/490050 [14:03:59] (03PS2) 10Vgutierrez: install_server: Use buster in acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490051 (https://phabricator.wikimedia.org/T207389) [14:05:35] (03PS1) 10GTirloni: toolforge::clush::master - Fix systemd timer definition [puppet] - 10https://gerrit.wikimedia.org/r/490052 (https://phabricator.wikimedia.org/T210818) [14:08:05] (03PS2) 10GTirloni: toolforge::clush::master - Fix systemd timer definition [puppet] - 10https://gerrit.wikimedia.org/r/490052 (https://phabricator.wikimedia.org/T210818) [14:08:59] (03CR) 10GTirloni: [C: 03+2] toolforge::clush::master - Fix systemd timer definition [puppet] - 10https://gerrit.wikimedia.org/r/490052 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [14:10:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] "OK. I 've added comments about changing that when T209011 is implemented." [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) (owner: 10Alexandros Kosiaris) [14:10:18] (03PS3) 10Alexandros Kosiaris: varnish: Add new WMCS IP space as trusted [puppet] - 10https://gerrit.wikimedia.org/r/488516 (https://phabricator.wikimedia.org/T213475) [14:10:22] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490053 [14:11:18] (03CR) 10Muehlenhoff: Add rasdaemon service to systems which support it. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [14:11:31] (03CR) 10Muehlenhoff: "Looks good, some comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [14:11:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490053 (owner: 10Marostegui) [14:12:40] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490053 (owner: 10Marostegui) [14:13:40] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10jbond) rasdaemon writes data to a sqlit3 file located in /var/lib/rasdaemon/ras-mc_event.db im not sure the format other then below but perhaps we could use... [14:13:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1092 (duration: 00m 46s) [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:26] (03PS1) 10Cmjohnson: Adding mgmt dns for db11[26-38] [dns] - 10https://gerrit.wikimedia.org/r/490054 (https://phabricator.wikimedia.org/T211613) [14:16:26] \o/ [14:17:18] (03PS11) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:17:20] (03PS2) 10Andrew Bogott: bootstrap-vz: set up a root terminal on ttyS1 [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) [14:18:25] (03PS4) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [14:18:27] (03PS4) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [14:18:29] (03PS1) 10Vgutierrez: install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) [14:18:48] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490053 (owner: 10Marostegui) [14:18:55] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:19:05] (03CR) 10Vgutierrez: [C: 03+2] install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:19:19] (03PS2) 10Vgutierrez: install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) [14:19:20] (03CR) 10jerkins-bot: [V: 04-1] site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:19:26] (03CR) 10jerkins-bot: [V: 04-1] install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:19:35] uh? [14:19:52] (03PS1) 10GTirloni: toolforge::clush::master - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/490056 (https://phabricator.wikimedia.org/T210818) [14:20:59] (03PS4) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [14:21:01] (03CR) 10GTirloni: [C: 03+2] toolforge::clush::master - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/490056 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [14:21:05] (03CR) 10Jbond: "thanks moritz, all comments fixed" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [14:21:18] (03PS1) 10Marostegui: db-eqiad.php: More API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490057 [14:21:35] (03PS3) 10Vgutierrez: install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) [14:21:37] (03PS5) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [14:21:38] rebase all the things \o/ [14:21:39] (03PS5) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [14:22:43] (03PS12) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:22:46] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:22:58] (03PS4) 10Vgutierrez: install_server: provide netboot config for acmechief[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/490055 (https://phabricator.wikimedia.org/T207389) [14:23:05] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10akosiaris) Change has been deployed across the fleet. WMCS IP space `172.16.0.0/1... [14:23:38] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490057 (owner: 10Marostegui) [14:24:52] (03Merged) 10jenkins-bot: db-eqiad.php: More API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490057 (owner: 10Marostegui) [14:25:19] (03PS13) 10Andrew Bogott: Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) [14:26:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to merge, we should update the commit message given that we're only targeting > stretch initially." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [14:26:25] (03CR) 10Andrew Bogott: [C: 03+2] Cloud vms: enable a default tty [puppet] - 10https://gerrit.wikimedia.org/r/489299 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [14:26:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More api traffic to db1092 (duration: 00m 44s) [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:04] (03PS4) 10Alexandros Kosiaris: sca: Remove the cluster [puppet] - 10https://gerrit.wikimedia.org/r/483103 (https://phabricator.wikimedia.org/T212772) [14:29:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] sca: Remove the cluster [puppet] - 10https://gerrit.wikimedia.org/r/483103 (https://phabricator.wikimedia.org/T212772) (owner: 10Alexandros Kosiaris) [14:30:58] (03CR) 10jenkins-bot: db-eqiad.php: More API traffic to db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490057 (owner: 10Marostegui) [14:32:02] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490062 [14:32:53] (03CR) 10Fsero: [C: 03+1] "LGTM :)" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/490027 (owner: 10Alexandros Kosiaris) [14:36:52] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490062 (owner: 10Marostegui) [14:37:56] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490062 (owner: 10Marostegui) [14:39:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1092 (duration: 00m 46s) [14:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:12] (03PS1) 10Elukey: role::analytics_test_cluster::coord: add kafkatee instance [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) [14:42:44] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490062 (owner: 10Marostegui) [14:43:45] (03PS1) 10Marostegui: dbstore1005: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/490068 (https://phabricator.wikimedia.org/T210478) [14:50:52] (03PS1) 10Alexandros Kosiaris: Remove all non kubernetes related zotero stuff from repo [puppet] - 10https://gerrit.wikimedia.org/r/490069 [14:51:07] 10Operations, 10Citoid, 10serviceops, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fsero) 05Open→03Resolved a:03fsero After latest deployments of zotero this has been fixed [14:53:40] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review, 10User-fsero: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fsero) 05Open→03Resolved a:05fsero→03None vms are already assigned and running. [14:59:59] (03PS1) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:00:28] (03CR) 10Ottomata: role::analytics_test_cluster::coord: add kafkatee instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [15:00:32] (03PS2) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:01:28] (03CR) 10Fsero: "I will be very happy if you can review the code and the underlying objective." [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:01:47] (03CR) 10jerkins-bot: [V: 04-1] Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:02:14] (03PS3) 10Andrew Bogott: bootstrap-vz: set up a root terminal on ttyS1 [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) [15:03:20] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: set up a root terminal on ttyS1 [puppet] - 10https://gerrit.wikimedia.org/r/489947 (https://phabricator.wikimedia.org/T215211) (owner: 10Andrew Bogott) [15:05:05] 10Operations, 10ops-eqiad: mw1299 is down (jobrunner-canary, now up but depooled) - https://phabricator.wikimedia.org/T215569 (10Cmjohnson) The self-dispatch was approved and the part should hopefully be here by tomorrow. [15:05:43] (03PS3) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:06:42] (03CR) 10jerkins-bot: [V: 04-1] Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:07:48] (03PS4) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:08:40] (03CR) 10jerkins-bot: [V: 04-1] Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:09:31] (03PS5) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:10:42] (03CR) 10Elukey: role::analytics_test_cluster::coord: add kafkatee instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/490067 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [15:11:00] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Wikimedia-Logstash, and 2 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10EvanProdromou) I wonder if this is a good use case for the [[ https://martinfowler.com/bliki/CircuitBreaker.html | circuit break... [15:16:43] 10Operations, 10Core Platform Team, 10MediaWiki-Database, 10Performance-Team, and 3 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10kchapman) [15:16:59] (03PS5) 10Jbond: Add rasdaemon service to systems which support it. [puppet] - 10https://gerrit.wikimedia.org/r/490042 (https://phabricator.wikimedia.org/T205396) [15:17:53] 10Operations, 10MediaWiki-Database, 10Performance-Team, 10Wikimedia-Logstash, and 4 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10CCicalese_WMF) [15:20:54] (03PS3) 10Herron: logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) [15:21:49] 10Operations, 10MediaWiki-Database, 10Performance-Team, 10Wikimedia-Logstash, and 4 others: MediaWiki errors overloading logstash - https://phabricator.wikimedia.org/T215611 (10CDanis) BTW @fgiunchedi authored an incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190208-logstas... [15:27:05] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Let's see if we can narrow down the packages needed: ` \-- rocm-dev |--hsa-rocr-dev |--hsa-ext-rocr-dev |--rocm-device-libs |--rocm... [15:27:10] (03CR) 10Marostegui: [C: 03+1] "+1 none of those look in use" [dns] - 10https://gerrit.wikimedia.org/r/490054 (https://phabricator.wikimedia.org/T211613) (owner: 10Cmjohnson) [15:27:47] cmjohnson1: do you want me to merge that change or just needed my +1? [15:28:21] Feel free to merge the change. Thank you! [15:28:24] Ah cool [15:28:31] (03CR) 10Marostegui: [C: 03+2] Adding mgmt dns for db11[26-38] [dns] - 10https://gerrit.wikimedia.org/r/490054 (https://phabricator.wikimedia.org/T211613) (owner: 10Cmjohnson) [15:28:56] cmjohnson1: merged and deployed [15:30:09] !log otto@deploy1001 scap-helm --help [namespace: --help, clusters: eqiad,codfw] [15:30:10] !log otto@deploy1001 scap-helm --help cluster eqiad completed [15:30:10] !log otto@deploy1001 scap-helm --help cluster codfw completed [15:30:10] !log otto@deploy1001 scap-helm --help finished [15:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:27] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: allow to fine-tune request timeouts [puppet] - 10https://gerrit.wikimedia.org/r/490077 [15:30:39] hmmm haha akosiaris ^^^ [15:30:46] scap-help --help does logbot! [15:30:49] helm* [15:31:13] "very simple and thin shim" indeed :) [15:32:04] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [15:32:19] yeah, can't wait to get rid of it [15:33:06] (03PS1) 10Alexandros Kosiaris: Add eventgate-analytics tokens [puppet] - 10https://gerrit.wikimedia.org/r/490078 [15:33:46] (03PS1) 10Marostegui: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490079 (https://phabricator.wikimedia.org/T214840) [15:34:56] (03PS2) 10Marostegui: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490079 (https://phabricator.wikimedia.org/T214840) [15:36:00] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490079 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [15:36:58] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490079 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [15:37:13] (03CR) 10jenkins-bot: db-codfw.php: Depool db2085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490079 (https://phabricator.wikimedia.org/T214840) (owner: 10Marostegui) [15:38:08] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085 - T214840 (duration: 00m 47s) [15:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:11] !log Stop MySQL on db2085 - T214840 [15:38:11] T214840: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 [15:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:07] (03CR) 10Herron: [C: 03+2] logstash::collector: pull logs from both kafka-logging clusters [puppet] - 10https://gerrit.wikimedia.org/r/480787 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [15:43:56] (03PS6) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [15:44:23] (03PS6) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [15:44:27] (03PS1) 10Ottomata: Update eventgate config.yaml template to use schema_base_uris key [deployment-charts] - 10https://gerrit.wikimedia.org/r/490080 (https://phabricator.wikimedia.org/T211247) [15:44:45] (03PS2) 10Alexandros Kosiaris: Add eventgate-analytics tokens [puppet] - 10https://gerrit.wikimedia.org/r/490078 [15:44:47] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [15:45:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update eventgate config.yaml template to use schema_base_uris key [deployment-charts] - 10https://gerrit.wikimedia.org/r/490080 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [15:45:15] !log rebooting db2085 for some tests [15:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:53] (03PS3) 10Alexandros Kosiaris: Add eventgate-analytics tokens [puppet] - 10https://gerrit.wikimedia.org/r/490078 (https://phabricator.wikimedia.org/T211247) [15:46:21] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Nice summary in https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/Programming-Guides.html [15:46:28] !log create namespaces for eventgate-analytics on eqiad/codfw/staging cluster T211247 T213194 [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:33] T211247: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline - https://phabricator.wikimedia.org/T211247 [15:46:33] T213194: Migrate citoid to kubernetes - https://phabricator.wikimedia.org/T213194 [15:46:40] sigh wrong task [15:48:23] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:13] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74948 bytes in 0.167 second response time [15:50:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add eventgate-analytics tokens [puppet] - 10https://gerrit.wikimedia.org/r/490078 (https://phabricator.wikimedia.org/T211247) (owner: 10Alexandros Kosiaris) [15:50:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC at https://puppet-compiler.wmflabs.org/compiler1002/14619/" [puppet] - 10https://gerrit.wikimedia.org/r/490078 (https://phabricator.wikimedia.org/T211247) (owner: 10Alexandros Kosiaris) [15:54:31] 10Operations, 10Discovery-Search, 10Elasticsearch: Add more metrics to upstream's elasticsearch exporter. - https://phabricator.wikimedia.org/T214547 (10Gehel) [15:54:34] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10MSantos) [15:56:05] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10MSantos) [15:56:40] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Gehel) 05Open→03Resolved a:03Gehel The incident is documented in https://wikitech.wikimedia.org/... [15:57:20] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: Kartotherian service on maps100[2-4] timed out on when trying to get tiles. - https://phabricator.wikimedia.org/T214434 (10Gehel) A better procedure will be tested and documented as part of the reimage of the codfw maps cluster. [15:58:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad,codfw] [15:58:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [15:58:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster codfw completed [15:58:10] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [15:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:25] lol [15:58:28] need to fix this [15:59:14] (03CR) 10BryanDavis: "> LGTM, however I don't know what's the trusty VM count at the minute" [puppet] - 10https://gerrit.wikimedia.org/r/489753 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:59:15] PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 2503962s 1728000s). [16:01:35] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:02:45] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:04:58] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Gehel) [16:09:35] 10Operations, 10Gerrit, 10serviceops: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10thcipriani) Cleaner threaddump output I grabbed last night and forgot to paste: {P8073} [16:11:23] (03PS7) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [16:11:49] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) First joy: ` (tensorflow_test) elukey@stat1005:~/tensorflow_test$ pip3 install tensorflow-rocm Collecting tensorflow-rocm Could not find a version that sa... [16:11:51] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) @MoritzMuehlenhoff has installed 4.9.144-3 on db2085. Out of 8 reboots, two of them got stuck (in a row). 1st reboot by @MoritzMuehlenhoff OK 2nd rebo... [16:12:07] (03PS7) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [16:12:18] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:14:49] (03PS8) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [16:15:20] (03PS1) 10Alexandros Kosiaris: kubernetes default egress policy: Allow kafka [puppet] - 10https://gerrit.wikimedia.org/r/490083 [16:15:54] (03PS8) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [16:17:22] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:17:34] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Patch-For-Review, and 2 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10Eevans) [16:18:08] !log refresh kubernetes default egress policy T211247 [16:18:09] (03PS6) 10Zoranzoki21: Enable VisualEditor at fiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446583 (https://phabricator.wikimedia.org/T192135) [16:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:12] T211247: Modern Event Platform: Stream Intake Service: Implementation: Deployment Pipeline - https://phabricator.wikimedia.org/T211247 [16:18:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10bd808) >>! In T206636#4946372, @Smalyshev wrote: > Not sure... [16:19:33] (03CR) 10Zoranzoki21: "> A reason why you remove rules that didn't expire yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [16:20:12] (03CR) 10Zoranzoki21: "> > A reason why you remove rules that didn't expire yet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [16:20:28] (03PS9) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [16:20:48] (03PS9) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [16:21:24] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:21:59] (03PS2) 10Zoranzoki21: Add new throttle rule for T215839 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) [16:22:52] finally.. first successful run of pcc.. sorry about the noise :( [16:23:11] (03CR) 10Zoranzoki21: "poke jenkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [16:24:30] (03PS3) 10Zoranzoki21: Add new throttle rule for T215839 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) [16:25:07] (03CR) 10Zoranzoki21: "Why Jenkins no works?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489819 (https://phabricator.wikimedia.org/T215839) (owner: 10Zoranzoki21) [16:26:10] (03PS10) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [16:26:26] (03PS10) 10Vgutierrez: site: Switch acmechief[12]001 to acme_chief role [puppet] - 10https://gerrit.wikimedia.org/r/489720 (https://phabricator.wikimedia.org/T207389) [16:27:01] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:28:27] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:29:06] that's expected ^ [16:29:13] the calico policy controller restart [16:31:21] ACKNOWLEDGEMENT - ElasticSearch shard size check - codfw-search- on search.svc.codfw.wmnet is CRITICAL: CRITICAL - enwiki_content_1547089641(61gb) Gehel Looking at other shards of the same index, this one looks like an outlier. It should shrink with compaction, lets wait a bit before doing anything. [16:31:55] (03PS2) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/489754 (https://phabricator.wikimedia.org/T213708) [16:32:02] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) After restarting with the previous kernel 4.9.0-7-amd64, the first time it didn't boot up, the second time it did. [16:32:08] (03PS1) 10Ladsgroup: statistics: Add configs for new analytics db hosts [puppet] - 10https://gerrit.wikimedia.org/r/490085 (https://phabricator.wikimedia.org/T213894) [16:32:38] (03CR) 10Vgutierrez: "pcc seems happy and there is no mention of certcentral in the change catalog for both nodes: https://puppet-compiler.wmflabs.org/compiler1" [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:32:49] (03CR) 10Cwhite: [C: 03+2] hiera: upgrade prometheus-node-exporter to 0.17 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/489754 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:34:39] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:35:34] (03CR) 10Fsero: [C: 04-1] "Looking up into 2.2.0 doc this should be possible." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490083 (owner: 10Alexandros Kosiaris) [16:41:54] (03PS11) 10Vgutierrez: acme_chief: Create acme_chief module as a duplicate of certcentral [puppet] - 10https://gerrit.wikimedia.org/r/489719 (https://phabricator.wikimedia.org/T207389) [16:42:07] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10Miriam) [16:42:33] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) [16:43:23] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) @Smalyshev I believe the approach we are suggesting really makes a difference when thinking beyond just rendering a template f... [16:44:45] (03PS1) 10Muehlenhoff: service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 [16:45:21] (03CR) 10jerkins-bot: [V: 04-1] service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 (owner: 10Muehlenhoff) [16:47:37] (03PS2) 10Muehlenhoff: service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 [16:48:17] (03CR) 10jerkins-bot: [V: 04-1] service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 (owner: 10Muehlenhoff) [16:53:41] 10Operations, 10Acme-chief, 10Traffic: Upgrade acme-chief to run in debian buster - https://phabricator.wikimedia.org/T215925 (10Vgutierrez) [16:53:53] 10Operations, 10Acme-chief, 10Traffic: Upgrade acme-chief to run in debian buster - https://phabricator.wikimedia.org/T215925 (10Vgutierrez) p:05Triage→03Normal [16:53:56] (03PS2) 10Alexandros Kosiaris: kubernetes default egress policy: Allow kafka [puppet] - 10https://gerrit.wikimedia.org/r/490083 [16:54:16] (03Abandoned) 10Mathew.onipe: wdqs: prefix exporter with wdqs_updater_ [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [16:54:38] (03PS1) 10Vgutierrez: acme-chief: Bump to buster [software/certcentral] - 10https://gerrit.wikimedia.org/r/490093 (https://phabricator.wikimedia.org/T215925) [16:55:31] (03Abandoned) 10Mathew.onipe: icinga: enable check for logstash [puppet] - 10https://gerrit.wikimedia.org/r/484685 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [16:55:55] (03PS1) 10RobH: setting logstash100[012] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/490094 (https://phabricator.wikimedia.org/T214608) [16:56:23] I really like the new gerrit interface, comparing diffs is much better. [16:57:09] (03PS1) 10Cwhite: prometheus: use non-namespaced hiera key to enable site lookup [puppet] - 10https://gerrit.wikimedia.org/r/490095 (https://phabricator.wikimedia.org/T213708) [16:57:15] (03CR) 10RobH: [C: 03+2] setting logstash100[012] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/490094 (https://phabricator.wikimedia.org/T214608) (owner: 10RobH) [16:57:41] (03CR) 10Vgutierrez: [C: 03+1] acme-chief: Bump to buster [software/certcentral] - 10https://gerrit.wikimedia.org/r/490093 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [16:57:44] (03CR) 10jerkins-bot: [V: 04-1] prometheus: use non-namespaced hiera key to enable site lookup [puppet] - 10https://gerrit.wikimedia.org/r/490095 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:57:48] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: use non-namespaced hiera key to enable site lookup [puppet] - 10https://gerrit.wikimedia.org/r/490095 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:58:38] (03CR) 10Fsero: [C: 03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/490083 (owner: 10Alexandros Kosiaris) [16:59:53] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) [17:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:01:43] \o/ [17:01:56] (03CR) 10Cwhite: [V: 03+2 C: 03+2] "Appears to do the right thing: https://puppet-compiler.wmflabs.org/compiler1001/14627/" [puppet] - 10https://gerrit.wikimedia.org/r/490095 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [17:02:28] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @santhosh ^ would you please review and verify it has the intended effect? I need to reset my Vagrant stuff, but figured this wa... [17:02:40] 10Operations, 10DBA, 10Packaging, 10Patch-For-Review: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) @MoritzMuehlenhoff has removed -8 kernel from db2085 and I have rebooted it 8 times with -7 now 1st reboot: OK 2nd reboot: OK 3rd reboot: OK 4th rebo... [17:04:54] (03PS3) 10Muehlenhoff: service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 [17:04:56] !log Start MySQL again on db2085 for s1 and s8 - T214840 [17:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:59] T214840: db2085/db1106 don't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 [17:05:43] ottomata elukey FYI I see four UNKNOWN for mirrormaker on icinga2001 -- known? I suspect related to the icinga failover [17:08:47] godog: known UNKNOWN, nice [17:08:52] :D [17:09:14] it could be definitely it yes, not sure about why now [17:09:19] any idea? [17:10:30] I haven't checked but my hunch is that the check is now checking prometheus codfw and not prometheus codfw [17:10:36] prometheus eqiad that is [17:13:15] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 53 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [17:14:26] (03CR) 10Mobrovac: [C: 03+1] service::node: Stop supporting trusty/upstart [puppet] - 10https://gerrit.wikimedia.org/r/490090 (owner: 10Muehlenhoff) [17:14:36] eqsin puppet is me [17:14:45] PROBLEM - puppet last run on lvs5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [17:15:43] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:06] (03PS1) 10RobH: logstash101[012] puppet repo updates [puppet] - 10https://gerrit.wikimedia.org/r/490101 (https://phabricator.wikimedia.org/T214608) [17:16:10] godog: prometheus pkg failures? ^ [17:16:28] ah no, eqsin [17:16:40] (03CR) 10RobH: [C: 03+2] logstash101[012] puppet repo updates [puppet] - 10https://gerrit.wikimedia.org/r/490101 (https://phabricator.wikimedia.org/T214608) (owner: 10RobH) [17:16:48] mobrovac: yeah shdubsh is on it [17:17:06] (03PS2) 10RobH: logstash101[012] puppet repo updates [puppet] - 10https://gerrit.wikimedia.org/r/490101 (https://phabricator.wikimedia.org/T214608) [17:17:11] yup yup, sorry for the ping [17:17:24] (03PS1) 10Papaul: DNS: Remove mgmt DNS for baham [dns] - 10https://gerrit.wikimedia.org/r/490102 (https://phabricator.wikimedia.org/T199247) [17:17:56] np! thanks mobrovac for keeping an eye on it [17:18:29] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) A "big rsync + snapshot prometheus + final rsync" yields about ~2h30m for the final rsync to run, with the bottleneck being a gazillion f... [17:18:55] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission baham - https://phabricator.wikimedia.org/T199247 (10Papaul) [17:19:08] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) [17:20:55] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:22:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:22:59] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter] [17:23:37] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:23:46] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Watching / External), and 2 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Joe) this is a result of a defect in python3-etcd packaging (so, blame me!) ` oblivian@restbase1016:~$ sudo -i confc... [17:24:40] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: python3-etcd needs python3-dnspython - https://phabricator.wikimedia.org/T209136 (10Joe) 05Resolved→03Open [17:24:42] (03PS1) 10WMDE-leszek: Added wmgWikibaseEntitySources setting for defining Wikibase "entity sources" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490104 (https://phabricator.wikimedia.org/T214557) [17:25:02] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: python3-etcd needs python3-dnspython - https://phabricator.wikimedia.org/T209136 (10Joe) Please note this is fixed on jessie but not on stretch. I'm going to look into it now. [17:25:07] 10Operations, 10Gerrit, 10Release-Engineering-Team (Backlog): Reimage cobalt as stretch - https://phabricator.wikimedia.org/T176774 (10Paladox) We should rename to gerrit1001 at the same time (also this task is blocked on T211139 since gerrit2001 cannot be used until we stop using the db) [17:26:02] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) Firmware is being updated on the bios and idrac before OS installation on all three hosts: installed bios: 1.7.0 installed ilom: 3.21.21.21 newest bios: 1.7.0 (no change, no need to... [17:26:07] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:27:50] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10Paladox) [17:30:19] RECOVERY - puppet last run on lvs5002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:33:23] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:39] (03PS1) 10RobH: fixing netboot entry for new logstash systems [puppet] - 10https://gerrit.wikimedia.org/r/490107 (https://phabricator.wikimedia.org/T214608) [17:44:58] (03CR) 10RobH: [C: 03+2] fixing netboot entry for new logstash systems [puppet] - 10https://gerrit.wikimedia.org/r/490107 (https://phabricator.wikimedia.org/T214608) (owner: 10RobH) [17:45:37] (03PS1) 10WMDE-leszek: DNM Define Wikibase "entity sources" on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) [17:45:54] (03CR) 10WMDE-leszek: [C: 04-1] DNM Define Wikibase "entity sources" on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) (owner: 10WMDE-leszek) [17:46:05] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:46:33] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:46:37] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:46:45] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:46:47] (03CR) 10jerkins-bot: [V: 04-1] DNM Define Wikibase "entity sources" on beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) (owner: 10WMDE-leszek) [17:47:01] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:47:03] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:47:19] nrpe crashed again [17:47:25] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [17:48:04] BAD nrpe BAD [17:48:20] ok what is notebook1003 [17:48:28] I have asked before but I am unable to remember [17:48:40] analytics [17:48:55] this poor thing must suffering [17:48:59] be* [17:49:07] it was crashing all the time a while ago [17:49:17] (03PS1) 10Giuseppe Lavagetto: Fix (again) dnspython dependency [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/490109 [17:50:14] (03CR) 10jerkins-bot: [V: 04-1] Fix (again) dnspython dependency [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/490109 (owner: 10Giuseppe Lavagetto) [17:51:11] jijiki: there is a task about it, sorry for the noise.. Need to apply some systemd cgroup limitations per user, it should fix the problem [17:52:01] the main issue is that sometimes people start multi-processes that eats all the memory, and when the OOM decides to party it kills nrpe too [17:52:51] hehe or give npre a better score and kill user processes :p [17:53:07] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [17:53:21] that is another option but it doesn't prevent the host to get saturated :) [17:53:23] RECOVERY - Disk space on notebook1003 is OK: DISK OK [17:53:45] weird thing is that this time I don't see OOM traces in dmesg [17:53:46] elukey: not at all, but it is more fair [17:53:49] RECOVERY - DPKG on notebook1003 is OK: All packages OK [17:53:55] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:53:59] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [17:54:00] elukey: ok I am curious now :p [17:54:12] !log notebook1003 - restarted nagios-nrpe-server T212824 [17:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:15] T212824: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 [17:54:15] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [17:54:40] but there was a bit spike https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?orgId=1&var-server=notebook1003&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now-1m [17:54:44] in cpu load [17:54:59] *big [17:55:44] and disk io [17:55:59] if it has swap that'd explain that? [17:56:15] yah it does have swap [17:56:17] (03CR) 10Jforrester: DNM Define Wikibase "entity sources" on beta commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490108 (https://phabricator.wikimedia.org/T214557) (owner: 10WMDE-leszek) [17:56:46] so you say this would explain the ~30% iowait [17:56:59] sorry ~26% [17:57:16] if it starts swapping pages to disk, idk that it'd totally explain it but it'd explain a spike in io and cpu [17:57:49] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 40 minutes ago with 0 failures [17:58:01] swap used is 976316 k [17:58:17] (03PS1) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/490110 [17:58:29] it is not a lot [17:58:30] one thing to notice is that MemUsed is ~61 over 62 [17:58:34] that is not Cache [17:59:15] I have seen in syslog processes erroring with "cannot allocate memory" [17:59:24] yep. [17:59:27] see the ticket above [17:59:46] sure I am aware of it, I am working on it :D [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1800). [18:00:06] but this time it wasn't the OOM to cause the issue [18:00:10] this is my point [18:00:22] of course :) [18:00:36] (03CR) 10Ottomata: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [18:01:08] the patch that I have in mind is like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488078/ [18:01:10] [5955048.761782] nrpe[14021]: segfault at 1 ip 000055be5969ce91 sp 00007ffd77d74d70 error 6 in nrpe[55be59696000+e000] [18:01:12] ah i see [18:01:23] ah lovely [18:01:32] it is getting better and better [18:02:23] but i betcha it's still like a memory erro [18:02:31] the swap is totally full [18:03:17] well what elukey said, npre commited suicide [18:07:37] (03PS5) 10Elukey: Introduce profile::analytics::cluster::limits::statistics [puppet] - 10https://gerrit.wikimedia.org/r/488078 (https://phabricator.wikimedia.org/T212824) [18:11:25] godog: hm. [18:11:28] so to roll this thing out I'd need to make an announcement first [18:12:00] godog: interewsting yes! [18:12:24] ottomata: does the above seems reasonable? Only memory limits (20G starts throttling, 30G kill) [18:12:28] for a group of processes [18:12:29] that check is check_prometheus [18:12:30] ottomata: ikr? I have to run now tho [18:12:46] and those metrics don't exist in codfw [18:13:21] (03PS1) 10Bstorm: labstore: convert our first systemd timer to the new format [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) [18:13:30] elukey: i forget...we were going to go the route of everybody in the same group and just kill processes from the group if they use too much? [18:13:40] if so, i'd say let them use as much as possible without borking the machine [18:13:44] 10Operations, 10Proton, 10Security-Team, 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10phuedx) [18:13:51] 64G on stat1007 [18:13:51] so [18:13:55] 60G then start killing? [18:13:58] (03CR) 10Alex Monk: [C: 03+2] acme-chief: Bump to buster [software/certcentral] - 10https://gerrit.wikimedia.org/r/490093 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [18:14:18] 60G is probably too late [18:14:33] 30G is in my opinion a good sign of something getting out of hand [18:14:55] we could have different settings for different set of hosts [18:15:03] like notebooks vs stat boxes [18:15:11] (03CR) 10Bstorm: "A bit later I'll try the compiler on this. There's a lot of ways it might not work to simply change over :-D" [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:15:15] elukey: why is 60G too late? [18:15:25] that's 4 G reserved for OS, ssh etc? [18:15:32] (03Merged) 10jenkins-bot: acme-chief: Bump to buster [software/certcentral] - 10https://gerrit.wikimedia.org/r/490093 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [18:15:38] 10Operations, 10Proton, 10Security-Team, 10Reading-Infrastructure-Team-Backlog (Kanban): [2 hrs] Decide on handling system updates for Proton - https://phabricator.wikimedia.org/T213366 (10phuedx) I've removed this from our (Readers Web's) kanban board as Proton has been handed over to Readers Infrastructure. [18:16:35] ottomata: I am pretty sure that the host is already trashing at that stage [18:17:17] elukey: ok! if you say so...! [18:17:18] (03CR) 10jenkins-bot: acme-chief: Bump to buster [software/certcentral] - 10https://gerrit.wikimedia.org/r/490093 (https://phabricator.wikimedia.org/T215925) (owner: 10Vgutierrez) [18:17:51] i don't really mind either way, just trying to be nice to our users...i think they will together will want use more than 30G on a machine that has 64G tho [18:17:51] ottomata: sorry didn't mean to say "this is it", it is a discussion, I should've written it differently :) [18:17:57] haha sure! [18:18:05] i took it as a discussion DUH [18:18:13] okok :) [18:18:31] yeah of course I am going to send an email to everybody before applying [18:18:39] k! do what you think is best, i will support it! [18:18:41] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) [18:18:42] my opinion is: [18:18:50] if we have a compute machine with 64G, users should be able to make the most of it [18:18:59] ah yes for sure [18:19:07] saving > half the ram for OS doesn't seem right [18:19:19] well it is also for other users :D [18:20:32] maybe 30/40 limits could also be acceptable [18:21:11] other users? [18:21:16] confused [18:21:20] all users are in the same group, no? [18:21:29] nono in this case those are per user settings [18:21:33] OH [18:21:38] the slice thing from systemd [18:21:47] i thought we were talking about every user was going to be in the same group and share the same limits [18:21:52] this was just to keep the machine from borking, not to share fairly [18:22:32] yeah that is one option but I need to figure out how to make it work transparently for the user, the slice systemd thing is transparent and I wanted to use it as safe net for the moment [18:22:47] the notebooks are getting borked every couple of days [18:23:04] 10Operations, 10ops-eqiad: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) [18:23:15] this is a possible solution (already used successfully on the toolforge nodes, but it is a different use case) [18:25:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @faidon battery replaced on cloudvirt1020 [18:26:43] all right will think about it more [18:27:00] will write a summary in the task so people can chime in [18:28:26] elukey: ok! [18:28:53] 10Operations: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10RobH) a:05RobH→03herron @herron, Ok, these are calling into puppet with role spare. You can apply new roles and push into service. Feel free to resolve this task once you are aware of it! [18:29:19] thanks robh! [18:37:28] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) @EvanProdromou After some digging in mc20* re... [18:38:05] PROBLEM - Host cloudvirt1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:40:30] (03CR) 10GTirloni: [C: 03+1] labstore: convert our first systemd timer to the new format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [18:46:47] 10Operations, 10Analytics, 10Research-management, 10User-Elukey: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843 (10elukey) Just as test, I downloaded several python3.6 packages from snapshot.debian.org and applied them to stat1005. This is the issue that I am facing: ` elukey@st... [18:48:30] !log make-wmf-branch 1.33.0-wmf.17 [18:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:58] 10Operations, 10Traffic, 10VisualEditor, 10Wikimedia-Apache-configuration: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200) - https://phabricator.wikimedia.org/T213214 (10matmarex) I did not experience this issue since my last report in January. Has anyone else ran into it again? [19:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T1900) [19:02:00] (03PS1) 10Dr0ptp4kt: WIP DO NOT MERGE enwiki source Google Translate [puppet] - 10https://gerrit.wikimedia.org/r/490120 (https://phabricator.wikimedia.org/T212197) [19:09:35] PROBLEM - Host cloudvirt1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:11:18] 10Operations, 10Operations-Software-Development, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, 10Services (watching): python3-etcd needs python3-dnspython - https://phabricator.wikimedia.org/T209136 (10mobrovac) [19:11:20] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) @BBlack ^ would you please review the enwiki VCL patch? We'll only want to merge it after ExternalGuidance has been tested with... [19:13:42] 10Operations, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee - https://phabricator.wikimedia.org/T215940 (10elappen-WMF) [19:16:02] 10Operations, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee - https://phabricator.wikimedia.org/T215940 (10elappen-WMF) [19:16:21] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10CDanis) a:05CDanis→03RobH icinga2001 looks stable; go for it Rob [19:19:19] (03PS1) 10Thcipriani: Group0 to php-1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490129 [19:20:13] RECOVERY - Host cloudvirt1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [19:24:43] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10RobH) >>! In T213121#4944125, @RobH wrote: > Chris shipped this, and I just put in an inbound shipemnt ticket for EQ Singapore SG#: 1-185487164544 > UPS tracking 1Z291X71DG27842078 EQ SG3... [19:31:23] (03CR) 10GTirloni: [C: 03+1] "Scratch what I said. The format accepts Systemd::Timer::Schedule and the interval fields is actually a variant (either Systemd::Timer::Int" [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [19:32:05] PROBLEM - Host cloudvirt1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:32:26] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.9 (duration: 10m 05s) [19:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:23] RECOVERY - Host cloudvirt1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [19:37:33] !log thcipriani@deploy1001 Pruned MediaWiki: 1.33.0-wmf.12 (duration: 03m 10s) [19:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:10] (03CR) 10GTirloni: [C: 03+1] "The behavior is documented in modules/systemd/types/timer/datetime.pp (it says, "will match only normalized forms"). To get a normalized v" [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [19:40:48] !log thcipriani@deploy1001 Started scap: testwiki to php-1.33.0-wmf.17 and rebuild l10n [19:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:12] (03CR) 10GTirloni: [C: 03+1] labstore: convert our first systemd timer to the new format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490112 (https://phabricator.wikimedia.org/T210818) (owner: 10Bstorm) [19:42:54] (03PS4) 10GTirloni: wmcs::monitoring - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) [19:45:02] thcipriani: BTW, I've got a stack of nine(!) WikibaseMediaInfo-related patches for legal compliance stuff to push into wmf.16 at some point once the train is done. Whee. :-( [19:45:27] James_F: neat :) will ping you when I'm done syncing [19:45:34] Thanks! [19:48:07] (03PS1) 10Ottomata: eventgate - Use topic_prefix instead of datacenter value [deployment-charts] - 10https://gerrit.wikimedia.org/r/490134 [19:55:16] (03CR) 10Ottomata: [C: 03+2] eventgate - Use topic_prefix instead of datacenter value [deployment-charts] - 10https://gerrit.wikimedia.org/r/490134 (owner: 10Ottomata) [19:55:17] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - Use topic_prefix instead of datacenter value [deployment-charts] - 10https://gerrit.wikimedia.org/r/490134 (owner: 10Ottomata) [19:57:53] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 49.00, 23.31, 15.55 [19:57:55] 10Operations, 10Cloud-VPS, 10Toolforge, 10Traffic, 10Patch-For-Review: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response) - https://phabricator.wikimedia.org/T213475 (10Cyberpower678) Not hitting anymore Varnish error messages. Cyberbot's operation... [19:58:40] (03PS5) 10GTirloni: wmcs::monitoring - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) [19:59:05] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 28.93, 23.39, 16.18 [19:59:42] !log thcipriani@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.17 and rebuild l10n (duration: 18m 54s) [19:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:02] (03PS6) 10GTirloni: wmcs::monitoring - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) [20:00:04] thcipriani: How many deployers does it take to do MediaWiki train - Americas version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190212T2000). [20:00:13] * thcipriani doing [20:01:21] (03CR) 10GTirloni: [C: 03+2] wmcs::monitoring - Convert cronjob to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/489394 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [20:02:51] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @faidon and all, it looks like we were missing a connection from the raid card to the riser card. This was not anywhere on the instructio... [20:03:30] (03CR) 10Thcipriani: [C: 03+2] Group0 to php-1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490129 (owner: 10Thcipriani) [20:04:22] (03PS1) 10GTirloni: wmcs::monitoring - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/490137 (https://phabricator.wikimedia.org/T210818) [20:04:34] (03Merged) 10jenkins-bot: Group0 to php-1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490129 (owner: 10Thcipriani) [20:05:29] (03CR) 10GTirloni: [C: 03+2] wmcs::monitoring - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/490137 (https://phabricator.wikimedia.org/T210818) (owner: 10GTirloni) [20:06:45] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:06:52] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: Group0 to 1.33.0-wmf.17 [20:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:15] RECOVERY - Host cloudvirt1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.86 ms [20:11:59] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [20:13:09] (03CR) 10jenkins-bot: Group0 to php-1.33.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490129 (owner: 10Thcipriani) [20:24:46] James_F: group0 all sync'd, deployment server is all yours! [20:24:59] Thanks! [20:35:29] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10jijiki) a:03RobH @RobH We had another alert for an mw server having a high load. After investigating with @CDanis, do you think we could add some thermal paste to the following servers? * mw1222... [20:44:10] I get *so much* e-mail read whilst waiting for CI to merge code for deployments. [20:47:59] James_F: feature! [20:48:25] thcipriani: :-D [20:52:10] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.16/resources/Resources.php: Hot-deploy If0d7b687e for other code (duration: 00m 54s) [20:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:39] Eurgh, several random terminations inside the phpunit runner in different jobs. [20:57:13] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/Wikibase/view/lib/resources.php: Hot-deploy I74f6389ae for other code, file 1 (duration: 00m 51s) [20:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:57] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.16/extensions/Wikibase/view/resources/resources.php: Hot-deploy I74f6389ae for other code, file 2 (duration: 00m 52s) [20:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:25] (03CR) 10Ori.livneh: Set expiry headers on thumbnails (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [21:10:02] !log working on troubleshooting icinga1001 via T214760 [21:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:05] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [21:12:33] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Julia.glen) [21:14:05] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Julia.glen) [21:16:47] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Julia.glen) a:05Julia.glen→03None [21:18:44] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) Ok, rebooted the system and watched it POST, no errors. A quick grep of SEL shows no additional entries from T214760#4945789. I'm now flashing the bios from 1.3.7 (December... [21:26:25] (03PS1) 10Herron: WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 [21:27:26] (03CR) 10jerkins-bot: [V: 04-1] WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [21:29:16] (03PS2) 10Herron: WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 [21:30:06] (03CR) 10jerkins-bot: [V: 04-1] WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [21:37:36] (03PS3) 10Herron: WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 [21:38:14] !log icinga1001 in hardware testing, dont mess with it T214760 [21:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:17] T214760: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 [21:39:47] (03PS4) 10Herron: WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 [21:47:32] (03PS1) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [21:48:09] (03CR) 10Herron: "marking as wip for now, but please have a look at the approach" [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [21:56:24] (03PS1) 10Herron: WIP: logstash: ingest udp_localhost messages by severity [puppet] - 10https://gerrit.wikimedia.org/r/490198 [21:57:52] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) a:05RobH→03Volans Ok, I've run the hardware tests and nothing reports as broken. I'd suggest we return this to service, since we aren't seeing any further errors. A sin... [21:59:01] (03PS2) 10Herron: WIP: logstash: ingest udp_localhost messages by severity [puppet] - 10https://gerrit.wikimedia.org/r/490198 [21:59:22] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10Nuria) @Julia.glen can you not ssh to stat1007.eqiad.wmnet ? [22:01:13] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10CDanis) OK, but IMO let's keep it passive. icinga can continue to run on `icinga2001` for now. [22:01:20] (03CR) 10Herron: "same here, leaving as wip but please have a look at the approach for comparison, etc." [puppet] - 10https://gerrit.wikimedia.org/r/490198 (owner: 10Herron) [22:03:03] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14633/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [22:05:51] 10Operations, 10SRE-Access-Requests: Requesting access to Production Shell for julia.glen - https://phabricator.wikimedia.org/T215966 (10TJones) @Julia.glen, I think this [[ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/488120/6/modules/admin/data/data.yaml | patch ]] should give you an account, but a... [22:06:17] (03CR) 10Herron: WIP: logstash: ingest udp_localhost messages by severity (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490198 (owner: 10Herron) [22:10:31] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) @RobH I actually disagree as the host has crashed already 2 times before it was even in production, so without any icinga-related load, than at least once a couple of month... [22:13:14] (03PS2) 10GTirloni: openstack - Convert cron jobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/490197 (https://phabricator.wikimedia.org/T210818) [22:14:07] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Volans) >>! In T214760#4949330, @CDanis wrote: > OK, but IMO let's keep it passive. icinga can continue to run on `icinga2001` for now. @CDanis: being eqiad our main active datac... [22:21:20] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) Ok, so I'm going to address some of the error messages and log messages here: >>! In T214760#4941030, @Volans wrote: > The host crashed again today and got rebooted, nothi... [22:25:48] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) So with the comments from @volans on T214760#4941652, it seems this may be an issue with CPU#27, which is the second CPU. It may be enough to get another CPU sent by Dell, s... [22:28:00] OK, two hours in and everything is landed, all looks good, I'm going to do a full scap. Yay. [22:28:20] (03PS1) 10Herron: lists:drop if unknown host issues mail from cmd containing our domain [puppet] - 10https://gerrit.wikimedia.org/r/490200 (https://phabricator.wikimedia.org/T215251) [22:29:02] !log jforrester@deploy1001 Started scap: Full scap for new i18n and code for T214482 T215471 T215472 [22:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:08] T215472: Note the CC0 nature of structured data as users edit via the file page - https://phabricator.wikimedia.org/T215472 [22:29:08] T215471: Note the CC0 nature of structured data in the UploadWizard - https://phabricator.wikimedia.org/T215471 [22:29:08] T214482: CC0 license message for structured data contributions - https://phabricator.wikimedia.org/T214482 [22:29:37] (03PS1) 10Bstorm: toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) [22:30:25] (03CR) 10jerkins-bot: [V: 04-1] toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) (owner: 10Bstorm) [22:31:30] (03PS2) 10Bstorm: toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) [22:32:02] (03CR) 10jerkins-bot: [V: 04-1] toolforge-k8s: set up an haproxy load balancer for HA api servers [puppet] - 10https://gerrit.wikimedia.org/r/490201 (https://phabricator.wikimedia.org/T215530) (owner: 10Bstorm) [22:43:28] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10RobH) a:05Volans→03Cmjohnson Chris, Can you open a support request with Dell and insist on a replacement CPU due to the output of T214760#4941652 please? [22:44:31] 10Operations, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee - https://phabricator.wikimedia.org/T215940 (10elappen-WMF) [22:47:05] !log jforrester@deploy1001 Finished scap: Full scap for new i18n and code for T214482 T215471 T215472 (duration: 18m 03s) [22:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:11] T215472: Note the CC0 nature of structured data as users edit via the file page - https://phabricator.wikimedia.org/T215472 [22:47:11] T215471: Note the CC0 nature of structured data in the UploadWizard - https://phabricator.wikimedia.org/T215471 [22:47:11] T214482: CC0 license message for structured data contributions - https://phabricator.wikimedia.org/T214482 [22:47:15] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10faidon) Before these are delivered for implementation, let's make sure that the two systems have identical settings, especially given we've tested var... [22:54:38] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee - https://phabricator.wikimedia.org/T215940 (10Qgil) [22:54:59] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists: Mailing list migration for Arbitration Committee - https://phabricator.wikimedia.org/T215940 (10Qgil) a:05elappen-WMF→03None [22:58:09] PROBLEM - Host ms-be1033 is DOWN: PING CRITICAL - Packet loss = 100% [23:06:46] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production ... - https://phabricator.wikimedia.org/T170480 [23:06:50] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481 (10greg) 05Open→03Invalid Thi... [23:07:11] 10Operations, 10Release-Engineering-Team, 10Category, 10Core Platform Team Backlog (Watching / External), and 2 others: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 (10greg) [23:07:13] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production ... - https://phabricator.wikimedia.org/T170480 [23:08:19] Eurgh, the new i18n hasn't been picked up by RL. I'm going to do a second full scap. [23:08:38] !log jforrester@deploy1001 Started scap: Another full scap, hoping to find the new i18n in RL for T214482 T215471 T215472 [23:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:43] T215472: Note the CC0 nature of structured data as users edit via the file page - https://phabricator.wikimedia.org/T215472 [23:08:44] T215471: Note the CC0 nature of structured data in the UploadWizard - https://phabricator.wikimedia.org/T215471 [23:08:44] T214482: CC0 license message for structured data contributions - https://phabricator.wikimedia.org/T214482 [23:09:56] !log removed 4 files for legal compliance [23:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:38] 10Operations, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production ... - https://phabricator.wikimedia.org/T170480 [23:10:41] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Core Platform Team Kanban (Doing), and 4 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456 (10greg) 05Open→03Invalid Being bold and cl... [23:14:39] !log jforrester@deploy1001 Finished scap: Another full scap, hoping to find the new i18n in RL for T214482 T215471 T215472 (duration: 06m 01s) [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:45] T215472: Note the CC0 nature of structured data as users edit via the file page - https://phabricator.wikimedia.org/T215472 [23:14:45] T215471: Note the CC0 nature of structured data in the UploadWizard - https://phabricator.wikimedia.org/T215471 [23:14:45] T214482: CC0 license message for structured data contributions - https://phabricator.wikimedia.org/T214482 [23:31:50] (03PS1) 10Cwhite: prometheus: do not change trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) [23:37:46] (03PS1) 10Cwhite: prometheus: attempt to force apt update [puppet] - 10https://gerrit.wikimedia.org/r/490204 (https://phabricator.wikimedia.org/T213708) [23:38:34] (03CR) 10jerkins-bot: [V: 04-1] prometheus: attempt to force apt update [puppet] - 10https://gerrit.wikimedia.org/r/490204 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [23:39:57] (03PS2) 10Cwhite: prometheus: attempt to force apt update [puppet] - 10https://gerrit.wikimedia.org/r/490204 (https://phabricator.wikimedia.org/T213708) [23:41:39] (03PS2) 10Cwhite: prometheus: do not change trusty hosts [puppet] - 10https://gerrit.wikimedia.org/r/490203 (https://phabricator.wikimedia.org/T213708) [23:55:30] 10Operations, 10Patch-For-Review: Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937 (10greg) [23:59:45] I can SWAT.