[00:00:04] RoanKattouw, ^d, marktraceur, James_F: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T0000). [00:00:23] Hmm. That's done. [00:00:32] * James_F edits. [00:01:04] The evening swat is done? [00:01:24] wut? [00:01:32] what's done James_F? [00:01:42] no, james is referring to a particular commit [00:01:58] ah yes [00:02:21] also, looks like swat is going to haunt me even beyond my grave [00:02:35] LET'S GET STARTED [00:02:40] You're doing it OuKB? [00:02:44] aha [00:03:08] legoktm, yt? [00:03:51] aude or hoo? [00:03:55] * aude here [00:03:56] here [00:05:22] (03CR) 10Hoo man: "Has this been communicated? I'm a bit unhappy with just launching this into nothing and hopping for the best, but that's just my 2ct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:05:35] (03CR) 10Hoo man: "* hoping" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:05:44] erghwtf [00:05:46] 00:05:16 3 apaches had sync errors [00:06:03] OuKB: hi [00:06:04] CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', '--include=/php-1.25wmf14', '--include=/php-1.25wmf14/extensions', '--include=/php-1.25wmf14/extensions/Wikidata', '--include=/php-1.25wmf14/exten [00:06:05] sions/Wikidata/***', '--exclude=*', 'mw1010.eqiad.wmnet::common', '/srv/mediawiki']' returned non-zero exit status 12 [00:06:11] 1 is known with a r/o fs [00:06:18] that's scap proxy [00:06:25] Reedy: mw1010 is only failing to set times on files [00:06:28] contents still sync [00:06:33] but no idea how hhvm handles taht [00:12:32] (03PS2) 10Kaldari: Turning on WikiGrok for anons on test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184810 [00:12:53] (03PS1) 10Dzahn: remove mw1010 from all dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/184813 [00:12:55] (03PS2) 10Kaldari: Turning on WikiGrok for anons on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184812 [00:13:40] (03CR) 10Hoo man: [C: 04-1] "sort all the things?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184813 (owner: 10Dzahn) [00:16:21] (03CR) 10Legoktm: "Keegan is taking care of that :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:16:26] (03CR) 10Keegan: "Has what been communicated to whom?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:17:13] (03CR) 10Keegan: "Oh, and this going live in production is going to be communicated to the stewards and renamers once this goes live." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:19:26] (03PS1) 10Dzahn: remove mw1010 from dsh groups and scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/184814 [00:20:34] (03PS2) 10Dzahn: remove mw1010 from dsh groups and scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/184814 [00:21:41] (03CR) 10Dzahn: [C: 04-2] "eh, wrong branch" [puppet] - 10https://gerrit.wikimedia.org/r/184813 (owner: 10Dzahn) [00:21:46] (03Abandoned) 10Dzahn: remove mw1010 from all dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/184813 (owner: 10Dzahn) [00:22:24] (03CR) 10Dzahn: [C: 032] remove mw1010 from dsh groups and scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/184814 (owner: 10Dzahn) [00:22:40] (03CR) 1020after4: [C: 031] phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 (owner: 10Rush) [00:22:58] Are we adding another proxy to replace it? [00:23:04] mutante: ^ [00:23:19] no, i dont know [00:23:19] tomorrow might be fun otherwise... [00:23:38] Where did _joe_ get with provisioning more of them? [00:23:41] I'm not sure which rack that will slam [00:23:44] where was the hardware fail reported? [00:24:24] that trace that OuKB pasted looks like a failure fetching from mw1010 not pusing to it [00:24:28] *pushing [00:25:05] bd808: 17:33 hoo: mw1010: rsync: failed to set times on "/srv/mediawiki/.": Read-only file system (30) [00:25:09] that's what I got yesterday [00:25:32] there was more (several files) [00:25:37] but I didn't save all that [00:25:42] (03CR) 10EBernhardson: "baring unforeseen circumstances, i was intending to put this in the pst evening edition swat for next monday(jan 19). I don't believe a s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://phabricator.wikimedia.org/T51193) (owner: 10Spage) [00:25:50] Ok. that seems bad :) [00:26:15] files changing without timestamp updates will make hhvm (or APC) nuts [00:26:37] i cant confirm it's readonly [00:26:46] Yeah, it all seems to work [00:26:55] even manually locally touching the same files worked for me [00:27:15] I guess rsync is actually crying about something else, but bad at error messages [00:27:16] fsck and reboot? [00:27:42] https://gerrit.wikimedia.org/r/#/c/174664/5/hieradata/common/dsh/config.yaml,unified [00:27:43] mmm [00:27:51] I should really dig into rack tables and get a few more added to that [00:28:03] !log maxsem Synchronized php-1.25wmf14/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#/c/184807/ (duration: 00m 12s) [00:28:03] Actually [00:28:04] https://phabricator.wikimedia.org/T1342#784056 [00:28:05] Logged the message, Master [00:28:25] (03Abandoned) 10Dzahn: remove mw1010 from dsh groups and scap proxy list [puppet] - 10https://gerrit.wikimedia.org/r/184814 (owner: 10Dzahn) [00:29:12] Optimally we'd have one proxy in each rack [00:29:15] yeah [00:29:27] OuKB: Looks fine, wikidata wise :) [00:29:38] At my basic investigation there, we need to add at least 3 [00:30:01] seems a bit like a jfdi [00:30:55] Do we care much if a scap proxy is an image scaler/job runner? [00:32:37] (03CR) 10EBernhardson: "Talked to James, turns out monday is a holiday :) Going to do this tuesday instead. Will also do morning instead of night swat so more p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://phabricator.wikimedia.org/T51193) (owner: 10Spage) [00:36:32] (03PS1) 10Reedy: Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) [00:37:04] 3MediaWiki-Core-Team, Release-Engineering, operations: Update servers in scap rsync proxy pool - https://phabricator.wikimedia.org/T1342#974944 (10Reedy) https://gerrit.wikimedia.org/r/#/c/184817/ [00:37:07] bblack: i'm trapped in a meeting for another 10 minutes. i was going to push it now but i don't feel *that* lucky. [00:38:12] (03CR) 10Reedy: "One question would be whether we care if scap proxies are in the jobrunner/image scaler pools" [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [00:42:11] (03PS1) 10Dzahn: temp. disable mw1010 as a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/184820 [00:42:30] (03CR) 10Dzahn: [C: 032] temp. disable mw1010 as a scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/184820 (owner: 10Dzahn) [00:45:26] ori: np, I'm milling about the house doing food prep for dinner and such [00:46:18] !log maxsem Synchronized php-1.25wmf14/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#/c/184807/ (duration: 00m 11s) [00:46:20] Logged the message, Master [00:46:27] (03CR) 10Dzahn: "@Reedy, joe said "It seems that some (all?) of our scap rsync proxies are in the job runner pool. This makes scap slow."" [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [00:47:28] legoktm, is https://gerrit.wikimedia.org/r/#/c/184757/ safe to push? [00:47:52] (03CR) 10MaxSem: [C: 032] Turning on WikiGrok for anons on test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184810 (owner: 10Kaldari) [00:47:59] (03Merged) 10jenkins-bot: Turning on WikiGrok for anons on test and test2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184810 (owner: 10Kaldari) [00:48:03] OuKB: yes [00:48:27] (03CR) 10MaxSem: [C: 032] Enable $wgCentralAuthEnableGlobalRenameRequest in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:48:33] (03Merged) 10jenkins-bot: Enable $wgCentralAuthEnableGlobalRenameRequest in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184757 (owner: 10Legoktm) [00:49:26] another option would be to hack scap and [00:49:29] "Try adding the -O (for --omit-dir-times) parameter to your command." [00:50:08] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/184810 https://gerrit.wikimedia.org/r/184757 (duration: 00m 06s) [00:50:12] Logged the message, Master [00:50:18] hoo, kaldari ^^^ [00:50:53] OuKB: thanks :D [00:50:55] the issue is more like http://superuser.com/questions/200012/rsync-failed-to-set-times-on-dir-path [00:52:56] mutante, it doesn't say Operation not permitted (1) [00:53:06] it says ro fs [00:53:56] but it's all just on directories and failed to set times [00:54:41] and the fs is not really ro [00:56:51] (03PS1) 10MaxSem: Revert "Turning on WikiGrok for anons on test and test2 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184823 [00:57:01] kaldari, ^^ [00:59:01] bblack: ready [00:59:03] ? [00:59:12] !log maxsem Synchronized wmf-config: touch (duration: 00m 07s) [00:59:15] Logged the message, Master [01:01:14] !log maxsem Synchronized wmf-config: touch (duration: 00m 08s) [01:01:16] Logged the message, Master [01:03:24] bblack: going for it [01:03:32] (03PS21) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [01:03:41] (03CR) 10Ori.livneh: [C: 032 V: 032] varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [01:03:43] yes, mw1062 is the actual culprit [01:03:46] that one IS ro [01:03:55] WTF RSYNC YOU SUCK [01:07:12] and here is the ticket [01:07:14] https://phabricator.wikimedia.org/T86542 [01:08:10] (03PS1) 10Dzahn: remove mw1062 from dsh groups - read-only fs [puppet] - 10https://gerrit.wikimedia.org/r/184827 (https://phabricator.wikimedia.org/T86542) [01:08:28] (03Abandoned) 10MaxSem: Revert "Turning on WikiGrok for anons on test and test2 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184823 (owner: 10MaxSem) [01:09:15] (03CR) 10Dzahn: [C: 032] remove mw1062 from dsh groups - read-only fs [puppet] - 10https://gerrit.wikimedia.org/r/184827 (https://phabricator.wikimedia.org/T86542) (owner: 10Dzahn) [01:11:06] 3ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#975021 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/184820/ (temp. removed mw1010 from scap proxies) https://gerrit.wikimedia.org/r/184827 (removed mw1062 from dsh groups after realizing that one is the culprit) after deploy... [01:11:37] 3ops-eqiad: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#975023 (10Dzahn) ^ please enable again after the fix [01:12:18] ori: I'm disappointed in your lack of resulting icinga-wm spam and/or user complaints [01:12:25] OuKB: ^ errors should disappear on next run [01:12:39] :D [01:13:46] Errmsg: Error executing row event: 'Table 'enwiki._externallinks_new' doesn't exist' [01:13:50] ^ schema change ? [01:14:07] Error executing row event: 'Table 'frwiki._externallinks_new' doesn't exist' [01:14:31] yep, looks like tables were deleted, 13h ago [01:15:39] springle: ^ [01:16:46] (03CR) 10MaxSem: [C: 032] Turning on WikiGrok for anons on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184812 (owner: 10Kaldari) [01:16:53] (03Merged) 10jenkins-bot: Turning on WikiGrok for anons on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184812 (owner: 10Kaldari) [01:17:51] !log maxsem Synchronized wmf-config: https://gerrit.wikimedia.org/r/184812 (duration: 00m 06s) [01:17:59] Logged the message, Master [01:18:59] kaldari, https://commons.wikimedia.org/wiki/File:IZBROKEIT.png [01:19:50] anyone heard of "externallinks_new" as opposed to non -_new ? [01:24:20] mutante: yep [01:24:34] springle: you already know about the issue? [01:24:36] (03PS2) 10Dzahn: Kill udpprofile::collector, unused [puppet] - 10https://gerrit.wikimedia.org/r/184698 (owner: 10Faidon Liambotis) [01:24:53] oh, it seems a bunch are already fixed even [01:25:09] yep :) [01:25:10] i just saw it pop up in Icinga [01:25:13] cool:) [01:25:59] (03CR) 10Dzahn: [C: 032] "yes, confirmed unused. not a single server has that package installed, none of the monitoring still exists..." [puppet] - 10https://gerrit.wikimedia.org/r/184698 (owner: 10Faidon Liambotis) [01:27:18] ori: where is that blogstats.py script please that you mentioned to HaeB [01:34:31] mutante: sec, doing three different things [01:37:00] mutante, HaeB: https://gerrit.wikimedia.org/r/#/admin/projects/analytics/blog [01:43:40] ori: gotcha, thx [01:50:32] bblack: by now puppet will have ran everywhere, right? [01:52:48] ori: thanks! [02:02:32] bblack: i think it's working. [02:12:00] PROBLEM - mediawiki-installation DSH group on mw1062 is CRITICAL: Host mw1062 is not in mediawiki-installation dsh group [02:16:13] !log l10nupdate Synchronized php-1.25wmf13/cache/l10n: (no message) (duration: 00m 01s) [02:16:18] !log LocalisationUpdate completed (1.25wmf13) at 2015-01-14 02:16:18+00:00 [02:16:19] Logged the message, Master [02:16:22] Logged the message, Master [02:20:07] icinga-wm: that's right and intended .. but nice that you noticed [02:24:31] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [02:24:36] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-14 02:24:36+00:00 [02:24:40] Logged the message, Master [02:24:43] Logged the message, Master [02:25:59] (03CR) 10Dzahn: [C: 031] Add missing m.{project}.org entries [dns] - 10https://gerrit.wikimedia.org/r/184690 (https://phabricator.wikimedia.org/T78421) (owner: 10Glaisher) [02:28:10] 3operations: Put all zirconium vhosts behind misc varnish cluster - https://phabricator.wikimedia.org/T60048#975099 (10Dzahn) a:5JohnLewis>3Dzahn [02:44:10] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#975142 (10Dzahn) we have 17 days left until the SSL cert for svn.wm.org expires https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=antimony&service=HTTPS until then we need to either buy a new cert, or resolv... [02:44:27] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#975144 (10Dzahn) p:5Low>3High [03:46:10] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [04:03:42] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:15:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jan 14 04:15:41 UTC 2015 (duration 15m 40s) [04:15:47] Logged the message, Master [04:28:31] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.005 second response time [04:38:29] 3operations, Wikimedia-DNS, Wikimedia-General-or-Unknown: m.{project}.org portal/redirect consistency and i18n issues - https://phabricator.wikimedia.org/T78421#975213 (10Glaisher) >>! In T78421#947412, @Krenair wrote: > m.wikiversity.org - redirect to beta.m.wikiversity.org, like oldwikisource? use the wwwport... [04:39:10] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.015 second response time [05:07:51] RECOVERY - Disk space on search1021 is OK: DISK OK [05:16:08] (03PS1) 1020after4: Observe the remote IP reported by X_FORWARDED_FOR header from proxy server [puppet] - 10https://gerrit.wikimedia.org/r/184837 [05:17:39] (03PS2) 1020after4: Observe the remote IP reported by X_FORWARDED_FOR header from proxy server [puppet] - 10https://gerrit.wikimedia.org/r/184837 (https://phabricator.wikimedia.org/T840) [05:29:35] springle: Around, by any chance? A few weeks ago I declared an index on an eventlogging table on m2-master.eqiad.wmnet, and I expected the index to replicate to s1-analytics-slave.eqiad.wmnet, but I don't think it has. I can go ahead and create it on s1-analytics-slave.eqiad.wmnet, but I'm worried that I'd break replication by doing so. Any tips? [05:33:20] ori: ah ok, two things. eventlogging replication to analytics-store and s1-analytics-slave is using (or rather, trialing) an external synchronization process using percona tools, so while new tables will auto-create, adding indexes will not. [05:33:26] second, don't do that :) [05:34:08] add indexes directly on s1-analytics-slave and analytics-store, unless it's needed for the consumer itself (which I didn't ask, so ignore #2 if so) [05:34:57] it's not; it's actually only needed for data analysis scripts that run against the slaves, but i (moronically, it appears) thought i should declare it on the master as a way of keeping things consistent. [05:35:20] would you like me to drop the index on the master? [05:36:06] not moronic at all, just trying different approach to reduce write load on the master as well as further parallelize replication [05:36:35] <_joe_> good idea, never did that :) [05:36:40] <_joe_> (morning) [05:36:48] hi _joe_ [05:36:48] ori: and fwiw, replication to codfw is still normal, so if the codfw master needs the index, leave it [05:37:41] but i do think we should only add custom indexes on analytics slaves at first. indexes are kind of special, in that slaves can be customized without breaking replication of data [05:38:51] ori: there are other examples too. the slaves have a cron job that adds indexes on (iirc) (wiki,timestamp), which wasn't done on the master at creation time. so there is precedent [05:39:35] it makes sense. keeps writes on master fast. [05:41:12] are you sure the consumer is not created wiki, timestamp indices on master? [05:41:15] *creating [05:41:45] let me check that's the combination i meant... [05:42:17] in any case, if there is some index that isn't being created, i can make that happen in eventlogging itself [05:44:33] actually i'd rather analytics start maintaining a list of their own custom indexes for eventlogging and production, that can be applied to a standard production slave. [05:45:36] but ultimately the slave has to keep up with the write load, right? so even if inserts are fast on the master, if the slaves are habitually unable to keep up because of the number of indices that need to be locked and updated, that would be a problem, no? [05:45:40] the same problem exists in labsdb; people need custom indexing for a tool but we struggle to keep it in sync with production schema [05:45:57] ori: certainly would be [05:46:23] we should really drive people toward hadoop [05:46:35] although, just to make it more exciting, analytic-store is using tokudb tables, with can handle massive write load, while the master is still innodb. [05:46:52] there is more and more expertise in the analytics team in using hadoop for complex queries so maybe it's more feasible now than it was a year ago [05:47:04] i'm fine with risking analytics-store on tokudb, new bugs etc, but not the master too just yet [05:47:13] nod [05:47:19] ori: i'll believe that when i see it ;) [05:48:33] also experimenting with clustered secondary indexes on analytics tokudb. sacrifice disk space for data locality [05:52:02] nod [05:54:03] <_joe_> springle: so tokudb is kinda working after all? [05:55:03] _joe_: as of tokudb 7.5 it's mostly behaving [06:00:07] (03CR) 10BryanDavis: Add extra scap proxies for A7, B7 and B8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [06:02:08] ori: yes, it was (wiki, timestamp) index that needed to be added manually on analytics slaves. it's possible that's been fixed and the cron is now a no-op [06:02:53] (and cron == mariadb event scheduler, technically) [06:28:40] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:51] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:00] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:20] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:50] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:05:51] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:06:01] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:06:11] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:31] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:06:41] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:23:46] _joe_: around? [07:28:19] <_joe_> ori: yes [07:28:30] _joe_: so, just a quick update on the current state of the world: [07:28:59] the package worked great in labs, so i uploaded it to apt, and i upgraded the app servers in batches. so far, so good. [07:29:14] i changed mw1230 and mw1231 to use the unix domain socket. still good, no errors. [07:29:19] <_joe_> ori: ok [07:29:40] i made all app servers listen on both 11212 and the unix domain socket, still good [07:29:49] <_joe_> so we just need to expand that config [07:29:56] <_joe_> yeah that makes sense in general [07:29:57] so now all app servers listen on the unix domain socket, but only mw1230 and mw1231 are using it [07:30:07] i'm just about to submit a patch to expand that to mw12* [07:30:09] if that's cool with you [07:30:12] just to keep it gradual [07:30:32] <_joe_> it is, I can always revert it in case [07:30:35] nod [07:31:08] (03PS1) 10Ori.livneh: Memcached: used UNIX domain socket on mw12* app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184843 [07:31:20] <_joe_> did you find the time to think about apache response time capturing? If not, I'll take a look [07:31:32] i haven't, i'm sorry :/ [07:32:24] the other status update is that b.black approved the X-Wikimedia-Debug thing and we pushed it out and amazingly it appears to have worked without a hitch [07:33:23] <_joe_> good, varnish large changes are always quite scary [07:35:08] <_joe_> btw, tim backported his pcre patch (he's always awesome) and it's pretty big. Not sure if we're going to land it in the upstream LTS [07:36:41] i didn't even realize that that was the plan [07:36:45] (to try and upstream it) [07:36:59] <_joe_> oh it always is [07:37:05] <_joe_> at least on my part :) [07:37:12] yeah, it's a good idea [07:37:30] <_joe_> my plan is typically: for things fb may like, try to upstream it there [07:37:41] <_joe_> for things they don't, upstream them to debian [07:40:45] Well, you are the better judge of version management issues, I don't have any wisdom to offer :) [07:40:58] I'm sure you'll figure out a good plan. Happy to help if there's anything I can do. [07:42:18] Out of curiosity, do you know what is the status of HHVM in Debian? The ITP thread seems to have died, but the last post was about uploading it to some server for testing, so I wasn't sure if it actually died or if it simply progressed past the ITP stage. [07:43:12] (03CR) 10Ori.livneh: [C: 032] Memcached: used UNIX domain socket on mw12* app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184843 (owner: 10Ori.livneh) [07:43:16] (03Merged) 10jenkins-bot: Memcached: used UNIX domain socket on mw12* app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184843 (owner: 10Ori.livneh) [07:43:59] !log ori Synchronized wmf-config/mc.php: use UNIX domain socket on mw12* app servers (duration: 00m 06s) [07:44:08] Logged the message, Master [07:44:59] <_joe_> ori: I think paravoid and I are going to upload it when in SF [07:45:08] <_joe_> I have some prep work to do this week [07:45:09] oh, cool! [07:45:11] that's awesome [07:46:33] good morning [07:46:47] the package is uploaded [07:46:51] it's currently sitting in https://ftp-master.debian.org/new.html [07:46:58] for 2 months now :( [07:47:16] but I was suggesting ot upload a new version in SF [07:47:25] and I'll ping ftp-masters [07:54:31] <_joe_> paravoid: still in new? [07:54:37] yeah [07:54:54] a combination of the freeze slowing down things + hhvm being huge and hard to review [07:54:57] <_joe_> I thought it was rejected for some mostly irrelevant reason like "php license" [07:55:18] no [07:56:14] <_joe_> oh, good then :) [07:56:24] <_joe_> it's quite hard to review for sure [08:12:15] <_joe_> whoa wtf happened yesterday evening? [08:12:28] <_joe_> I see a series of 5xx spikes after 18 UTC [08:13:41] there were several big code pushes, it looks like [08:14:29] wmf-config/Wikibase.php: bump cache epoch, wmf-config/InitialiseSettings.php: monolog: enable for group0 + group1 wikis, Non Wikipedias to 1.25wmf14 [08:15:40] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 2 failures [08:16:44] mw1225 is: Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install libssl1.0.0-dbg' returned 100: E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) [08:16:53] looks transient, i'll force another run [08:16:55] <_joe_> ori: yes I was dist-upgrading it [08:16:59] ah ok [08:17:04] i'll go away then :) [08:17:04] <_joe_> no don't worry, I'm rebooting it [08:17:17] <_joe_> since it was out of rotation for no apparent reason [08:17:44] <_joe_> YuviPanda: where did you land? [08:17:47] <_joe_> :) [08:17:59] _joe_: heh, Chennai :) [08:18:11] spending a few days at my parents’ place before flying out to SF [08:18:12] <_joe_> oh, that's home, right? [08:18:24] well, its my parents’ home :) [08:18:31] I hardly spend any time here these days [08:18:34] <_joe_> your "hometown" [08:18:54] <_joe_> I don't spend more than 10 days a year in my hometown since 15 years now [08:18:56] hmm, in some form yeah. where I grew up. [08:19:11] although I think I’ve spent far more time ‘on the internet’ than ‘in chennai' [08:19:17] <_joe_> eheh [08:19:27] <_joe_> I grew up before the internet [08:19:48] I had a 5y period of a computer but no internet. that was a fun time. [08:19:53] lots and lots of PDFs [08:20:21] <_joe_> you don't imagine how hard is for my step-daughter to imagine that we lived without wikipedia and google [08:20:45] :D It’s hard for *me* to imagine some times [08:20:53] <_joe_> she always looks to me with pity, like she's talking to some caveman [08:24:16] <_joe_> !log repooling mw1225, depooled for a long time [08:24:20] Logged the message, Master [08:24:20] _joe_: yup, yup. I’d still think it’s better than nothing, of course. [08:25:12] * YuviPanda prepares the LDAP patch for SWAT [08:25:24] even though that’s kind of ridiculous, since SWAT doesn’t actually affect wikitech [08:29:01] <_joe_> YuviPanda: does OSM need to be local to virt1000? [08:29:14] <_joe_> re: wikitech on hhvm discussion we had a couple of days ago [08:29:20] _joe_: at this point? yeah. [08:29:21] <_joe_> s/hhvm/trusty/ [08:29:28] it could be moved with some work [08:29:28] <_joe_> :/ [08:29:35] <_joe_> ok [08:29:41] <_joe_> not top priority anyway [08:29:45] and I want to move it too, fwiw. [08:29:53] right now there’s literally no way to test changes to OSM [08:30:08] <_joe_> so, lemme merge some changes of mine, then I'll take a look at your beta-related changes [08:30:27] there's nothing that OSM does locally [08:30:31] but otoh it relies on LDAP [08:30:47] paravoid: it’s mostly just a matter of ‘not sure what will break, but sure something is going to' [08:30:51] so this would add a prod -> LDAP dependency + prod -> openstack dependency [08:31:36] _joe_: woo, cool. [08:32:57] so, when we get to horizon, OSM will be only used for OpenStreetMap ? [08:33:10] can't wait for that day :-) [08:34:14] yup, me neither :D [08:34:22] * YuviPanda has been trying to bump up Horizon in our priorities list [08:35:24] anyway, on to killing more local betacluster patches [08:35:53] <_joe_> paravoid: well virt1000 is prod, right? [08:36:14] <_joe_> my idea was to have a dedicated old appserver for wikitech [08:36:22] also, I’ve been using an IDE for puppet for a while now. being able to jump to classes directly (rather than just files) is nice. [08:36:28] <_joe_> not to use our whole infrastructure to serve it [08:36:41] <_joe_> YuviPanda: which IDE? [08:36:46] _joe_: IntelliJ [08:36:57] <_joe_> is there anything able to edit programs besides emacs? [08:37:03] <_joe_> ewww [08:37:08] <_joe_> "J" [08:37:54] _joe_: it’s the only Java program I can stand. I use it for a fair amount of things (Java, Python, PHP, JS) [08:38:01] _joe_: it also has a fairly nice Vim plugin [08:38:11] although not as good as evil mode, the only way to make emacs useful [08:38:12] * YuviPanda ducks [08:38:23] <_joe_> lol [08:39:26] hmm [08:39:34] the eventlogging base classes isn’t really configurable [08:40:32] it's literally the first puppet i ever wrote. what do you need it to do? [08:40:42] ori: oh, I’m trying to get rid of the local patch in betalabs [08:40:54] - $log_dir = $::eventlogging::log_dir [08:40:54] + $log_dir = "/var/log/eventlogging" [08:41:31] and $::eventlogging::log_dir is hardcoded. am making that a class param... [08:41:42] (in the eventloggin base class) [08:41:46] or at least, looking around [08:42:02] if you can't stand to wait, do -- i'm going to make the file-based logging way less verbose [08:42:16] by logging validation errors back into the event stream as ValidationError events or something [08:42:28] (don't ask me what happens if a ValidationError event fails to validate :P) [08:42:38] ValidationErrorValidationError? [08:42:39] :D [08:42:46] turtles all the way down [08:43:39] ori: yeah, I could definitely wait. Lots of other patches to find ways to get rid of :) [08:44:00] like the apache uid=48 thing [08:44:02] * YuviPanda investigates [08:45:10] (03PS3) 10Giuseppe Lavagetto: puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/183879 [08:45:49] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/183879 (owner: 10Giuseppe Lavagetto) [08:45:58] hmm, why do we even need uid to match? [08:46:45] 3operations, Beta-Cluster: File upload area resorts to 0777 permissions to for uploaded content - https://phabricator.wikimedia.org/T75206#975349 (10yuvipanda) Why is this needed again? Just having the user on the labstore machines seems to fix the permissions issues. [08:47:16] <_joe_> YuviPanda: match with what? [08:47:25] _joe_: with each other. [08:47:27] https://phabricator.wikimedia.org/T78076 [08:47:37] is 48 in some places, 996 in others [08:47:52] at least according to that bug. [08:48:05] there’s a patch that renumbers it to 48 everywhere (theoretically) that’s been -2’d. [08:48:09] it’s cherry picked on betalabs [08:48:26] I’m digging through to see both why it was required in the first place and why it was -2’d [08:48:28] <_joe_> of course it is [08:48:36] i mean [08:48:47] it’s fairly trivial why it’s -2’d (lots of pain) [08:48:58] and trying now to see why it was requierd in the first place [08:51:12] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#975352 (10yuvipanda) Why is this needed again? T76086 seems to have fixed T75206. And as @ori said, we should be agnostic about the numeric values unless there's a very good reason. [08:51:17] (03PS1) 10Giuseppe Lavagetto: Revert "puppet: move hiera lookups for the cluster to the actual classes" [puppet] - 10https://gerrit.wikimedia.org/r/184851 [08:51:27] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "puppet: move hiera lookups for the cluster to the actual classes" [puppet] - 10https://gerrit.wikimedia.org/r/184851 (owner: 10Giuseppe Lavagetto) [08:52:01] <_joe_> shit. how can this ever work? [08:53:57] <_joe_> I can't understand why this did not work [08:56:19] <_joe_> ohhh, well. This is embarassing :P [08:57:49] (03PS1) 10Giuseppe Lavagetto: puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/184853 [09:00:08] (03CR) 10Yuvipanda: "Are there plans to merge this anytime soon?" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [09:02:24] (03PS4) 10Yuvipanda: logstash: Split beta role into two composable clearer ones [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) [09:02:33] (03PS2) 10Giuseppe Lavagetto: puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/184853 [09:03:15] (03CR) 10Yuvipanda: [C: 032] logstash: Split beta role into two composable clearer ones [puppet] - 10https://gerrit.wikimedia.org/r/184618 (https://phabricator.wikimedia.org/T86642) (owner: 10Yuvipanda) [09:03:27] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: move hiera lookups for the cluster to the actual classes [puppet] - 10https://gerrit.wikimedia.org/r/184853 (owner: 10Giuseppe Lavagetto) [09:03:54] (03PS1) 10Ori.livneh: Memcached: make remaining app servers use UNIX domain socket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184854 [09:04:09] _joe_: any objections? things look rock-solid on mw12*. [09:05:11] PROBLEM - MySQL Processlist on db1056 is CRITICAL: CRIT 318 unauthenticated, 0 locked, 0 copy to table, 0 statistics [09:06:15] <_joe_> ori: 1 sec [09:06:25] (03PS1) 10Giuseppe Lavagetto: Removed empty default hiera file [puppet] - 10https://gerrit.wikimedia.org/r/184855 [09:06:41] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: puppet fail [09:06:41] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: puppet fail [09:06:41] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: puppet fail [09:06:41] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: puppet fail [09:06:49] <_joe_> this is me ^^ [09:06:50] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: puppet fail [09:06:50] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [09:06:50] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: puppet fail [09:06:51] PROBLEM - puppet last run on capella is CRITICAL: CRITICAL: puppet fail [09:06:51] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [09:06:51] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: puppet fail [09:06:51] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [09:06:55] (03PS2) 10Giuseppe Lavagetto: Removed empty default hiera file [puppet] - 10https://gerrit.wikimedia.org/r/184855 [09:07:00] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [09:07:00] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: puppet fail [09:07:00] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: puppet fail [09:07:01] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [09:07:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Removed empty default hiera file [puppet] - 10https://gerrit.wikimedia.org/r/184855 (owner: 10Giuseppe Lavagetto) [09:07:10] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [09:07:11] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [09:07:20] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: puppet fail [09:07:20] PROBLEM - puppet last run on ms1004 is CRITICAL: CRITICAL: puppet fail [09:07:21] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: puppet fail [09:07:21] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: puppet fail [09:07:21] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [09:07:21] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [09:07:21] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: puppet fail [09:07:22] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: puppet fail [09:07:22] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [09:07:30] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail [09:07:31] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: puppet fail [09:07:32] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: puppet fail [09:07:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: puppet fail [09:07:40] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: puppet fail [09:07:41] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [09:07:41] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [09:07:41] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [09:07:41] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [09:07:41] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: puppet fail [09:07:41] <_joe_> icinga-wm: shush! [09:07:50] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [09:07:50] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: puppet fail [09:07:50] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [09:07:51] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [09:07:51] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: puppet fail [09:08:00] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [09:08:01] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [09:08:08] should really setup a thing that handles IRC paging, maybe. [09:08:10] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: puppet fail [09:08:10] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [09:08:11] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: puppet fail [09:08:11] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [09:08:11] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: puppet fail [09:08:11] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: puppet fail [09:08:11] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [09:08:13] :) [09:08:20] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [09:08:20] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: puppet fail [09:08:30] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [09:08:30] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [09:08:31] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [09:08:31] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: puppet fail [09:08:31] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: puppet fail [09:08:31] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: puppet fail [09:08:31] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [09:08:32] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: puppet fail [09:08:32] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [09:08:33] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: puppet fail [09:08:33] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail [09:08:34] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: puppet fail [09:08:40] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [09:08:41] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail [09:08:41] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [09:08:41] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: puppet fail [09:08:51] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [09:08:51] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: puppet fail [09:08:51] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [09:08:51] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [09:09:00] PROBLEM - puppet last run on polonium is CRITICAL: CRITICAL: puppet fail [09:09:00] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: puppet fail [09:09:00] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: puppet fail [09:09:01] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [09:09:01] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [09:09:06] <_joe_> it's solved btw [09:09:10] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [09:09:10] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: puppet fail [09:09:10] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: puppet fail [09:09:10] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: puppet fail [09:09:11] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: puppet fail [09:09:11] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: puppet fail [09:09:11] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [09:09:14] <_joe_> sigh [09:09:20] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [09:09:20] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: puppet fail [09:09:20] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: puppet fail [09:09:20] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: puppet fail [09:09:20] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [09:09:21] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [09:09:21] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: puppet fail [09:09:21] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: puppet fail [09:09:21] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: puppet fail [09:09:22] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [09:09:23] PROBLEM - puppet last run on antimony is CRITICAL: CRITICAL: puppet fail [09:09:30] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: puppet fail [09:09:31] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: puppet fail [09:09:31] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: puppet fail [09:09:31] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: puppet fail [09:09:31] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:09:31] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on es1007 is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: puppet fail [09:09:40] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [09:09:41] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: puppet fail [09:09:41] the technical term for this is "icinga bukake", i believe [09:09:41] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [09:09:41] PROBLEM - puppet last run on nembus is CRITICAL: CRITICAL: puppet fail [09:09:42] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [09:09:43] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [09:09:43] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [09:09:43] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [09:09:49] <_joe_> ori: ahahah [09:09:50] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: puppet fail [09:09:50] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: puppet fail [09:09:50] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: puppet fail [09:09:55] hmm, really need to bring back quips in some form [09:10:00] RECOVERY - MySQL Processlist on db1056 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 2 statistics [09:10:00] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [09:10:00] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [09:10:00] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [09:10:01] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [09:10:01] PROBLEM - puppet last run on virt1004 is CRITICAL: CRITICAL: puppet fail [09:10:01] PROBLEM - puppet last run on pc1002 is CRITICAL: CRITICAL: puppet fail [09:10:01] PROBLEM - puppet last run on db1003 is CRITICAL: CRITICAL: puppet fail [09:10:11] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: puppet fail [09:10:11] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: puppet fail [09:10:20] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [09:10:20] PROBLEM - puppet last run on virt1001 is CRITICAL: CRITICAL: puppet fail [09:10:21] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: puppet fail [09:10:30] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: puppet fail [09:10:30] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: puppet fail [09:10:31] PROBLEM - puppet last run on plutonium is CRITICAL: CRITICAL: puppet fail [09:10:41] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: puppet fail [09:10:50] PROBLEM - puppet last run on elastic1006 is CRITICAL: CRITICAL: puppet fail [09:10:51] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: puppet fail [09:12:10] (03PS3) 10Giuseppe Lavagetto: mediawiki: move cluster definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/183880 [09:20:42] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: move cluster definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/183880 (owner: 10Giuseppe Lavagetto) [09:24:05] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:24:15] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:24:34] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:24:34] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:24:35] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:24:44] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:24:54] RECOVERY - puppet last run on capella is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:25:04] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:25:04] RECOVERY - puppet last run on ms1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:25:05] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:25:15] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:15] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:25:25] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:25:35] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:25:35] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:25:36] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:25:44] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:25:44] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:25:44] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:25:45] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:25:54] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:25:54] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:25:55] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:26:13] malkovich malkovich [09:26:14] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:26:14] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:26:24] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:25] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:26:25] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:25] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:26:25] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:25] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:26:35] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:26:35] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:26:35] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:44] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:26:44] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:26:45] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:26:54] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:26:55] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:27:04] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:27:04] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:27:04] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:27:05] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:27:05] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:27:14] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:27:14] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:27:15] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:27:15] RECOVERY - puppet last run on es1007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:27:15] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:27:24] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: puppet fail [09:27:24] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:24] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: puppet fail [09:27:24] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [09:27:24] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:24] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [09:27:24] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [09:27:25] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:27:25] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: puppet fail [09:27:26] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [09:27:26] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [09:27:27] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: puppet fail [09:27:28] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:27:29] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: puppet fail [09:27:29] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [09:27:34] RECOVERY - puppet last run on virt1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:27:34] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: puppet fail [09:27:34] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:27:34] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:27:35] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:27:35] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: puppet fail [09:27:35] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: puppet fail [09:27:35] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: puppet fail [09:27:35] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: puppet fail [09:27:36] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail [09:27:37] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: puppet fail [09:27:37] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: puppet fail [09:27:37] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:38] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:44] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [09:27:44] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:27:45] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: puppet fail [09:27:45] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [09:27:45] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: puppet fail [09:27:45] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:45] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: puppet fail [09:27:45] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:27:45] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:27:46] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:27:52] ? [09:27:54] RECOVERY - puppet last run on db1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:27:54] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:27:54] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: puppet fail [09:27:54] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: puppet fail [09:27:54] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: puppet fail [09:27:54] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: puppet fail [09:27:55] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: puppet fail [09:27:55] RECOVERY - puppet last run on pc1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:27:55] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [09:27:56] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [09:28:04] RECOVERY - puppet last run on polonium is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:28:04] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:28:05] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:28:05] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [09:28:05] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [09:28:05] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:05] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: puppet fail [09:28:05] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [09:28:06] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: puppet fail [09:28:07] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [09:28:07] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: puppet fail [09:28:07] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: puppet fail [09:28:08] (03PS1) 10Giuseppe Lavagetto: puppet: fix ganglia lookup scoping [puppet] - 10https://gerrit.wikimedia.org/r/184857 [09:28:08] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:08] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [09:28:14] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:14] RECOVERY - puppet last run on plutonium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:28:14] RECOVERY - puppet last run on antimony is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:28:14] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: puppet fail [09:28:15] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: puppet fail [09:28:15] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [09:28:15] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: puppet fail [09:28:15] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:28:15] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:28:16] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:28:25] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: puppet fail [09:28:25] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:25] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:28:25] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: puppet fail [09:28:26] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: puppet fail [09:28:26] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: puppet fail [09:28:26] RECOVERY - puppet last run on elastic1006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:28:26] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: puppet fail [09:28:44] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:44] RECOVERY - puppet last run on virt1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:45] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:45] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:29:04] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:04] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:16] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:29:19] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: fix ganglia lookup scoping [puppet] - 10https://gerrit.wikimedia.org/r/184857 (owner: 10Giuseppe Lavagetto) [09:29:25] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:45] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:07] <_joe_> grrr [09:30:27] <_joe_> old bugs that resurface when you fix things [09:41:51] <_joe_> ganglia is going to be ok in ~ 10 minutes [09:41:55] <_joe_> sorry for that [09:43:19] hmm, #puppet tagged tasks should also show up here, I think. [09:43:24] rather than in -dev [09:43:28] <_joe_> yes [09:43:34] * YuviPanda writes patch [09:44:38] _joe_: wanna +1 that? https://gerrit.wikimedia.org/r/184859 [09:44:43] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:44:43] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:45:02] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:45:02] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:45:13] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:45:27] _joe_: thanks [09:45:32] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:45:32] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:45:53] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:46:03] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:46:12] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:46:12] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:46:23] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:46:23] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:46:23] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:46:23] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:46:40] huh, that didn’t seem to work [09:46:43] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:46:52] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:53] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:46:53] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:47:02] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:02] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:03] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:47:03] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:03] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:47:12] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:47:13] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:47:13] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:47:13] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:47:22] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:47:22] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:47:23] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:47:23] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:47:23] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:47:32] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:43] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:47:44] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:52] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:47:52] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:47:52] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:48:03] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:03] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:48:22] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:23] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:48:23] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:32] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:48:52] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:02] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:43] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:43] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:53:55] _joe_: i can babysit the nutcracker patch [09:54:29] <_joe_> ori: I am mostly free now [09:54:41] cool [09:54:46] shall we go for it? [09:54:52] <_joe_> yes [09:56:22] (03CR) 10Ori.livneh: [C: 032] Memcached: make remaining app servers use UNIX domain socket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184854 (owner: 10Ori.livneh) [09:56:26] (03Merged) 10jenkins-bot: Memcached: make remaining app servers use UNIX domain socket [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184854 (owner: 10Ori.livneh) [09:57:44] !log ori Synchronized wmf-config/mc.php: Memcached: make remaining app servers use UNIX domain socket (duration: 00m 06s) [09:57:47] <_joe_> whoa [09:57:48] Logged the message, Master [09:57:52] <_joe_> a shitton of errors [09:58:34] where? [09:58:41] <_joe_> on most servers [09:58:52] <_joe_> we have a file called 'nutcracker.sock 0666' [09:59:03] <_joe_> revert :) [09:59:39] <_joe_> did you deploy the package everywhere? [10:00:00] <_joe_> oh sorry [10:00:03] <_joe_> not everywhere [10:00:06] <_joe_> on the imagescalers [10:00:10] <_joe_> of course [10:00:28] <_joe_> so, please revert, we need to refine the patch a bit [10:00:46] <_joe_> or to recompile nutcracker there as well [10:00:52] <_joe_> which is probably the best thing to do [10:01:02] ohhh [10:01:11] UNIX domain socket being faster than TCP ones ? :D [10:01:13] yeah ok i'll make it conditional on the os [10:01:34] <_joe_> that would work [10:01:42] <_joe_> I completely forgot about that tbh [10:01:57] <_joe_> but I'll rebuild the package there as well [10:03:59] (03PS1) 10Ori.livneh: nutcracker: use UNIX domain socket only if on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184865 [10:04:27] (03CR) 10Giuseppe Lavagetto: [C: 031] nutcracker: use UNIX domain socket only if on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184865 (owner: 10Ori.livneh) [10:04:48] (03CR) 10Ori.livneh: [C: 032] nutcracker: use UNIX domain socket only if on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184865 (owner: 10Ori.livneh) [10:04:51] thanks [10:04:52] (03Merged) 10jenkins-bot: nutcracker: use UNIX domain socket only if on HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184865 (owner: 10Ori.livneh) [10:05:12] <_joe_> I'll remove that conditional as soon as I have a new package [10:05:29] sweet [10:05:33] !log ori Synchronized wmf-config/mc.php: nutcracker: use UNIX domain socket only if on HHVM (duration: 00m 07s) [10:05:36] Logged the message, Master [10:05:40] (03PS1) 10Filippo Giunchedi: lsearchd: remove nagios check_lucene [puppet] - 10https://gerrit.wikimedia.org/r/184866 (https://phabricator.wikimedia.org/T86150) [10:07:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lsearchd: remove nagios check_lucene [puppet] - 10https://gerrit.wikimedia.org/r/184866 (https://phabricator.wikimedia.org/T86150) (owner: 10Filippo Giunchedi) [10:09:08] <_joe_> ori: mmmh it seems not to have worked [10:09:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove service endpoints for lsearchd [dns] - 10https://gerrit.wikimedia.org/r/184624 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [10:09:20] <_joe_> meaning I still see errors on fluorine [10:10:06] could be udp2log's buffer is still getting flushed [10:11:02] <_joe_> ori: no, netstat agrees with me [10:11:15] <_joe_> ori: let me revert the whole thing and go on again at a later time [10:11:27] 3operations: Decommission lsearchd - https://phabricator.wikimedia.org/T85009#975470 (10fgiunchedi) [10:11:29] 3operations: remove lsearchd support from puppet - https://phabricator.wikimedia.org/T86150#975468 (10fgiunchedi) 5Open>3Resolved done, puppet role/class decommissioned and dns entries removed [10:12:53] _joe_: just a sec [10:14:13] <_joe_> I don't get why this happens btw, the code seems good. [10:14:33] oh duh [10:14:35] no, it's not [10:14:40] 'servers' key twice [10:15:05] (03PS1) 10Ori.livneh: Typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184867 [10:15:18] (03CR) 10Ori.livneh: [C: 032] Typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184867 (owner: 10Ori.livneh) [10:15:23] (03Merged) 10jenkins-bot: Typo fix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184867 (owner: 10Ori.livneh) [10:15:31] <_joe_> sigh [10:15:46] !log ori Synchronized wmf-config/mc.php: typo fix for nutcracker socket (duration: 00m 05s) [10:15:51] Logged the message, Master [10:16:06] all better now. [10:16:15] <_joe_> yeah [10:16:24] terbium is still not quiet, because it has long-running jobs that haven't updated yet [10:16:24] <_joe_> terbium still complaining [10:16:41] <_joe_> but I guess it's just apc delay [10:17:01] nah, it's a cron job that is still running [10:17:17] <_joe_> yeah it's probably that [10:17:21] <_joe_> wikidatawiki [10:17:30] <_joe_> ok, good :) [10:18:41] (03PS3) 10Giuseppe Lavagetto: puppet: use the role keyword for all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/183881 [10:20:35] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: use the role keyword for all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/183881 (owner: 10Giuseppe Lavagetto) [10:24:34] hmm, should carefully design an ENC for Horizon that’s essentially just role keywords + hiera [10:24:58] <_joe_> hiera has nothing to do with the ENC [10:24:59] <_joe_> beware [10:25:16] <_joe_> the ENC should IMHO just give classes to include [10:25:16] true but wikitech sets puppet global vars [10:25:22] should be hiera instead [10:25:26] <_joe_> yes [10:25:51] yeah, should have said replace OSM with a role-only-ENC + hiera [10:32:12] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:32:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:33:41] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:34:32] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:36:18] 3operations: Possible bandwidth saturation on bits esams - https://phabricator.wikimedia.org/T86749#975518 (10ori) 3NEW [10:43:37] (03PS3) 10Giuseppe Lavagetto: puppet: include admin in role classes for mediawiki and cache [puppet] - 10https://gerrit.wikimedia.org/r/183882 [10:54:13] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: include admin in role classes for mediawiki and cache [puppet] - 10https://gerrit.wikimedia.org/r/183882 (owner: 10Giuseppe Lavagetto) [10:55:17] (03PS2) 10Giuseppe Lavagetto: swift: use roles and other linting [puppet] - 10https://gerrit.wikimedia.org/r/184307 [10:57:16] (03CR) 10Giuseppe Lavagetto: [C: 032] swift: use roles and other linting [puppet] - 10https://gerrit.wikimedia.org/r/184307 (owner: 10Giuseppe Lavagetto) [10:59:22] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: puppet fail [11:00:56] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: puppet fail [11:03:46] (03PS1) 10Giuseppe Lavagetto: swift: move role def before ganglia inclusion [puppet] - 10https://gerrit.wikimedia.org/r/184870 [11:04:27] <_joe_> meh, I forgot the videoscalers [11:04:28] <_joe_> :) [11:04:47] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] swift: move role def before ganglia inclusion [puppet] - 10https://gerrit.wikimedia.org/r/184870 (owner: 10Giuseppe Lavagetto) [11:08:27] (03PS1) 10Giuseppe Lavagetto: puppet: use role for videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/184871 [11:08:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet: use role for videoscalers as well [puppet] - 10https://gerrit.wikimedia.org/r/184871 (owner: 10Giuseppe Lavagetto) [11:10:55] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:12:25] (03PS2) 10Giuseppe Lavagetto: puppet: include admin in swift roles, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/184308 [11:12:54] (03PS1) 10Dan-nl: Add *.bnf.fr to the wgCopyUploadsDomains whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184872 (https://phabricator.wikimedia.org/T86699) [11:15:27] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: include admin in swift roles, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/184308 (owner: 10Giuseppe Lavagetto) [11:16:45] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [11:17:13] * _joe_ adds terbium to the list [11:18:06] (03CR) 10Steinsplitter: [C: 031] "ok." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184872 (https://phabricator.wikimedia.org/T86699) (owner: 10Dan-nl) [11:18:46] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:20:41] (03CR) 10Reedy: [C: 04-1] Add extra scap proxies for A7, B7 and B8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [11:21:09] (03PS2) 10Giuseppe Lavagetto: videoscaler: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/184309 [11:22:01] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.02 [11:22:38] (03PS3) 10Giuseppe Lavagetto: videoscaler: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/184309 [11:22:56] (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: use role keyword [puppet] - 10https://gerrit.wikimedia.org/r/184309 (owner: 10Giuseppe Lavagetto) [11:23:09] (03PS2) 10Giuseppe Lavagetto: puppet: use hiera for elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/184310 [11:25:49] (03CR) 10Reedy: "So job runners are mw1001 to mw1016" [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [11:27:11] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [11:28:21] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: puppet fail [11:28:44] <_joe_> sigh [11:29:36] <_joe_> damn puppet scoping [11:29:47] (03PS1) 10Giuseppe Lavagetto: videoscaler: declare explicitly aggregation servers [puppet] - 10https://gerrit.wikimedia.org/r/184873 [11:29:57] (03CR) 10jenkins-bot: [V: 04-1] videoscaler: declare explicitly aggregation servers [puppet] - 10https://gerrit.wikimedia.org/r/184873 (owner: 10Giuseppe Lavagetto) [11:30:06] (03PS2) 10Giuseppe Lavagetto: videoscaler: declare explicitly aggregation servers [puppet] - 10https://gerrit.wikimedia.org/r/184873 [11:30:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] videoscaler: declare explicitly aggregation servers [puppet] - 10https://gerrit.wikimedia.org/r/184873 (owner: 10Giuseppe Lavagetto) [11:33:10] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:44] <_joe_> Reedy: are you sure you still need to replace the jobrunner as a scap proxy? [11:33:54] <_joe_> even now after hhvm? [11:34:27] _joe_: nope [11:34:36] i noticed the cpu usage and stuff is a lot lower [11:34:41] <_joe_> yes [11:35:10] certainly adding one per rack would be good, and see how things look after a few scaps [11:35:40] Part of it is probably doing some work in scap itself to stop it overloading specific machines etc [11:35:49] (03PS3) 10Giuseppe Lavagetto: puppet: use hiera for elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/184310 [11:37:26] (03PS2) 10Reedy: Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) [11:37:29] _joe_: Can we get that deployed today? I just fixed the comment issue [11:37:35] (03CR) 10jenkins-bot: [V: 04-1] Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [11:38:00] <_joe_> Reedy: for sure, lemme just finish my train of changes [11:38:06] sweet, thanks :) [11:38:29] <_joe_> we're down to 4 [11:38:31] oh, needs rebasing for that commented out one [11:38:49] i'll do that now [11:40:43] (03PS3) 10Reedy: Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) [11:41:48] (03PS4) 10Giuseppe Lavagetto: puppet: use hiera for elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/184310 [11:42:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet: use hiera for elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/184310 (owner: 10Giuseppe Lavagetto) [11:46:29] (03PS1) 10Giuseppe Lavagetto: elasticsearch: rename hiera file [puppet] - 10https://gerrit.wikimedia.org/r/184876 [11:46:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] elasticsearch: rename hiera file [puppet] - 10https://gerrit.wikimedia.org/r/184876 (owner: 10Giuseppe Lavagetto) [11:51:21] (03PS2) 10Giuseppe Lavagetto: puppet: use role for logstash [puppet] - 10https://gerrit.wikimedia.org/r/184311 [11:52:36] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: use role for logstash [puppet] - 10https://gerrit.wikimedia.org/r/184311 (owner: 10Giuseppe Lavagetto) [11:56:56] (03CR) 10Filippo Giunchedi: "thanks Daniel! I've merged yesterday https://gerrit.wikimedia.org/r/184620 which is a superset of this and the alerts are gone" [puppet] - 10https://gerrit.wikimedia.org/r/184380 (owner: 10Dzahn) [11:58:34] (03CR) 10Filippo Giunchedi: [C: 031] cxserver: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [12:03:07] (03PS2) 10Giuseppe Lavagetto: puppet: use role, hiera in rcstream [puppet] - 10https://gerrit.wikimedia.org/r/184312 [12:03:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [12:04:20] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: use role, hiera in rcstream [puppet] - 10https://gerrit.wikimedia.org/r/184312 (owner: 10Giuseppe Lavagetto) [12:05:11] godog: yay :) [12:05:33] (03PS2) 10Giuseppe Lavagetto: puppet: use role for ocg services [puppet] - 10https://gerrit.wikimedia.org/r/184325 [12:06:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet: use role for ocg services [puppet] - 10https://gerrit.wikimedia.org/r/184325 (owner: 10Giuseppe Lavagetto) [12:12:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [12:19:34] <_joe_> well, it seems I managed not to break the site [12:21:23] YuviPanda: hehe I'll get to the others now too, needless to say it feels like a discussion on how to do SOA and puppet, they look all very similar [12:21:26] _joe_: \o/ [12:21:58] godog: yeah. Although in this specific instance I’m worried only about getting rid of the ::beta stuff and splitting whatever extra was in beta into more composable individual roles [12:22:07] that could hopefully be shared once they are all split out [12:22:19] a simple jenkins_access role that works across eerywhere would be nice, for example [12:22:30] that's true indeed, that'd be nice [12:22:51] yeah. but I want to kill all the ::beta ones first before generalizing. [12:23:18] there’s some questions about roles with params that have default values vs explicit hiera() calls. [12:23:50] I like the former, but we can punt that discussion to another time if required (I can update patches to just use hiera() calls) [12:24:20] I don't mind either way as long as there is one obvious way to do it :) [12:25:11] heh [12:28:15] (03CR) 10Filippo Giunchedi: [C: 031] apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [12:28:53] <_joe_> oh, right, I have to unbreak terbium [12:29:25] (03CR) 10Filippo Giunchedi: [C: 031] citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [12:29:46] godog: also, there’s a question about ferm rules I’m unsure about. Do we need to open up the ports explicitly in prod as well? [12:29:51] or is that only required in beta? [12:29:57] ori: any insight in https://gerrit.wikimedia.org/r/#/c/183568/ ? [12:30:06] I’d suppose that since we need it because base firewall drops everything, we’d need it in prod too... [12:30:44] yep I think so too, also can't hurt regardless [12:31:13] yeah [12:33:37] (03PS1) 10Giuseppe Lavagetto: terbium: use hiera [puppet] - 10https://gerrit.wikimedia.org/r/184882 [12:38:59] (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: use hiera [puppet] - 10https://gerrit.wikimedia.org/r/184882 (owner: 10Giuseppe Lavagetto) [12:42:14] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [12:53:05] (03PS2) 10Yuvipanda: citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) [12:53:07] (03PS3) 10Yuvipanda: cxserver: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) [12:53:09] (03PS6) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) [12:53:11] (03PS1) 10Yuvipanda: Unify mathoid production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184884 (https://phabricator.wikimedia.org/T86633) [12:53:50] (03PS2) 10Yuvipanda: mathoid: Unify mathoid production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184884 (https://phabricator.wikimedia.org/T86633) [12:53:57] godog: ^ fixed some issues with cxserver, and added mathoid [13:56:21] hmm, how long till SWAT? [13:57:14] YuviPanda, just over 2 hours I think [13:57:19] hmm, o [13:57:20] k [14:01:34] (03PS1) 10QChris: Clone analytics/aggregator through SSH to ease automated daily push of data [puppet] - 10https://gerrit.wikimedia.org/r/184885 [14:06:43] YuviPanda: ack, will take a look [14:06:56] godog: cool, thanks. I’ll make hiera changes for beta shortly [14:07:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, but manifests/site.pp needs an update as well. Most likely it would be nice to convert the include role::citoid::production to role " [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [14:07:50] akosiaris: bah, I missed that one. (did it for the others) [14:08:57] (03PS3) 10Yuvipanda: citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) [14:08:59] (03PS3) 10Yuvipanda: mathoid: Unify mathoid production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184884 (https://phabricator.wikimedia.org/T86633) [14:09:01] updated [14:10:22] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#975827 (10Aklapper) So is there any vague ETA when "HTTPS by default" will be finished? Because my differentiation to a standard "component" tag would be "can and will this project realistically ever be defined... [14:12:53] (03PS2) 10Faidon Liambotis: Add missing m.{project}.org entries [dns] - 10https://gerrit.wikimedia.org/r/184690 (https://phabricator.wikimedia.org/T78421) (owner: 10Glaisher) [14:12:58] (03CR) 10Faidon Liambotis: [C: 032] Add missing m.{project}.org entries [dns] - 10https://gerrit.wikimedia.org/r/184690 (https://phabricator.wikimedia.org/T78421) (owner: 10Glaisher) [14:19:04] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.002 second response time [14:19:10] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#975841 (10faidon) Well, the ops-y parts of the "HTTPS by default" goal, i.e. scalability, performance & monitoring work is expected to be finished this quarter, so yes, there is an ETA. However, the actual swit... [14:23:54] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [14:35:19] akosiaris: I updated that series of patchsets, btw. do take a look when you have time :) [14:36:47] that is what I am doing right now [14:37:27] akosiaris: wheee! :) [14:38:45] (03CR) 10Yuvipanda: "https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=140865&oldid=140717 is the hiera changes required for this." [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [14:51:02] (03PS11) 10Anomie: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 [14:51:18] mutante: thanks for doing the udp2log iptables merge! [14:51:23] ferm* [14:52:42] there's another patch that he didn't merge [14:52:48] oh? [14:52:54] I split them up to make the first one safer and easier to merge [14:52:55] (03PS1) 10Yuvipanda: tools: Install python-socketio-client [puppet] - 10https://gerrit.wikimedia.org/r/184887 (https://phabricator.wikimedia.org/T86015) [14:52:58] i didn' actually check, i just saw the email [14:53:01] https://gerrit.wikimedia.org/r/#/c/184695/ [14:53:24] can you take care of that? you know this infrastructure better than anyone [14:53:47] (03CR) 10Yuvipanda: [C: 032] tools: Install python-socketio-client [puppet] - 10https://gerrit.wikimedia.org/r/184887 (https://phabricator.wikimedia.org/T86015) (owner: 10Yuvipanda) [14:55:08] paravoid, the difference here is that now all is blocked by default except for udp, right? [14:55:29] yes [14:55:52] so I should triple chekc to make sure there isn't any tcp stuff that needs open too [14:55:53] hm. [14:56:01] ok, will try to do that today. [14:56:01] so if there are any TCP services that do not have explicit ferm rules to pierce holes through the firewall, they'd be blocked [14:56:09] awesome :) [14:57:52] (03CR) 10Anomie: "It would be nice to get it merged." [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [14:58:10] (03PS12) 10Anomie: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 [15:00:04] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T1500). [15:00:09] anomie: hmm, I suppose ottomata or manybubbles could take a look at ^ at some point? [15:01:28] (03PS4) 10Rush: phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 [15:03:17] (03CR) 10Rush: [C: 032 V: 032] phab update (and peripherals) for T78243 [puppet] - 10https://gerrit.wikimedia.org/r/184802 (owner: 10Rush) [15:05:19] (03PS1) 10Yuvipanda: ldap: Install ldapvi on terbium [puppet] - 10https://gerrit.wikimedia.org/r/184889 [15:05:26] paravoid: ^ [15:06:47] anomie: hmm, your cherry-pick on betalabs seems to have caused a puppetfailure on logstash1 [15:07:07] at least, PROBLEM - Puppet failure on deployment-logstash1 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:07:11] <_joe_> ensure latest, package [15:07:14] YuviPanda: I saw a puppet failure, but it was all about an out-of-date NTP package or something like that [15:07:34] anomie: oh, hmm. [15:07:55] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Unify production and beta roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [15:10:25] PROBLEM - check if phabricator taskmaster is running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args PhabricatorTaskmasterDaemon [15:12:20] <_joe_> chasemp: are you doing something with phabricator? [15:12:31] its in a maint period yes [15:12:33] <_joe_> oh yes I see your commit above [15:12:37] <_joe_> sorry [15:12:38] https://www.mediawiki.org/wiki/Phabricator/Maintenance [15:13:27] sorry for teh nimsoft pages [15:13:34] emails even I forget they exist [15:14:05] RECOVERY - check if phabricator taskmaster is running on iridium is OK: PROCS OK: 20 processes with regex args PhabricatorTaskmasterDaemon [15:16:00] akosiaris: whee! note that it won’t merge yet because of the dependencies. [15:16:47] plus I’ve to futz with wikitech before merging it as well [15:16:52] since the role names have changed [15:17:07] <_joe_> YuviPanda: yeah that's the most annoying part [15:18:36] _joe_: I though you already added them in https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=140865&oldid=140717 [15:18:42] sigh I mean YuviPanda [15:18:55] akosiaris: that’s the hiera changes [15:19:02] akosiaris: but wikitech will still try to apply the ::beta roles [15:19:35] ah, yeah sorry [15:19:35] _joe_: yup. something to consider fixing in the horizon time. [15:19:37] <_joe_> akosiaris: you need to change the role names everywhere in ldap and in wikitech i suppose as well [15:19:46] yeah [15:19:50] well, I can do that, one patch at a time. [15:19:53] <_joe_> such a nice thing [15:20:07] no need to subject more people than necessary to the horror that is OSM :) [15:20:54] (03PS7) 10Yuvipanda: apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) [15:21:42] hmm, I can test these with cherry-picks, actually [15:24:53] YuviPanda: OpenStreetMap? [15:25:02] marktraceur: the other OSM [15:25:10] Optimize Social Media? [15:25:22] marktraceur: no, that’s how us kids pronounce ‘awesome’ now [15:25:42] I believe it [15:25:45] Kids are weird [15:26:00] totes [15:26:08] (03CR) 10Yuvipanda: [C: 032] apertium: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184586 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [15:28:31] please be aware search is reindexing for phabricator, this may cause search oddness for ...an hour? [15:29:22] (03PS4) 10Yuvipanda: cxserver: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184590 (https://phabricator.wikimedia.org/T86633) [15:29:50] hi [15:31:41] akosiaris: hmm [15:31:42] Warning: /Stage[main]/Cxserver/File[/srv/deployment/cxserver/cxserver/config.js]: Ensure set to :present but file type is link so no content will be synced [15:31:49] in beta [15:32:35] not sure if that symlink is hand done or what [15:32:40] kart_: ^ [15:32:49] lrwxrwxrwx 1 root root 34 Jan 12 17:29 config.js -> /srv/deployment/cxserver/config.js [15:32:56] vs prod which has an actual file there [15:34:53] YuviPanda: in beta, it is symlink [15:34:58] kart_: why so? [15:35:05] have changed anything there? [15:35:19] hmm? [15:35:25] nothing to affect it, no [15:35:31] but it’s giving me a warning, so am wondering :) [15:35:38] also am trying to eliminate differences between beta and prod [15:35:40] and this is one [15:36:01] YuviPanda: we use cxserver/deploy and use config.js maintained by puppet [15:36:06] That's why [15:37:25] YuviPanda: remove the link and let puppet populate the file [15:37:30] and delete the link's target [15:37:35] akosiaris: yeah, cool [15:38:09] done [15:38:19] (03PS4) 10Yuvipanda: citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) [15:38:24] (03PS1) 10Ottomata: Temporaily disable all bits varnishkafka production [puppet] - 10https://gerrit.wikimedia.org/r/184896 [15:39:24] YuviPanda: btw https://gerrit.wikimedia.org/r/#/c/184590/4/hieradata/role/common/cxserver/production.yaml this needs to be renamed as well me thinks. Since production was removed from the role name [15:39:26] no ? [15:40:01] akosiaris: oh, I’m not sure if the production is for realm or role name. [15:40:09] I just assumed it’s realm and didn’t do much there. [15:40:15] akosiaris: also I’m not sure if we should have that at all [15:40:25] akosiaris: since it’s the same as default [15:40:28] I think it should be /hieradata/role/common/cxserver.yaml [15:40:51] * _joe_ coughs [15:40:57] if we do decide to rely on the default, we should delete it yes [15:41:11] _joe_: please do tell :-) [15:41:20] (03CR) 10Ottomata: [C: 032] Temporaily disable all bits varnishkafka production [puppet] - 10https://gerrit.wikimedia.org/r/184896 (owner: 10Ottomata) [15:41:25] * YuviPanda realizes he hasn’t actually ever had to deal with hiera in prod yet [15:41:29] <_joe_> https://gerrit.wikimedia.org/r/#/c/184590/4/hieradata/role/common/cxserver/production.yaml is wrong [15:41:41] <_joe_> it's role/common/cxserver.yaml [15:41:45] right. [15:41:48] <_joe_> as akosiaris said :) [15:41:51] :-) [15:41:55] apologies. I’ll fix. [15:41:56] however [15:41:58] should I just delete it? [15:42:06] <_joe_> use https://wikitech.wikimedia.org/wiki/Puppet_Hiera as a reference [15:42:07] the default is good enough, and it’s confusing to specify the default twice, It hink [15:42:09] <_joe_> yes [15:42:15] yes, delete it [15:42:18] :D cool [15:42:21] and for apertium as well [15:42:26] <_joe_> that file works only if you have a role::cxserver::production [15:42:31] same thing there [15:44:20] btw, all this does make the path to an "nodejs service" module/role easier [15:44:21] yup. [15:44:21] have you used the (WMF) concur travel website now _joe_? [15:44:21] akosiaris: parsoid seems a bit more complicated though. [15:44:42] <_joe_> mark: ahah yes [15:44:46] <_joe_> it's horrible [15:44:59] <_joe_> well, not properly /used/ [15:45:01] (03CR) 10BBlack: "So this slipped through the cracks in my messy review queue obviously. At this point, I'm thinking maybe instead of a cookie on only text" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [15:46:10] !log stopping all varnishkafka bits instances [15:46:15] Logged the message, Master [15:46:22] (03PS1) 10Yuvipanda: Kill apertium and cxserver hiera files [puppet] - 10https://gerrit.wikimedia.org/r/184899 [15:47:12] (03CR) 10Yuvipanda: [C: 032] Kill apertium and cxserver hiera files [puppet] - 10https://gerrit.wikimedia.org/r/184899 (owner: 10Yuvipanda) [15:47:21] akosiaris: ^ done [15:49:41] (03CR) 10Alexandros Kosiaris: [C: 032] citoid: Unify production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184605 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [15:50:15] marktraceur, manybubbles, ^d: Who wants to SWAT today? [15:50:30] <^d> jouncebot_: next [15:50:30] In 0 hour(s) and 9 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T1600) [15:50:34] i'm doing a meeting that day [15:50:36] there’s a swat patch for wikitech, should it go first or last? [15:50:47] hmm, probably last, in case the sync there kills wikitech :) [15:50:56] and im chilling my hand so slow typing [15:51:43] James_F|Away, tto, m4tx: Ping for SWAT in about 8.5 minutes [15:53:34] anomie: I can if you don't want to [15:53:34] marktraceur: Go for it [15:53:34] I am around and could swat, haven't done it before though [15:53:34] (03PS4) 10Yuvipanda: mathoid: Unify mathoid production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184884 (https://phabricator.wikimedia.org/T86633) [15:53:34] Krenair: Sounds like fun, want a chaperone? [15:53:34] <^d> Krenair: It's like a normal deploy except it's not your own code :) [15:53:58] marktraceur, that would be good [15:54:03] Cool. [15:54:25] Got a full plate too. [15:54:37] James_F|Away's patch is supposed to be first but he's not here [15:55:03] I put my patch in the correct place for this SWAT, right? [15:55:14] Krenair: The order is up to you, if he's not here shunt him to the end, if he doesn't get here, tell him to wait until the evening SWAT [15:55:18] YuviPanda: Yup [15:55:22] cool [15:55:25] it says "(do this before other deploys!)" [15:55:54] (03CR) 10Yuvipanda: [C: 032] mathoid: Unify mathoid production and beta roles [puppet] - 10https://gerrit.wikimedia.org/r/184884 (https://phabricator.wikimedia.org/T86633) (owner: 10Yuvipanda) [15:56:00] Krenair: Ah, hm. [15:56:08] there he is [15:57:34] akosiaris: alright, allmerged. Thank you! :) [15:57:48] only parsoid left [15:57:52] and it seems to be somewhat more complex [15:58:10] Bah, ruddy WiFi. [15:58:12] You know me, always just on time. :-) [15:58:14] marktraceur: Jenkins won't let you merge things if you don't +2 Timo's patches first. [15:58:18] marktraceur: Hence why I put it in SWAT, rather than let people complain on a list. [15:59:24] okay [16:00:04] manybubbles, anomie, ^d, marktraceur, James_F: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T1600). [16:00:57] hm, chrome is misbehaving [16:01:50] Krenair: Sounds like the subtitle of RoanKattouw_away's talk. :-) [16:02:18] FF worked though [16:06:52] 3Phabricator, operations: The options of the Security dropdown in Phabricator need to be clear and documented - https://phabricator.wikimedia.org/T76564#976080 (10chasemp) [16:07:44] hey _joe_, yt? i got a system user git push ssh key question for ya [16:08:01] YuviPanda: BTW, we're still waiting for a decision on how we're exposing citoid; it's blocking deployment. :-( Do you know anything about that? [16:08:17] James_F: hmm, nope. akosiaris maybe? [16:08:28] 3operations, Multimedia, ops-core: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#976103 (10Bawolff) A random assortment of images to test that covers a wide variety of formats we use (Just going to give thumbnail urls, hope that's good): * Super large tiff file: https://upload.... [16:08:34] I guess we could do it the same way we do cxserver (whatever that is) [16:08:42] 3operations, Beta-Cluster: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#976105 (10bd808) >>! In T78076#975352, @yuvipanda wrote: > Why is this needed again? T76086 seems to have fixed T75206. And as @ori said, we should be agnostic about the numeric values unless there's a very... [16:09:23] marktraceur, does jenkins usually take this long to merge to the wmf branches? [16:09:34] Krenair: Yes. [16:09:38] sigh [16:09:40] Krenair: Same as master. [16:09:53] Krenair: Sorry :( [16:10:01] well it's not your fault :) [16:10:08] No, but it sucks [16:10:09] YuviPanda: It needs client direct access and it doesn't need caching. We were going to pipe requests through the misc Varnish setup, but it was veto'ed. [16:11:43] at least with master I can just leave it and will notice a few hours later if jenkins broke [16:12:07] True. [16:12:26] James_F: hmm, any discussions you can point me to? [16:12:46] or via notification etc. [16:13:07] um, James_F, marktraceur [16:13:08] YuviPanda: https://phabricator.wikimedia.org/T76949 (see https://gerrit.wikimedia.org/r/#/c/178419/ for the original intent). [16:13:13] jenkins failed [16:13:18] Krenair: It did? Oh dear. [16:13:22] _joe_: Any chance you could do my scap proxy updates before todays deploy? :) [16:13:45] after wasting 10 minutes... [16:13:46] Super. [16:13:47] Eurgh, yeah, the git clone failed? [16:13:48] Well... [16:13:53] That's pretty poor. [16:13:54] the other changes are for wmf14, maybe that could be tried instead [16:13:57] And novel. [16:14:04] Krenair: Yeah, I'd say try the other one [16:14:12] YuviPanda: But it then became "OK, let's build the Services cluster". [16:14:21] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.424988 [16:14:33] Krenair: You can merge both branches at once, by the way. [16:14:46] <_joe_> Reedy: yep sorry [16:14:53] <_joe_> got lost in other things [16:14:56] James_F: yes, restbase has the exact same requirement [16:14:56] YuviPanda: But I don't know where that went. [16:15:04] 3Multimedia, operations, Wikimedia-Media-storage: Upgrade poppler-utils to at least 1.20 - https://phabricator.wikimedia.org/T57624#976120 (10Bawolff) [16:15:10] so did cxserver [16:15:11] Reedy: kart_ has some ContentTranslation stuff to push out today, he'll need some assistance, can you help before the MW window? [16:15:11] akosiaris: Yes, but this service was scheduled for deployment months ago and is still waiting. :-) [16:15:21] greg-g: I should be able to, yeah [16:15:28] Reedy: cool [16:15:29] akosiaris: TBF we weren't ready months ago, and it's not Operations' fault at all. [16:15:38] akosiaris: But making some progress would be great. [16:15:55] is that the only blocker ? [16:15:59] marktraceur, okay. will remember that for next time [16:16:13] akosiaris: Plus merging the code that uses it, pretty much. [16:16:14] there is a plan to have cxserver, restbase, citoid behind a "services" varnish cluster [16:16:27] akosiaris: This is an uncacheable service, though. [16:16:38] for now cxserver has gone uncached through the parsoid varnishes [16:16:49] akosiaris: I'm fine for it to go wherever, but if we're letting perfect be the enemy of the good I'll be sad. [16:17:02] we could do the same for citoid as a temporary solution to unblock that [16:17:04] That sounds fine for citoid. [16:17:05] (03PS4) 10Giuseppe Lavagetto: Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [16:17:21] How much does that make you unhappy, though? :-) Don't want to make it too crufty. [16:17:21] same might happen for citoid, we are blocked on some procurement IIRC [16:17:31] Ah. [16:17:31] RECOVERY - Varnishkafka Delivery Errors on cp3016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:17:43] (03CR) 10Giuseppe Lavagetto: [C: 032] Add extra scap proxies for A7, B7 and B8 [puppet] - 10https://gerrit.wikimedia.org/r/184817 (https://phabricator.wikimedia.org/T1342) (owner: 10Reedy) [16:17:45] In that case, yeah, shall we try that? Should I try to write a patch(!)? [16:18:33] yes, if that is the only blocker for citoid (aka nothing other blocks deployment) please do [16:18:46] akosiaris: Thanks! :-) [16:18:52] I 'll talk with the rest of the team, I don't see why they would object, just so everyone knows [16:18:59] * James_F nods. [16:19:04] <_joe_> I object! [16:19:08] <_joe_> :P [16:19:45] * James_F grins at _joe_. [16:19:54] James_F: https://gerrit.wikimedia.org/r/#/c/181613/ is probably going to be of help to you [16:20:02] akosiaris: Thanks. ;-) [16:20:44] alright that went through [16:22:16] Krenair: Try the wmf13 one again. [16:22:52] trying [16:23:45] (03PS3) 10Yuvipanda: Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) [16:24:04] marktraceur, so just sync-file php-1.25wmf14/Gruntfile.js and same for package.json? [16:24:32] Krenair: Not sure what the order should be there - a sync-dir may be safer, I don't know [16:25:22] considering it's renaming files... [16:25:34] will sync-dir php-1.25wmf14 then, ok marktraceur? [16:26:04] Mm-hmm [16:26:07] It'll take longer [16:26:11] But safer for sure [16:26:23] Then again, I don't think grunt or package.json need to be deployed, so it's up to you. [16:26:47] I wondered that, but it should be kept in sync shouldn't it? [16:26:49] <_joe_> Reedy: merged [16:27:39] 3operations, ops-requests: Configure twemproxy to bind a unix domain socket - https://phabricator.wikimedia.org/T83328#976155 (10Joe) The package is ready, uploaded to apt, tested on beta and deployed to prod. All hosts but non-HAT ones are using the UNIX socket [16:27:50] 3operations, ops-requests: Configure twemproxy to bind a unix domain socket - https://phabricator.wikimedia.org/T83328#976156 (10Joe) 5Open>3Resolved [16:27:57] marktraceur, is tto here? [16:28:00] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#976157 (10mark) >>! In T76562#964609, @RobH wrote: > So this won't work behind misc-web-lb until after the gerrit dependency on using sshd on the same inte... [16:28:08] tt nope [16:28:22] m4tx, here? [16:28:24] Krenair: Give him a bit and if he doesn't show up, boot him to the evening slot [16:28:46] Maybe send a memoserv memo [16:28:48] ™ [16:29:01] I don't know their nickserv name [16:29:09] oh, is tto. okay [16:29:57] (03PS1) 10Jforrester: Re-use Parsoid Varnishes for citoid too [puppet] - 10https://gerrit.wikimedia.org/r/184904 (https://phabricator.wikimedia.org/T76949) [16:30:15] _joe_: woop [16:30:59] marktraceur, So James_F's wmf13 patch failed to merge, wmf14 version does not need syncing it seems (just touches gruntfile and stuff for tests to make +2 possible) [16:31:08] tto is not here [16:31:15] 3operations: restructure site.pp to use roles, hiera. - https://phabricator.wikimedia.org/T86774#976179 (10Joe) 3NEW a:3Joe [16:31:30] m4tx is here but not responding [16:31:42] /me waves [16:31:47] Krenair: Sync it anyway [16:31:47] that leaves YuviPanda's wikitech config [16:31:58] Krenair: even if it doesn't touch production, we sync it [16:31:58] Krenair: Neither the wmf13 nor wmf14 patches need syncing for production, but it's not cool to leave them around. [16:32:09] Also, marktraceur put it more succinctly. :-) [16:32:12] Krenair: I'm here [16:32:19] I just said it doesn't need to sync, so you have options [16:32:35] You don't need to worry about bugs arising from incorrect order of syncing the files or whatever [16:32:47] Krenair: BTW, -hhvm has already failed on https://gerrit.wikimedia.org/r/#/c/184821/ :-( [16:33:14] Krenair: Same recurring zuul.clone bug. [16:33:15] * James_F sighs. [16:33:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [16:33:25] okay, syncing wmf14 change for James_F [16:33:32] Thansk. [16:33:37] Also, Thanks. [16:33:41] m4tx is next [16:34:34] James_F: You said Jenkins will fail the other patches though, right? [16:34:59] marktraceur: If it doesn't we've got bigger problems. [16:35:04] * James_F sighs. [16:35:04] Fun. [16:36:41] via sync-dir because marktraceur said that'd be safer (multiple files renamed) [16:38:14] Really we should have just cowboy-deployed these last night. [16:38:16] * James_F sighs. [16:38:27] !log krenair Synchronized php-1.25wmf14: https://gerrit.wikimedia.org/r/#/c/184818/ (duration: 06m 14s) [16:38:31] Logged the message, Master [16:38:38] marktraceur, so that should be fine now [16:39:05] fyi: https://phabricator.wikimedia.org/T86730#976230 [16:39:11] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#976232 (10Glaisher) [16:39:11] (re the -hhvm job failure) [16:41:23] marktraceur, should I have started jenkins on this one for m4tx earlier? [16:41:53] Krenair: No, one patch on each branch at a time [16:42:10] Krenair: And there isn't a deploy after you, so you aren't crunched for time [16:42:43] (03CR) 10Alexandros Kosiaris: [C: 032] ldap: Install ldapvi on terbium [puppet] - 10https://gerrit.wikimedia.org/r/184889 (owner: 10Yuvipanda) [16:48:00] m4tx, hey [16:48:01] !log krenair Synchronized php-1.25wmf14/extensions/MultimediaViewer/resources/mmv/mmv.lightboxinterface.js: https://gerrit.wikimedia.org/r/#/c/184633/ (duration: 00m 07s) [16:48:06] please check [16:48:09] Logged the message, Master [16:48:42] Yay! [16:49:01] Krenair: sec [16:50:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [16:52:30] m4tx, ... [16:52:48] (so close to fixing that -hhvm issue, once I get sudo) [16:53:08] Krenair: works, sorry for late reply [16:53:11] great [16:53:22] YuviPanda, ping [16:53:44] Krenair: pong [16:54:03] (03CR) 10Alex Monk: [C: 032] Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [16:54:07] (03Merged) 10jenkins-bot: Make wikitech default shell /bin/bash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184635 (https://phabricator.wikimedia.org/T86668) (owner: 10Yuvipanda) [16:54:43] (03PS1) 10Ottomata: Puppetize automated git push via http for stats user [puppet] - 10https://gerrit.wikimedia.org/r/184909 [16:55:27] (03CR) 10jenkins-bot: [V: 04-1] Puppetize automated git push via http for stats user [puppet] - 10https://gerrit.wikimedia.org/r/184909 (owner: 10Ottomata) [16:55:41] !log krenair Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/184635/ (duration: 00m 06s) [16:55:43] YuviPanda, ^ [16:55:45] Logged the message, Master [16:55:55] Krenair: cool, thanks! [16:55:59] Krenair: I need to run sync now on virt1000 [16:56:04] wikitech isn’t touched by sync from tin [16:56:13] (03PS2) 10Ottomata: Puppetize automated git push via http for stats user [puppet] - 10https://gerrit.wikimedia.org/r/184909 [16:56:14] okay. and virt1000 is presumably more restricted than tin? [16:56:19] Krenair: yup [16:56:26] do you remember what I should run? [16:56:28] scap? [16:56:30] sync-common? [16:56:40] it was one file, sync-file wmf-config/wikitech.php [16:57:12] YuviPanda: on virt1000 you should run `sync-common --verbose` [16:57:21] (03PS3) 10Ottomata: Puppetize automated git push via http for stats user [puppet] - 10https://gerrit.wikimedia.org/r/184909 [16:57:22] right [16:57:37] !log running sync-common —verbose on virt1000 [16:57:42] Logged the message, Master [16:58:12] sync-file is a push command from tin but virt1000 isn't actually in the dsh group that is effected by pushes. It can only pull today. [16:58:20] yup [16:58:38] * affected. [16:58:40] (Sorry.) [16:58:54] tto's change will be bumped to the next swat window, James_F's wmf13 change can wait [16:58:58] (03CR) 10CSteipp: "Works for me." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [16:59:08] (and can't go anyway until jenkins stops being silly) [16:59:12] Krenair: Apparently it should now work. [17:01:10] well.. it is technically at the end of the window and I'm not convinced it's really needed :) [17:01:37] the next deployment (mw train) is moving group2 from wmf13 to wmf14, so... [17:01:39] Krenair: TBF wmf13 is getting un-deployed in a few hours. [17:01:43] exactly [17:01:43] Indeed. [17:02:02] (03CR) 10Ottomata: [C: 032] Puppetize automated git push via http for stats user [puppet] - 10https://gerrit.wikimedia.org/r/184909 (owner: 10Ottomata) [17:05:00] RIGHT. [17:05:02] uh, YuviPanda [17:05:10] is everything okay on wikitech? [17:05:20] uh, it’s still deploying [17:05:27] ofcourse not [17:05:31] jesus [17:05:31] ah, ok :) [17:05:41] why the fuck [17:05:43] is it syncing [17:05:50] 1.25wmf2?! [17:06:28] we really need to add wikitech to the dsh group [17:06:32] or figure some other way of doing this [17:07:23] YuviPanda: What have you done!? [17:08:08] !log delete php-1.25wmf1 to 11 on virt1000 [17:10:04] [Wed Jan 14 17:05:26 2015] [error] [client 65.19.138.33] PHP Fatal error: Unsupported operand types in /srv/mediawiki/php-1.25wmf14/languages/Language.php on line 481 [17:10:07] uh oh [17:10:14] That usually means something is null [17:10:36] localisation cache not rebuilt? [17:10:36] (03CR) 10Ottomata: "I think hardcoding this IP here will bite us later. Luckily, I hope to be removing this udp2log sometime in the near future (as I always " [puppet] - 10https://gerrit.wikimedia.org/r/184803 (owner: 10Dzahn) [17:11:06] hmm, I ran sync-common [17:11:12] Krenair: https://gerrit.wikimedia.org/r/#/c/184821/ V+2'ed. [17:11:14] does that rebuild l10n cache? [17:11:43] Reedy: hmm, let me run scap after I’m done pruning all the old old php-* things [17:11:50] James_F, I was planning to abandon that [17:11:50] YuviPanda: scap-rebuild-cdbs [17:12:42] !log running scap-rebuild-cdbs on virt1000 [17:12:48] 3Wikimedia-DNS, operations, Wikimedia-General-or-Unknown: m.{project}.org portal/redirect consistency and i18n issues - https://phabricator.wikimedia.org/T78421#976356 (10Glaisher) DNS set up now. Thanks! Looks like missing.php is doing the redirects for those domains to incubator now. Do we now make missing.ph... [17:13:22] 3Wikimedia-DNS, operations, Wikimedia-General-or-Unknown: m.{project}.org portal/redirect consistency and i18n issues - https://phabricator.wikimedia.org/T78421#976364 (10Glaisher) scratch that, missing.php is for missing wikis, not mobile wikis. need an apache config for this. [17:13:50] 3Wikimedia-DNS, operations, Wikimedia-Apache-configuration, Wikimedia-General-or-Unknown: m.{project}.org portal/redirect consistency and i18n issues - https://phabricator.wikimedia.org/T78421#976366 (10Glaisher) [17:15:50] Reedy: Krenair wikitech is back up [17:15:57] lol [17:16:30] either we should add this to the dsh group, or I should learn how our deployment system works :P [17:16:38] !bug 1 [17:16:42] heh [17:16:59] but in theory, sync-common && scap-rebuild-cdbs should be enough [17:17:03] right [17:17:08] in that order? [17:17:14] indeed [17:17:34] though, wikitech might be broken for a few mins [17:17:38] hmm, since I interrupted sync-common earlier (because it was syncing really old branches) I suppose I should run them again [17:17:50] makes sure things are properly in sync [17:18:06] I guess I can rm everything except wmf13 and 14 [17:18:10] YuviPanda: no [17:18:18] Because next sync-common will just recreate everything eles [17:18:28] we have wmf5-14 on disk currently [17:18:28] Reedy: in /srv/mediawiki? [17:18:40] I think I need to delete some more... [17:18:57] Reedy: hmm, where is this copying from? tin? [17:19:08] 3ops-network: Possible bandwidth saturation on bits esams - https://phabricator.wikimedia.org/T86749#976394 (10faidon) I don't think it's suspicious, just the natural balance of traffic -- esams' traffic is more stable due to regional/scope differences, plus the scale of the graph can be misleading (every point... [17:19:22] yeah [17:19:33] if we work out what can be deleted, I can delete some [17:20:13] mmm [17:20:36] Anything less than 10 can go [17:20:40] * Reedy deletes for starters [17:21:24] rm: cannot remove `php-1.25wmf5/.git/modules/extensions/NavigationTiming/rr-cache/64a08cb15637108ae15a77d1258e29ba98c8c3aa/preimage': Permission denied [17:21:25] bah [17:22:10] YuviPanda: fancy rm -rf /srv/mediawiki-staging/php-1.25wmf5 on tin? [17:22:14] yeah sure [17:23:02] (03PS1) 10Reedy: Delete 1.25wmf[6-9] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184917 [17:23:13] !log rm -rf /srv/mediawiki-staging/php-1.25wmf5 on tin [17:23:17] Logged the message, Master [17:23:35] (03CR) 10Reedy: [C: 032] Delete 1.25wmf[6-9] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184917 (owner: 10Reedy) [17:23:40] (03Merged) 10jenkins-bot: Delete 1.25wmf[6-9] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184917 (owner: 10Reedy) [17:24:00] Reedy: alright, now let me run another sync-common && scap-rebuild-cdbs because none of the sync-commons before have completed. [17:24:17] should hopefully be faster [17:24:52] shouldn't need the rebuild-cdbs [17:25:04] Reedy: well, too late :P [17:25:14] Reedy: is our deployment system documented soeplace? [17:25:15] *someplace [17:25:38] mw1152: Permission denied (publickey). [17:25:40] Wikitech! [17:27:28] !log sync-common && scap-rebuild-cdbs completed on virt1000, wikitech still seems up. Tending to the wounded. [17:27:32] Logged the message, Master [17:27:42] !log deleted php-1.25wmf5 through php-1.25wmf9 from /srv/mediawiki [17:27:46] Logged the message, Master [17:27:50] 3operations: Requesting access to gallium for cmcmahon - https://phabricator.wikimedia.org/T86685#976418 (10Jgreen) a:3Jgreen [17:29:11] Reedy: Krenair yay, patch has taken effect :) [17:29:16] (03PS3) 10Ottomata: udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [17:29:36] !log reedy Purged l10n cache for 1.25wmf12 [17:29:36] paravoid: i'm not 100% sure that is needed, but seeing that there is a ssh-bastion rule to open up ssh from bastion, I assume that ssh is not open by default [17:29:40] Logged the message, Master [17:29:50] nothing is open by default [17:29:57] the default ruleset has a DROP policy [17:30:05] anything that's not explicitly allowed is dropped [17:30:07] right, but there are a couple of holes, e.g. bastion ssh [17:30:10] in base::frirewall [17:30:12] yes [17:30:25] k [17:30:28] aye then this is likely needed [17:30:36] !log mw1152: Permission denied (publickey). [17:30:37] i just looked through puppet related to the udp2log boxes [17:30:40] Logged the message, Master [17:30:43] that's the only addition i could find [17:30:49] maybe there are more, but that's all I can think of [17:30:52] base::firewall is included [17:30:56] aye [17:30:59] so this would be included as well [17:31:00] i mean this: [17:31:00] https://gerrit.wikimedia.org/r/#/c/184695/3/manifests/site.pp [17:31:01] yeah [17:32:06] ok [17:32:24] why not in the webstatscollector role? [17:32:29] not include that* [17:33:11] i guess that would be fine. yeah that makes more sense, thanks [17:33:46] (03PS4) 10Ottomata: udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [17:34:16] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#976428 (10Aklapper) So I think we are done here (two projects created) and we can close this as resolved? Or anything left to sort out? [17:34:33] paravoid: i'm ok with merging that and watching it, if you like. after SoS perhaps? [17:34:35] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#976429 (10faidon) 5Open>3Resolved [17:34:47] wfm [17:34:53] k [17:48:19] ottomata: glad it worked out, there was just that syntax issue in the ud2plog_rsync rule from some time before [17:49:22] 3ops-eqiad: please check db1051 and db1056 for 2.5" disk brackets - https://phabricator.wikimedia.org/T86788#976499 (10RobH) 3NEW a:3Cmjohnson [17:55:09] (03PS1) 10Rush: phab set owner and permissions on repo directory [puppet] - 10https://gerrit.wikimedia.org/r/184924 [17:55:11] (03PS1) 10Giuseppe Lavagetto: parsoid: use hiera, role [puppet] - 10https://gerrit.wikimedia.org/r/184925 [17:55:13] (03PS1) 10Giuseppe Lavagetto: restbase: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/184926 [17:55:15] (03PS1) 10Giuseppe Lavagetto: lvs: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/184927 [17:55:17] (03PS1) 10Giuseppe Lavagetto: memcached: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/184928 [17:55:19] (03PS1) 10Giuseppe Lavagetto: redis: user role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/184929 [17:55:59] <_joe_> I'm off for today, if anyone fancies taking a look :) [17:57:54] (03Abandoned) 10Rush: phab set owner and permissions on repo directory [puppet] - 10https://gerrit.wikimedia.org/r/184924 (owner: 10Rush) [18:00:11] (03PS1) 10Rush: phab set owner and permissions on repo directory [puppet] - 10https://gerrit.wikimedia.org/r/184930 [18:01:43] (03CR) 10Rush: [C: 032] phab set owner and permissions on repo directory [puppet] - 10https://gerrit.wikimedia.org/r/184930 (owner: 10Rush) [18:13:12] (03CR) 10Dzahn: [C: 032] Don't use logrotate for the wikidata dump logs [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [18:19:31] !log deleted /etc/logrotate.d/dumpwikidatajson on snapshot1003 (gerrit 182173) [18:19:43] Logged the message, Master [18:20:53] (03CR) 10Dzahn: "applied on snapshot1003, just had to manually deleted the file from /etc/cron.d/ , puppet would only do that if it had been ensure => abse" [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [18:26:36] (03PS1) 10RobH: virt1000.wikimedia.org revoked [puppet] - 10https://gerrit.wikimedia.org/r/184938 [18:27:45] (03CR) 10Dzahn: "typo, i meant /etc/logrotate.d/" [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [18:28:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [18:29:16] ? [18:29:21] that looks bad? [18:29:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [18:29:39] thanks. [18:29:51] :) [18:30:01] at least just one interface ?:p [18:32:46] (03CR) 10RobH: [C: 032] virt1000.wikimedia.org revoked [puppet] - 10https://gerrit.wikimedia.org/r/184938 (owner: 10RobH) [18:35:23] Hi Reedy and greg-g :) There's a small change to a EducationProgram extension API that I'm trying to get onto today's train directly out to production, if that's feasible :) Here's the first patch to merge to the mwf14 branch, hope it's correct. https://gerrit.wikimedia.org/r/#/c/184940/ (I don't have +2 powers for that branch.) Is this doable? If so could maybe you +2? Many thanks in advance! [18:36:10] It's already been tested on the beta cluster and only affects an API used by the WikiEducation foundation, who need it to continue some of their outside-MW development... [18:50:17] (03CR) 10RobH: [C: 04-1] "Shouldn't any permission change like this have an associated access request ticket/task?" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (owner: 10Cscott) [19:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T1900). Please do the needful. [19:00:04] Dear anthropoid, the time has come. Please deploy ContentTranslation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T1900). [19:05:46] (03PS5) 10Ottomata: udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [19:07:24] (03CR) 10Ottomata: [C: 032] udp2log: tighten up firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [19:10:04] Reedy: When's https://www.mediawiki.org/w/index.php?title=MediaWiki_1.25/wmf15/Changelog going to be populated?! ;-) [19:10:25] James_F: When shit is branched :P [19:10:35] Reedy: Bah. :-P [19:10:50] It's on Translate atm [19:11:36] Reedy: preparing yet another backport for wmf14 [19:11:42] lol [19:11:47] probalby need 5 min or however fast jenkins allows [19:12:02] aude: So… 20 minutes? :-) [19:12:25] * aude hope not! [19:13:03] aude: presume it won't need a scap? [19:13:09] no [19:13:15] it's just moving some js files [19:13:45] since stuff in wikibase repo is not availble on clients... [19:14:01] (03Abandoned) 10QChris: Clone analytics/aggregator through SSH to ease automated daily push of data [puppet] - 10https://gerrit.wikimedia.org/r/184885 (owner: 10QChris) [19:23:57] 3Ops-Access-Requests, operations: Give parsoid admins the ability to update/restart the RT testing service - https://phabricator.wikimedia.org/T86804#976874 (10Dzahn) 3NEW [19:24:47] (03PS2) 10Dzahn: Give parsoid admins the ability to update/restart the RT testing service. [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [19:25:18] (03CR) 10Dzahn: "created ticket. T86804" [puppet] - 10https://gerrit.wikimedia.org/r/180221 (https://phabricator.wikimedia.org/T86804) (owner: 10Cscott) [19:27:47] James_F: https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf15/Changelog [19:28:30] (03PS1) 10Reedy: add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184965 [19:28:51] (03CR) 10Reedy: [C: 032] add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184965 (owner: 10Reedy) [19:29:06] (03CR) 10Dzahn: "I'll merge it if Krinkle is fine with the latest version that has been uploaded via Gerrit patch uploader." [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [19:35:46] (03Merged) 10jenkins-bot: add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184965 (owner: 10Reedy) [19:35:47] 3ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#976931 (10RobH) 3NEW a:3RobH [19:36:50] 3ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#976946 (10RobH) 3NEW a:3RobH [19:37:23] (03PS1) 10Reedy: testwiki to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184971 [19:37:25] (03PS1) 10Reedy: wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184972 [19:37:27] (03PS1) 10Reedy: group0 to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184973 [19:38:05] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184971 (owner: 10Reedy) [19:38:46] !log reedy Started scap: testwiki to 1.25wmf15 and build l10n caches. Some extension bumps in wmf14 [19:38:53] Logged the message, Master [19:39:02] Reedy: Thanks! [19:39:45] Reedy: BTW, are you suppressing items in the changelog that touch submodules? Most/many of the VisualEditor changes each week happen in the VisualEditor-core submodule… [19:40:06] Not explictly [19:40:14] I don't think the script has been setup for sub submodules [19:40:18] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [19:40:32] Does it ignore items that start with "Update" then? [19:41:09] $skipLines = array( [19:41:09] 'Localisation updates from', [19:41:09] 'COMMITMSG', // Fix for escaping fail leaving a commit summary of $COMMITMSG [19:41:09] 'Add (\.gitreview and )?\.gitignore', [19:41:10] 'Creating new WMF', [19:41:12] 'Commit of various live hacks', // Our catchall patch for live hacky stuff [19:41:14] 'Applied patches to new WMF', [19:41:17] '(Updat(e|ing) )? ?.*? to (master|head|[0-9a-f]{5,40}|production tip)', // Update foo to master [19:41:19] 'Bump .*? for deployment', [19:41:21] ); [19:41:50] (03Merged) 10jenkins-bot: testwiki to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184971 (owner: 10Reedy) [19:42:39] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:43:02] YuviPanda: [19:43:02] [19:43:13] https://wikitech.wikimedia.org/wiki/Special:SpecialPages [19:43:23] missing message in osm it seems [19:45:34] !log reedy scap failed: CalledProcessError Command '('/usr/bin/git', 'merge-base', 'HEAD', 'gerrit\norigin')' returned non-zero exit status 128 (duration: 06m 47s) [19:45:41] Logged the message, Master [19:45:46] (03PS2) 10Ottomata: Add logs from 'misc' caches to kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/184183 (owner: 10QChris) [19:47:07] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#976980 (10Dzahn) 3NEW [19:47:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [19:47:19] that sucks [19:47:22] wtf [19:47:51] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#976994 (10Dzahn) p:5Triage>3Normal [19:48:12] (03CR) 10QChris: [C: 04-1] "We want to do some more kafka testing before increasing the load on them." [puppet] - 10https://gerrit.wikimedia.org/r/184183 (owner: 10QChris) [19:48:20] merge base? [19:48:30] norigin? [19:48:35] Reedy: "'(Updat(e|ing) )? ?.*? to (master|head|[0-9a-f]{5,40}|production tip)', // Update foo to master" will cover https://gerrit.wikimedia.org/r/#/c/184544/ and its ilk. [19:48:58] did that happen because gerrit had one of the tmp db not found errors? [19:49:08] James_F: We should probably just remove that line tbh [19:49:09] mark: https://etherpad.wikimedia.org/p/mobile_apps_onboarding_brian [19:49:15] Reedy: Yeah. [19:49:34] Reedy: I guess it was to ignore… deployment things in MW-core? But they don't normally take that form. [19:49:48] It was for submodule bumps in master [19:49:56] s/master/core/ [19:50:06] Where that information wasn't really needed, as it was listed in the submodule for the extension beneath [19:50:16] * James_F nods. [19:50:24] (03PS10) 10Dzahn: Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [19:50:32] But they often don't work. [19:50:33] (03CR) 10jenkins-bot: [V: 04-1] Allow "hoo" to sudo into datasets [puppet] - 10https://gerrit.wikimedia.org/r/152724 (https://phabricator.wikimedia.org/T86808) (owner: 10Hoo man) [19:50:40] /srv/deployment/scap/scap/scap/utils.py: ('/usr/bin/git', 'merge-base', 'HEAD', remote), [19:51:00] Why is the remote gerrit\norigin [19:51:02] tfinc: thanks [19:51:18] Reedy: "git #81f609fd - Update EventLogging for cherry-picks", "git #0b233c76 - Update WikimediaMessages, adds wikibase-otherprojects-wikidata", "git #7e3b1238 - Update PageTriage with easy JS fix"… from https://www.mediawiki.org/wiki/MediaWiki_1.25/wmf12/Changelog [19:51:30] Reedy: Doesn't appear to be having the desired effect. :-) Yeah, let's just kill it. [19:52:02] James_F: it's mediawiki/tools/release if you want to make a patchset [19:52:02] (03CR) 10Krinkle: "The code looks good and it might work, however (I asked Bartosz to double check) neither of us knows for sure they will run in the correct" [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [19:52:06] (03PS7) 10Krinkle: gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) [19:52:11] Reedy: Will do. [19:52:48] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#977025 (10Dzahn) [19:53:02] James_F: A couple of the others might want to go too [19:53:06] remote = subprocess.check_output( ('/usr/bin/git', 'remote'), cwd=repo_directory, stderr=dev_null).strip() [19:53:11] I guess that's leaving a newline? [19:54:09] 3operations: git::clone makes changed files root-only readable - https://phabricator.wikimedia.org/T86527#977026 (10Jgreen) p:5Triage>3Volunteer? [19:54:17] https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/utils.py#L142-L144 [19:54:32] * James_F shrugs. [19:55:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [19:56:04] !log reedy Started scap: testwiki to 1.25wmf15 and build l10n caches. Some extension bumps in wmf14. take 2 [19:56:09] Logged the message, Master [19:57:04] Reedy: I did https://gerrit.wikimedia.org/r/#/c/184981/ for now; . [19:57:39] !log reedy scap failed: CalledProcessError Command '('/usr/bin/git', 'merge-base', 'HEAD', 'gerrit\norigin')' returned non-zero exit status 128 (duration: 01m 34s) [19:57:57] 'gerrit\norigin' [19:58:01] Ah, it's using both remotes [19:58:04] for whatever reason... [19:59:08] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#977032 (10Dzahn) [19:59:53] 3Ops-Access-Requests, operations: Give "hoo" sudo access to dataset snapshot hosts - https://phabricator.wikimedia.org/T86808#977033 (10mark) [20:00:36] (03PS8) 10Krinkle: gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) [20:00:38] !log reedy Started scap: verbose [20:00:44] Logged the message, Master [20:00:50] 3ops-codfw: rack graphite2001 - https://phabricator.wikimedia.org/T86554 (10RobH) This doesn't respond to my attempts to reach its mgmt, please check and advise. (Also, were the SSDs installed into the system?) [20:00:55] mutante: MatmaRex: ^ I've documented the hack [20:00:59] !log reedy scap failed: CalledProcessError Command '('/usr/bin/git', 'merge-base', 'HEAD', 'gerrit\norigin')' returned non-zero exit status 128 (duration: 00m 21s) [20:01:09] bah [20:01:15] (03PS1) 10Ottomata: Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 [20:01:17] it won't tell me where it's found that remote [20:01:27] Reedy: Nice error [20:01:31] (03CR) 10Bartosz Dziewoński: [C: 031] gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [20:01:55] (03CR) 10jenkins-bot: [V: 04-1] Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 (owner: 10Ottomata) [20:02:01] reedy@tin:/srv/mediawiki-staging/php-1.25wmf14$ git submodule foreach git remote | grep gerrit [20:02:01] gerrit [20:02:29] Entering 'extensions/CirrusSearch' [20:02:29] gerrit [20:02:36] (03PS2) 10Ottomata: Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 [20:02:52] !log reedy Started scap: testwiki to 1.25wmf15 and build l10n caches. Some extension bumps in wmf14 [20:03:21] (03CR) 10Dzahn: [C: 032] gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [20:03:25] (03CR) 10jenkins-bot: [V: 04-1] Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 (owner: 10Ottomata) [20:03:28] wha [20:03:45] just a sec please, will apply gerrit config change [20:03:50] pssh parens [20:03:59] aude: https://phabricator.wikimedia.org/T86811?workflow=create [20:04:06] it might cause a very short gerrit outage [20:04:13] (03PS1) 10RobH: setting asset tag mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/184985 [20:04:21] (03PS3) 10Ottomata: Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 [20:04:34] ugh [20:04:57] No idea why CirrusSearch ended up with a gerrit remote [20:05:00] (03PS4) 10Ottomata: Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 [20:05:12] grrr :) [20:05:40] Scheduling refresh of Service[gerrit] [20:05:59] and done [20:06:08] (03PS2) 10RobH: setting asset tag mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/184985 [20:06:14] i git git review the two seconds you had it down [20:06:15] heh [20:06:18] (03CR) 10Dzahn: "i agree, we are trying it. config has been applied and gerrit restarted" [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [20:06:19] hit even. [20:06:44] mutante: well, that didn't work :D [20:06:48] It's no longer matching [20:07:03] boo [20:07:04] :p [20:07:09] PROBLEM - Varnishkafka Delivery Errors on cp3005 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 53.366665 [20:07:45] (03PS1) 10Dzahn: Revert "gerrit: Don't match Phabricator identifiers within urls" [puppet] - 10https://gerrit.wikimedia.org/r/184986 [20:07:48] oh well ? [20:09:20] qchris: ^^^ i think even with all bits disabled, upload still has problems, [20:09:25] convinced? :) [20:09:37] (03CR) 10Dzahn: [C: 032] "well, this was ". Worst case scenario it'll match nothing and we can revert it soon after."" [puppet] - 10https://gerrit.wikimedia.org/r/184986 (owner: 10Dzahn) [20:09:54] ottomata: convinced. [20:10:02] (03PS5) 10Ottomata: Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 [20:10:07] :) [20:10:19] RECOVERY - Varnishkafka Delivery Errors on cp3005 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:11:17] yea, sorry about that. one more restart for the revert [20:11:22] GAH [20:11:24] haha [20:11:29] only eough time to rebase! [20:11:38] aude: dunno if you saw, but we're scapping with more scap proxies today :) [20:12:15] :) [20:12:27] ottomata: go ahead, that will be it for today [20:13:01] (03CR) 10Ottomata: [C: 032 V: 032] Only attempt to include ganglia diskstat monitor if ganglia is included [puppet] - 10https://gerrit.wikimedia.org/r/184984 (owner: 10Ottomata) [20:13:05] danke [20:16:16] (03CR) 10Dzahn: "unfortunately didn't work, it didn't match any T's anymore" [puppet] - 10https://gerrit.wikimedia.org/r/177128 (https://phabricator.wikimedia.org/T75997) (owner: 10Krinkle) [20:16:54] 20:05:27 Started sync-apaches [20:16:54] sync-common: 100% (ok: 265; fail: 0; left: 0) [20:16:54] 20:16:40 Finished sync-apaches (duration: 11m 12s) [20:16:57] That seems pretty good to me [20:19:18] ah, the exception logs.... [20:19:30] hopefully / think it's fixed with our updates [20:20:03] Reedy: sync-apaches is still a thing? [20:20:13] mutante: heh [20:20:23] just the job queue [20:20:31] I think the name was just kept when it was ported to python [20:20:34] oh, ah [20:21:32] 3operations: git::clone makes changed files root-only readable - https://phabricator.wikimedia.org/T86527#977083 (10Jgreen) git::clone doesn't seem to offer the granularity you need. There's a class param $shared which toggles git's core.sharedRepository to 'group', which makes the repo group writable. I think i... [20:22:45] !log reedy Finished scap: testwiki to 1.25wmf15 and build l10n caches. Some extension bumps in wmf14 (duration: 19m 53s) [20:22:49] Logged the message, Master [20:22:55] bam! [20:23:18] fuck yeah [20:23:20] less than 20 minutes [20:23:23] <20m for a full scap with a new branch :) [20:23:29] (after the initial screwing about) [20:24:00] oh, now i get it, because you got all the new proxies [20:24:16] yeah :) [20:24:20] cool [20:25:03] Still biased to mw1070.eqiad.wmnet quite a bit [20:25:37] really? [20:25:44] wonder what it is about that server [20:26:22] (03CR) 10Dzahn: "should the DNS change still wait a bit or are we good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/180248 (owner: 10Ottomata) [20:26:30] Reedy: https://dpaste.de/TE3U [20:26:44] But that's all the syncs since midnight [20:27:28] that's interesting [20:27:32] nearly twice that of any other server [20:28:06] (03Abandoned) 10Dzahn: remove disabled search-pool monitoring [puppet] - 10https://gerrit.wikimedia.org/r/184380 (owner: 10Dzahn) [20:28:16] I'm going to prune the log and see what happened just in the last scap [20:29:10] (03CR) 10Dzahn: [C: 031] "let's merge it next week while we meet?:)" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [20:29:38] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184972 (owner: 10Reedy) [20:29:45] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184972 (owner: 10Reedy) [20:30:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf14 [20:30:27] Logged the message, Master [20:30:43] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184973 (owner: 10Reedy) [20:30:48] (03Merged) 10jenkins-bot: group0 to 1.25wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184973 (owner: 10Reedy) [20:31:18] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf15 [20:31:22] Logged the message, Master [20:32:02] (03PS1) 10Reedy: Update IW cache per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184991 [20:32:13] !log reedy Synchronized wmf-config/: Update IW cache (duration: 00m 06s) [20:32:19] Logged the message, Master [20:32:23] (03CR) 10Reedy: [C: 032] Update IW cache per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184991 (owner: 10Reedy) [20:32:25] (03PS9) 10Dzahn: generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) [20:32:27] (03CR) 10jenkins-bot: [V: 04-1] generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (owner: 10Dzahn) [20:32:29] (03Merged) 10jenkins-bot: Update IW cache per request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184991 (owner: 10Reedy) [20:33:02] (03PS10) 10Dzahn: generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (https://phabricator.wikimedia.org/T40799) [20:33:08] (03CR) 10jenkins-bot: [V: 04-1] generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (https://phabricator.wikimedia.org/T40799) (owner: 10Dzahn) [20:34:25] Reedy: For just the scap, mw1033 was actually the most used slave -- https://dpaste.de/2hrc [20:34:42] the rest look relatively even [20:35:06] mw1033 is the first in the list. I think we should shuffle it [20:35:35] and mw1070 was the first before mw1033 was added [20:35:54] so the first host in the list gets extra requests apparently [20:36:25] That probably means there are equal weight routes to multiple hosts [20:36:40] !log Zuul applied Ori patch to fix a git lock contention in Zuul-cloner {{bug|T86730}} . Tagged wmf-deploy-20150114-1 [20:36:42] and it keeps the first shortest one it finds [20:36:46] Logged the message, Master [20:36:54] hmm, that's make sense [20:37:07] where it should do a rand() array index if count > 1 [20:37:17] !log Restarting Zuul [20:37:18] (03CR) 10Reedy: [C: 04-1] Content Translation configuration for Production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [20:37:20] hashar: cool, I hope that works. BTW, if there are any lock files left over from before the patch was deployed, they would still need to be cleaned up manually. [20:37:20] Logged the message, Master [20:37:29] yeah or just shuffle the list before checking which is easy [20:37:57] entropy is hard ;) [20:38:12] ori: I am not sure why the lock are not disposed when zuul-cloner terminates though. Maybe because jobs are abruptly terminated [20:38:14] (03Abandoned) 10Dzahn: generate wikimedia wiki entries from helper [dns] - 10https://gerrit.wikimedia.org/r/171769 (https://bugzilla.wikimedia.org/38799) (https://phabricator.wikimedia.org/T40799) (owner: 10Dzahn) [20:38:25] actually we return early so the first host that pings back wins [20:38:32] so shuffle for sure [20:38:45] hashar: https://docs.python.org/2/reference/datamodel.html#object.__del__ : "It is not guaranteed that __del__() methods are called for objects that still exist when the interpreter exits." [20:39:10] python !== php :) [20:39:42] That php behavior is nice but it leads to bad expectations for other languages [20:39:51] ori: greaaat [20:40:11] I've written my fair share of Python and I don't think I've ever used __del__ in code meant for others to use [20:40:36] GitPython is a bad shape :( [20:40:38] the Pythonic idiom for cleanup work is try / finally or context managers ('with') [20:40:59] or the atexit module for process-global cleanup [20:41:08] 3operations, Wikimedia-DNS: Missing mobile DNS entries for numerous *.wikimedia.org wikis (*.m.wikimedia.org not existing) - https://phabricator.wikimedia.org/T40799#977237 (10Dzahn) [20:41:14] as I understand it the lib is a one man effort that started long ago. So I am not surprised there is a bunch of corner case bugs in it [20:41:47] 3operations, Wikimedia-DNS: Missing mobile DNS entries for numerous *.wikimedia.org wikis (*.m.wikimedia.org not existing) - https://phabricator.wikimedia.org/T40799#471541 (10Dzahn) [20:41:48] yeah, if i can find the time I'd like to file a detailed issue upstream [20:43:24] (03PS15) 10KartikMistry: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 [20:45:46] (03PS16) 10Reedy: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [20:45:51] (03CR) 10Reedy: [C: 031] Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [20:47:46] (03CR) 10Reedy: [C: 032] Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [20:47:50] (03Merged) 10jenkins-bot: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [20:52:02] (03CR) 10Dzahn: "@Giuseppe: I have meanwhile added it to the deployment role like you suggested." [puppet] - 10https://gerrit.wikimedia.org/r/177080 (owner: 10Dzahn) [20:52:35] (03PS1) 10Reedy: Move ContentTranslation into production extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184994 [20:53:31] (03CR) 10Reedy: [C: 032] Move ContentTranslation into production extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184994 (owner: 10Reedy) [20:54:00] !log hashar Started scap: (no message) [20:54:02] !log hashar scap aborted: (no message) (duration: 00m 01s) [20:54:18] holy hell [20:54:26] I ran scap by mistake on tin [20:54:35] Logged the message, Master [20:54:40] lol [20:54:43] (03Merged) 10jenkins-bot: Move ContentTranslation into production extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/184994 (owner: 10Reedy) [20:54:59] here is my output http://paste.openstack.org/show/158015/ [20:55:00] hashar: ! [20:55:10] Logged the message, Master [20:55:19] !log reedy Started scap: Add ContentTranslation messages [20:55:27] Logged the message, Master [20:55:29] hopefully next scap is going to fix it :D [20:55:36] I'm still laughing [20:55:39] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:50] you can literally see how you started hitting Ctrl + C a bunch of times [20:56:05] -1 cookies for hashar from Reedy :) [20:56:47] let's make --version actually output version and not run it ? [20:57:03] feature request! [20:57:04] :) [20:57:12] patches welcome [20:57:25] I don't think --version is a flag to the script [20:59:08] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:59:26] scap --really [20:59:34] icinga-wm: sshh [21:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150114T2100). [21:04:01] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#977293 (10RobH) Well, first Faidon advised this wouldn't work, and that I shouldn't waste time on it. I wanted to confirm why it wouldn't work, since I im... [21:11:12] !log deployed parsoid version 45b0aafb (deploy sha 88525538) [21:11:20] Logged the message, Master [21:12:48] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:14:53] (03PS1) 10Ori.livneh: VCL: Be consistent about using vmod_header to append header items [puppet] - 10https://gerrit.wikimedia.org/r/184997 [21:15:19] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:19] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:38] PROBLEM - HHVM busy threads on mw1197 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [21:18:59] PROBLEM - HHVM queue size on mw1197 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [21:24:34] !log reedy Finished scap: Add ContentTranslation messages (duration: 29m 14s) [21:24:40] Logged the message, Master [21:24:46] How was that one 10 minutes longer!? [21:25:29] Reedy: are we good? [21:25:32] !log reverted parsoid deploy to previous deployment (2cd6fefa) -- seeing a lot of dirty diffs post-deploy [21:25:34] kart_: yup [21:25:36] Logged the message, Master [21:25:57] kart_: you can effectively just do [21:25:59] git pull [21:26:09] sync-dir wmf-config Enable ContentTranslation [21:27:02] in side /srv/mediawiki-staging, right? [21:27:14] Reedy: ^ [21:27:18] Yup [21:27:22] exactly [21:27:41] Permission denied (publickey). [21:27:54] heh [21:28:00] How did you log into tin? [21:28:07] ssh tin [21:28:16] from bast1001? [21:28:41] yep. Using my ssh config hack. [21:28:50] bast1001 it is [21:28:58] ssh -A tin [21:29:30] -A forwards your key [21:29:40] yuk [21:29:43] thanks [21:30:10] same error though [21:31:03] Reedy: :( [21:31:43] really? [21:31:59] yes [21:32:46] What does ssh-agent say? [21:33:04] (03CR) 10Dzahn: "i tried to let this run in puppet-compiler to make it more merge-friendly, but i run into "Error: Could not find data item mediawiki_memca" [puppet] - 10https://gerrit.wikimedia.org/r/178873 (owner: 10Dzahn) [21:33:30] ssh-add -l ? [21:33:50] and/or that [21:34:15] kartik@tin:~$ ssh-add -l [21:34:15] The agent has no identities. [21:34:47] akosiaris: ping [21:35:55] Reedy: and [21:35:56] kartik@tin:~$ ssh-agent [21:35:56] SSH_AUTH_SOCK=/tmp/ssh-SpnLdUVY8753/agent.8753; export SSH_AUTH_SOCK; [21:35:59] SSH_AGENT_PID=8754; export SSH_AGENT_PID; [21:36:02] echo Agent pid 8754; [21:36:58] so, your agent much be loaded in properly as you wouldn't have been able to get onto tin from bast1001 [21:37:15] kart_: so you did ssh -A bast1001.wikimedia.org then ssh -A tin [21:37:15] ? [21:37:15] he probably uses ProxyCommand [21:37:25] hopefully... [21:37:26] mutante: yes. [21:37:26] which is good [21:37:37] do you have to forward for scapping? [21:37:45] if yea, then on local computer: ssh-add /path/to/key then use -A when connecting to bastion [21:37:45] yep :/ [21:37:50] But only the key for gerrit [21:37:50] kart_: ^ [21:39:15] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#977352 (10RobH) Ok, I was wrong about some of the above, Chad volunteered a bit more time so we could properly detail this. Gerrit runs and listens to 808... [21:39:30] subbu: https://gerrit.wikimedia.org/r/#/c/181613/ looks okay at first sight [21:40:00] ok, when did that go live? [21:40:17] subbu: yesterday I think [21:40:24] Jan 13, 2015 4:06 AM [21:41:37] i see .. so, i see a few dirty diffed pages an hour back [21:42:42] James_F: when was the last VE deploy? [21:43:38] Reedy: all ok [21:43:38] PROBLEM - HHVM queue size on mw1197 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [80.0] [21:43:49] Reedy: git pull; done [21:44:21] Reedy: sync-dir wmf-config Enable ContentTranslation ; now [21:44:28] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:45:04] kart_: yup [21:45:15] !log kartik Synchronized wmf-config: Enable ContentTranslation (duration: 00m 06s) [21:45:16] tarted [21:45:19] kart_, when to expect a mushroom cloud? :P [21:45:20] Logged the message, Master [21:45:29] heh [21:45:30] oh, now... :) [21:46:51] Reedy: thanks a lot! [21:47:01] Reedy: anything else? [21:47:07] :) [21:47:09] kart_: not if everything is working [21:47:16] :) [21:50:22] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#977384 (10Dzahn) for the related issue that we want to make Gerrit listen on 22, and not on 29418, which we'd like anyways, besides the cert issue being in... [21:51:52] Reedy: well, I don't see SpecialPage for ContentTranslation, do we need to sync anything else? [21:51:53] 3operations, Code-Review, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#977392 (10Dzahn) the suggestion to use an iptables rule to forward 22 to 29418 came from Faidon recently on IRC [21:52:10] kart_: You shouldn't.. Which wiki? Does it show in Special:Version? [21:52:18] checking [21:53:12] yes. in Special:Version [21:53:19] eswiki/cawiki [21:54:23] kart_: where should it be? [21:55:04] In Special:Pages list [21:55:34] eg https://ca.wikipedia.org/wiki/Especial:P%C3%A0gines_especials [21:55:38] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00326797385621 [21:56:48] kart_: what if you go directly to it? [21:57:31] No such special page :( [22:00:39] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [22:00:57] That's interesting. [22:01:01] > var_dump( SpecialPageFactory::exists( 'ContentTranslation' ) ); [22:01:01] bool(true) [22:01:23] very JohnFLewis [22:03:05] !log apt-get clean on fluorine to get a little disk space [22:03:08] RECOVERY - Disk space on fluorine is OK: DISK OK [22:03:10] Logged the message, Master [22:03:46] mutante: sorry that was me killing diskspace on fluorine i think, i was decompressing logs in my home dir(to the tune of 10's of G, unexpected) [22:04:01] ahah [22:04:23] i see now 43G are free again [22:04:30] ebernhardson: ah :) ok [22:04:47] i was wondering where all that space came from again [22:05:32] exception logs, eh [22:05:48] trying to track down something, and zcat to grep was getting slow [22:05:54] gotcha [22:05:59] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [22:06:03] Can you not write to /a? [22:06:17] yea, use /a [22:06:23] 599G there [22:06:33] hmm, i probably should? it only has /a/mw-log and i didn't want to muck with the actual logs. i'll make an extra dir [22:06:50] permission denied :) [22:07:16] mutante: fancy creating a /a/tmp or something? :) [22:07:47] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 06s) [22:07:55] you cant duplicate the entire mw-log though, it's too big [22:08:04] sure [22:08:05] ofc, i was just decompressing one type of logs :) [22:08:12] but being able to decompress a few files... [22:08:38] ok try /a/tmp/ [22:08:56] Logged the message, Master [22:09:00] mutante: looks good, thanks! [22:09:06] i let wikidev write to it [22:09:21] (03PS1) 10Reedy: Enable ContentTranslation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185068 [22:09:29] 3ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#977439 (10RobH) Please rack these into rack b4-codfw BEFORE the new mw servers are racked there. I want to ensure we can fit these there first, and any space issues result in moving one of the planned new mw sy... [22:09:37] !log made an /a/tmp/ on fluorine to let wikidev write to for log analysis [22:09:40] (03CR) 10Reedy: [C: 032] Enable ContentTranslation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185068 (owner: 10Reedy) [22:09:41] Logged the message, Master [22:10:40] (03PS1) 10RobH: setting mgmt dns for incoming wtp order [dns] - 10https://gerrit.wikimedia.org/r/185069 [22:11:12] (03CR) 10RobH: [C: 032] setting asset tag mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/184985 (owner: 10RobH) [22:11:46] (03CR) 10RobH: [C: 032] setting mgmt dns for incoming wtp order [dns] - 10https://gerrit.wikimedia.org/r/185069 (owner: 10RobH) [22:12:44] 3ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#977453 (10RobH) [22:13:01] 3ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#977456 (10RobH) p:5High>3Normal a:5RobH>3Papaul [22:13:15] !log reedy Synchronized wmf-config/InitialiseSettings.php: CT on testwiki (duration: 00m 06s) [22:13:19] Logged the message, Master [22:14:17] greg-g: Can I grab an emergency deploy? VisualEditor/Parsoid bug causing irritating wikitext corruptions in production. There's nothing in the calendar right now. [22:14:28] kart_: Stupid question.. It does work fine on beta, right? [22:14:49] Reedy: yep [22:14:56] (03Merged) 10jenkins-bot: Enable ContentTranslation on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185068 (owner: 10Reedy) [22:14:58] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [22:15:01] since June 2014. [22:15:15] Reedy: Can you authorise in greg-g's absence? [22:15:19] James_F: WFM [22:15:23] Need me to deploy it? [22:15:24] OK. :-) [22:15:34] Reedy: Krenair offered but if you're already in tin. [22:15:50] Reedy: https://gerrit.wikimedia.org/r/#/c/185067/ [22:15:51] you asked me :p [22:15:58] * James_F grins at Krenair. [22:16:08] 14 and 15? [22:16:09] doit [22:16:09] Reedy: *** Krenair failed to refuse. ;-) [22:16:12] Reedy: Yeah. [22:16:15] greg-g: Thanks! [22:18:41] !log reedy Synchronized php-1.25wmf15/includes/libs/virtualrest/ParsoidVirtualRESTService.php: (no message) (duration: 00m 05s) [22:18:48] Logged the message, Master [22:18:53] !log reedy Synchronized php-1.25wmf15/includes/libs/virtualrest/ParsoidVirtualRESTService.php: (no message) (duration: 00m 05s) [22:20:01] Reedy: do we need wmgBetaFeaturesWhitelist? [22:20:09] kart_: Yes. [22:20:16] ? [22:20:19] kart_: People were deploying Beta Features accidentally. [22:20:27] kart_: It was really bad for users. [22:20:29] That'll stop the special pages? [22:20:39] Reedy: thanks! [22:20:59] James_F: that makes me laugh actually 'I accidentally deployed beta stuff :o' [22:21:02] James_F: thanks. Google helps :) [22:21:04] Reedy: It stops any Beta Feature being shown in the preferences, and so should stop any effect. [22:21:23] Bad James_F :) [22:21:25] CT has some API modules that show [22:21:31] JohnFLewis: Indeed. "Oh, damn, I didn't realise that master got deployed automatically!" "Err, really?". [22:21:45] kart_: Not bad. Heroic. Fixing wikis for users. :-) [22:21:51] kart_: know what you need to do? :P [22:22:11] James_F: it doesn't actually prevent the preference showing up in the preferences, as far as I can see [22:22:30] yeah [22:22:32] https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [22:22:35] It shows there for me [22:22:40] Nikerabbit: It stops it being registered, I thought. [22:22:41] 3ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#977488 (10RobH) I'll update this task later with the mgmt IP info. For now the racking locations: b4-codfw (AFTER new WTP systems are racked): mw2135 through mw2148 c3-codfw: mw2149 through mw2188 c4-codfw: mw2189 through mw2215 [22:23:00] 3ops-codfw: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#977490 (10RobH) [22:23:03] 3ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#977489 (10RobH) [22:23:13] Anyway, I'm always happy to C+1 changes to the approved Beta Features list as long as they've gone through the normal process. [22:23:18] So it appears in the prefs, but BetaFeatures::isFeatureEnabled doesn't return it as enabled? [22:23:28] seems to be the case [22:23:38] any chance of fixing this now? [22:23:43] Yeah [22:23:55] I was wondering if kart_ was going to as he found out what it was :) [22:24:28] Reedy: yes. [22:25:30] cool [22:26:09] I'm confused on 'contenttranslation' is right or not :) [22:26:55] key. got it. [22:27:12] 3operations: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#977493 (10PleaseStand) [22:28:14] (03PS1) 10KartikMistry: Whitelist ContentTranslation BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185074 [22:28:23] Reedy: Please ^^ [22:28:46] kart_: The date should be today+6 months. [22:28:46] or Nikerabbit if you're nearby [22:28:52] ah [22:28:57] kart_: It's the nominal termination date. [22:29:08] kart_: Note that a bunch of these need cleaning up. :-) [22:29:35] ah [22:29:53] s/cleaning up/de-deploying because no-one is looking after them/ [22:30:37] James_F: or graduate? [22:30:53] (03PS2) 10KartikMistry: Whitelist ContentTranslation BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185074 [22:31:02] Reedy, did you deploy that ve/parsoid/core patch to wmf15 twice? [22:31:17] damn it, didn't do 14 [22:31:38] !log reedy Synchronized php-1.25wmf14/includes/libs/virtualrest/ParsoidVirtualRESTService.php: (no message) (duration: 00m 05s) [22:31:43] James_F|Away, Reedy - isn't "New Seach" (sic) already default everywhere? [22:31:44] Logged the message, Master [22:31:56] aharoni: haha, yeah. ^d manybubbles [22:33:40] Reedy: deployment of this patch will be different I guess. [22:33:42] (03PS1) 10Amire80: New Search is default now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 [22:33:48] kart_: not really [22:34:58] Reedy: should I go to tin? :) [22:35:40] James_F|Away: Do you want to +1 it? [22:35:55] (03CR) 10Chad: "Fine by me. We already disabled this on the Cirrus side awhile back." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185076 (owner: 10Amire80) [22:36:18] (03CR) 10Reedy: [C: 031] Whitelist ContentTranslation BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185074 (owner: 10KartikMistry) [22:36:56] kart_: yeah, to tin! [22:37:24] Reedy: there, Sir! [22:38:46] Reedy: same stuff? git pull; and ? [22:38:59] sync-file wmf-config/InitialiseSettings.php [22:40:00] !log kartik Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [22:40:04] Logged the message, Master [22:40:40] 3ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#977542 (10RobH) mgmt ips: mw2135 1H IN A 10.193.2.35 mw2136 1H IN A 10.193.2.36 mw2137 1H IN A 10.193.2.37 mw2138 1H IN A 10.193.2.38 mw2139 1H IN A 10.193.2.39 mw21... [22:42:04] 3ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#977547 (10RobH) a:5RobH>3Papaul Also, once these are racked, Papaul can setup the asset tag mgmt address entries (good practice!) [22:42:15] Reedy: thanks. Testing. [22:43:56] Reedy: it seems, page still not appearing. [22:44:04] (03PS1) 10RobH: setting mw2135-mw2215 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/185077 [22:44:53] (03CR) 10RobH: [C: 032] setting mw2135-mw2215 mgmt entries [dns] - 10https://gerrit.wikimedia.org/r/185077 (owner: 10RobH) [22:45:27] Reedy: that should be deployed, right? [22:46:11] the patch is still not merged, if that matters [22:46:15] ^ [22:46:16] lmfao [22:46:18] yup [22:46:24] (03CR) 10Reedy: [C: 032] Whitelist ContentTranslation BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185074 (owner: 10KartikMistry) [22:46:27] (03CR) 10Dzahn: "caused issues in labs, who need holes for rsyncd, it seems" [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [22:46:29] (03Merged) 10jenkins-bot: Whitelist ContentTranslation BetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185074 (owner: 10KartikMistry) [22:46:54] kart_: pull and sync again :P [22:46:58] Bad kartik [22:47:15] !log kartik Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [22:47:22] Logged the message, Master [22:47:29] WFM now [22:48:55] (03PS1) 10Ori.livneh: Restrict 'forceprofile' to requests that set X-Wikimedia-Debug header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185080 [22:48:57] AaronSchulz: ^ [22:51:45] wikitech wiki is broken [22:51:56] oh (Cannot access the database: Too many connections (208.80.154.18)) [22:52:12] and back [22:52:36] It was doing that earlier today I think [22:52:49] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:52:51] intermittently [22:53:49] deployment-bastion.eqiad.wmflabs [22:53:57] ^ it's a bastion but you only get to it via another bastion? [22:54:02] (03PS2) 10Ori.livneh: Restrict 'forceprofile' to requests that set X-Wikimedia-Debug header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185080 [22:54:24] (03PS1) 10Se4598: Disable compact personal bar (beta feature) on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) [22:54:29] (03CR) 10jenkins-bot: [V: 04-1] Disable compact personal bar (beta feature) on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [22:54:48] (03CR) 10Aaron Schulz: [C: 031] Restrict 'forceprofile' to requests that set X-Wikimedia-Debug header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185080 (owner: 10Ori.livneh) [22:54:56] (03PS3) 10Ori.livneh: Restrict 'forceprofile' to requests that set X-Wikimedia-Debug header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185080 [22:55:00] (03CR) 10Ori.livneh: [C: 032] Restrict 'forceprofile' to requests that set X-Wikimedia-Debug header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185080 (owner: 10Ori.livneh) [22:57:07] !log ori Synchronized wmf-config/StartProfiler.php: I5bd397456: Restrict "forceprofile" to requests that set X-Wikimedia-Debug header (duration: 00m 06s) [22:57:09] Reedy: we're good, but we've stuff for akosiaris :) [22:57:13] Logged the message, Master [22:58:36] (03PS2) 10Se4598: Disable compact personal bar (beta feature) on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) [22:59:07] (03CR) 10Legoktm: "Given that its time is up, can we just remove it from everywhere?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [22:59:56] (03CR) 10Se4598: "code from I4a0743d6f071e40c38140658c01eacfc88945f9c and followup If3a338bbaaa303edd541480b64c43526e9f5fe32" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/185081 (https://phabricator.wikimedia.org/T86831) (owner: 10Se4598) [23:04:08] (03PS3) 10Ori.livneh: "Un-disable" xhprof [puppet] - 10https://gerrit.wikimedia.org/r/182992 [23:04:17] (03CR) 10Ori.livneh: [C: 032 V: 032] "As of I5bd397456, this is confined to mw1017, so it's not a DDoS vector." [puppet] - 10https://gerrit.wikimedia.org/r/182992 (owner: 10Ori.livneh) [23:05:04] (03PS1) 10Dzahn: beta: deployment-bastion, add ferm rule for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/185085 [23:07:30] (03PS2) 10Dzahn: beta: deployment-bastion, add ferm rule for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/185085 [23:09:55] (03CR) 10BryanDavis: [C: 031] beta: deployment-bastion, add ferm rule for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:10:30] (03CR) 10Dzahn: "< wmf-insecte> Project beta-scap-eqiad build #38235: STILL FAILING in 19 min: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/382" [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:15:36] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/185085/2" [puppet] - 10https://gerrit.wikimedia.org/r/184695 (owner: 10Faidon Liambotis) [23:22:58] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.070 second response time [23:22:58] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 117468 bytes in 0.401 second response time [23:26:14] (03CR) 10Dzahn: "cherry-picked on beta, deployment-bastion has the new rule" [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:30:09] RECOVERY - HHVM queue size on mw1197 is OK: OK: Less than 30.00% above the threshold [10.0] [23:30:58] RECOVERY - HHVM busy threads on mw1197 is OK: OK: Less than 30.00% above the threshold [76.8] [23:35:29] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [23:37:05] (03CR) 1020after4: [C: 031] beta: deployment-bastion, add ferm rule for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:39:47] (03CR) 10Dzahn: [C: 032] "15:41 < bd808> 23:32:32 Finished sync-proxies (duration: 02m 16s)" [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:43:23] (03CR) 10Dzahn: "< wmf-insecte> Yippee, build fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/185085 (owner: 10Dzahn) [23:52:29] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures