[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T0000). Please do the needful. [00:00:04] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:01:11] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [00:01:47] (03CR) 10Dzahn: [V: 040 C: 031] Gerrit: Remove useless space from config [puppet] - 10https://gerrit.wikimedia.org/r/325834 (owner: 10Chad) [00:03:51] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:05:30] Oh boy--I can't log into gerrit [00:05:32] > Cannot assign user name "awight" to account 4151; name already in use. [00:05:43] awight: Known. [00:05:47] k thanks [00:05:47] Try "Awight" [00:05:49] ? [00:05:53] bahaha [00:05:55] Might be a weird capitalization bug.... [00:05:58] ok one moment [00:06:15] ooh funky > Cannot assign user name "awight" to account 4152; name already in use. [00:06:16] heh, like mediawiki [00:06:43] awight: It's trying to make a new account (for some reason) but conflicts because your account already exists.... [00:06:46] It's weird. [00:07:26] I can live w/o it tonight, if you think it'll be solved magically, or I'm happy to be a beta tester, just lmk [00:08:20] I doubt it'll be magic, I gotta figure out what's up :( [00:08:52] awight: Weird part, it's not everybody. And logging out didn't break it for ok-logged-in users on next attempt. [00:08:54] Weird! [00:08:54] * TabbyCat whispers revert [00:09:01] TabbyCat: Yeah not that simple. [00:09:20] TabbyCat: Gerrit does not roll back nicely, at all. [00:09:25] ostriches i swear they fixed that. [00:09:32] Yeah well. [00:09:34] Clearly not [00:09:59] "This prevents creation of new accounts on every logout/login sequence." [00:10:12] Just ping me if you want me to try anything [00:10:12] looks like it could either be a regresion or not actually fixed. [00:10:23] Yeah my bet is not actually fixed. [00:10:38] Technically we can work around it, but it's incredibly manual. [00:10:52] Requires a bunch of DB fixes. [00:10:54] (03PS1) 10RobH: setup gerrit2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/325860 [00:11:25] I'll stop logging in for now, cos it looks like I'm spamming the db [00:11:43] (03CR) 10RobH: [V: 040 C: 032] setup gerrit2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/325860 (owner: 10RobH) [00:13:11] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2856022 (10RobH) [00:14:19] 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2851729 (10RobH) [00:14:40] kaldari: ping? [00:14:46] her [00:14:51] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 26.88 seconds [00:14:57] you've someone already swatting to deploy your change? [00:15:24] awight: Try one last time, please. [00:15:27] I tried something [00:15:46] (03PS4) 10Dzahn: Gerrit: Remove exec/require/subscribe for auto-provisioning [puppet] - 10https://gerrit.wikimedia.org/r/325834 (owner: 10Chad) [00:16:15] (03CR) 10Dzahn: [V: 040 C: 032] "http://puppet-compiler.wmflabs.org/4832/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/325834 (owner: 10Chad) [00:16:37] ostriches: with lowercase? [00:16:53] Whatever you usually use. [00:16:54] * awight is scared to lose my last magic wish [00:16:55] k [00:17:01] Dereckson: if that question is for me, no [00:17:10] :( failure [00:17:11] (03PS5) 10Dzahn: Gerrit: Remove exec/require/subscribe for auto-provisioning [puppet] - 10https://gerrit.wikimedia.org/r/325834 (owner: 10Chad) [00:17:21] awight: dammit. ok thanks. [00:17:34] kaldari: okay, let's do it [00:19:23] ostriches>: something weird is going on with Gerrit. When I try to log in I get: "Cannot assign user name "kaldari" to account 4160; name already in use." [00:19:41] Dammit, you too [00:19:42] Fuck [00:19:58] That's 4 people now [00:20:12] Let me test as well, my username is capitalized though [00:20:13] my login just worked, if you wish a report of someone not affected by the issue [00:20:22] * awight sends a big FU weather balloon over Mountain View [00:20:32] Dereckson: *Most* are working best I can tell, or I would have way more complaints ;-) [00:20:34] But thanks [00:21:02] I've logged in just fine :) [00:21:07] awight: lol [00:21:24] ostriches: random data point, kaldari has cn and sn that aren't exact matches. His cn is capitalized [00:21:34] Hmmmm [00:21:37] Ah, lemme check something [00:21:44] ok, it is applying the latest change right now [00:21:46] I swear... this open-source software gifted to us by mega-profit companies is really not working out well [00:21:54] the behavior of this over time has changed in wikitech when it makes accounts [00:21:55] which should stop any auto-restarts [00:22:04] My login also works [00:22:16] bd808, not to mention pre-wikitech accounts [00:22:27] A-ha! [00:22:31] Let's try this. [00:22:40] are shell names and wikitech names now the same? [00:22:41] Krenair: ture [00:22:51] no TabbyCat [00:22:55] k [00:22:58] (03PS2) 10Smalyshev: Add configs for LDF server [puppet] - 10https://gerrit.wikimedia.org/r/317282 (https://phabricator.wikimedia.org/T136358) [00:23:00] Bah, I just realised who you are TabbyCat [00:23:03] I was looking for you earlier [00:23:07] I don't remember for what now [00:23:07] meow [00:23:13] ostriches: Logging in with "Kaldari" instead of "kaldari" works for me [00:23:20] TabbyCat: no, but the cn and sn will both be the same as the wikitech username for an account created today [00:23:54] I think that's it :) [00:24:03] I think basically they all need to match in the DB [00:24:23] Krenair we talked few hours ago unless it is something different [00:24:33] TabbyCat, it was different [00:24:35] TabbyCat: I would have assumed you were a new AaronSchulz cat themed nick [00:24:43] doesn't quite explain awight though. [00:24:46] 143 | NULL | NULL | username:awight [00:24:47] TabbyCat, oh, right, you did this: https://www.mediawiki.org/w/index.php?title=Help:Extension:CentralAuth/Global_rename&diff=0&oldid=2303970 [00:24:55] yes [00:24:59] meow [00:25:07] TabbyCat, you rewrote half a page for wikitech, but it's on mediawiki.org [00:25:09] * AaronSchulz ate too much pizza [00:25:14] * AaronSchulz will eat more later anyway [00:25:33] so shall I move that stuff to wikitech instead? [00:25:53] legoktm had how to run eval.php there so I just tried to follow that logic [00:26:07] I'd leave it to legoktm personally [00:26:30] well, yeah, he will know better [00:26:56] :) I am 143! win! [00:27:08] * awight tucks feather in cap [00:27:10] we are 138 [00:27:28] fatal: krenair does not have "Access Database" capability. [00:27:29] pff [00:27:35] bd808: aargh now my victory song is stuck in a loop [00:27:54] * TabbyCat sleep [00:28:02] awight: and a loop of unknown meaning at that [00:28:49] yes! what *is that about [00:29:15] ah ffs, why are the ACL history diffusion links now even MORE broken? [00:29:38] Diffusion links are dumb, I always said so [00:29:41] bd808: > Danzig's response to their statements: "They didn't write it, and they don't know what the f--k it's about." [00:29:44] Like gitblit before that [00:29:47] and gitweb before that [00:29:50] I used them [00:29:58] #sorrynotsorry [00:30:06] I got 99 problems but diffusion links ain't gonna be one [00:30:11] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:30:37] seriously though, https://gerrit.wikimedia.org/r/#/admin/projects/All-Projects,access [00:30:43] the link is https://phabricator.wikimedia.orgrefs/meta/config [00:30:49] awight: *nod* And Glenn isn't telling for whatever random reason. [00:31:18] Krenair i fixed it [00:31:28] but you will have to wait for gerrit 2.13.4 [00:31:52] See https://gerrit-review.googlesource.com/#/c/92620/ [00:31:56] ostriches, is capability-access-database's database-accessing capability (or lack thereof) also not going to be your problem? [00:32:21] omg ] works in unified diff again [00:32:26] * bd808 hugs ostriches [00:32:43] kaldari: live on mwdebug1002 [00:32:52] Krenair: I already revoked that. [00:33:00] ostriches, why? [00:33:16] Because. [00:33:26] Dereckson: This is only a change to a (manually run) maintanence script, so can't test on mwdebug1002. [00:33:29] Nobody needs direct write access to the DB [00:33:42] And if they do, they have shell access [00:33:51] Sure but occasionally I've wanted to read the DB directly [00:33:59] Well, if it had a way to access w/o writing, sure. [00:34:11] But it doesn't. [00:34:15] So sorry [00:34:26] I guess I'll just have to put it back for my own account when I want it... at least there'll be a log that way [00:34:43] even if you can't access it from the web because we shut down gitblit and diffusion doesn't work properly [00:35:34] On the other hand it's going to be a pain in my ass [00:35:46] kaldari: yeah, I've checked the change before, but I thought perhaps you can check a dry run. But no, the script doesn't have a dry run mode. [00:35:51] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:35:58] Dereckson: not this one [00:38:12] !log dereckson@tin Synchronized php-1.29.0-wmf.5/extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php: [[Gerrit:325733|Improve populateLocalAndGlobalIds maintenance script]] (T148242) (duration: 00m 46s) [00:38:19] Here you are. [00:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:28] T148242: Fully populate local_user_id and global_user_id fields in production - https://phabricator.wikimedia.org/T148242 [00:38:34] Next: there is a small issue in CommonSettings.php [00:38:35] 306 Notice: Use of undefined constant NS_TOPIC - assumed 'NS_TOPIC' in /srv/mediawiki/wmf-config/CommonSettings.php on line 2765 [00:38:59] another extension.json move perhaps? [00:39:02] To who is concerned: Use numerics for non standard namespaces. [00:39:08] NS_TOPIC would be... Flow [00:39:12] ostriches according to https://gerrit.googlesource.com/gerrit/+/v2.13.2/ReleaseNotes/ReleaseNotes-2.13.2.txt there was no fix for capitals in login. contradicting what is said here https://www.gerritcodereview.com/releases/2.13.md#2.13.2 [00:39:26] so the same one that's been having problems for the last few days [00:39:28] RoanKattouw [00:41:20] paladox: I'm not sure that's the right issue though. It *was* indexed... [00:41:37] awight: Hmm, try one last time, for good measure.... [00:41:41] Yep, seems strange though. [00:41:49] awight: (all lowercase) [00:42:41] https://gerrit.googlesource.com/gerrit/+/4d07688b1e30ec1fad1def21fd5eac788ea19438 [00:43:30] https://gerrit.googlesource.com/gerrit/+/79ae5803bee893bf8f6fae2f17aa6b7dc4e67bb1 [00:44:02] paladox: 4d07688b1e30ec1fad1def21fd5eac788ea19438 totally unrelated. [00:44:12] oh [00:44:33] (03PS1) 10Dereckson: Use numerics instead of custom namespace constant for NS_TOPIC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325863 [00:44:37] If you guys could review https://gerrit.wikimedia.org/r/#/c/325839/ that'd be great [00:44:47] I had it applied earlier, then took it off and it got restarted [00:44:52] So it's no longer live [00:44:54] The problem is only triggered for new created accounts, after the migration to 2.13 release, because the old accounts were reindexed during the migration. All new created users after the migration are suffering from this bug. [00:44:58] But now the opposite is true. [00:45:05] New accounts are fine, ones before the migration aren't [00:45:09] Or they aren't getting indexed. [00:45:11] ostriches: private tab, lowercase, "remind me" checked--no dice [00:45:12] lol [00:45:28] Should an issue be raised on there tracker? [00:45:36] Not until I have some idea where to even look [00:45:40] Ok [00:45:59] (03PS1) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) [00:46:13] This looks like something of interest https://gerrit-review.googlesource.com/#/c/79089/ (i am digging through the commits) [00:46:28] What, I thought I'd fixed all the Flow constants in the config [00:46:29] (03CR) 10jenkins-bot: [V: 04-1 C: 040] install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:47:50] (03PS2) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) [00:48:31] So normally https://gerrit.wikimedia.org/r/#/c/325863 should fix it, and to test that, we need a non english wiki with LQT [00:48:39] (03CR) 10jenkins-bot: [V: 04-1 C: 040] install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:48:53] hu.wikipedia or sv.wikisource [00:48:57] or pt.wikibooks [00:50:21] (03PS3) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) [00:52:11] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:11] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:55] jenkins-bot tell me about the _good_ things too [00:53:01] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [00:53:01] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [00:53:35] (03PS2) 10Dereckson: Use numerics instead of custom namespace constant for NS_TOPIC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325863 (https://phabricator.wikimedia.org/T152651) [00:54:20] (03CR) 10Dzahn: [V: 040 C: 040] "maybe first https://gerrit.wikimedia.org/r/#/c/325864/" [puppet] - 10https://gerrit.wikimedia.org/r/325739 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:55:24] arf Krenair [00:55:26] https://pt.wikibooks.org/wiki/Especial:Todas_as_páginas [00:55:51] the current issue is funny, it adds a new new namespace at the start of the list [00:56:16] What am I looking at here? Something funny with LQT vs. Flow? [00:56:18] https://pt.wikibooks.org/wiki/T%C3%B3pico:Pnp47r0sz4pfj18n and indeed Flow content isn't reacheablze [00:56:25] yep [00:56:36] It's almost 1AM here and I'm not quite up to digging back into LQT [00:56:38] LQT has to be Tópico: [00:56:42] and Flow has to be Topic: [00:56:57] that's the solution found for non English wikis [00:57:18] Let's check that [00:57:43] (03CR) 10Dereckson: [V: 040 C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325863 (https://phabricator.wikimedia.org/T152651) (owner: 10Dereckson) [00:57:48] (03CR) 10jenkins-bot: [V: 040 C: 040] Use numerics instead of custom namespace constant for NS_TOPIC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325863 (https://phabricator.wikimedia.org/T152651) (owner: 10Dereckson) [00:58:14] (03Merged) 10jenkins-bot: Use numerics instead of custom namespace constant for NS_TOPIC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325863 (https://phabricator.wikimedia.org/T152651) (owner: 10Dereckson) [00:59:33] Live on mwdebug1002 [01:00:00] https://pt.wikibooks.org/wiki/T%C3%B3pico:Ajuda_Discuss%C3%A3o:Etapas_de_desenvolvimento/economia [01:00:03] LQT is back [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T0100). [01:00:41] the 0's being bold and red is going to bug me :/ [01:01:52] https://pt.wikibooks.org/wiki/Topic:Pn326towg59zc9fk back too [01:01:55] so yes the fix works [01:03:51] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:03:55] (03CR) 10BryanDavis: [V: 040 C: 04-1] "needs a manual rebase because the files have moved to modules/role/files/logstash" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [01:04:17] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Use numerics instead of custom namespace constant for NS_TOPIC (T152651) (duration: 00m 45s) [01:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:31] T152651: Fix NS_TOPIC Flow configuration issue - https://phabricator.wikimedia.org/T152651 [01:04:55] 0,0 [01:04:57] aww [01:05:00] awight: Try again? [01:05:01] <3 [01:05:08] Wait one sec [01:05:20] nop [01:05:28] k wait looping [01:06:27] Ok now [01:11:16] * ostriches hopes the silence is awight just overwhelmed with excitement at it working [01:11:31] no /o\ [01:11:40] dang it [01:11:59] ok we're gonna try something. you're gonna be a guinea pig [01:12:13] this might actually work [01:12:34] * awight squeezes blood from the stone [01:13:01] either that or it'll blow up in my face lol [01:13:32] no, not that wire! [01:17:33] awight: Now try [01:17:44] * ostriches is trying something crazy [01:17:48] O_O [01:17:53] wat did you do. it worked. [01:18:02] Ok, now let's see if I can clean this up, log out. [01:18:14] Technically you have a new account, but I'm going to try and give you your old account_id back [01:18:43] oh my. signed out. [01:18:56] https://en.wikipedia.org/wiki/Laika [01:19:04] I... don't think she made it. [01:20:02] awight: Ok, reassigned, flushed caches. Try.... [01:20:06] * ostriches crosses fingers [01:21:11] all systems go. houston, we have problems [01:21:20] .ashcasuhicas897hcas7asb8yascbuyasbhicasbh [01:21:22] Son of a [01:21:26] * awight kicks own account [01:22:08] I mean, I suppose the other way works, reassign all stuff to your *new* account_id [01:22:10] But seriously. [01:22:11] I need some non-work time, but will try again in a few hours & will either IRC or add it to the task [01:22:12] wtf. [01:22:24] (03PS1) 10Yurik: LABS: set license for structured data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325866 [01:22:25] I am beyond stumped. [01:22:27] no such thing as a free software lunch [01:22:47] If it's a clue, my account was renamed from adamw once upon a time [01:22:56] Nah, probably isn't it. [01:23:06] ostriches, are you deploying something? I need to push labs config change ^^^^ [01:23:20] yurik: "beta cluster" [01:23:22] good luck, I'll check in maybe in 3 hours [01:23:39] yurik: No, I'm futzing with gerrit [01:23:41] * yurik throws a very heavy cookie at bd808 [01:23:50] bd808, you want to do it instead? :-P [01:24:00] I want someone to figure this out for me instead. [01:24:04] * ostriches has a headache [01:24:55] oki, i'm about to sync commonssetting-labs... bd808 -- LABS! :-P [01:25:05] Terrible file names [01:25:07] I should fix that [01:25:10] ohh boy [01:25:18] ostriches: yes plz [01:25:19] ostriches, you shouldn't touch production with a headacke :-P [01:25:34] (03CR) 10Yurik: [V: 040 C: 032] LABS: set license for structured data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325866 (owner: 10Yurik) [01:25:37] Well, the headache came as a result of it being broken and needing touching [01:25:39] (03CR) 10jenkins-bot: [V: 040 C: 040] LABS: set license for structured data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325866 (owner: 10Yurik) [01:25:40] pushing ^^ [01:25:44] I didn't embark on this for fun :p [01:25:48] ostriches: what's the current lead btw? out of curiosity [01:26:05] (03Merged) 10jenkins-bot: LABS: set license for structured data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325866 (owner: 10Yurik) [01:26:15] It has *something* to do with capitalization of usernames, caches, and lucene indexes. [01:26:23] For which there have been several upstream fixes in 2.12/2.13 [01:26:37] All of which we have, but who knows if they made things worse. [01:26:42] It does not affect *all* accounts [01:26:55] Closest I got to a workaround was: [01:27:03] 1) Delete old user, flush caches & reindex [01:27:12] 2) Have user login again, flush caches & reindex [01:27:19] [user now has a working account] [01:27:40] 3) Assign them their old userid back so we could keep their history, flush cache & reindex [01:27:44] 4) Boom, back to broken [01:28:17] I suppose I could techniaclly stay at (2) and make (3) "Reassign all their old work to the new user id" [01:28:27] But that's a lot of work....especially if this ends up being more widespread.... [01:28:39] !log yurik@tin Synchronized wmf-config/CommonSettings-labs.php: set license for structured data 325866 (duration: 00m 45s) [01:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:14] ostriches: *nod* and the switching of userid happens via gsql ? [01:29:27] Yeah from cobalt [01:30:20] ostriches that sound like a broken migration script in gerrit. Anyways we should report it to gerrit even if we doint have all the facts like which file is the bug in. We can always add those details later. The quicker we bring this to there attention the more likly it will be fixed. [01:30:24] :) [01:34:25] https://groups.google.com/forum/#!msg/repo-discuss/U6qM_4j5wG0/TEZi-1RkCQAJ [01:36:00] ostriches: I was peeking at the logs and the last attempt also had this [2016-12-08 01:21:01,461] WARN com.google.gerrit.server.query.account.InternalAccountQuery : Ambiguous external ID gerrit:awigh [01:36:04] tfor accounts: 4188, 143 [01:37:03] Um.... [01:37:08] Ambiguous would be bad. [01:37:15] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2856209 (10RobH) a:05leila>03None Please note that ops clinic duty handles this, not @cmjohnson specifically. As this was missing the #ops-a... [01:37:31] godog, paladox: I just filed a task https://bugs.chromium.org/p/gerrit/issues/detail?id=5090 [01:37:43] ostriches thanks :) [01:37:51] thanks ostriches ! [01:39:01] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:40:42] I know that workaround works as far as the "make a new account_id" [01:40:48] But trying to give back the old id fails [01:40:52] And something is clearly broken here [01:41:31] hopefully they will fix it :) [01:43:16] I think my old magic powers were why I was able to raise the priority to 1 :) [01:43:53] lol [01:43:58] your a project member [01:44:42] Like I said, old magic powers! [01:44:49] yep [01:47:29] (03PS1) 10RobH: adding new shell users arnad & jgonsior [puppet] - 10https://gerrit.wikimedia.org/r/325868 (https://phabricator.wikimedia.org/T152023) [01:47:59] ostriches Starred by 2 users [01:48:48] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2835737 (10RobH) Please note that there are a couple things that'll make future requests easier: * tag with #ops-access-requests,... [01:50:44] * paladox has to go, 01:50am. [01:54:49] (03PS1) 10RobH: new shell user piccardi [puppet] - 10https://gerrit.wikimedia.org/r/325869 (https://phabricator.wikimedia.org/T151969) [01:55:05] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2856231 (10RobH) Please note that there are a couple things that'll make future requests easier: * tag with #ops-access-re... [01:57:08] 06Operations, 10Beta-Cluster-Infrastructure, 03Scap3 (Scap3-MediaWiki-MVP), 07WorkType-NewFunctionality: etcd/confd is not started on beta cluster Varnish caches - https://phabricator.wikimedia.org/T116224#2856262 (10demon) [02:04:31] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:08:01] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:11:51] (03PS3) 10Tim Landscheidt: mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) [02:26:27] 06Operations, 10Beta-Cluster-Infrastructure, 03Scap3 (Scap3-MediaWiki-MVP), 07WorkType-NewFunctionality: etcd/confd is not started on beta cluster Varnish caches - https://phabricator.wikimedia.org/T116224#2856292 (10Krenair) 05Open>03Resolved a:03Krenair https://wikitech.wikimedia.org/w/index.php?ti... [02:32:31] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [02:55:12] 06Operations, 06Performance-Team: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473#2856349 (10Gilles) For alterting we might want to configure email. Afaik the Grafana ini file needs SMTP config, it doesn't use sendmail. [02:58:51] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:25:31] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:26:11] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:26:51] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:30:01] (03PS7) 10Dzahn: gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 (owner: 10Paladox) [03:35:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5874 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789378 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587819 MB (34% inode=99%) [03:40:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5822 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789378 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587763 MB (34% inode=99%) [03:45:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5779 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789378 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587706 MB (34% inode=99%) [03:49:05] (03CR) 10Dzahn: [C: 032] gerrit: Fix jenkins comments to pretty them again [puppet] - 10https://gerrit.wikimedia.org/r/325826 (owner: 10Paladox) [03:49:30] (03PS1) 10Chad: Rewrite wmf-beta-autoupdate as a scap3 plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325875 (https://phabricator.wikimedia.org/T151519) [03:50:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5735 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789378 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587801 MB (34% inode=99%) [03:50:52] (03CR) 10Chad: "For comparison:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325875 (https://phabricator.wikimedia.org/T151519) (owner: 10Chad) [03:53:31] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:54:11] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:56:31] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:21] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [03:59:13] !log manually restarting gerrit to pick up config change to make jenkins comments pretty again (https://gerrit.wikimedia.org/r/#/c/325826/) (we stopped letting puppet do it for now) [03:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:49] already over [04:07:39] (03PS2) 10Dzahn: Move wikimedia-logo.svg to role module [puppet] - 10https://gerrit.wikimedia.org/r/325729 (owner: 10Tim Landscheidt) [04:15:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5875 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587683 MB (34% inode=99%) [04:17:51] (03PS1) 10EBernhardson: Add a few more php7.0 package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325877 [04:20:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5834 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587773 MB (34% inode=99%) [04:25:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5784 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587722 MB (34% inode=99%) [04:30:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5738 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587671 MB (34% inode=99%) [04:33:11] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:34:03] hey - we have a disk space issue on the fundraising database [04:34:19] is anyone about to help? [04:35:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5695 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587758 MB (34% inode=99%) [04:37:12] eileen: do we need to kick some people? [04:38:03] yeah - there are some log files that would clear up a tonne of space [04:38:12] but, I don't have permission to delete them [04:38:24] but, if I can craft the drush command correctly I can [04:39:34] who is about on this channel - I just need someone with more permissions on the server [04:40:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5652 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587709 MB (34% inode=99%) [04:44:31] I just tried texting Jeff [04:44:38] but it is late there [04:45:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5615 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587662 MB (34% inode=99%) [04:50:11] PROBLEM - check_disk on barium is CRITICAL: DISK CRITICAL - free space: / 5588 MB (10% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789360 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587744 MB (34% inode=99%) [04:54:01] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:11] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [05:04:26] mark are you there - trying to figure out who can give some opps support for fundraising server [05:15:38] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2856449 (10leila) >>! In T151969#2856231, @RobH wrote: > Please note that there are a couple things that'll make future req... [05:18:02] (03PS2) 10Legoktm: contint: Add a few more php7.0 package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [05:18:28] (03CR) 10Legoktm: [C: 031] "A little confusing that these don't have the php7.0 prefix, but okay. I'll cherry-pick this onto the puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [05:18:55] eileen: do people need to be paged? [05:20:16] legoktm: I did try jeff - not sure who is in the best timezone [05:20:50] it's a little late for robh and mutante, maybe _joe_ is awake early? [05:21:12] or apergos? [05:21:19] we might have a way to hack around the permissions -just discussing it with ejegg [05:21:58] emailing the ops list is probably a good idea if you can't get any ops right now [05:22:28] ok - I'll try that [05:23:01] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [05:23:02] what's the list email? [05:23:09] ops@lists.wikimedia.org [05:27:57] thanks legoktm [05:34:36] eileen: if it's urgent at all call someone [05:35:00] contacts are on the staff list on officewiki [05:35:09] I think we can get around it.... [05:35:45] hurry up and get that second FR root hired :) [05:37:09] OK - looks like we can use jenkins to delete the log file that is unneeded [05:37:17] & will get our space back down [06:25:11] RECOVERY - check_disk on barium is OK: DISK OK - free space: / 11323 MB (21% inode=90%): /dev 7976 MB (99% inode=99%): /run 1558 MB (97% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 7985 MB (100% inode=99%): /archive 2789340 MB (97% inode=99%): /boot 202 MB (77% inode=99%): /archive/banner_logs 1587635 MB (34% inode=99%) [06:28:17] issue resolved! [06:34:41] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=576.10 Read Requests/Sec=523.40 Write Requests/Sec=33.50 KBytes Read/Sec=42091.60 KBytes_Written/Sec=434.40 [06:43:41] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=74.40 Read Requests/Sec=5.20 Write Requests/Sec=7.30 KBytes Read/Sec=32.40 KBytes_Written/Sec=251.20 [06:56:24] (03CR) 1020after4: [C: 031] Rewrite wmf-beta-autoupdate as a scap3 plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325875 (https://phabricator.wikimedia.org/T151519) (owner: 10Chad) [07:26:41] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:55:41] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:01:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:07:04] (03PS3) 10Legoktm: contint: Add a few more php7.0 package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [08:29:01] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:49:01] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:21] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:53:01] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:11] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:51] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:54:01] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [08:55:56] (03CR) 10Hashar: [C: 031] "From a quick chat with Kunal:" [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [09:15:45] jouncebot: next [09:15:45] In 4 hour(s) and 44 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1400) [09:17:01] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:18:07] (03PS1) 10Urbanecm: [throttle] Lift six-account limit - 2016-12-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) [09:19:32] Hi everyone, can somebody deploy https://gerrit.wikimedia.org/r/325899 right now? [09:19:52] See T152669 for details [09:19:52] T152669: Lift 6 account creation limit for #100womenwiki event on8th December - https://phabricator.wikimedia.org/T152669 [09:20:08] (03PS1) 10Yuvipanda: paws_internal: add statistics-privatedata-users to notebook* [puppet] - 10https://gerrit.wikimedia.org/r/325900 [09:20:19] (03CR) 10Legoktm: [C: 04-1] [throttle] Lift six-account limit - 2016-12-08 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) (owner: 10Urbanecm) [09:20:21] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:22:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 255 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:22:41] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 63 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:23:05] (03PS2) 10Urbanecm: [throttle] Lift six-account limit - 2016-12-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) [09:23:11] (03PS1) 10Yuvipanda: paws_internal: Allow ops / statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/325901 [09:24:49] (03CR) 10Yuvipanda: [C: 032] paws_internal: add statistics-privatedata-users to notebook* [puppet] - 10https://gerrit.wikimedia.org/r/325900 (owner: 10Yuvipanda) [09:25:04] (03PS2) 10Yuvipanda: paws_internal: add statistics-privatedata-users to notebook* [puppet] - 10https://gerrit.wikimedia.org/r/325900 [09:25:09] (03CR) 10Yuvipanda: [V: 032 C: 032] paws_internal: add statistics-privatedata-users to notebook* [puppet] - 10https://gerrit.wikimedia.org/r/325900 (owner: 10Yuvipanda) [09:26:27] (03PS3) 10Legoktm: throttle: Lift six-account limit for BBC 100 Women event - 2016-12-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) (owner: 10Urbanecm) [09:26:38] (03CR) 10Legoktm: [C: 032] throttle: Lift six-account limit for BBC 100 Women event - 2016-12-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) (owner: 10Urbanecm) [09:27:22] (03Merged) 10jenkins-bot: throttle: Lift six-account limit for BBC 100 Women event - 2016-12-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325899 (https://phabricator.wikimedia.org/T152669) (owner: 10Urbanecm) [09:27:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 255 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:27:41] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 422 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:27:50] (03PS2) 10Yuvipanda: paws_internal: Allow ops / statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/325901 [09:27:57] (03CR) 10Yuvipanda: [V: 032 C: 032] paws_internal: Allow ops / statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/325901 (owner: 10Yuvipanda) [09:29:55] !log legoktm@tin Synchronized wmf-config/throttle.php: Lift six-account limit for BBC 100 Women event T152669 (duration: 01m 39s) [09:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:11] T152669: Lift 6 account creation limit for #100womenwiki event on8th December - https://phabricator.wikimedia.org/T152669 [09:40:51] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:59] (03PS4) 10Gehel: contint: Add a few more php7.0 package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [10:05:55] (03CR) 10Gehel: [C: 032] contint: Add a few more php7.0 package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/325877 (owner: 10EBernhardson) [10:08:51] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:18:06] !log restarting elasticsearch codfw cluster restart for Java 8 upgrade - T151325 [10:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:19] T151325: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325 [10:43:18] (03CR) 10Ema: cache_misc req_handling: subpaths and defaulting (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300655 (https://phabricator.wikimedia.org/T110717) (owner: 10BBlack) [10:44:22] 06Operations, 10Mail: Create an Alias - https://phabricator.wikimedia.org/T152641#2856783 (10Aklapper) --> #operations [10:44:35] 06Operations, 10Mail: Create email alias for benefactors@ - https://phabricator.wikimedia.org/T152641#2856786 (10Aklapper) [11:37:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-4/0/3: down - Core: asw2-d-eqiad:xe-7/0/40 {#3465} [10Gbps DF]BRxe-4/3/3: down - Core: asw2-d-eqiad:xe-7/0/41 {#3537} [10Gbps DF]BR [11:40:29] that's me [11:41:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [12:31:41] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [12:37:31] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4774021 keys, up 38 days 4 hours - replication_delay is 631 [12:42:01] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:01] !log restarting elasticsearch eqiad cluster restart for Java 8 upgrade - T151325 [12:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:15] T151325: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325 [12:51:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4753131 keys, up 38 days 4 hours - replication_delay is 0 [12:55:01] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 436249921 for key PRIMARY on query. Default database: commonswiki. Query: [snipped]2 [12:56:07] 06Operations, 10DBA, 10Monitoring: Create script to monitor db dumps for backups are successful (and if not, old backups are not deleted) - https://phabricator.wikimedia.org/T151999#2856939 (10akosiaris) So, Bacula does indeed honor an error return from a ClientRunBeforeJob script. http://www.bacula.org/5.... [12:59:41] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:00:02] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:05:31] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:07:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 616 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4753528 keys, up 38 days 4 hours - replication_delay is 616 [13:11:01] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:37:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4752683 keys, up 38 days 5 hours - replication_delay is 45 [13:38:11] jouncebot: Nemo_bis [13:38:14] jouncebot: next [13:38:14] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1400) [13:38:16] sorry nemo [13:52:05] !log upgrading cache_misc and cache_maps to varnish 4.1.4-1wm1 [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:24] (03CR) 10Faidon Liambotis: "This is orthogonal to this changeset, however I can't help but notice that you're encountering new issues because of finding a new NIH way" [puppet] - 10https://gerrit.wikimedia.org/r/325570 (owner: 10Hashar) [13:59:13] (03CR) 10Faidon Liambotis: [C: 04-2] "We can just import ieee-data 20160613.1 into our jessie repo. From a quick look, it looks like a super simple package that won't even need" [puppet] - 10https://gerrit.wikimedia.org/r/325699 (https://phabricator.wikimedia.org/T152440) (owner: 10Filippo Giunchedi) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1400). Please do the needful. [14:04:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 640 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4754225 keys, up 38 days 5 hours - replication_delay is 640 [14:05:01] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 612 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4003 keys, up 38 days 5 hours - replication_delay is 612 [14:06:01] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 644 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 9172 keys, up 38 days 5 hours - replication_delay is 644 [14:07:29] European SWAT is empty [14:12:51] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:13:27] (03PS1) 10Faidon Liambotis: aptrepo: pin ElasticSearch version [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) [14:13:31] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:15:49] (03CR) 10Faidon Liambotis: "This is a hotfix for preventing an issue like T138608 happening again." [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:16:31] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [14:19:31] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 197 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:20:11] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 102 probes of 408 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [14:21:41] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.96 ms [14:22:31] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 412 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:24:21] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 26 probes of 254 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:24:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 11 probes of 261 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:24:32] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 126 probes of 255 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:24:41] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 134 probes of 423 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [14:25:11] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 1 probes of 408 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [14:29:21] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 254 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:29:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 255 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:29:41] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 423 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [14:30:01] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4750640 keys, up 38 days 6 hours - replication_delay is 0 [14:31:01] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4753814 keys, up 38 days 6 hours - replication_delay is 0 [14:31:22] I think the above just means that RIPE NCC is rebooting their anchors [14:31:32] possibly rolling out a new version of their software [14:32:31] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 412 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:35:29] (03CR) 10DCausse: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [14:46:34] Annoying https://gerrit.wikimedia.org/r/#/c/325820/ hasn't been reviewed yet. It's a annoying regression for some wikis' workflow where user rights changes are frequent (e.g. Commons to grant rename right). [14:46:57] (and so would have been a good fit for SWAT) [14:47:32] I'll watch it during the day to be sure we have it for this evening SWAT. [14:50:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4748425 keys, up 38 days 6 hours - replication_delay is 50 [14:59:21] (03CR) 10DCausse: [C: 031] "stepping back on this I think the root cause is that we do not really control the dependency between elasticsearch and our plugins. For pr" [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [15:14:50] (03PS5) 10Ottomata: Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) [15:15:01] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:15:11] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:51] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [15:18:01] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:20:07] (03CR) 10Ottomata: "Merging this, there will likely be changes as I apply these roles individually to eventlog1001" [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [15:20:09] (03CR) 10Ottomata: [C: 032] Refactor eventlogging analytics role classes into many files [puppet] - 10https://gerrit.wikimedia.org/r/325838 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [15:22:41] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 628 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4750614 keys, up 38 days 7 hours - replication_delay is 628 [15:24:00] (03PS1) 10Ottomata: role eventlogging::consumer::files -> eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/325936 (https://phabricator.wikimedia.org/T152621) [15:26:17] (03PS2) 10Ottomata: role eventlogging::consumer::files -> eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/325936 (https://phabricator.wikimedia.org/T152621) [15:29:01] (03PS3) 10Ottomata: role eventlogging::consumer::files -> eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/325936 (https://phabricator.wikimedia.org/T152621) [15:31:48] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me as a safeguard." [puppet] - 10https://gerrit.wikimedia.org/r/325931 (https://phabricator.wikimedia.org/T138608) (owner: 10Faidon Liambotis) [15:32:32] (03PS4) 10Ottomata: role eventlogging::consumer::files -> eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/325936 (https://phabricator.wikimedia.org/T152621) [15:33:24] (03PS1) 10Gehel: elasticsearch: Java 8 is now the default everywhere Bug: T151325 [puppet] - 10https://gerrit.wikimedia.org/r/325938 (https://phabricator.wikimedia.org/T151325) [15:35:58] (03CR) 10Ottomata: [V: 032 C: 032] "Looks good https://puppet-compiler.wmflabs.org/4838/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/325936 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [15:36:53] (03CR) 10Gehel: [C: 032] elasticsearch: Java 8 is now the default everywhere Bug: T151325 [puppet] - 10https://gerrit.wikimedia.org/r/325938 (https://phabricator.wikimedia.org/T151325) (owner: 10Gehel) [15:36:59] (03PS2) 10Gehel: elasticsearch: Java 8 is now the default everywhere Bug: T151325 [puppet] - 10https://gerrit.wikimedia.org/r/325938 (https://phabricator.wikimedia.org/T151325) [15:42:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4750100 keys, up 38 days 7 hours - replication_delay is 0 [15:43:23] (03PS1) 10ArielGlenn: move configuration of tables to be dumped out to a yaml file [puppet] - 10https://gerrit.wikimedia.org/r/325939 [15:44:26] !log restarting elasticsearch relforge cluster for Java 8 upgrade - T151325 [15:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:40] T151325: Upgrade to Java 8 for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151325 [15:46:57] (03PS1) 10Ottomata: Apply role::eventlogging::analytics::zeromq [puppet] - 10https://gerrit.wikimedia.org/r/325940 (https://phabricator.wikimedia.org/T152621) [15:47:01] (03PS1) 10ArielGlenn: remove some dblist paths from dump config settings, no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/325941 [15:48:11] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:23] (03PS2) 10ArielGlenn: move table job info to a default config file and add setting for override [dumps] - 10https://gerrit.wikimedia.org/r/325844 (https://phabricator.wikimedia.org/T152679) [15:52:18] (03PS1) 10ArielGlenn: document the new table jobs yaml file [dumps] - 10https://gerrit.wikimedia.org/r/325943 (https://phabricator.wikimedia.org/T152679) [15:53:07] (03PS2) 10Ottomata: Apply role::eventlogging::analytics::{zeromq,mysql,processor} [puppet] - 10https://gerrit.wikimedia.org/r/325940 (https://phabricator.wikimedia.org/T152621) [15:53:49] (03PS1) 10ArielGlenn: remove unneeded dblists and references to them [dumps] - 10https://gerrit.wikimedia.org/r/325944 (https://phabricator.wikimedia.org/T152679) [15:54:33] (03PS1) 10ArielGlenn: cleanup of README for general configuration and sample config file [dumps] - 10https://gerrit.wikimedia.org/r/325945 (https://phabricator.wikimedia.org/T152679) [15:55:17] (03PS1) 10ArielGlenn: remove halt, last reference to forcenormal configuration settings [dumps] - 10https://gerrit.wikimedia.org/r/325946 [15:55:24] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2857212 (10leila) @RobH Thanks for double-checking re NDA. :) I confirm that the students and the WMF have signed NDA and MOU. The... [15:55:27] (03PS3) 10Ottomata: Apply role::eventlogging::analytics::{zeromq,mysql,processor} [puppet] - 10https://gerrit.wikimedia.org/r/325940 (https://phabricator.wikimedia.org/T152621) [15:55:35] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2857214 (10leila) a:05leila>03RobH [15:56:05] (03PS1) 10ArielGlenn: fix up silly handling of table job names [dumps] - 10https://gerrit.wikimedia.org/r/325947 (https://phabricator.wikimedia.org/T152679) [15:56:16] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for understanding WDQS - https://phabricator.wikimedia.org/T152023#2857216 (10RobH) Sounds good! Since the #ops-access-requests was added yesterday, I'll merge this on Friday if there are no object... [15:57:42] (03CR) 10Ottomata: [C: 032] "no op https://puppet-compiler.wmflabs.org/4842/" [puppet] - 10https://gerrit.wikimedia.org/r/325940 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [15:58:57] (03PS1) 10Ottomata: Remove now unused eventlogging.pp role [puppet] - 10https://gerrit.wikimedia.org/r/325948 (https://phabricator.wikimedia.org/T152621) [16:00:52] (03CR) 10Ottomata: [C: 032] Remove now unused eventlogging.pp role [puppet] - 10https://gerrit.wikimedia.org/r/325948 (https://phabricator.wikimedia.org/T152621) (owner: 10Ottomata) [16:01:59] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2857222 (10mehtab.ahmed) @Aklapper: kindly vist this site, https://www.google.com/get/noto/ When I searched there for Sindhi (Arabic Script), site offered... [16:02:51] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:21] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:13:51] (03PS1) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [16:14:44] (03CR) 10jenkins-bot: [V: 04-1] labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 (owner: 10Rush) [16:17:11] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:18:06] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Request access to data/cluster for article expansion research - https://phabricator.wikimedia.org/T151969#2857226 (10RobH) a:03RobH @leila: I don't have access to view/read T148546. Can its permissions be adjusted to allow #ac... [16:18:37] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Tiziano Piccardi shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T151969#2857228 (10RobH) [16:19:23] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: adrian bielefeldt & julius gonsior shell request + analytics-privatedata-users - https://phabricator.wikimedia.org/T152023#2857230 (10RobH) [16:25:01] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members] [16:27:20] hey robh. can you check to see if you can see the content of https://phabricator.wikimedia.org/T148546 now? [16:27:37] nope, hav a 404 which is a permission thing [16:27:42] but i didnt log in and out... [16:27:43] lemme try [16:27:51] (normally dont have to when added to groups but who knows!) [16:28:38] leila: Nope, still cannot! [16:28:55] basically my concern is if thats some NDA project, it'll likely come up in the future for ops to grant access requests =] [16:29:10] we're ok for the three you asked for already mind you, since you confirmed on task that the nda exists =] [16:29:35] we already have an nda thing for trusted volunteers, but its not the same as this it seems. [16:29:40] (we being wmf, not ops!) [16:29:51] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:30:08] robh: yeah. I added acl*operations-team to it, the issue is that "Visible To" for these tasks is set to S9 which is visible only to people who process research collaborations. There are a few solutions: 1) we go with one of us confirming that the NDA is signed and we agree that's enough. 2) We add all ops members to S9 so they can see the content of that task (the task itself is not visible to everyone because it may cont [16:30:36] Ohhhhhhhh, its s9 [16:30:39] !log disable puppet on californium to help debug T151422 with bd808 [16:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:52] T151422: Striker error logs not getting into ELK cluster - https://phabricator.wikimedia.org/T151422 [16:30:52] can you add #acl*operations-team to view s9? [16:31:19] how many S do we have? some research tell me that at least there's one space for each acl*policy-admins subproject [16:31:20] we maintain the acl group as we add and lose opsen, so just adding it would allow ALL ops clinic duty to view tasks in there [16:32:07] I like that solution, robh: adding acl*operations-team to s9. this is beyond my privileges though. ;) but I ask Dario to do it and report back. Just a note that we are in a research offsite and it may take a few days for this to happen. okay? [16:32:15] and yeah, space view settings overrride individual task view permissions [16:32:29] right [16:32:38] leila: absolutely ok! I just want to streamline future access requests from you guys so you dont wait the extra days =] [16:32:52] twentyafterfour is a Phab admin and mantainer, he can do that [16:32:56] yes, I'm with you robh. I report back on the task. thanks for your help. :) [16:33:12] yeah but asking mukunda to skip permissions on behalf of dario seems bad [16:33:18] like i can drop to phab root and do it right this second [16:33:23] but seems bad to me. [16:33:28] yeah, I'd say we let Dario do it. [16:33:34] sure sure [16:33:45] if it were urgent id just do it, but it isnt so im cool waiting =] [16:33:46] * TabbyCat goes back lazying [16:33:46] thanks TabbyCat. :) [16:34:07] TabbyCat: suggestions for assistance are always appreciated though i dont want it to seem like im being a shit ;] [16:34:43] robh: sure and sorry [16:34:51] no apologies needed! [16:35:30] ... [16:36:28] twentyafterfour: we were discussing a space that i dont have access to and it was susggested that you could fix. even though you could, it would bypass the folks who run that space, so I said we'd wait [16:36:44] for them to triage and approve within their team that runs said space [16:36:58] robh: yeah, I'm actually not sure I could fix it really [16:37:14] the only thing I can do as admin is remove all security from a task by way of the cli [16:37:19] but that would not be good [16:37:30] oh, i assumed you could drop to the root/phab admin user in the ui [16:37:46] but i now realize that requires password recovery string from the command line so dunno if you do that ever. [16:37:51] robh: there is no real all-powerful phab user [16:37:54] it's very strict [16:38:14] yeah, i intentionally add the admin user (not admin right) to stuff ops makes so not only ops has access [16:38:26] but ive run into stuff that it cannot hit easily, as you say [16:40:24] I always try to add acl*operations_team to the policy on things I maintain but I never maintain anything that's super secret. I have no experience with the spaces as I am not a member of any of them [16:41:31] (03Abandoned) 10Filippo Giunchedi: base: get rid of monthly ieee-data cronjob [puppet] - 10https://gerrit.wikimedia.org/r/325699 (https://phabricator.wikimedia.org/T152440) (owner: 10Filippo Giunchedi) [16:42:45] it looks like the way Administratorship in Phabricator works somewhat hinders their work [16:45:52] TabbyCat: it's designed so that phabricator admins are not all-powerful. The user with root on the server is all-powerful but it generally avoids accidental disclosure or accidental deletion of data [16:47:42] twentyafterfour: from my inexpert POV, I think y'all should be able to do more things via the interface rather than going to the server, but it's just a comment, of course the designers of the software and its mantainers will know better as to why adminship works that way [16:48:46] There are some situations where it'd be nice to have a gui version of 'sudo' but generally I like the fact that my account doesn't have the ability to view content that I should not be reading [16:49:07] this way I don't have to think 'should I be reading this' every time I look at something [16:49:43] thats why i never even gave 'admin' rights to my normal daily user [16:49:46] its nothing special. [16:50:08] was chase's suggestion when we were first testing it and it was a good one, heh [16:50:33] overall i dont hate that phab rquires a drop to a cli since indeed, we have a lot of private data in it and being overly paranoid about it is never bad. [16:50:56] but im also one of the ones who has the rights to use said command line, so i'm quite biased [16:53:01] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1700). Please do the needful. [17:00:49] no patches [17:01:24] (03PS4) 10Chad: Remove bits docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317657 [17:02:01] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:15] (03PS1) 10Filippo Giunchedi: package_builder: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/325956 [17:03:28] (03CR) 10Chad: [C: 032] Remove bits docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317657 (owner: 10Chad) [17:04:07] (03Merged) 10jenkins-bot: Remove bits docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317657 (owner: 10Chad) [17:07:45] (03PS1) 10Urbanecm: Enable SandboxLink on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325957 (https://phabricator.wikimedia.org/T152692) [17:11:11] (03PS1) 10Chad: Point extract2.php symlink to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325959 [17:11:24] (03CR) 10Chad: [C: 032] Point extract2.php symlink to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325959 (owner: 10Chad) [17:12:28] (03Merged) 10jenkins-bot: Point extract2.php symlink to proper location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325959 (owner: 10Chad) [17:13:56] !log demon@tin Synchronized docroot: cleanups: rm bits & fix wwwportal/w/extract2.php symlink (duration: 00m 48s) [17:14:04] ding dong bits is dead [17:14:05] * ostriches dances [17:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:55] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/325956 (owner: 10Filippo Giunchedi) [17:16:58] (03PS2) 10Alexandros Kosiaris: package_builder: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/325956 (owner: 10Filippo Giunchedi) [17:17:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: add stretch [puppet] - 10https://gerrit.wikimedia.org/r/325956 (owner: 10Filippo Giunchedi) [17:19:59] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2857373 (10mobrovac) [17:22:49] ostriches: \o/ a long pull but worth it [17:23:26] I mean it's been dead in dns & varnish for awhile, just the last bit of apache docroot cleanups :) [17:23:31] Nice to have it all gone now [17:23:58] ./docroot is almost sane now! [17:24:02] (almost) [17:29:01] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:31:05] !log upload ieee-data 20160613.1 and upgrade jessie machines to it T152440 [17:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:17] T152440: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440 [17:31:23] hasharCall, any luck with commons extension update? [17:31:38] yurik: I am in meeting sorry [17:31:46] will commute back home just after [17:32:07] for mediawiki/extensions.git poke Chad in #wikimedia-releng I guess [17:32:08] or here [17:32:33] mw/extensions is busted. [17:32:37] Yeh [17:32:40] it's a bug [17:32:48] I've always hated those auto-updating-repos. [17:32:50] * ostriches sighs [17:32:50] i carn't get it to work on any test installs [17:34:11] But it seems to work for upstream [17:34:24] The documentation on this is terrible.... [17:34:28] We could just be Doing It Wrong [17:36:10] Yeh likly [17:37:55] 06Operations, 10media-storage, 13Patch-For-Review: cronspam cleanup: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) - https://phabricator.wikimedia.org/T152440#2857436 (10fgiunchedi) 05Open>03Resolved Resolving, let's reopen if it happens again in the fo... [17:41:52] (03PS1) 10Thiemo Mättig (WMDE): Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) [17:42:41] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:42:57] https://gerrit.googlesource.com/Public-Plugins/+/b96210dec456a96febd589213b5868b2e20d1ba9%5E%21/#F0 [17:43:15] (03PS2) 10Thiemo Mättig (WMDE): Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) [17:43:26] ostriches ^^ [17:43:38] upstream changed it from refs to matching and all [17:43:42] which one should we do? [17:43:47] (03CR) 10Urbanecm: [C: 031] "Looks good for me" [dns] - 10https://gerrit.wikimedia.org/r/323851 (https://phabricator.wikimedia.org/T151731) (owner: 10MarcoAurelio) [17:43:59] paladox: I used all. [17:44:08] Since matching is a little over-strict for our use [17:44:16] ok [17:44:17] thanks [17:55:13] ostriches: Now I can't sign on to Gerrit with either version of my username, lowercase or uppercase [17:55:22] GOD FUCKING DAMMIT [17:55:33] I HATE MY LIFE [17:55:51] 🍮 [17:57:08] gerrit... [17:59:09] (03PS1) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) [18:00:04] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2857492 (10mobrovac) Ok, this should be resolved now. There was an issue with the EventBus HTTP Proxy service rejectin... [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1800). Please do the needful. [18:00:15] will try to make it, not sure [18:02:08] (03PS2) 10Filippo Giunchedi: role: include memcached_exporter in role::memcached [puppet] - 10https://gerrit.wikimedia.org/r/321725 (https://phabricator.wikimedia.org/T147326) [18:03:29] ostriches i think this https://gerrit-review.googlesource.com/#/c/91950/ is something of interest [18:03:31] PROBLEM - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:03:49] it was abandoned but talks about uppercase. [18:04:01] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused [18:04:21] PROBLEM - cassandra-b service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:04:21] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:04:49] https://gerrit-review.googlesource.com/#/c/90336/1/gerrit-server/src/main/java/com/google/gerrit/server/schema/Schema_127.java [18:05:05] paladox: It has to do with a totally different table though [18:05:12] That's account_patch_reviews [18:05:14] Oh [18:05:24] Which contains no user names [18:05:40] is the account_id table? [18:06:04] No, the issue is in account_external_ids [18:06:09] ok [18:06:22] * paladox searching up on github [18:06:26] * paladox for account_external_ids [18:06:43] 06Operations: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857513 (10RobH) [18:06:47] 06Operations: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857530 (10RobH) Assigning this task to @joe for his input on where to rack (if it matters) and what hosts to decom (if needed for racking.) Please detail/comment and assign back to @robh, thanks! [18:07:21] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857532 (10RobH) [18:07:36] paladox: I mean, we can work around it by having all broken accounts deleted, recreated, then old work reassigned to the new account, but that's ugly & manual. [18:07:46] Yep [18:07:47] Plus, there's clearly a real underlying bug here.... [18:07:48] !log cleanup nginx access logs on elasticsearch cluster (access logs are disabled, this is just left over) [18:07:55] I am searching for the commit that caused all this [18:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:46] (03CR) 10Filippo Giunchedi: "@bblack: question re: the RCODE counters, would it be the sum of those be meaningful in some way? That's the main factor between deciding " [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [18:09:56] (03CR) 10jenkins-bot: [V: 04-1] prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) (owner: 10Filippo Giunchedi) [18:10:33] modules/prometheus/manifests/node_gdnsd.pp:32 WARNING ensure found on line but it's not the first attribute (ensure_first_param) ;_; [18:10:51] ostriches un related to breaking of login's but i found the commit that broke submodules https://github.com/gerrit-review/gerrit/commit/c62f9fe5111af422d4e93cf48e1cb0dfe6a561dd [18:11:05] (03PS1) 10Legoktm: Deploy Linter extension to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) [18:11:08] i found that when searching for that table account_external_ids [18:11:23] (03PS1) 10Mobrovac: [TEST] Scap: Do not redefine scap::sources in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/325977 [18:11:30] Well, the broken submodules came about by adding those ACLs and removing the old style of stuff :p [18:11:41] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:11:56] (03PS2) 10Filippo Giunchedi: prometheus: export gdnsd stats via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/325975 (https://phabricator.wikimedia.org/T147426) [18:12:21] RECOVERY - cassandra-b service on restbase1011 is OK: OK - cassandra-b is active [18:12:21] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [18:12:59] (03PS1) 10ArielGlenn: use explicit exclude/include list for dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/325978 [18:13:39] (03CR) 10Filippo Giunchedi: [C: 032] role: include memcached_exporter in role::memcached [puppet] - 10https://gerrit.wikimedia.org/r/321725 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [18:13:41] RECOVERY - cassandra-b SSL 10.64.0.118:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-b valid until 2017-09-12 15:34:06 +0000 (expires in 277 days) [18:14:02] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.118 port 9042 [18:14:07] paladox: A flat revert doesn't work though, we'll have to find/fix the issue going forward.... [18:14:11] I'm pretty sure it *works* [18:14:16] We just haven't gotten it working [18:14:20] Ok [18:14:22] yeh it works [18:14:26] upstream have it working [18:14:35] just there docs do not clearly tell us how they did it [18:14:55] YuviPanda: re your account, how shall we proceed? [18:15:08] I'm a bit wary of using MergeAccount [18:15:25] (03PS2) 10ArielGlenn: use explicit exclude/include list for dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/325978 [18:15:38] !log restarting broker on kafka1012 to repro T152674 [18:15:49] unless somebody is checking logs for fixing any possible issues [18:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:49] T152674: Webrequest dataloss registered during the last kafka restarts - https://phabricator.wikimedia.org/T152674 [18:15:54] TabbyCat: please don't use UserMerge. [18:15:56] TabbyCat: I don't know anything about it tho :) legoktm might know [18:16:15] legoktm: I won't, that's why I asked [18:16:46] (03PS3) 10ArielGlenn: use explicit exclude/include list for dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/325978 [18:16:54] looks like it should be manually fixed then :S [18:18:06] (03CR) 10ArielGlenn: [C: 032] use explicit exclude/include list for dumps mirrors [puppet] - 10https://gerrit.wikimedia.org/r/325978 (owner: 10ArielGlenn) [18:18:08] brb [18:18:10] dinner [18:20:03] congrats on bye bye bits [18:20:06] that's amazing [18:20:33] (03CR) 10Mobrovac: "I cherry-picked this on beta's puppetmaster and ran puppet on deployment-tin. but no luck. Not all repositories are being cloned there." [puppet] - 10https://gerrit.wikimedia.org/r/325977 (owner: 10Mobrovac) [18:24:33] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857650 (10RobH) [18:26:08] 06Operations, 10ops-codfw: rack/setup/install mw2051-mw2060 - https://phabricator.wikimedia.org/T152698#2857513 (10RobH) [18:30:12] (03CR) 10Dereckson: "Should this merged before or after Id11c27431aa1a2e770699b7cd62579c288823504?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [18:30:37] (03CR) 10Arlolra: Deploy Linter extension to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) (owner: 10Legoktm) [18:32:35] (03CR) 10Legoktm: Deploy Linter extension to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) (owner: 10Legoktm) [18:37:16] im back [18:37:25] (03CR) 10Arlolra: [C: 031] Deploy Linter extension to beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) (owner: 10Legoktm) [18:38:08] jouncebot: next [18:38:08] In 0 hour(s) and 21 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1900) [18:38:08] In 0 hour(s) and 21 minute(s): Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1900) [18:38:17] (03CR) 10Legoktm: [C: 032] Deploy Linter extension to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) (owner: 10Legoktm) [18:38:57] (03Merged) 10jenkins-bot: Deploy Linter extension to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325976 (https://phabricator.wikimedia.org/T152620) (owner: 10Legoktm) [18:40:15] !log legoktm@tin Synchronized wmf-config: Deploy Linter extension to beta cluster - no-op (duration: 00m 50s) [18:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:21] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.002 second response time [18:43:31] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [18:43:36] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.001 second response time [18:43:41] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.004 second response time [18:43:51] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.003 second response time [18:43:56] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.003 second response time [18:43:57] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.001 second response time [18:43:57] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.001 second response time [18:44:02] ostriches any fixes you want to try, we could try here https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen, but we will need to use a deb as i installed that with a deb [18:44:03] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.002 second response time [18:44:03] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.004 second response time [18:44:13] i got the error too when trying Paladox instead of paladox. [18:44:13] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time [18:44:13] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.001 second response time [18:44:33] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time [18:44:33] andrewbogott: related to anything you're doing? ^ [18:44:43] * YuviPanda looks [18:44:56] was eating lunch so it's not me I'm guessing it's a false negative w/ toolschecker [18:44:57] YuviPanda: not as far as I know [18:45:18] labservices1001 is still up at least [18:45:48] ostriches: Any update? I need to get access to gerrit some time today if possible. I'll owe you 3 stroopwafels and a pepernoot. [18:46:08] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.002 second response time [18:46:08] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [18:46:08] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time [18:46:08] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.001 second response time [18:46:22] YuviPanda, chasemp, the dns servers seem ok to me [18:46:48] this seemed to do a similar false alerts earlier this week iirc? [18:46:51] am trying to figure out what's going on [18:46:51] things seem fine andrewbogott afaict it's the check proxy I imagine, which sucks. YuviPanda are you failing it over? [18:46:55] kaldari: Update: I'm super duper angry now [18:46:55] when it did all the checks of the serivces seemed ok [18:47:03] robh: earlier this week was an actual issue [18:47:08] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.002 second response time [18:47:08] i may not be recaling correctly, ohh [18:47:08] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.002 second response time [18:47:08] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.001 second response time [18:47:09] ok [18:47:13] kaldari: And have zero clue how to fix this, short of massive manual workarounds that do not scale beyond single users [18:47:17] i wasnt recalling correctly ;] [18:47:22] robh: over the weekend there was an actual failure that instigated similar alerts [18:47:45] actually all we need is a war. [18:48:03] I'm ready to declare a war.... [18:48:08] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.014 second response time [18:48:08] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.019 second response time [18:48:08] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.128 second response time [18:48:08] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.149 second response time [18:48:33] my assumption is a reboot or a failover on YuviPanda part is happening [18:48:47] ostriches: Should I register a new account in the meantime? [18:49:02] chasemp: yeah I logged in -labs [18:49:07] I just restarted toolschecker process [18:49:08] ostriches https://github.com/gerrit-review/gerrit/commit/96e0d7fd8be23db14520a26cffa93eafed53c3fa ? [18:49:09] kaldari: I mean...your work won't be attributed to your old account.... [18:49:13] But I guess I can't stop you! [18:49:13] YuviPanda: tx, any ideas on why teh failure? [18:49:16] that looks like some kind of fix for external id's [18:49:19] it's on master [18:49:20] chasemp: yeah, looking now [18:49:29] needs backporting if it is a fix. [18:49:29] jouncebot: refresh [18:49:31] I refreshed my knowledge about deployments. [18:50:01] paladox: That....could be it..... [18:50:12] Oh [18:50:12] paladox: Try pulling it into your dev install and test? [18:50:13] yay [18:50:15] ok [18:50:18] doing it now [18:50:41] I have to setup buck on gerrit-test3 now. [18:51:21] chasemp: it might actually be a legit outage for at least etcd [18:51:39] etcd outage crashed the checker possibily? [18:51:45] that's not cool but possible I guess [18:51:53] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.005 second response time [18:52:08] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.002 second response time [18:52:08] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [18:52:08] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.003 second response time [18:53:43] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.002 second response time [18:53:43] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.003 second response time [18:53:48] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.003 second response time [18:53:48] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.003 second response time [18:53:57] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.003 second response time [18:53:57] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.002 second response time [18:53:57] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.003 second response time [18:54:03] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.002 second response time [18:54:03] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.002 second response time [18:54:04] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.001 second response time [18:54:09] mind if I silence the checker for an hour to avoid pagerfest? [18:54:14] YuviPanda: are you bouncing the service? I'm going to silence it [18:54:18] ah godog :) yeah I'm on it [18:54:26] chasemp: thanks! [18:54:31] chasemp: uh, no not me this time. [18:54:36] chasemp: yes, please silence it [18:54:54] well that reboot didn't take :) [18:54:57] restart even [18:55:05] DAMN ! worker 1 (pid: 10124) died, killed by signal 9 :( trying respawn ... [18:55:08] we had this happen once before and madhuvishy diagnosted it iirc [18:55:25] chasemp: my theory is that one of the operations is taking too long some of the times, causing other threads to be killed [18:55:36] since this server is 'serialized' to do only one check at a time [18:56:05] ostriches it's building now, buck build release :) [18:56:05] i just restarted the service that time [18:56:11] i did git fetch https://gerrit.googlesource.com/gerrit refs/changes/10/92610/4 && git cherry-pick FETCH_HEAD [18:56:58] ok YuviPanda I'm goign to finish my sandwhich, you're on it? seems mostly false except what's the state of etcd? [18:57:04] chasemp: yeah, am on it. [18:57:15] chasemp: etcd looks fine, i just checked the hosts. will check again shortly. [18:57:37] chasemp: get back to sammic etc [18:57:43] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 57.614 second response time [18:57:46] right-o [18:57:57] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [18:57:57] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.106 second response time [18:57:57] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.120 second response time [18:58:03] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.003 second response time [18:58:13] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.166 second response time [18:58:18] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.533 second response time [18:58:18] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.534 second response time [18:58:18] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.474 second response time [18:58:18] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.479 second response time [18:58:18] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.483 second response time [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1900). Please do the needful. [19:00:04] Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:04] gehel: Respected human, time to deploy Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T1900). Please do the needful. [19:00:50] Ok i have to install nodejs [19:00:54] npm, bower, zip [19:00:55] !log Started rebuildItemsPerSite for Wikidata on terbium. Note: This can be killed at any time, if needed. [19:00:59] and a load of other tools [19:01:02] paladox: Are you building on master? [19:01:08] You shouldn't need those.... [19:01:09] Nope on stable-2.13 [19:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:19] Meh, whatevs :) [19:01:22] it fails buck if i doint have those. [19:01:28] i did a cherry-pick [19:01:30] Well they're all packaged so :) [19:01:35] Oh [19:02:01] No wdqs deploys... [19:02:33] i'll have a SWAT patch in a minute, gerrit is processing [19:03:00] I can SWAT today [19:03:27] Dereckson: ping for SWAT [19:06:28] thcipriani: my cherry-picks are on wikitech now [19:06:38] ebernhardson: okie doke [19:07:41] oh good, jenkins is angry at one https://gerrit.wikimedia.org/r/#/c/325811/ [19:07:53] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.406 second response time [19:07:53] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.758 second response time [19:08:11] maybe it will like a rebase [19:09:45] yea jenkins happy now [19:10:54] neat [19:13:32] ostriches i've built it now, deploying now to gerrit.git.wmflabs.org as it is the one that uses the puppet class so should be able to give us results on weather it worked. [19:14:34] ummm. it looks like the submodule bump didn't happen for those patches? [19:14:45] thcipriani: hmm, thats odd [19:15:27] (03PS5) 10Andrew Bogott: Add clientlib.pp and mwopenstackclients.py [puppet] - 10https://gerrit.wikimedia.org/r/325830 (https://phabricator.wikimedia.org/T150092) [19:15:29] (03PS1) 10Andrew Bogott: Designate policy: Add public read-only access [puppet] - 10https://gerrit.wikimedia.org/r/325994 (https://phabricator.wikimedia.org/T150092) [19:15:43] (03Abandoned) 10Mobrovac: [TEST] Scap: Do not redefine scap::sources in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/325977 (owner: 10Mobrovac) [19:15:46] submodule bumps aren't happening automatically [19:15:47] Right now [19:15:59] They changed the config and the docs suck, #2 priority after the login issue [19:16:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4759014 keys, up 38 days 10 hours - replication_delay is 642 [19:19:04] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: Port memcached statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147326#2857865 (10fgiunchedi) This is rolling out now, I noticed there's a big number of metrics related to slabs (i.e. per-slab, and per-command/per-sl... [19:19:23] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 642 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4765138 keys, up 38 days 10 hours - replication_delay is 642 [19:19:53] so we can't update the wikimedia portals then [19:20:12] meh, not urgent [19:20:18] ebernhardson: here's wmf.4 https://gerrit.wikimedia.org/r/#/c/325996/ [19:20:37] (03PS2) 10Rush: labsdb: cleanup maintain-meta_p enough to make it viable [puppet] - 10https://gerrit.wikimedia.org/r/325949 [19:20:56] ^ [19:21:04] Woops TabbyCat [19:21:24] (03PS1) 10Kaldari: Temporarily disable centralauth-rename right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) [19:21:29] paladox: then as we used to do [19:21:45] For now that will be the only way until we find a fix. [19:22:13] I forgot how to update a submodule but I wrote a list of steps in my laptop [19:22:35] ebernhardson: and here's wmf.5: https://gerrit.wikimedia.org/r/#/c/325998/ [19:22:38] TabbyCat git submodule update? [19:23:15] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 3 others: Provide read-only access to OpenStack APIs from WMF IP space - https://phabricator.wikimedia.org/T150092#2857881 (10Andrew) [19:23:54] (03CR) 10Urbanecm: "Again? It was disabled for some time (T151155 for details)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325997 (https://phabricator.wikimedia.org/T148242) (owner: 10Kaldari) [19:24:02] TabbyCat: Basically, checkout the branch/sha1/tag/whatever in the submodule, then commit the changed sha1 in the parent repo back upstream [19:24:13] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:24:28] ebernhardson: could you +1 or +2 those for me and I'll get them one the deployment servers [19:25:53] thcipriani, will you have space for me in Morning SWAT? I thought I will be unavailable in this window but apparently not... If possible I wish to deploy 325803 and 325957. [19:26:10] * thcipriani looks [19:27:12] thcipriani: ok [19:27:22] (03PS1) 10Yuvipanda: tools: Increase harakiri timeout for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/326004 [19:27:27] Urbanecm: yeah those seem like quick ones, please add [19:27:38] thcipriani, okay, going to re-schedule them. [19:27:44] ebernhardson: thanks! [19:28:06] non-auto-submodule means lots of waiting on jenkins :( [19:29:02] thcipriani, added. Thanks! [19:29:24] ostriches testing now, it's online [19:29:27] 06Operations, 05Prometheus-metrics-monitoring: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#2857893 (10fgiunchedi) re: nginx+ldap, it looks like the ldap auth module isn't included, though we can use pam auth for nginx and libpam-ldap as ldap client [19:29:29] (03PS3) 10Thcipriani: Enable SandboxLink at sdwiki and sdwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325803 (https://phabricator.wikimedia.org/T152609) (owner: 10Urbanecm) [19:29:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325803 (https://phabricator.wikimedia.org/T152609) (owner: 10Urbanecm) [19:30:00] ostriches dosen't work [19:30:07] though it may work for new users [19:30:23] and existing users [19:30:30] as long as you havent hit the bug [19:30:33] Cannot assign user name "paladox" to account 8; name already in use. [19:30:47] (03Merged) 10jenkins-bot: Enable SandboxLink at sdwiki and sdwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325803 (https://phabricator.wikimedia.org/T152609) (owner: 10Urbanecm) [19:31:39] Urbanecm: sdwiki and sdwiktionary are live on mwdebug1002, check please [19:31:52] ostriches try logging in with Chad and chad on https://gerrit.git.wmflabs.org/ [19:32:07] * Urbanecm is testing [19:32:35] thcipriani: pong [19:32:44] Dereckson: hello [19:32:46] Hi here :) [19:33:23] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 624 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4762911 keys, up 38 days 11 hours - replication_delay is 624 [19:33:26] thcipriani, sdwiki and sdwiktionary works, please deploy to the whole cluster. [19:33:39] Urbanecm: ok, doing, thanks [19:35:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:325803|Enable SandboxLink at sdwiki and sdwikt]] T152609 (duration: 00m 46s) [19:35:13] ^ Urbanecm live everywhere [19:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:19] T152609: Enable "Sandbox" link for Sindhi Wikipedia & Wiktionary; adjust sidebar items and title - https://phabricator.wikimedia.org/T152609 [19:35:42] (03PS2) 10Thcipriani: Enable SandboxLink on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325957 (https://phabricator.wikimedia.org/T152692) (owner: 10Urbanecm) [19:35:50] thcipriani, works. [19:36:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325957 (https://phabricator.wikimedia.org/T152692) (owner: 10Urbanecm) [19:36:23] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4748451 keys, up 38 days 11 hours - replication_delay is 0 [19:36:23] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4752187 keys, up 38 days 11 hours - replication_delay is 0 [19:36:58] (03Merged) 10jenkins-bot: Enable SandboxLink on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325957 (https://phabricator.wikimedia.org/T152692) (owner: 10Urbanecm) [19:37:40] Urbanecm: bnwikisoure sandboxlink live on mwdebug1002, check please [19:38:09] paladox: Works for initial login/signup for me. [19:38:15] Ok [19:38:19] thcipriani, working. [19:38:35] Urbanecm: ok, going live everywhere [19:38:37] Log out, log back in as "Chad" works [19:38:46] Logging in as "chad" gives me the error. [19:39:16] So, clearly it's being more strict on usernames. Question is: why? And why does fixing your capitalization work for *some* but not *all* users? [19:40:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:325957|Enable SandboxLink on bnwikisource]] T152692 (duration: 00m 45s) [19:40:14] ^ Urbanecm live everywhere [19:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:16] T152692: Enable SandboxLink on bnwikisource - https://phabricator.wikimedia.org/T152692 [19:40:25] Oh [19:40:37] thcipriani, works, thanks for the deployment. [19:40:48] Urbanecm: yw :) [19:40:52] ostriches so the patch worked? [19:41:26] paladox: Well, I dunno? I mean I could login as myself "Chad" -- "chad" didn't work [19:41:33] But that capitalization workaround isn't working for everyone [19:41:34] Oh [19:41:42] Hmm [19:42:25] ebernhardson: PageImages changes for wmf.4 and wmf.5 are live on mwdebug1002, check please [19:42:38] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2857914 (10Dbrant) 05Open>03Resolved Looks resolved now; thanks! [19:42:44] ostriches https://github.com/gerrit-review/gerrit/commit/94732bfa1945a02bd4441b70a501cfe5d21eb907 [19:43:22] Probably related, yeah. All of this new account cache/index stuff is wonky for ldap :\ [19:43:33] Can you check gsql and see what my account_external_ids entries look like on the dev install we're using? Pastebin, if you would :) [19:43:37] Should be 2 entries. [19:44:02] ok [19:44:05] how do i do that? [19:44:09] never used gsql. [19:44:11] thcipriani: can't really check much, it only effects some jobs. Should be safe enough though [19:44:22] ebernhardson: yup, ok, going live :) [19:46:10] paladox: `java -jar bin/gerrit.war gsql` [19:46:15] Opens up a sql prompt [19:46:18] Thanks [19:46:18] To your reviewdb [19:46:23] :) [19:46:36] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/extensions/PageImages/includes/Job/InitImageDataJob.php: SWAT: [[gerrit:325989|Wrap waitForReplication in try/catch]] (duration: 00m 53s) [19:46:47] fatal: unknown command gsq [19:46:48] (no com.google.gerrit.pgm.gsq) [19:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:51] ostriches ^^ [19:46:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4749772 keys, up 38 days 11 hours - replication_delay is 33 [19:47:13] gerrit2@gerrit-test3:~/review_site$ java -jar bin/gerrit.war gsq [19:47:23] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:47:33] paladox: You forgot an "l" .. it's gsql [19:47:35] Not gsq [19:47:37] :) [19:47:45] yep, lol, just realised that [19:47:46] :) [19:47:52] Ok im in [19:48:01] gerrit> [19:48:29] So yeah, find my account_id by `select account_id from accounts where full_name = 'Chad'` [19:48:40] ok [19:48:41] thanks [19:48:44] Then use that account_id to `select * from account_external_ids where account_id = 121920912` or whatever the number is [19:49:10] https://phabricator.wikimedia.org/P4594 [19:49:12] ostriches ^^ [19:49:14] oh [19:49:40] !log thcipriani@tin Synchronized php-1.29.0-wmf.4/extensions/PageImages/includes/Job/InitImageDataJob.php: SWAT: [[gerrit:325811|Wrap waitForReplication in try/catch]] (duration: 00m 46s) [19:49:45] ostriches updated the paste too [19:49:46] :) [19:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:54] ^ ebernhardson live on wmf.4/5 [19:50:35] Though i did add you to the admin group a few days ago on the dev install so it could have registered you then [19:52:13] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:52:27] paladox: I'm curious if we could go the all-lowercase route https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#ldap.localUsernameToLowerCase [19:52:33] Would that even work? :p [19:53:11] Other option is ditching ldap for oauth, but group mgmt would be annoying :( [19:53:24] Not sure, want me to test? [19:53:31] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: add the #acl*operations-team to the s9 analytics space for nda approvals - https://phabricator.wikimedia.org/T152718#2857932 (10RobH) [19:53:36] would oauth even work? [19:53:57] OAuth would work, but we'd lose our ldap/* groups, plus would need a manual migration [19:54:10] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: add the #acl*operations-team to the s9 analytics space for nda approvals - https://phabricator.wikimedia.org/T152718#2857948 (10RobH) p:05Triage>03Low [19:54:13] paladox: Couldn't hurt to try the all-lowercase... [19:54:18] Ok [19:54:26] At least in dev [19:54:40] Dereckson: special:userrights change live on mwdebug1002, check please [19:56:02] ostriches how do i stop puppet [19:56:09] as puppet will overide the change. [19:56:11] please [19:56:20] `puppet agent --disable` [19:56:27] thanks [19:56:42] And --enable when you want it back on [19:56:50] thcipriani: looks good to me. I've checked on Commons a Special:UserRights page, still working, apparently without issue, full test will need a further test frome someone with self-removal group permission. [19:56:50] thanks [19:57:11] (but such full test was done before master merge, and worked fine too) [19:57:29] ostriches im restarting gerrit now [19:57:38] Dereckson: ok, I'll go ahead and push out live [19:57:44] will be a few mins, which i have no idea why it takes so long to start. [19:58:39] everything good to go with wmf.5? [19:59:01] * twentyafterfour prepares to deploy teh train [19:59:09] ostriches yay theres a tool to convert usernames to lowercase https://gerrit.wikimedia.org/r/Documentation/pgm-LocalUsernamesToLowerCase.html [19:59:15] should i run it? [19:59:29] Yeah :) [19:59:44] Ok [19:59:52] !log thcipriani@tin Synchronized php-1.29.0-wmf.5/includes/specials/SpecialUserrights.php: SWAT: [[gerrit:325964|Special:Userrights should set isself on page view, not just on submit]] T152600 (duration: 00m 45s) [20:00:02] ^ Dereckson live everywhere [20:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T2000). [20:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] T152600: Special:UserRights fails to allow users to remove their rights when it is specified in $wgGroupsRemoveFromSelf - https://phabricator.wikimedia.org/T152600 [20:00:21] (03CR) 10Filippo Giunchedi: "@gehel interesting! I see text-https which also uses 'sh' with all weight 1 and .e.g upload-https at weight 4, https://config-master.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/324371 (https://phabricator.wikimedia.org/T151971) (owner: 10Filippo Giunchedi) [20:00:36] ostriches i found they made a mistake in the docs for that [20:00:41] they added _ to java -jar gerrit.war _LocalUsernamesToLowerCase -d review_site [20:00:42] Thanks for the deploy thcipriani. [20:00:47] where as it has to be java -jar gerrit.war LocalUsernamesToLowerCase -d review_site [20:00:55] Dereckson: thanks for the checks :) [20:01:20] ostriches Converting local usernames: 100% (8/8) [20:01:52] Ok, now we'll restart and see if it works :) [20:02:06] Yep [20:02:17] starting it now [20:02:39] :) [20:02:43] If this works....and we can confirm there's no users on ldap who only differ in casing in their names (because ugh, that would *suck*), we could do this [20:02:52] Ok [20:02:53] :) [20:06:09] (03PS1) 1020after4: all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326014 [20:06:11] (03CR) 1020after4: [C: 032] all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326014 (owner: 1020after4) [20:06:53] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326014 (owner: 1020after4) [20:07:13] ostriches it's started [20:07:23] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.5 [20:07:30] ostriches guess what [20:07:33] Paladox works [20:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:46] And paladox? [20:07:48] yeh [20:07:51] ostriches it's fixed [20:08:01] yay, probaly related to the patch were testing [20:08:02] It could work :) [20:08:21] Mainly gotta make sure ldap entries are sane to let us do case-insensitive [20:08:24] Yeh ostriches you should use that patch just in case. looks like that script i ran and enabled the option worked [20:08:36] This is some wierd bug [20:08:53] Nope, doesn't work for me [20:08:56] Oh [20:08:58] Chad, CHAD, chad all don't work [20:09:08] (03PS2) 10Yuvipanda: tools: Increase harakiri timeout for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/326004 [20:09:31] ostriches did it show the error? [20:09:34] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Increase harakiri timeout for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/326004 (owner: 10Yuvipanda) [20:09:34] Yep [20:09:38] Oh [20:09:41] Take a look at my account_external_ids again. I'm willing to bet I only have 1 entry now instead of 2. [20:09:45] ok [20:10:41] !log upgrade prometheus-node-exporter in labs - T152580 [20:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:55] T152580: rollout prometheus-node-exporter 0.13 - https://phabricator.wikimedia.org/T152580 [20:12:15] https://phabricator.wikimedia.org/P4595 [20:12:17] ostriches ^^ [20:12:28] i've also included mine, so you can compare [20:13:44] ostriches that just gave me a good idea what about running that tool for just converting usernames to lowercase but not enabling the config, would that work in conjustion with the patch we were testing? [20:13:45] Ok, so yes. It's capitalization. Sorta. It's mostly to do with the cache/index building, because there's no real functional difference between our accounts. [20:14:03] The fact that these account_external_ids rows get dropped is *bad* [20:14:04] Bad bad bad [20:14:17] Yep [20:14:32] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2858030 (10GWicke) > Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a partition, or if the maximum droppable tomb... [20:14:49] ostriches want me to disable the config again? [20:14:54] Eh, can leave it now [20:14:57] to see if running the script is sufficient for now. [20:15:00] oh [20:15:00] Or whatever, I don't mind :) [20:15:03] ok [20:15:04] It's your install [20:15:05] :) [20:15:23] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:15:34] I will leave it on, but just wondering for seeing if just the script will work. [20:18:03] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:53] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:31:30] ostriches we should just open a group talk page on gerrits google topic thingy. [20:31:42] since setting priority to 1 hasent got any ones attention yet [20:31:58] ostriches unless oauth would work? [20:32:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4751728 keys, up 38 days 12 hours - replication_delay is 623 [20:32:11] We use oauth for mediawiki right? [20:32:52] paladox: Well problem with oauth is we'd lose our ldap groups, like I said. [20:32:56] Oh [20:33:04] Which...we could work around I suppose.... [20:33:11] But I'd rather just see the damn thing fixed! [20:33:18] hmm, would using oauth be an improvement though, yeh. [20:33:35] It'd be an improvement in some ways, yeah [20:33:40] But still -- migration costs [20:36:20] https://gerrit.wikimedia.org/r/#/dashboard/98 gives an error? [20:37:06] ostriches yeh, im writing on the gerrit's group hoping to get a quicker response [20:38:11] Krenair: wfm [20:38:26] paladox: Wanna put it in an etherpad and write together? [20:38:28] :) [20:38:35] ostriches, yeah but logged out [20:38:40] Ok [20:39:03] it says "Error in operator star:ignore" [20:39:04] Krenair: "error in star:ignore" [20:39:05] Yeah [20:39:08] Weird. [20:39:09] * ostriches shrugs [20:39:23] Oh, star:ignore probably doesn't work for an anon [20:39:26] Since anons can't start [20:39:28] ostriches https://etherpad.wikimedia.org/p/Gerrit_login_problems [20:40:23] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[tzdata],Service[zotero],Exec[zotero-admin_ensure_members],Exec[sc-admins_ensure_members] [20:44:08] ostriches looks like they are moving accounts to git https://gerrit-review.googlesource.com/#/c/79800/24/gerrit-server/src/main/java/com/google/gerrit/server/account/AccountManager.java [20:44:23] PROBLEM - Redis status tcp_6480 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4754722 keys, up 38 days 12 hours - replication_delay is 629 [20:44:23] PROBLEM - Redis status tcp_6481 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 632 600 - REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4751425 keys, up 38 days 12 hours - replication_delay is 632 [20:46:24] paladox: Yeah I know, I'm not a fan of that plan ;-) [20:46:32] Oh [20:46:43] paladox: I put some placeholders in the etherpad for the google group links & gerrit links that were sorta-related. [20:46:48] Could you add those? [20:47:04] ostriches not sure what you mean by add those? [20:47:46] ah [20:47:48] now i know [20:48:08] https://github.com/gerrit-review/gerrit/commit/94732bfa1945a02bd4441b70a501cfe5d21eb907 [20:55:23] PROBLEM - Redis status tcp_6380 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 636 600 - REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 7570 keys, up 38 days 12 hours - replication_delay is 636 [20:55:53] PROBLEM - Redis status tcp_6381 on rdb2002 is CRITICAL: CRITICAL: replication_delay is 615 600 - REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 11164 keys, up 38 days 12 hours - replication_delay is 615 [20:57:01] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2858141 (10Eevans) >>! In T144431#2858030, @GWicke wrote: >> Files that are not a part of a compaction can be ruled out as overlapping if they do not contain a par... [20:59:09] paladox: We're not alone: https://groups.google.com/forum/#!topic/repo-discuss/fyVpg16B0pI [20:59:12] lol [20:59:16] i just found that too [20:59:55] Well if we file our report too, it will give them more info that it happends to other users then just one [21:00:08] Hmmm, one last question while we have the lowercase setting turned on..... [21:00:17] Did you run `index start --force accounts`? [21:00:21] Err, account [21:00:30] Curious if that'll fix.... [21:00:31] I doubt it [21:01:06] (03PS4) 10Andrew Bogott: Keystone hook: Change project id to == project name [puppet] - 10https://gerrit.wikimedia.org/r/324928 (https://phabricator.wikimedia.org/T150091) [21:02:08] Oh nope [21:02:13] should i run that? [21:02:48] java -jar gerrit.war index start --force accounts -d review_site [21:02:49] ? [21:02:53] how do i run that? [21:02:56] ostriches ^^ [21:02:57] please [21:04:14] I ran it over ssh [21:05:23] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1762 bytes in 2.165 second response time [21:05:41] Oh [21:08:23] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:08:33] jhava -jar gerrit.war reindex -d review_site --threads 4 [21:08:36] java -jar gerrit.war reindex -d review_site --threads 4 [21:08:41] would that work ostriches ^^ [21:08:57] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern - https://phabricator.wikimedia.org/T144431#2858170 (10Eevans) [21:09:14] paladox: In theory, but I did that originally before it all broke :) [21:09:20] Anyway, e-mail sent. [21:09:23] Ok [21:09:23] I'm gonna go have lunch [21:09:24] thanks [21:09:25] :) [21:10:13] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1730 bytes in 0.269 second response time [21:11:14] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858188 (10Mholloway) 05Resolved>03Open Back again: https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Mon... [21:13:47] ostriches it's started [21:13:48] now [21:14:22] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858199 (10Pchelolo) @Mholloway I see 'toy manufacturer' description in both links and on wikidata item [21:16:29] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858212 (10Mholloway) Must have just synced. The mobile-sections-lead version was lagging with "company" (the old des... [21:16:53] RECOVERY - Redis status tcp_6381 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6381 has 1 databases (db0) with 4744265 keys, up 38 days 12 hours - replication_delay is 0 [21:17:23] RECOVERY - Redis status tcp_6481 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6481 has 1 databases (db0) with 4746698 keys, up 38 days 12 hours - replication_delay is 0 [21:17:23] RECOVERY - Redis status tcp_6380 on rdb2002 is OK: OK: REDIS 2.8.17 on 10.192.0.120:6380 has 1 databases (db0) with 4752448 keys, up 38 days 12 hours - replication_delay is 0 [21:18:23] RECOVERY - Redis status tcp_6480 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6480 has 1 databases (db0) with 4750080 keys, up 38 days 12 hours - replication_delay is 0 [21:19:30] !log mobrovac@tin Starting deploy [changeprop/deploy@2fe48e0]: Deploying fix for T152229 [21:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:42] T152229: Interleave processing of backlink jobs - https://phabricator.wikimedia.org/T152229 [21:20:22] !log mobrovac@tin Finished deploy [changeprop/deploy@2fe48e0]: Deploying fix for T152229 (duration: 00m 51s) [21:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:34] ostriches there's also https://github.com/gerrit-review/gerrit/commit/5e269758e8dd645f92f3c34ac6802282d1d6bbe3 [21:20:39] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858235 (10Pchelolo) @Mdholloway Hm. So we have 2 queues in Change-Prop - one for main events and one for events deriv... [21:23:13] ostriches https://github.com/gerrit-review/gerrit/commit/79ae5803bee893bf8f6fae2f17aa6b7dc4e67bb1 [21:23:28] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858257 (10Mholloway) @Pchelolo Sounds great, thank you! [21:24:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4748217 keys, up 38 days 13 hours - replication_delay is 19 [21:24:05] https://gerrit-review.googlesource.com/#/c/79089/ is the commit that broke this [21:26:21] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858261 (10Pchelolo) Created https://github.com/wikimedia/change-propagation/pull/145 [21:28:32] ostriches ^^ [21:39:39] (03CR) 10Thiemo Mättig (WMDE): "Talking about "now handled directly" would not be correct. Adding this here was just wrong and must be reverted." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [21:43:31] !log created securepoll_elections.el_owner on all wikis T152721 [21:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:43] T152721: Undefined property: stdClass::$el_owner in SecurePoll/includes/main/Store.php on line 179 - https://phabricator.wikimedia.org/T152721 [21:47:36] ostriches could you try relogging in please? [21:47:41] I've reindexed :) [21:48:55] Nope [21:48:57] No bueno [21:50:13] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1499 [21:50:25] Ok [21:50:45] ostriches i've figured it is https://gerrit-review.googlesource.com/#/c/79089/ that broke [21:51:11] i wonder how to fix this now, would a revert work?, or i guess we will have to go with outh? [21:51:13] oauth [21:51:22] I tried reverting. [21:51:26] Too many intermediate changes [21:51:29] oh [21:52:40] Let's avoid pointing blame at any one change, if anything it's probably several :) [21:52:51] I prefer saying "Hey, this may be related" -- rather than "This broke us" [21:53:02] Oh [21:53:06] ok [21:53:06] That way if we're wrong we're not lying hehe :) [21:53:11] Yep [21:53:34] ostriches i did https://gerrit-review.googlesource.com/#/c/92732/ , time for a rebase, lol [21:54:50] I want upstream to notice us! [21:54:51] lolol [21:54:54] FIX MAH BUG [21:55:02] Yep [21:55:11] more noise we make the more likly they will fix it [21:55:13] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 2097342 Threads: 223 Questions: 323907773 Slow queries: 19769 Opens: 11509 Flush tables: 1 Open tables: 590 Queries per second avg: 154.437 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [21:55:15] to get rid of us lol [21:57:25] (03PS8) 10Legoktm: Set $wgUserEmailUseReplyTo = true; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322243 (https://phabricator.wikimedia.org/T66795) [22:00:04] yurik and maxsem: Dear anthropoid, the time has come. Please deploy Enable structured data on Commons (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161208T2200). [22:00:30] (03CR) 10Legoktm: [C: 032] Set $wgUserEmailUseReplyTo = true; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322243 (https://phabricator.wikimedia.org/T66795) (owner: 10Legoktm) [22:00:38] * legoktm quickly sneaks his config change in first [22:01:06] (03Merged) 10jenkins-bot: Set $wgUserEmailUseReplyTo = true; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322243 (https://phabricator.wikimedia.org/T66795) (owner: 10Legoktm) [22:01:59] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2858468 (10Theklan) We have made changes to Template:Frantziako udalerri infotaula INSEE and now the time for loading each articles is considerably smaller.... [22:03:10] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Set $wgUserEmailUseReplyTo = true; everywhere - T66795 (duration: 00m 56s) [22:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:25] T66795: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795 [22:04:09] I'm done [22:05:17] any mushroom clouds on the horizon? [22:05:39] (03PS3) 10Dereckson: Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [22:05:50] (03CR) 10Dereckson: [C: 031] Revert "Add Abenaki language (abe) to Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [22:06:18] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2858474 (10hashar) The [[ https://eu.wikipedia.org/w/index.php?title=Txantiloi:Frantziako_udalerri_infotaula_INSEE&diff=5656756&oldid=5479930 | Wiki diff ]] [22:06:43] ostriches now that's how to get upstream attention https://gerrit-review.googlesource.com/#/c/92732/ [22:06:44] lol [22:06:47] already a comment [22:07:44] (03CR) 10Dereckson: [C: 031] "Yeah, it's so much more important and valuable to add a short explanation about WHY we revert and what's expected instead. I've updated th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325962 (https://phabricator.wikimedia.org/T150633) (owner: 10Thiemo Mättig (WMDE)) [22:08:09] ostriches anyways i rebased it [22:10:23] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:11:14] paladox: you should really put something in the commit message that describes the revert. as it stands that's a pretty useless patch [22:11:26] Oh [22:11:27] ok [22:12:04] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2858497 (10Legoktm) 05Open>03Resolved a:03Legoktm Deployed to all wikis now. Announced... [22:16:47] !log mobrovac@tin Starting deploy [changeprop/deploy@ab552cd]: Deploying fix for T152690 [22:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:01] T152690: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690 [22:17:36] !log mobrovac@tin Finished deploy [changeprop/deploy@ab552cd]: Deploying fix for T152690 (duration: 00m 49s) [22:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:07] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2858539 (10AnnaMariaKoshka) @Legoktm - I have tested it (ru wiki) - it works! =) Thanks. [22:20:39] (03PS1) 10Yurik: Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) [22:20:58] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10RESTBase-API, and 4 others: Refreshing mobile-sections does not purge mobile-sections-lead - https://phabricator.wikimedia.org/T152690#2858541 (10mobrovac) OK, this should be now fixed once and for all :) Please @Mholloway and @Dbrant recheck and resolv... [22:25:19] (03CR) 10MaxSem: [C: 031] Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [22:27:42] (03PS2) 10Yurik: Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) [22:27:43] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:31:07] (03CR) 10Legoktm: [C: 031] Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [22:38:23] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:40:42] ostriches https://groups.google.com/forum/#!topic/repo-discuss/SesSyGPbNoI :) [22:40:54] fire in the .... i meant scaping [22:41:31] paladox: I had sent as a reply to another thread, but it said waiting for approval :\ [22:41:32] Hmmm [22:41:37] Oh [22:41:50] is that your first time posting on that form? [22:41:53] I used to be on the list but left, guess I'm a "new" member now :) [22:42:01] oh [22:42:02] yep [22:43:00] !log yurik@tin Started scap: Updating JsonConfig bump and i18n https://gerrit.wikimedia.org/r/#/c/326045/ [22:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:19] yurik: You may see some warnings during the cache_git_info stage for extensions & skins. It's fine. It's actually always done that, the error message is just new (still trying to sort it out) [22:46:25] (for full scaps) [22:46:43] ostriches, thanks for heads up! [22:51:31] https://github.com/gerrit-review/gerrit/commit/38b73e2358aea20396d0400138d834938b566f34 [22:52:13] ostriches https://github.com/gerrit-review/gerrit/commit/0a5d4632876a6c940db79d08119955d545c41744 [22:52:18] return Strings.nullToEmpty(input.getUserName()).toLowerCase(); [22:52:59] ostriches ah [22:53:00] https://github.com/gerrit-review/gerrit/commit/d9d518e1f6747ced035948a1abd35cf6da93dc0b [22:53:12] ^^ case insensitive [22:54:23] 06Operations, 06Parsing-Team, 10uprightdiff, 13Patch-For-Review: Debian packaging for uprightdiff - https://phabricator.wikimedia.org/T152577#2853037 (10fgiunchedi) LGTM, I've uploaded uprightdiff 1.0-1 to carbon jessie-wikimedia [22:54:43] legoktm: ^ should be available at the next 'apt update' :) [22:56:23] https://github.com/gerrit-review/gerrit/commit/83415257b4e1d96e561560a0923a5b3cfacddb22 [22:56:43] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:56:58] paladox: Yeah, all those touching this. Something's not right though :) [22:57:06] Yep [22:57:08] There's like a half dozen commits touching this stuff [22:57:12] Yeh [22:57:12] So a revert isn't likely. [22:57:18] More likely to find/fix. [22:57:21] If someone can find it [22:57:39] https://github.com/gerrit-review/gerrit/commit/d9d518e1f6747ced035948a1abd35cf6da93dc0b looks like it converts it to lowercase [22:57:49] 06Operations, 06Parsing-Team, 10uprightdiff, 13Patch-For-Review: Debian packaging for uprightdiff - https://phabricator.wikimedia.org/T152577#2858663 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [22:57:57] I guess if upstream is gonna be all quiet, we'll need to reproduce locally and fix ourselves... [22:58:16] (as in, I guess I need to install ldap :p) [22:59:00] oh [22:59:03] ostriches not really [22:59:10] gerrit.git.wmflabs.org uses ldap [22:59:22] since that is what puppet applied to it when i applied the gerrit class [22:59:23] Well, I want to be able to attach a debugger to it :) [22:59:28] oh [22:59:43] Could turn logging way up on gerrit.git.wmflabs.org [22:59:48] Like, DEBUG everything [23:00:02] 06Operations, 10ChangeProp, 06Parsing-Team, 10Parsoid, and 7 others: Separate clusters for asynchronous processing from the ones for public consumption - https://phabricator.wikimedia.org/T152074#2858668 (10GWicke) According to https://www.mediawiki.org/wiki/Parsoid/Deployments#Wednesday.2C_December_7.2C_2... [23:00:08] Ok [23:00:09] yeh [23:01:37] doing some trace backs, i came accross https://github.com/gerrit-review/gerrit/blob/b098096bdd9141c9dfcbf51220ffcde39bedf5cf/gerrit-elasticsearch/src/main/java/com/google/gerrit/elasticsearch/ElasticQueryBuilder.java#L163 [23:01:44] which external_id calls in java [23:02:24] oh never mind, that is master only [23:04:57] https://github.com/gerrit-review/gerrit/blob/3a4e10f3da402c4c6f8692787c16d531f503f16c/gerrit-lucene/src/main/java/com/google/gerrit/lucene/QueryBuilder.java#L215 [23:05:36] https://github.com/gerrit-review/gerrit/blob/3a4e10f3da402c4c6f8692787c16d531f503f16c/gerrit-lucene/src/main/java/com/google/gerrit/lucene/QueryBuilder.java#L222 [23:08:14] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2858692 (10Deskana) p:05High>03Low This hasn't been touched in quite a while, so lowering priority and putting in t... [23:08:46] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Investigate the need for master only (non data nodes) in our ES cluster - https://phabricator.wikimedia.org/T109090#1540084 (10EBernhardson) dedicating an entire node to master duties might be a bit much, especially since 2 of the 3 are only o... [23:10:34] !log yurik@tin Finished scap: Updating JsonConfig bump and i18n https://gerrit.wikimedia.org/r/#/c/326045/ (duration: 27m 34s) [23:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:54] paladox: I responded to Luca [23:11:56] But pending approval [23:12:03] Oh [23:12:54] ostriches he responded, what did you write? [23:13:06] (03CR) 10Yurik: [C: 032] Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [23:13:35] paladox: Um, it disappeared into the approval queue :p [23:13:41] (03Merged) 10jenkins-bot: Enable structured data in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/326046 (https://phabricator.wikimedia.org/T148745) (owner: 10Yurik) [23:13:42] Oh [23:13:43] lol [23:13:53] Basically, "If you think a full offline reindex after fixing rows will work, we can do that" [23:14:11] But, "If the secondary index results in DB rows being deleted, that's scary because data loss...." [23:14:13] I think we will want to try ^^ [23:14:20] godog: awesome, thank you :) [23:14:21] oh [23:14:37] So, little bit of "Yeah, if you think that workaround will work...ok....but really, this needs a damn fix" [23:14:39] ;-) [23:14:41] But nicer! [23:14:45] ostriches we have backups?, is there a way we can put gerrit into read only mode so that we could try without any data lost [23:14:54] yep [23:15:02] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#2858713 (10fgiunchedi) @Jgreen I've outlined some of the deployment and architecture at https://wikitech.wikimedia.org/wiki/Prometheus plus docs at https://prometheus.io. By default metrics are... [23:15:31] legoktm: no worries! [23:15:55] paladox: I'm gonna keep thinking about it while I wait for a response/approval [23:16:04] Ok [23:16:23] ostriches your approval may take days / weeks, it took a long time for me to get approved. [23:16:44] Freaking upstream.... [23:16:51] Anyway, still gonna think [23:16:55] Yep [23:20:45] (03PS1) 10Legoktm: visualdiff: Install uprightdiff package [puppet] - 10https://gerrit.wikimedia.org/r/326053 [23:22:10] ostriches, i replied and then he replied [23:22:15] per the replyer "Yes, I believe it will work ... BUT, fix the DB first of all as you mentioned as well that some external ids were missing." [23:22:22] what does he mean by fix the db? [23:24:41] Yeah [23:24:43] I know what he means. [23:24:52] I want my e-mail approved lol [23:24:56] yeh [23:25:19] email google and get your email quickly approved :) [23:26:03] Hah [23:26:09] ostriches, legoktm, just to double check - i'm introducing a new setting var in initializeSetting, and use it in Commons. Do i scap-dir the whole config, or should i do it one at a time? [23:26:27] yurik: sync-file initialise first, then do common [23:26:46] if you do it all at the same time nothing bad will really happen, just a bunch of warnings [23:26:52] Or third option: fix config mgmt ;-) [23:27:07] * legoktm quickly hides [23:27:11] legoktm, thx, makes sense [23:28:15] LOL [23:28:22] what is mgmt? [23:29:32] !log yurik@tin Synchronized wmf-config/InitialiseSettings.php: Enable structured data on Commons https://gerrit.wikimedia.org/r/#/c/326046 (duration: 00m 45s) [23:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:46] when building gerrit i get error [23:29:47] Exception in thread "main" org.jruby.exceptions.RaiseException: (LoadError) load error: asciidoctor/stylesheets -- java.lang.OutOfMemoryError: GC overhead limit exceeded [23:29:48] now [23:29:50] ostriches ^^ [23:31:43] ostriches your email was approved [23:31:45] !log yurik@tin Synchronized wmf-config: Enable structured data on Commons - step 2 - https://gerrit.wikimedia.org/r/#/c/326046 (duration: 00m 46s) [23:31:46] your comment is live [23:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:53] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:43] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:03] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [23:43:19] 06Operations, 06Discovery, 06Discovery-Search, 10Monitoring, 07Wikimedia-Incident: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#1570456 (10greg) It's an explicit follow-up from an incident. These should be prioritized along side other "fun/new" wo... [23:45:09] ostriches another reply. [23:45:14] we have an explanation [23:45:26] it is old and new code mixed for the migration to git [23:45:30] ostriches, could you take a look - https://gerrit.wikimedia.org/r/#/c/326051 -- it shows that it was merged to wmf5, but git pull from tin is not showing it [23:45:53] paladox: Reading. [23:45:57] Ok thanks :) [23:46:14] yurik: Auto submodule bumps are a little busted right now, needs a manual bump in core like we used to do [23:50:46] ostriches https://github.com/gerrit-review/gerrit/commit/75eb96921f702c92d30185cbe978f35f4190b5a0 [23:51:01] that looks like it a prep change and what luca is talking about. [23:52:21] I'm hoping we can come up with something here w/ Luca :) [23:52:27] I've met him before, smart guy [23:52:56] Oh yep :) [23:58:58] ostriches "I believe so ... but it would be better to get Edwin's opinion on this, as he is the author of that change :-)" [23:59:04] I saw [23:59:04] i am going to try a revert [23:59:11] Yeah try a revert locally [23:59:13] We'll test at least [23:59:19] https://gerrit-review.googlesource.com/#/c/92774/ [23:59:20] Ok [23:59:51] whoever is doing SWAT: the structured data is live and kicking, but a tiny follow up patch needs to be SWATed - we merged it, but because the auto-submodule bump is borked, we MaxSem and I got a bit confused. Please sync-file https://gerrit.wikimedia.org/r/#/c/326051 [23:59:58] ostriches im not sure if i rebased correctly