[00:39:01] (03PS1) 10Dr0ptp4kt: Prepare for logging MCC-MNC. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130989 [00:51:12] PROBLEM - Puppet freshness on hafnium is CRITICAL: Last successful Puppet run was Thu May 1 18:49:24 2014 [01:17:12] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed Apr 30 10:04:02 2014 [02:11:42] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3791 MB (3% inode=99%): [02:18:42] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3434 MB (3% inode=99%): [02:19:27] (03CR) 10Dr0ptp4kt: [C: 04-2] "Don't merge yet. We're looking at EventLogging as a more optimal route." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130989 (owner: 10Dr0ptp4kt) [02:29:43] (03PS1) 10Chad: Fix LDAP admin privs [operations/puppet] - 10https://gerrit.wikimedia.org/r/131010 [02:40:42] !log LocalisationUpdate completed (1.24wmf2) at 2014-05-02 02:39:39+00:00 [02:40:50] Logged the message, Master [02:55:12] PROBLEM - Puppet freshness on holmium is CRITICAL: Last successful Puppet run was Thu May 1 23:54:55 2014 [03:00:01] (03Abandoned) 10Dr0ptp4kt: Prepare for logging MCC-MNC. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130989 (owner: 10Dr0ptp4kt) [03:00:42] RECOVERY - Disk space on virt0 is OK: DISK OK [03:10:20] !log LocalisationUpdate completed (1.24wmf3) at 2014-05-02 03:09:16+00:00 [03:10:27] Logged the message, Master [03:29:10] (03PS1) 10Springle: MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 [03:51:06] (03PS2) 10Springle: MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 [03:52:12] PROBLEM - Puppet freshness on hafnium is CRITICAL: Last successful Puppet run was Thu May 1 18:49:24 2014 [04:02:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 2 04:01:04 UTC 2014 (duration 1m 3s) [04:02:17] Logged the message, Master [04:08:59] 1 minute? [04:10:17] not unheard of I guess: #wikimedia-operations.04-07.log:17:13 <+logmsgbot> !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 7 21:13:14 UTC 2014 (duration 1m 59s) [04:10:48] still seeing 40-55 minute runs every now and then though [04:15:31] (03PS1) 10Andrew Bogott: Sync UIDs with ldap for a bunch of users: [operations/puppet] - 10https://gerrit.wikimedia.org/r/131018 [04:16:16] (03CR) 10Andrew Bogott: [C: 04-2] "Andrew will merge this as scheduled" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131018 (owner: 10Andrew Bogott) [04:18:12] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed Apr 30 10:04:02 2014 [04:22:15] (03PS3) 10Springle: MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 [04:39:41] (03PS4) 10Springle: MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 [04:42:47] (03CR) 10Springle: MariaDB multi-source replication monitoring. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 (owner: 10Springle) [04:44:34] (03CR) 10Springle: "Unsure if the way I'm calling mariadb::monitor_replication using $name is wise?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 (owner: 10Springle) [04:58:30] (03CR) 10Chad: Configure Swift-backed elasticsearch backups (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130760 (owner: 10Chad) [05:38:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Add another check using graphite, small fixes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130582 (owner: 10Giuseppe Lavagetto) [05:39:27] (03PS2) 10Giuseppe Lavagetto: Add another check using graphite, small fixes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/130582 [05:46:32] RECOVERY - Disk space on dataset1001 is OK: DISK OK [05:56:12] PROBLEM - Puppet freshness on holmium is CRITICAL: Last successful Puppet run was Thu May 1 23:54:55 2014 [06:17:19] !log re-enabled puppet on osmium and hafnium [06:17:27] Logged the message, Master [06:18:24] (03PS1) 10Springle: Default temporary and custom tables to Aria engine for Analytics. Won't affect upstream schema changes and does not prevent researchers using InnoDB or TokuDB explicitly. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131024 [06:19:22] RECOVERY - Puppet freshness on hafnium is OK: puppet ran at Fri May 2 06:19:17 UTC 2014 [06:20:05] (03CR) 10Springle: [C: 032] Default temporary and custom tables to Aria engine for Analytics. Won't affect upstream schema changes and does not prevent researchers usin [operations/puppet] - 10https://gerrit.wikimedia.org/r/131024 (owner: 10Springle) [06:21:21] <_joe_> springle: so Aria? let's see how it works [06:24:38] _joe_: :) [06:24:50] analysts are good for trying stuff out on [06:28:27] mainly, though, I hope to move some of Analytics reporting write-load out of innodb's log files [06:28:44] <_joe_> which sounds like a good plan [06:29:08] <_joe_> I thought analytics was more heavy on temp tables [06:29:16] it's possible that default should be toku instead, but aria has myisam's datawarehousing history [06:29:16] <_joe_> and read, apart from that [06:29:26] plus crash-safe nowadays [06:29:33] yeah [06:29:38] <_joe_> yes, that sounds like +1 [06:42:06] (03PS1) 10Giuseppe Lavagetto: Fix a typo in icinga checkcommands. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131028 [06:42:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix a typo in icinga checkcommands. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131028 (owner: 10Giuseppe Lavagetto) [06:51:46] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: (null) [06:51:59] <_joe_> oook, that is wrong :) [06:54:56] :-D [07:01:14] (03PS1) 10Giuseppe Lavagetto: Another small fix to icinga config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131029 [07:01:29] <_joe_> apergos: two typos in the same commit, both really stupid, still :( [07:01:54] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Another small fix to icinga config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131029 (owner: 10Giuseppe Lavagetto) [07:02:00] it's early yet [07:02:20] <_joe_> apergos: this was from a few days back [07:02:33] well it is good you are finding them now at least [07:02:33] <_joe_> I waited for a couple of CRs and merged this morning [07:17:37] (03PS1) 10Giuseppe Lavagetto: Fixing checks for graphite and gdash. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131030 [07:18:26] PROBLEM - Puppet freshness on osmium is CRITICAL: Last successful Puppet run was Wed Apr 30 10:04:02 2014 [07:27:16] RECOVERY - Puppet freshness on osmium is OK: puppet ran at Fri May 2 07:27:07 UTC 2014 [07:27:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Need to convert class into a define." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 (owner: 10Springle) [07:28:20] <_joe_> springle: if you need clarifications, just ask [07:29:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Fixing checks for graphite and gdash. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131030 (owner: 10Giuseppe Lavagetto) [07:33:43] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:34:17] <_joe_> \o/ [07:37:48] :-) [07:45:27] (03PS5) 10Springle: MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 [07:48:52] (03CR) 10Springle: MariaDB multi-source replication monitoring. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 (owner: 10Springle) [07:50:48] <_joe_> springle: I assume check_mariadb.pl comes from the internets? [07:51:08] <_joe_> if not, I can give it a thorough review [07:51:22] _joe_: nope, i wrote it today [07:51:24] please do [07:51:38] <_joe_> springle: oh ok, you want me to work, damn! [07:52:04] didn't spot any monitoring scheck scripts for multi-source repl out there yet [07:52:25] really this needs to use heartbeat [07:52:31] but need to start somewhere [07:52:47] <_joe_> on it :) [07:54:36] _joe_: see this is what you get for being so responsive with the gerrit reviews :) more work [07:55:24] <_joe_> springle: I really am not, I'm responsive on things where a) I think I have an idea b) I'm interested [08:02:39] <_joe_> springle: I'll try to make my reviews harsh and unpleasant. Like "-1 for whitespace" [08:03:05] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 8477 bytes in 0.020 second response time [08:03:16] <_joe_> and \o/ again. [08:08:48] (03CR) 10Giuseppe Lavagetto: [C: 031] MariaDB multi-source replication monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131015 (owner: 10Springle) [08:09:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Move license text to LICENSE and start README [operations/software] - 10https://gerrit.wikimedia.org/r/129883 (owner: 10Hashar) [08:13:43] so why on osmium does /usr/bin/puppet start by invoking /usr/bin/ruby1.9.1 (which does not exist) but on the other trusty hosts it correctly wants ruby1.8? [08:14:07] <_joe_> apergos: other trusty hosts? where? [08:14:21] <_joe_> apergos: did not know we had more [08:14:25] copper, tantalum [08:14:32] <_joe_> oh ok [08:14:48] <_joe_> btw, which version of ruby 1.8? [08:15:08] uh "ruby1.8" [08:15:35] <_joe_> mh just looking at our puppet code for reviewing https://gerrit.wikimedia.org/r/#/c/120518/3 [08:15:52] I'd edit it in place but if the next install results in the same issue I would be sad [08:15:55] <_joe_> seems the class mha::manager that's inside role::mha is never ever called anywhere [08:18:06] <_joe_> mha::node is used in role::coredb::common [08:18:30] <_joe_> we really do need to put everything in autoload layout... [08:22:38] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Sorry matanya, I just noticed mha::manager (the class calling this method) does not get called anywhere. We should remove it instead of co" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120518 (owner: 10Matanya) [08:22:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "So I tested the change and it works, however it does spew out a warning" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [08:37:23] apergos: https://rt.wikimedia.org/Ticket/Display.html?id=7407 ? I though I had fixed that :-( [08:38:03] well just seen on osmium [08:38:11] and that seems to be a newish install [08:38:29] a yeah, I fixed the other scripts, the puppet one slipped [08:38:32] ah ha [08:38:32] grrrr [08:38:43] ruby/puppet ... sigh... [08:38:47] yup [08:38:54] <_joe_> akosiaris: good morning :) [08:39:09] morning ? it's close to noon here :P [08:39:18] good morning to you too [08:39:26] that's still quite morning! [08:39:27] <_joe_> yes, well, I did not see you before :) [08:39:28] :-P [08:39:31] still morning [08:40:10] yeah, I am catching up on emails and viewing yesterday's metrics meeting [08:40:13] <_joe_> yes, although I woke up at 6 AM... it feels like I'm near to lunch time now :( [08:40:16] so we got a new ED ! [08:41:42] apparently [08:42:07] I don;t suppose the metrics meeting video is available yet? [08:42:14] oh found [08:56:55] PROBLEM - Puppet freshness on holmium is CRITICAL: Last successful Puppet run was Thu May 1 23:54:55 2014 [10:46:36] (03PS1) 10Ori.livneh: Add 'editstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 [11:19:40] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This patch misses the point, re-enginnering is needed." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/118966 (owner: 10Matanya) [11:22:33] _joe_: you see all my bad patches :/ [11:23:24] <_joe_> matanya: sorry man :/ [11:23:38] no worries [11:23:46] i should do a better job [11:24:11] <_joe_> sorry, lunch :) [11:24:16] <_joe_> see you later [11:24:40] Didn't we have some outages this week? [11:25:34] Somewhere during one UTC night [11:27:00] (I can't fing anything in the SAL, and there was no message or anything on any of the lists I checked.) [11:27:03] find* [11:30:32] (Might have been the night of 28/29 April) [11:57:55] PROBLEM - Puppet freshness on holmium is CRITICAL: Last successful Puppet run was Thu May 1 23:54:55 2014 [12:06:25] !log updated our Jenkins Job Builder copy abbf318..8df6bab [12:06:35] Logged the message, Master [12:20:40] (03PS1) 10Alexandros Kosiaris: Update netmon1001 torrus configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/131044 [12:24:13] (03CR) 10Alexandros Kosiaris: [C: 032] Update netmon1001 torrus configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/131044 (owner: 10Alexandros Kosiaris) [12:30:04] (03PS4) 10Matanya: manutius: remove torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130587 [12:56:23] (03PS1) 10Faidon Liambotis: blog: fix varnish configuration to handle the load [operations/puppet] - 10https://gerrit.wikimedia.org/r/131045 [12:58:37] (03PS2) 10Faidon Liambotis: blog: bump max backend connections to 256 [operations/puppet] - 10https://gerrit.wikimedia.org/r/131045 [12:58:56] (03CR) 10Faidon Liambotis: [C: 032 V: 032] blog: bump max backend connections to 256 [operations/puppet] - 10https://gerrit.wikimedia.org/r/131045 (owner: 10Faidon Liambotis) [12:59:56] RECOVERY - Puppet freshness on holmium is OK: puppet ran at Fri May 2 12:59:51 UTC 2014 [13:00:49] _joe_: hey [13:01:08] icinga alert for graphite, for host tungsten HTTP WARNING: HTTP/1.1 401 Authorization Required - 779 bytes in 0.004 second response time [13:23:33] (03CR) 10Faidon Liambotis: "One comment, otherwise looks good. Be sure to monitor the MW appserver load after this gets deployed; the first iteration of the MediaWiki" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130634 (owner: 10BBlack) [13:26:03] <_joe_> paravoid: seen that, I'm on it [13:31:55] (03PS1) 10Alexandros Kosiaris: Exclude teredo interface from check_eth [operations/puppet] - 10https://gerrit.wikimedia.org/r/131051 [13:34:26] (03PS1) 10Giuseppe Lavagetto: Change the graphite url to check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131052 [13:36:10] (03CR) 10Giuseppe Lavagetto: [C: 032] Change the graphite url to check. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131052 (owner: 10Giuseppe Lavagetto) [13:46:43] !log swift @ eqiad: setting zone 5 (ms-be1013/1014/1015) to weight 2000, i.e. 66% [13:46:50] Logged the message, Master [13:47:32] hey, puppet is very fast [13:47:44] I haven't seen puppet runs in 16s since forever [13:47:59] andrewbogott: your find / were killing swift, see ops@ [13:48:18] paravoid: ok, catching up... [13:49:52] (03CR) 10Faidon Liambotis: [C: 032] "Ideally we'd check for the interface type in the check itself, but this will do for now. Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131051 (owner: 10Alexandros Kosiaris) [13:50:29] (03CR) 10Ottomata: "Yeah, makes sense. I saw that error too, but was kinda in a hurry at the time to get the elasticsearch node reinstalled and back online. " [operations/puppet] - 10https://gerrit.wikimedia.org/r/130856 (owner: 10Ottomata) [13:51:49] paravoid: Yikes -- did that cause an outage, or just ganglia redlines? [13:52:21] performance degredation [13:52:51] Kind of obvious that that was going to happen, in retrospect :( [13:53:14] hindsight is 20/20. [13:53:20] Yep! [13:53:37] So, I have another batch of uid substitutions queued up for… right now. Suppose I should just do -x for all systems? [13:53:51] Also, "I shouldn't have done that" is probably the single most common saying after things break. :-P [13:53:54] (man page says that -x is the new -xdev) [13:54:27] <_joe_> Coren: or "we should have done that" [13:54:35] The problem with -x is that you also need to make sure to explicitly look at every local filesystem. [13:56:06] Technically speaking the swift boxes don't even /have/ user accounts do they? Just log in as root? [13:56:09] * andrewbogott looks in puppet [13:59:53] twkozlowski: yes, there was a partial outage the night of 28/29 around 1am, there was a report to ops+engineering lists the next day [14:00:27] (03CR) 10Krinkle: Add 'editstream' module for broadcasting recent changes over WebSockets (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [14:00:48] <_joe_> paravoid: I get you were not looking at puppet on neon (as you said it's fast) :) [14:01:05] no I was not [14:01:25] (03CR) 10Krinkle: "Nice! Do the channels for different wikis get handled transparently between mediawiki and redis? (I guess those redis instances will not h" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [14:01:51] if something's going wronger or righter with neon, you may want to keep in mind I did this recently: https://gerrit.wikimedia.org/r/#/c/130731/ [14:02:09] paravoid: can you clarify 'certain classes of boxes, like Swift'? What boxes are you thinking of, other than swift? [14:02:09] (the IPv6 DNS was for a bogus address that didn't exist on neon, I changed it to the one that did exist) [14:02:32] (03PS1) 10coren: Tool Labs: move git-review to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/131053 [14:02:44] And, can I catch the swift boxes with ".*-[be|fe].*" y'think? [14:02:45] <_joe_> bblack: I think it's simply very slow, as it has always been in my (brief) experience [14:02:46] andrewbogott: maybe dumps, maybe analytics, labstore [14:03:05] hm, analytics are going to be the prime offenders, in terms of having user-owned files all over [14:03:10] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [14:03:15] "ms-be*" should be anough [14:04:01] * andrewbogott is definitely going to need someone to proofread this regexp [14:04:40] (03PS2) 10Andrew Bogott: Sync UIDs with ldap for a bunch of users: [operations/puppet] - 10https://gerrit.wikimedia.org/r/131018 [14:05:55] (03CR) 10coren: [C: 032] Tool Labs: move git-review to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/131053 (owner: 10coren) [14:07:19] (03PS3) 10Andrew Bogott: Sync UIDs with ldap for a bunch of users: [operations/puppet] - 10https://gerrit.wikimedia.org/r/131018 [14:10:58] (03CR) 10Andrew Bogott: [C: 032] Sync UIDs with ldap for a bunch of users: [operations/puppet] - 10https://gerrit.wikimedia.org/r/131018 (owner: 10Andrew Bogott) [14:18:03] bblack: Is there anything secret in how long the outage was, and what was the cause? [14:22:42] dang, I really don't see how to do negative matching in a salt run. Is there a regexp mode, or just globbing? [14:25:25] (03PS1) 10Ottomata: Removing nginx module in order to add it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131057 [14:26:23] ottomata: (why?) [14:27:21] nm, found it [14:27:33] mark, see thread about vagrant shared submodules [14:27:44] 'git submodules for more shared puppet modules' [14:27:58] this is in prep for hackathon work where folks want to try to make vagrant more like production [14:28:41] ah right [14:28:47] the three slated for sharing are nginx, mariadb, and varnish [14:29:11] that nginx module is pretty bad/rudimentary though [14:29:15] but sure [14:29:26] i wouldn't say bad, just rudimentary [14:29:35] yeah rudimentary [14:29:36] for sure [14:30:13] right [14:30:21] a module that does almost nothing but the basics almost can't be bad [14:30:27] haha [14:30:32] as opposed to a puppet module that tries to do everything and usually fails at it! [14:30:35] I disagree on that one :) [14:30:47] grr, i can't push the submodule add for review...i think until the removal is merged? [14:31:10] plenty of bad small modules [14:35:14] ah! is nginx module not being used anywhere? [14:35:19] it is [14:35:27] I find it being used, but then the classes that are using it aren't including it? [14:35:28] or it was maybe? [14:35:29] sorry [14:35:35] aren't being included* [14:36:01] I see archiva using it :) [14:36:06] hah [14:36:08] uhhh [14:36:10] (and protoproxy) [14:36:11] <_joe_> a puppet question: if I find a class that is not referenced anywhere in the code, should I remove it [14:36:48] OH pffff [14:36:52] i was greppingIN the modules/ dir [14:36:53] duh [14:39:18] ok, hm, you guys mind if I push this submodule through? i was going to leave 2 commits up for review: rm modules/nginx + git submodule add modules/nginx [14:39:32] but i can't seem to push the submodule add until I merge the removal? [14:39:34] hm [14:39:56] (03CR) 10Faidon Liambotis: "We discussed it some more with Chad on IRC. Based on the size of the dataset (~1TB), let's wait until the new Swift servers to be fully we" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130760 (owner: 10Chad) [14:40:15] ah wait, I think I got it.... [14:40:16] (03PS1) 10Ottomata: Adding nginx module as git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131058 [14:40:18] yeahhh [14:40:21] needed not to rebase [14:40:27] paravoid: Seems like my pattern should be ^(.(?!ms-be|labstore|snapshot))*$ but I can't find a single online regexp tester that will back me up :( Am I doing the exclusion wrong? [14:48:19] I love how the salt docs say that salt supports 'perl-form regular expressions' and then immediately link to the python regex docs [14:51:58] Can anyone help me understand how to use a negative lookahead? I'm copying a doc example verbatim and it still doesn't work [14:52:06] yeah [14:52:24] can you paste an example and what you expect it to do? [14:52:50] ^((?!ms-be|labstore|snapshot).*)$ should work I think [14:53:01] or without the outer brackets [14:53:03] This is a baby step, but… right now I'm trying to match all minions that aren't ms-be* [14:53:09] So '^(.(?!ms-be))*$' [14:53:15] * andrewbogott tries w/out the outer () [14:53:32] move the dot over [14:53:48] the brackets aren't the problem [14:53:50] <_joe_> the dot should be before * [14:54:38] <_joe_> like mark said before :) [14:54:54] that's a lookahead, and you're using it to look-behind, which is the same issue I had in the regex from mark/faidon about gettingSTartedUserToken :P [14:55:12] if you really want to look behind rather than ahead, it's (? mark, _joe_: I think that's working! You guys are better than the internet. [14:56:05] although maybe with the anchor, lookahead still works? [14:56:08] I donno :) [14:56:11] ahead from the ^ [14:56:26] <_joe_> bblack: the anchor makes it work [14:56:38] ok, so my full regex is now '^((?!ms-be|labstore|snapshot)).*$' [14:56:41] which is just what mark said [14:56:58] <_joe_> 1 parens too much [14:57:04] my next question is… does that exclude the right hosts such that my upcoming 'find' won't break everything? [14:57:08] there are some other rules about fixed-lengths too, but I think you're covered there [14:57:16] <_joe_> '^(?!ms-be|labstore|snapshot).*$' [14:57:23] <_joe_> this should work as well. [14:59:06] paravoid: check my work before I break things again? [14:59:19] if things aren't being captured, you could leave off the .*$ too right? [14:59:49] meanwhile… bblack, ori, ottomata, springle, are you guys ready for me to 'killall -u' you in the cluster? [15:00:12] uhm, maybe :) [15:00:17] bblack: I feel like without the .* it won't match anything. There has to be a positive match component in there doesn't there? [15:00:37] your anchor + lookahead? [15:00:43] ah, true [15:00:45] no idea, test it [15:00:56] bblack, I'll kill you last :) [15:02:00] Hi andrewbogott. I'm having trouble sshing into stat1003 this morning. Could it be related to Uid changes? FWIW, I have no problem connecting to bast1001 and I've changed nothing about my ssh-agent strategy. [15:02:14] <_joe_> andrewbogott: not necessarily. [15:02:28] halfak: possible -- but, can you ask me again in half an hour? [15:02:44] andrewbogott: testing in perl, assuming pcre behaves similarly, you don't need the trailing .*$ [15:03:22] andrewbogott, Sure. Anyone else I can both in the meantime? It turns out that this is a little time critical. :\ [15:03:30] *bother [15:04:49] Meh. [15:04:52] * halfak can wait 30 minutes.  [15:05:03] halfak: looking... [15:06:15] halfak: you had a running session on stat1003 which prevented the rename. Could be I forgot to boot you, or you logged back in right after I did… anyway if you stand clear for 5 mins it should get sorted out [15:06:39] try now? [15:07:47] (03PS1) 10Ottomata: Removing mariadb module in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131060 [15:07:49] (03PS1) 10Ottomata: Adding modules/mariadb as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131061 [15:08:35] Worked! Thanks andrewbogott [15:08:45] cool [15:08:50] sorry for the mishap [15:13:08] !log resetting a bunch more UIDs. Running find-and-chown again, but this time not on the swifts: salt -E '^(?!ms-be|labstore|snapshot).*$' [15:13:14] Logged the message, Master [15:13:45] bblack: so any ideas how long the Apr 28/29 outage was, and what caused it? [15:14:09] twkozlowski: there was a detailed writeup on the engineering lists, did you not get it? [15:14:23] subject line is: Incident: Large 5xx spike 00:20 -> 01:00 UTC 29-Apr (and other fallout) [15:14:51] bblack: engineering@ is WMF/NDA only [15:15:11] ah sorry, sometimes my brain assumes that's everyone I'm talking to here! :) [15:15:23] :) [15:16:32] twkozlowski: Basically, a cookie was deployed that set a cookie for anon users that looked (to our varnish caching layer) like a login/session cookie, which caused our caches to kinda explode. [15:16:33] right [15:17:00] bblack: could you put that in the format of https://wikitech.wikimedia.org/wiki/Incident_documentation/Report_Template so it can be posted on wikitech? [15:18:59] bblack: oh, I saw the GettingStarted patch set on Gerrit, so I assume it was that [15:19:06] Thanks for the info, very helpful. [15:19:19] twkozlowski: yeah, that was it [15:21:02] (03CR) 10BryanDavis: [C: 031] "The CPU hit we saw with the first patch I wrote to implement this on the MW side was due to parsing all ~180 addresses in the list as CIDR" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130634 (owner: 10BBlack) [15:21:37] bblack: So just to make sure I get this stuff right, was that just page loading problem, HTTP 502 errors or what exactly? [15:21:49] (03CR) 10Faidon Liambotis: [C: 04-1] "This is really great work. The architecture behind this looks very sane so far, so just -1 for a bunch of implementation comments (see inl" (0320 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129501 (owner: 10Rush) [15:21:50] PROBLEM - check configured eth on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:21:54] chasemp: ^^ [15:22:00] PROBLEM - RAID on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:22:12] I only read a bit of your discussion in the backlog, since 00:20 UTC is kinda late here [15:22:29] someone else should probably review this as well, probably _joe_ :) [15:22:50] PROBLEM - Disk space on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:13] twkozlowski: it caused intermittent 5xx errors on a subset of pages (include Main_Page), but there's no easy way to describe the subset. pseudo-random from an outside point of view. [15:23:30] PROBLEM - check if dhclient is running on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:32] (03CR) 10Hashar: [C: 031] "Thanks! Forgot to +1 earlier :-(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125888 (https://bugzilla.wikimedia.org/36422) (owner: 10BryanDavis) [15:23:40] RECOVERY - check configured eth on ms1002 is OK: NRPE: Unable to read output [15:23:40] RECOVERY - Disk space on ms1002 is OK: DISK OK [15:23:43] and during the critical period 00:20 -> 01:00, that subset of pages was persistently 5xx (or nearly so) [15:23:47] paravoid: okidoke, will go through the comments, thanks [15:23:50] RECOVERY - RAID on ms1002 is OK: OK: optimal, 5 logical, 10 physical [15:23:54] good stuff [15:24:30] PROBLEM - twemproxy port on mw1151 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:30] PROBLEM - RAID on mw1151 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:45] <_joe_> mmmh icinga problems? [15:24:47] bblack: Thanks! Will describe it as "page loading problems due to high server load", then [15:25:20] RECOVERY - check if dhclient is running on ms1002 is OK: PROCS OK: 0 processes with command name dhclient [15:25:54] twkozlowski: that seems reasonable [15:26:20] RECOVERY - twemproxy port on mw1151 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [15:26:20] RECOVERY - RAID on mw1151 is OK: OK: no RAID installed [15:28:30] PROBLEM - check if dhclient is running on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:28:30] PROBLEM - DPKG on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:00] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:00] PROBLEM - RAID on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:10] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:30] PROBLEM - twemproxy process on mw1151 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:30] PROBLEM - check configured eth on mw1151 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:30:00] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [15:30:20] RECOVERY - twemproxy process on mw1151 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [15:30:20] RECOVERY - check configured eth on mw1151 is OK: NRPE: Unable to read output [15:30:40] (03PS2) 10Ori.livneh: Add 'editstream' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 [15:30:50] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [15:31:21] RECOVERY - check if dhclient is running on ms1002 is OK: PROCS OK: 0 processes with command name dhclient [15:31:21] RECOVERY - DPKG on ms1002 is OK: All packages OK [15:31:50] RECOVERY - RAID on ms1002 is OK: OK: optimal, 5 logical, 10 physical [15:36:54] (03CR) 10Ori.livneh: "> Nice! Do the channels for different wikis get handled transparently between mediawiki and redis? (I guess those redis instances will not" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [15:37:00] ^ Krinkle [15:38:38] what's with mw1053, mw1151 and ms1002? [15:38:43] can someone have a look? [15:39:20] ms1002 should be not be used atm, right apergos? [15:40:13] <_joe_> paravoid: I'll take a look, I assumed this was a problem with checks in icinga, given the flapping nature of it [15:40:27] those checks are usually fairly reliable [15:40:45] paravoid: i had an idea while not sleeping [15:41:04] an 'infrastructure-lint' script that iterates through the json of all the 1-week graphs on ganglia [15:41:25] and prints the name of any for which the following is true: the N highest values are all in the right half of the graph [15:41:50] I'd rather do this with check_graphite icinga alerts [15:41:55] it's too coarse a criterion for alerts, because it'll have plenty of false positives (new hardware getting turned on, etc.) [15:43:09] ori: Hm.. I'm trying to trace how, in the current irc.wm.o setup, does the udp prefix (e.g. "#en.wikipedia\t") becomes the irc channel [15:43:23] ori: Looks like when RCFeeds got introduced, a few things changed [15:43:35] it used to actually be a prefix, and it just happened to work with how we pipe it into irc [15:43:40] how now it's part of the udp url [15:43:47] 'uri' => "udp://$wmgRC2UDPAddress:$wmgRC2UDPPort/$wmgRC2UDPPrefix", [15:43:59] Where wmgRC2UDPPrefix = "#wikichannel\t" [15:44:53] CommonSettings.php:2688 [15:45:05] <_joe_> paravoid: mmh high i/o wait on mw1053, while mw1151 is perfectly fine [15:45:08] if ( $wmgRC2UDPPrefix === false ) { $matches = null; if ( preg_match( '/^\/\/(.+).org$/', $wgServer, $matches ) && isset( $matches[1] ) ) { $wmgRC2UDPPrefix = "#{$matches[1]}\t"; } } [15:45:26] ori: https://github.com/etsy/skyline [15:45:46] bblack: rrdtool has holt-winters-based anomaly detection too [15:46:02] <_joe_> ori: graphite does as well [15:46:05] ori: Yep, I wrote those lines [15:46:11] <_joe_> (since we'd like to use it) [15:46:13] if I'm looking on tungsten where are new whisper creations being logged? [15:46:23] ori: But how do we turn the prefix into the first argument to PRIVMSG for ircd? [15:46:24] the problem is that it is fiendishly difficult to find an optimum value. you inevitably either run yourself mad with false alerts or remain blind with respect to certain problems [15:46:42] chasemp: /var/log/upstart/(something with 'carbon').log [15:46:44] .../var/log/graphite is empty but /var/log/graphite-web (not there which I expect) is not [15:46:49] upstart ok [15:47:02] thanks [15:47:04] ori: It used to be part of the udp body, and we tossed it as-is to PRIVMSG, so the prefix naturally worked as channel name. I guess we extract it somewhere later on? [15:47:25] Krinkle: yeah, does it matter exactly where? the point is we can do that too [15:47:34] Sure, I'm just curious [15:47:52] it's probably true that we're not exploiting this nice namespacing abstraction when we just shove everything into 'edits' [15:47:53] https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py <- is an interesting read as well [15:47:54] So that if stuff fails again, as long irc.wm.o exists, I know what's up. [15:48:14] ori: Well, there has to be a way for clients to subscribe to certain wikis only. [15:48:18] I'd rather not require clients to do that. [15:48:22] That's a lot of data [15:48:26] like, a whole lot of data. [15:48:32] yeah, fair [15:48:40] PROBLEM - HTTPS on ssl3001 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [15:48:55] i love that (verify depth is 6), it sounds very lovecraftian [15:48:55] Even CVNBot as full cap "only" listens to 300 wikis. And we have a separate instance for en.wiki and another one for commonswiki [15:49:01] it crashes otherwise [15:49:24] at full peak* [15:49:37] Krinkle: let's take it to -dev, sounds like ops has a full plate [15:49:40] RECOVERY - HTTPS on ssl3001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 628 days) [15:49:42] k [15:51:24] <_joe_> bblack: I'd like to build most of those things into check_graphite [15:51:41] <_joe_> so that we can do real anomaly detection for any metric we want [15:52:28] (03CR) 10Krinkle: "Well, there definitely has to be a way for clients to only get subscribe to certain wikis. I was assuming that is already in place. I'm mo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [15:53:50] Reedy: You have two accounts in puppet, reedy and samreed. samreed is marked as 'old' but it is still enabled. May I disable it, or are you using both accounts? [15:56:53] (03CR) 10coren: [C: 031] "If the objective is to have puppet-lint on every instance, this is the correct way to do it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/130847 (owner: 10Rush) [16:01:04] (03CR) 10Krinkle: Add 'editstream' module for broadcasting recent changes over WebSockets (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [16:04:14] <_joe_> !log depooled mw1053 for hardware problems [16:04:21] Logged the message, Master [16:07:15] (03CR) 10Mark Bergsma: [C: 04-2] "Don't merge, doesn't match the private key" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130797 (owner: 10RobH) [16:07:28] (03PS1) 10Ottomata: Removing modules/varnish in prep for adding it as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131068 [16:07:30] (03PS1) 10Ottomata: Adding modules/varnish as a git submodule [operations/puppet] - 10https://gerrit.wikimedia.org/r/131069 [16:17:41] mutante, yet another dumb RT question: I have a ticket from Kevin asking me to create him an RT account. Obviously he has an RT account already, having opened a ticket via email. But are there some standard boxes I should check on his account to give him more privs? [16:36:52] (03PS1) 10Aaron Schulz: Cleaned up job loop types a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/131072 [16:39:47] ori: ^ [16:48:21] (03CR) 10Ori.livneh: [C: 032] Cleaned up job loop types a bit [operations/puppet] - 10https://gerrit.wikimedia.org/r/131072 (owner: 10Aaron Schulz) [17:12:30] andrewbogott: you able to login to zero-bdd.wmflabs.org? i'm trying from bastion.wmflabs.org, but it seems to be timing out. i can reboot the thing if needed [17:12:50] * andrewbogott tries it [17:13:35] (03CR) 10BBlack: "Holding this (+ ip6_mapped changeset) till Monday so as not to create weekend problems." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130634 (owner: 10BBlack) [17:15:20] dr0ptp4kt: my root key worked… I'm poking around a bit. [17:15:22] puppet is unhappy [17:16:21] andrewbogott: thx [17:16:25] (03PS1) 10EBernhardson: Update flow cache versioning prefix [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131084 [17:18:34] dr0ptp4kt: that instance is having some kind of identity crisis… like dhcp partially failed on startup. [17:18:36] Is it harmless to reboot? [17:19:05] andrewbogott: yeah, always okay to reboot that machine. want me to do it, or you got it? [17:19:15] I'll do it, we'll see if it works any better this time... [17:20:23] (03CR) 10MaxSem: "Reedy, I can deploy it with next SWAT, but I need your +1:)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/130609 (https://bugzilla.wikimedia.org/42436) (owner: 10Withoutaname) [17:23:06] (03PS5) 10BryanDavis: Provision scap scripts using trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 [17:23:11] (03PS6) 10BryanDavis: Provision scap scripts using trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 [17:26:24] dr0ptp4kt: didn't help. I'm still looking but this will take a bit [17:26:38] andrewbogott: thx [17:31:11] dr0ptp4kt: ok, I fixed the problem I was looking at. Did that by chance fix your issue as well? [17:32:37] andrewbogott: i'm in, thx [17:32:46] cool [17:33:06] the resolv.conf still thought the box was in pmtpa for some reason [17:33:18] I changed it by hand, can only hope it doesn't revert. [17:43:42] (03PS1) 10RobH: subscribe the chained.pem file to the non-chained.pem file [operations/puppet] - 10https://gerrit.wikimedia.org/r/131087 [17:47:01] !log Restarting stuck jenkins [17:47:07] Logged the message, Mr. Obvious [17:47:24] (03CR) 10RobH: subscribe the chained.pem file to the non-chained.pem file (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/131087 (owner: 10RobH) [17:57:35] (03CR) 10BryanDavis: "Applied via cherry-pick in beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 (owner: 10BryanDavis) [17:58:41] (03CR) 10EBernhardson: "as opposed to websockets, what about server sent events[1]? They are ridiculously simple compared to websockets on both ends of the stream" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [18:25:02] (03PS1) 10Rush: .gitreview [operations/debs/python-diamond] - 10https://gerrit.wikimedia.org/r/131102 [18:27:28] (03CR) 10Rush: [C: 032 V: 032] "just .gitreview" [operations/debs/python-diamond] - 10https://gerrit.wikimedia.org/r/131102 (owner: 10Rush) [18:49:39] (03PS1) 10Ori.livneh: Get rid of $scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/131108 [18:54:32] (03PS2) 10Ori.livneh: Get rid of $scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/131108 [18:54:45] (03PS1) 10Ori.livneh: s/git-deploy/Trebuchet/g [operations/puppet] - 10https://gerrit.wikimedia.org/r/131110 [18:56:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [18:57:35] _joe_: ooh, nice! [18:57:37] ^demon: you about? [18:57:40] much better phrasing [18:57:49] <^demon> chasemp: Yo, whassup? [18:58:03] so had a gerrit sitch, and I know how you love it when I bother you w/ these things :) [18:58:40] <^demon> What happened? [18:58:42] have a repo where I'm merging in from an upstream, and when if I do git-review w/ just our changes...it wants to submit every commit ever [18:58:45] as a changeset [18:59:17] do I need to push directly after merging upstream? and then git-review local changes post upstream matches our repo [18:59:22] unsure if that's a good description [18:59:31] <^demon> Yeah I'd do that. [18:59:40] I don't have perms to do that, I don't believe [18:59:41] <^demon> Push upstream in directly, then git-review your stuff. [18:59:49] <^demon> What's the repo? [19:00:01] operations/debs/python-diamond [19:00:49] <^demon> Gave ldap/ops direct push rights to it [19:01:02] <^demon> Oh, and force committer. [19:01:07] <^demon> *forge [19:01:16] cool thank you will try that, far easier than what I've been doing [19:01:41] although 300 or so merges in one day, I would look like a champ until further inspection [19:01:59] <^demon> Most productive code reviewer ever ;-) [19:03:50] (03CR) 10Chad: [C: 031] s/git-deploy/Trebuchet/g [operations/puppet] - 10https://gerrit.wikimedia.org/r/131110 (owner: 10Ori.livneh) [19:04:33] (03CR) 10Chad: [C: 031] Get rid of $scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/131108 (owner: 10Ori.livneh) [19:05:00] (03CR) 10Ori.livneh: [C: 032] Get rid of $scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/131108 (owner: 10Ori.livneh) [19:05:06] (03PS2) 10Ori.livneh: s/git-deploy/Trebuchet/g [operations/puppet] - 10https://gerrit.wikimedia.org/r/131110 [19:09:28] (03CR) 10Ori.livneh: [C: 032] s/git-deploy/Trebuchet/g [operations/puppet] - 10https://gerrit.wikimedia.org/r/131110 (owner: 10Ori.livneh) [19:15:39] <_joe_> ori: this is actually another alarm that uses H-W data from graphite [19:16:56] (03PS2) 10Chad: Fix LDAP admin privs [operations/puppet] - 10https://gerrit.wikimedia.org/r/131010 [19:20:08] (03CR) 10Ori.livneh: [C: 032] "My understanding is that this is not granting new privileges, merely completing a refactor. And it unblocks Chad, who is currently unable " [operations/puppet] - 10https://gerrit.wikimedia.org/r/131010 (owner: 10Chad) [19:22:51] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [19:24:24] <_joe_> mmmh that is annoingly flapping [19:24:36] is there any other kind of flapping? :) [19:27:17] <_joe_> ori: as usual, the half-baked solution of using plain HW confidence bands does not work. I'll need to look at what etsy does [19:27:46] _joe_: make the problem domain appear shallower and simpler than it in fact is, and write hip blog posts about it :) [19:27:50] * ori snarks [19:27:52] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [19:28:09] <_joe_> oook, acknowledging that. [19:29:02] _joe_: don't take my snarking seriously, you're doing insanely great work, i'm really happy to see it [19:29:11] <_joe_> add to that the changing message makes the irc bot spam us. [19:29:33] <_joe_> ori: I got your rant and I agree [19:29:55] <_joe_> my idea was - we need to start from some basics and improve that [19:30:15] <_joe_> (if I were hip as needed, I'd have said 'iterate on that') [19:31:11] ACKNOWLEDGEMENT - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds Giuseppe Lavagetto Acknowledged as its spamming the chat room - also this is clearly experimental [19:31:25] well, you know, there's a famous case in US law, jacobellis v. ohio, which dealt with how to identify obscenity or "hardcore" pornography [19:31:52] and potter stewart, in his opinion, famously wrote: "i know it when i see it" [19:32:29] <_joe_> lol. [19:38:25] i think anomaly detection might be similar [19:38:56] it's sort of obvious when you see it but hard to identify algorithmically [19:40:57] Parser cache doesn't expire after 30 days right, or does it? I mean the really low level parser cache, that's used by logged in users as well. [19:41:00] Only on edit? [19:41:05] Or does it do expire? [19:41:11] (aside from cache epoch) [19:41:29] (03PS1) 10Ori.livneh: Revert "Fix LDAP admin privs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131123 [19:42:00] (03CR) 10Ori.livneh: "^d, Coren, mutante - fyi. sorry. :-/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131123 (owner: 10Ori.livneh) [19:42:05] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Fix LDAP admin privs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131123 (owner: 10Ori.livneh) [19:43:19] ori: Not sure I get why "sorry"? [19:44:10] well, i shouldn't have merged that in the first place unless it was a no-brainer, which i thought it was but it wasn't :/ [19:50:03] (03CR) 10Ori.livneh: "We can do better than that. Let's find a way to avoid copying these files to /usr/bin altogether by adding the scap dir to $PATH." [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 (owner: 10BryanDavis) [20:02:41] (03CR) 10Hashar: "Would it make sense to have the python server in a different git repo so we can scale maintenance to a wider audience than folks having me" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 (owner: 10Ori.livneh) [20:03:10] (03PS1) 10Andrew Bogott: Change mhernandez uid to 4990, as in labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131203 [20:08:36] (03PS1) 10Odder: Add SVG logos for nine Wikibooks wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131204 (https://bugzilla.wikimedia.org/52019) [20:08:38] <_joe_> Oh I HATE gerrit [20:09:20] <_joe_> ori: I'd like to propose a couple of patches to your code, but of course I'd just need to amend your own commit, right? [20:10:10] there are different workflows possible _joe_ :-] [20:10:23] 1) you can comment inline to ask the original author to fix/enhance its patch [20:10:42] 2) submit a follow up commit that has for parent the "offending" commit you want to improve [20:10:52] greg-g: bug 64765 fixed. When's the next deploy window and/or should we make one? [20:10:54] 3) sneak in and be bold by amending his patch. [20:11:03] <_joe_> hashar: oh ok so 2) [20:11:12] <_joe_> good, makes sense [20:11:15] <_joe_> thanks :) [20:11:33] _joe_: yeah 2) is nice if you introduce a bug that kill the cluster. This way you are the one to be git blamed for it :-D [20:11:48] <_joe_> eheh [20:11:58] Krinkle: how bad is it? is it worth a Friday deploy or can it wait for first thing Monday? [20:12:09] _joe_: inline diff is nice since it let you discuss with the author and iterating until the patch is "flawless" [20:13:07] <_joe_> hashar: I'd like to generalize the code a bit, allowing it to subscribe to more than one topic (now it's just 'edits'), and to use a config file [20:13:21] <_joe_> not the kind of patch you do with inline comments :) [20:13:28] greg-g: it's minor from a software point of view, but user wise, this is an encyclopedia where one would expect to be able to use the table of contents. [20:13:43] it's a 1 line fix though. I'd like to get it out. [20:13:53] <_joe_> but this is a topic for the weekend, if the rain keeps pouring :) [20:13:53] _joe_: so yeah maybe better with a follow up change :D [20:14:04] * ^d puts his contents on the table. [20:14:13] ^d: :) [20:14:35] contents as in table-of-contents table contents [20:14:40] (free as in free-table free) [20:14:56] Krinkle: :) [20:15:02] Krinkle: I trust you, wanna do it now? [20:15:13] in 15 minutes [20:15:17] Thanks [20:15:35] "I trust you, wanna do it now" [20:15:37] Hmmm... [20:15:50] One couldn't guess it was about software deploy on a Friday [20:16:04] * YuviPanda gives twkozlowski a phamplet [20:16:04] greg-g: wanna buy a bridge? [20:16:08] Only on a friday [20:16:09] xD [20:16:15] chrismcmahon: :P [20:16:21] some people you just have to trust :) [20:16:41] Krinkle: is the merge on Beta Cluster and not breaking anything weird there? ;) [20:17:35] working as expected on beta [20:17:37] tested on http://en.wikipedia.beta.wmflabs.org/wiki/Dido_Sotiriou [20:17:44] mw.loader.moduleRegistry['mediawiki.util'].dependencies [20:17:44] ["jquery.accessKeyLabel", "jquery.mwExtension", "mediawiki.notify", "mediawiki.toc"] [20:18:28] <^d> chrismcmahon: Just wait til we have Replacement Greg around next week. [20:18:39] <^d> *That* is the time to ask for crazy thing. [20:18:43] <^d> +s [20:18:49] Oh? [20:19:32] shhhhhhhhh [20:19:34] <^d> Deskana Pt III - The Return of the Replacement Greg [20:19:36] <^d> :p [20:19:36] it's only one day [20:19:39] (03CR) 10Andrew Bogott: [C: 032] Change mhernandez uid to 4990, as in labs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/131203 (owner: 10Andrew Bogott) [20:19:51] * greg-g sings "For just one day" [20:19:59] <^d> Replacement Greg + Chad's doing the deploy that day [20:20:02] <^d> GONNA BE A BLAST [20:20:14] Replacement Greg + Replacement Reedy, you should say :P [20:20:54] uhh, I'm gonna need you to send me a cashier's check... [20:23:29] greg-g: (which version of that song?) [20:24:51] Krinkle: https://www.youtube.com/watch?v=AQ0J602X53c [20:25:10] Ah, the bowie song [20:29:51] !log krinkle synchronized php-1.24wmf2/resources/Resources.php 'Ia12998fb11c686' [20:29:58] Logged the message, Master [20:31:44] !log krinkle synchronized php-1.24wmf3/resources/Resources.php 'Ia12998fb11c686' [20:31:51] Logged the message, Master [20:37:50] greg-g: done, confirmed fix in production [20:39:57] Krinkle: thanks much [21:02:51] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:04:34] poor tungsten, much maligned [21:04:47] but it can take so much heat and not melt! [21:14:15] !log zuul jenkins stuck again :( [21:14:22] Logged the message, Master [21:14:58] !log restarting Jenkins (making sure the java process properly disappear) [21:15:04] Logged the message, Master [21:16:47] (03CR) 10Greg Grossmeier: "Please remove Commons from the list ;)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125035 (owner: 10MarkTraceur) [21:19:33] Does anyone know where I can go to find information about how CORS is set up on the cluster? [21:19:48] In particular I'm looking to figure out why CORS-y requests work on enwp but not on officewiki. [21:21:10] !log Jenkins is back :-] [21:21:17] Logged the message, Master [21:30:52] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/131221 [21:31:12] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 [21:31:19] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 [21:33:24] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/131221 [21:33:26] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/131221 (owner: 10Hashar) [21:33:28] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 [21:33:30] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [21:33:32] (03PS2) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 [21:33:34] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [21:34:09] (03CR) 10Hashar: "recheck" [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/131221 (owner: 10Hashar) [21:34:38] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/131221 (owner: 10Hashar) [21:34:40] marktraceur: search Bz, the original bz setup ticket should have details [21:34:44] (03CR) 10Hashar: "recheck" [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [21:34:50] (03CR) 10Hashar: "recheck" [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [21:35:14] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 [21:36:33] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/131222 (owner: 10Hashar) [21:36:39] (03PS3) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 [21:36:55] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/131223 (owner: 10Hashar) [21:37:09] hashar-spam :) [21:38:18] greg-g: yeah moaaar jobs ! [21:38:29] i need to sleep actually [21:38:50] those 10+ hours long days are tiring [21:39:55] <^d> ori: I still can't hit silver :\ [21:40:26] <^d> permission denied (publickey) [21:42:46] ^d: i had to revert the patch, it broke on silver [21:42:54] <^d> Ah, missed that. [21:43:09] <^d> Sorry 'bout that. Something I can fix? [21:43:18] dunno [21:43:57] hashar: yuck, yes, please sleep [21:43:59] i'm looking for the error message in my scrollback buffer but i think i may have lost it -- something about the accounts not being present [21:46:30] <^d> ori: Ugh, ok. Don't worry, will deal with it later. [21:46:39] <^d> Not urgent. [21:46:56] greg-g: yeah doing so. Have a good week-end :] [21:47:06] hashar: you too! [21:55:15] (03CR) 10BryanDavis: "I'd be willing to look into getting rid of the need for symlinks. I know that scap itself hardcodes the /usr/local/bin path in a couple of" [operations/puppet] - 10https://gerrit.wikimedia.org/r/129814 (owner: 10BryanDavis) [22:24:40] (03PS1) 10Aaron Schulz: Removed useless "handlerUrl" config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131228 [22:28:25] (03CR) 10Aaron Schulz: [C: 032] Removed useless "handlerUrl" config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131228 (owner: 10Aaron Schulz) [22:28:33] (03Merged) 10jenkins-bot: Removed useless "handlerUrl" config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131228 (owner: 10Aaron Schulz) [22:30:03] !log aaron synchronized wmf-config/filebackend.php 'Removed useless "handlerUrl" config' [22:30:10] Logged the message, Master [22:30:14] (03PS3) 10Ori.livneh: Add 'changes' module for broadcasting recent changes over WebSockets [operations/puppet] - 10https://gerrit.wikimedia.org/r/131040 [22:33:34] (03PS1) 10Aaron Schulz: Made private wikis use thumb_handler.php for thumbnails. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 [22:34:19] csteipp: ^ [22:34:54] (03PS2) 10Aaron Schulz: Made private wikis use thumb_handler.php for thumbnails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 [22:55:02] AaronSchulz, ping [22:59:30] gwicke: hm? [23:00:07] (03CR) 10GWicke: "The job runner restart triggered by this seems to have doubled the Parsoid job runner throughput, now back at old levels. Apparently job r" [operations/puppet] - 10https://gerrit.wikimedia.org/r/131072 (owner: 10Aaron Schulz) [23:00:15] AaronSchulz, ^^ [23:00:39] I was wondering what caused the Parsoid job queue processing rate to jump up to normal levels again [23:01:09] maybe jobs -rd acting weird...it is bash after all :) [23:01:33] yeah, or hanging / crashing workers that are now replaced by fresh ones [23:01:42] https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Parsoid+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [23:02:02] bumped up at around 2pm [23:02:29] maybe doing ps -ef and looking at TIME would show something next time [23:02:48] I doubt the bash loops themselves have trouble...more likely the php procs [23:02:59] yeah [23:03:16] and the bash loop probably doesn't do much monitoring or respawning.. [23:04:06] which loop? [23:04:29] the top level loop doesn't, the next level loops will respawn the php procs (which come and go all the time anyway) [23:04:51] unless jobs -r was misreporting (I remember a bug with that a year back or so) [23:05:02] yeah, runJobsLoopService would be the interesting one [23:05:33] the php scripts have timeouts...though they don't count system time in i/o calls... [23:05:41] probably wrapping them in timeout would be easy enough [23:06:22] csteipp: I'll probably merge that soon [23:06:59] AaronSchulz, next time when the load drops I'll check for hung php job runners [23:07:13] the TIME should should be quite low [23:07:13] if there are any we'd have to do something about it [23:07:27] if it's like an hour you know for sure there is some odd thing going on [23:07:30] yes TIME low & start time a while in the past [23:09:57] I'll do the timeout thing too [23:10:27] (03CR) 10CSteipp: [C: 031] "thumb.php seems to do all the same checks as img_auth, so I think this is safe." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 (owner: 10Aaron Schulz) [23:10:39] (03CR) 10Aaron Schulz: [C: 032] Made private wikis use thumb_handler.php for thumbnails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 (owner: 10Aaron Schulz) [23:10:46] (03Merged) 10jenkins-bot: Made private wikis use thumb_handler.php for thumbnails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 (owner: 10Aaron Schulz) [23:11:28] (03CR) 10Aaron Schulz: "Note: most things use thumbScriptUrl which uses thumb.php and works already...except TMH..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131230 (owner: 10Aaron Schulz) [23:12:58] AaronSchulz, thanks! [23:13:25] !log aaron synchronized wmf-config/filebackend.php 'Made private wikis use thumb_handler.php for thumbnails' [23:13:30] Logged the message, Master [23:21:39] somone from operations online, i need help: [23:22:09] this patch was merged https://gerrit.wikimedia.org/r/#/c/129167/ , but now the text schows up two times [23:23:10] it has worked hours ago, but now the old situation is back [23:24:08] ping: twkozlowski [23:26:43] (03PS1) 10Aaron Schulz: Use timeout to kill hanging PHP processes [operations/puppet] - 10https://gerrit.wikimedia.org/r/131236 [23:26:53] AaronSchulz: ??? [23:28:06] can somone pls emergencydisable the transclusion from MediaWiki:Newimages-summary to Special:NewFiles? [23:38:20] Steinsplitter, example? [23:38:49] https://en.wikipedia.org/wiki/Special:NewFiles and https://www.mediawiki.org/wiki/Special:NewFiles look normal [23:40:00] ah, commons [23:40:27] MaxSem: kaldari is helping me now. the patch affects only commons. (strange, one hour ago was all normal :() [23:55:31] Steinsplitter: where is kaldari helping you? [23:56:28] greg-g: on irc, it is possible to disable systemmsgs with "-" [23:56:40] Steinsplitter: what channel? [23:57:12] qery, i have contacted him. [23:57:40] k [23:57:45] he is verry helpful :) [23:58:25] indeed :) [23:58:27] i was schocked, the special page broken. on commons the page is widely used :P [23:58:42] yeah, and you said only today/few hours ago? [23:59:04] yeah. after i have marked the template the 2nd time [23:59:16] i am not sure, maby this or a mw change [23:59:31] which is weird because we haven't deployed anything that would touch that today [23:59:47] well, we haven't touched commons since Tuesday, other than this change https://gerrit.wikimedia.org/r/#/q/Ia12998fb11c686,n,z