[00:01:14] (03CR) 10Dzahn: [C: 031] "in the past i would have said the service/package part doesn't belong into a role class, but then i saw the discussion about getting rid o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 (owner: 10Rush) [00:02:26] come on, silly zuul [00:02:56] superm401, can you check if everything besides the config change works meanwhile? [00:03:14] MaxSem, I can test for regressions, but it won't really do anything without the config. [00:03:21] Except for the prior functionality. [00:03:25] ah, ok [00:04:08] zuul looks stuck, stuff in queue and no tests running [00:04:22] Quick test doesn't show any regressions. [00:04:30] Without the config I mean. [00:04:33] (03Merged) 10jenkins-bot: Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 (owner: 10Phuedx) [00:05:41] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/144938 (duration: 00m 04s) [00:05:46] Logged the message, Master [00:05:50] superm401, ^ [00:06:13] Thanks MaxSem, will test now. [00:08:07] Thanks MaxSem! [00:09:20] (03CR) 10Andrew Bogott: [C: 031] phabricator class for installing in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 (owner: 10Rush) [00:09:36] (03CR) 10Rush: [C: 032] phabricator class for installing in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 (owner: 10Rush) [00:11:02] MaxSem, seems to be working. [00:13:10] wee [00:16:30] any maxage experts? My response has Age:870 [00:16:30] Cache-Control:s-maxage=300, must-revalidate, max-age=0 [00:16:40] how is that even possible? [00:17:07] shouldn't age be < 300? [00:17:42] bblack, MaxSem ^ debugging the new "ON" [00:18:09] oh, finally refreshed - i suspect after reaching about a 1000 seconds [00:18:51] what are we setting max 300 on? [00:20:08] bblack, https://en.m.wikipedia.org/w/index.php?title=Special:ZeroRatedMobileAccess&zcmd=js-banner [00:21:06] (03PS1) 10Reedy: Enable Flow on officewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 [00:21:40] (03CR) 10Reedy: [C: 04-1] "Database tables need to be created before enabling" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 (owner: 10Reedy) [00:27:54] greg-g, can we do one more patch for wmf/1.24wmf11? We missed it before. [00:28:30] (03PS1) 10BBlack: no-op: set up for slow return of traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145176 [00:29:08] yurikR2: I'm seeing >300 Age in my browser there as well [00:29:24] GuidedTour change to support the previous GettingStarted one. [00:29:40] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [00:29:52] yurikR2: but are you sure that's actually a problem? I'm no Cache-Control expert, but I thought it was possible to revalidate against HEAD + Last-Modified and then keep the cache longer? [00:30:00] It's https://gerrit.wikimedia.org/r/#/c/145175/ [00:32:36] (03CR) 10BBlack: [C: 032] no-op: set up for slow return of traffic to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145176 (owner: 10BBlack) [00:35:34] superm401: sure, can you? [00:35:52] greg-g, yep, thanks. [00:38:19] (03PS1) 10BBlack: ulsfo return: OIT + generic-AP [operations/dns] - 10https://gerrit.wikimedia.org/r/145182 [00:38:43] bblack, the point is that it doesn't re-validate with the backend. We want this snipet of data frequently re-checked [00:38:44] (03CR) 10BBlack: [C: 032] ulsfo return: OIT + generic-AP [operations/dns] - 10https://gerrit.wikimedia.org/r/145182 (owner: 10BBlack) [00:39:14] yurikR2: apparently it does eventually, just not after 300s :) [00:39:21] that snippet is not per page, its only per carrier, so the load will be tiny, but responsiveness when fixing issues, especially at first, are at a much higher premium [00:39:37] exactly. I wonder if varnish has cache override somewhere [00:45:05] yurikR2: I think varnish wants to look at max-age, not s-max-age [00:45:29] maybe, still reading [00:45:53] i thought s-maxage is specifically for "shared maxage" - which is exactly what varnish is [00:47:21] bblack, the bigger problem is that for some reason, my opera doesn't get detected :( [00:47:48] yeah varnish docs that it obeys both s-maxage max-age [00:47:58] https://www.varnish-cache.org/trac/wiki/VCLExampleLongerCaching <- interesting read [00:48:28] (we do have some ttl magic in various VCLs as well, but they all seem to be < 300 when set explicitly...) [00:49:29] bblack, aa, we do have an issue :((( patching varnish, sec... [00:49:34] (03CR) 10Dzahn: phabricator class for installing in labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 (owner: 10Rush) [00:50:14] !log mattflaschen Synchronized php-1.24wmf11/extensions/GuidedTour/: GuidedTour cherry-pick to 1.24wmf11 in support of GettingStarted anonymous editor acquisition test (duration: 00m 09s) [00:50:19] Logged the message, Master [00:50:41] Done. [00:50:49] Thanks, greg-g [00:50:52] And thanks again, MaxSem [00:51:38] (03PS1) 10Yurik: Keeping req.http.X-Forwarded-By for the backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/145188 [00:51:40] bblack, ^ [00:52:03] without it, all of our partners with opera and no direct zero are not getting whitelisted :( [00:52:56] ^ you realize we had that out for a reason right? it will fragment the cache [00:53:22] bblack, only by a factor of 2 [00:53:28] opera vs direct [00:53:37] vs nokia! [00:53:45] we only have one nokia [00:54:02] still "only by a factor of 2" is not insignificant. why the change? [00:54:12] or I should say, why is this broken now and not before? [00:54:17] bblack, how about we unset it if X-CS is not set? [00:54:29] in that case it will affect very few carriers [00:54:51] ok, but still, I'd like to understand what's going on with the new opera issue [00:55:14] sure, the problem is that with the multiple configs per carrier, i am forced to analyze each incoming header [00:55:24] to see what they have and don't have signed up to [00:55:41] it is no longer enough to just look at the X-CS being on and treat it as a "all's good" [00:56:09] i will make it unset only if X-CS is not set [00:56:14] this way the impact is tiny [00:56:29] (03PS1) 10BBlack: ulsfo return: JP, KP, KR [operations/dns] - 10https://gerrit.wikimedia.org/r/145191 [00:56:42] (03CR) 10BBlack: [C: 032] ulsfo return: JP, KP, KR [operations/dns] - 10https://gerrit.wikimedia.org/r/145191 (owner: 10BBlack) [00:57:16] yurikR2: ok [00:57:29] (03PS2) 10Yurik: Keeping req.http.X-Forwarded-By for the backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/145188 [00:57:34] bblack, [00:57:35] ^ [00:58:46] (03CR) 10BBlack: [C: 032 V: 032] Keeping req.http.X-Forwarded-By for the backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/145188 (owner: 10Yurik) [00:58:54] thx :) [01:06:53] !log cleared icinga downtimes for ulsfo (we now have some traffic back there) [01:07:01] Logged the message, Master [01:08:09] bblack, where can i see the cache hit rates for varnish? [01:09:19] ganglia! :) [01:09:56] (03CR) 10TTO: "Why was this abandoned? Could you leave a comment at the bug to explain it to the reporter?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145144 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [01:12:52] yurikR2: e.g. http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=cp1046+cache+hit%2Fmiss&vl=&x=&n=&hreg[]=cp1046.eqiad.wmnet&mreg[]=varnish.cache_.%2A>ype=line&glegend=show&aggregate=1&embed=1&_=1404954681703 [01:13:38] bblack, was that spike the depl? [01:13:46] (I suspect the ugly spike on the right is just vcl reload -> stats roll noise) [01:14:09] it stays pretty consistent before/after that spike. ganglia has lots of issues with graph anomolies on stats rollover type stuff [01:14:31] lovelly. maybe the patch hasn't been deployed yet? [01:15:40] in any case, or current hitrate is abysmal. hopefully X-CS=ON will alleviate that :) [01:15:47] s/or/our/ [01:16:08] (03PS1) 10Yuvipanda: toollabs: Use appropriate ubuntu release version in mariadb src [operations/puppet] - 10https://gerrit.wikimedia.org/r/145195 [01:16:25] hmm, I guess all the labs folks will be asleep now [01:16:42] scfc_de: am fixing issues as I see 'em on the trusty node. I think ^ will fix some [01:21:08] (03PS2) 10Yuvipanda: toollabs: Don't add mariadb repo on trusty hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/145195 [01:24:49] (03PS1) 10BBlack: KH,MY,PH,SG,TW back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145197 [01:25:47] grr, debugging opera is painful - they seem to have removed all response headers :( [01:25:51] (03CR) 10BBlack: [C: 032 V: 032] KH,MY,PH,SG,TW back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145197 (owner: 10BBlack) [01:36:51] YuviPanda|zzz: I'm half asleep, I'll look into that tomorrow (today :-)). [01:37:02] scfc_de: :D I went to sleep, couldn't, so back. [01:37:07] scfc_de: added the two nodes, btw. [01:38:01] !log potassium,hydrogen,search1016,nitrogen,analytics1024,chromium - upgrade SSL [01:38:07] Logged the message, Master [01:38:59] (03PS1) 10Yurik: Allow all TEST* carriers to pass [operations/puppet] - 10https://gerrit.wikimedia.org/r/145208 [01:39:25] bblack, i kept wondering why i can't figure it out with the test carriers, but now with multiple sub-sets under one carrier, i need this ^ [01:39:40] sorry i keep bugging you ( [01:41:54] bblack, actually, it seems like we will have to switch detection to this pattern in many places, otherwise it won't work. X-CS no longer has just one number, it sometimes gets "|blah" appended at the end [01:43:48] so either we do a string subst() in front of all these if/elses, or we switch all if(x-cs=="111-11") to if(x-cs ~ "^111-11(\|.*)?") [01:45:32] we should probably refactor a bit and split into two distinct headers before the big if/else. it should simplify/replace the split down in the analytics part anyways [01:45:58] (instead of regexing all over the if/else blocks) [01:46:47] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:46:58] bblack, could you +2 ^ for testing ones - this way i can already verify that opera works. [01:47:01] maybe leave X-CS at its original meaning (not containing the |blah part at the time of the big if/else block), and substr-split that off into X-CS-ZN for the zeronet part right off) [01:47:16] !log argon - Ignoring file 'puppet_base_2.7' in directory '/etc/apt/preferences.d/ [01:47:23] Logged the message, Master [01:47:49] bblack, oops, wrong key. As for regexes, sure, we could split them, but the entire if/else still needs a new variable name - we would have to compare on var, and set another [01:48:06] will get something in shortly [01:48:11] oops wrong key? [01:48:30] (03CR) 10BBlack: [C: 032] Allow all TEST* carriers to pass [operations/puppet] - 10https://gerrit.wikimedia.org/r/145208 (owner: 10Yurik) [01:50:01] (03PS1) 10BBlack: MM, TH, VN back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145209 [01:50:16] (03PS1) 10BBlack: BD, ID, MN back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145210 [01:50:18] (03PS1) 10BBlack: BT, HK, MO back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145211 [01:50:20] (03PS1) 10BBlack: BN, CC, CX, LA, MV, NP, TL back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145212 [01:51:08] (03CR) 10BBlack: [C: 032] MM, TH, VN back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145209 (owner: 10BBlack) [01:58:02] (03CR) 10BBlack: [C: 032] BD, ID, MN back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145210 (owner: 10BBlack) [02:05:45] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:06:08] (03CR) 10BBlack: [C: 032] BT, HK, MO back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145211 (owner: 10BBlack) [02:21:51] (03CR) 10BBlack: [C: 032] BN, CC, CX, LA, MV, NP, TL back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145212 (owner: 10BBlack) [02:25:35] That's all of Oceania/Asia that we previously mapped to ulsfo is back on ulsfo now [02:26:20] What remains is US/Canada states/territories to put back, but I'm going to wait until later in the evening here first to let the caches build a bit more (and come in during lower load here). [02:38:21] !log argon,calcium,iron,rhenium,bast1001,oxygen,netmon1001 - upgraded SSL [02:38:28] Logged the message, Master [02:38:37] bblack: cool!! [02:40:22] out, bbl [02:42:33] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-10 02:41:29+00:00 [02:42:38] Logged the message, Master [02:48:32] !log netmon1001 - DocumentRoot [/etc/apache2/undef] does not exist [02:48:36] Logged the message, Master [02:49:49] !log argon,netmon1001, graceful'led apaches [02:49:54] Logged the message, Master [03:12:51] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-10 03:11:48+00:00 [03:12:57] Logged the message, Master [03:25:36] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [03:37:28] this is weird - mw1151 is still not in sync [03:44:32] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:47:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 10 03:46:36 UTC 2014 (duration 46m 35s) [03:47:46] Logged the message, Master [04:14:26] (03PS1) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:17:19] (03PS2) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:21:53] (03PS3) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:33:22] (03PS1) 10BBlack: Eastern portion of Western US/Canada back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145220 [04:33:25] (03PS1) 10BBlack: Western-most Canada/US back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145221 [04:34:01] (03PS4) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:34:34] (03CR) 10BBlack: [C: 032] Eastern portion of Western US/Canada back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145220 (owner: 10BBlack) [04:43:45] (03PS5) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:50:34] (03PS6) 10Springle: Start using MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [04:54:45] (03PS7) 10Springle: Use MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 [05:09:33] (03CR) 10Springle: [C: 032] Use MariaDB event scheduler for labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/145218 (owner: 10Springle) [05:42:48] PROBLEM - puppet last run on mchenry is CRITICAL: Timeout while attempting connection [05:43:27] PROBLEM - Host db60 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:27] PROBLEM - Host db73 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:37] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:43:37] PROBLEM - Host ps1-d3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:43:37] PROBLEM - Host ps1-d2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:44:07] PROBLEM - Host sanger is DOWN: PING CRITICAL - Packet loss = 100% [05:44:07] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [05:44:07] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:07] PROBLEM - Host ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:44:07] PROBLEM - Host fenari is DOWN: PING CRITICAL - Packet loss = 100% [05:44:08] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:44:08] PROBLEM - Host db69 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:09] PROBLEM - Host es4 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:17] PROBLEM - Host dataset2 is DOWN: CRITICAL - Time to live exceeded (208.80.152.185) [05:44:17] PROBLEM - Host linne is DOWN: CRITICAL - Time to live exceeded (208.80.152.167) [05:44:17] PROBLEM - Host dobson is DOWN: CRITICAL - Time to live exceeded (208.80.152.173) [05:44:17] PROBLEM - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:17] PROBLEM - Host ps1-c1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:44:18] PROBLEM - Host db71 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:18] PROBLEM - Host mchenry is DOWN: PING CRITICAL - Packet loss = 100% [05:44:19] PROBLEM - Host ps1-c3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:44:27] PROBLEM - Host db72 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:27] PROBLEM - Host nfs1 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:27] PROBLEM - Host ps1-c2-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [05:44:37] PROBLEM - Host es7 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:37] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:37] PROBLEM - Host pdf3 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:37] PROBLEM - Host db74 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:37] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [05:45:07] PROBLEM - Host 208.80.152.131 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:21] ? [05:49:25] uh [05:49:36] springle: ^^ ? [05:49:48] odd huh [05:51:29] 32 hosts down? [05:51:38] (03PS1) 10Legoktm: Set $wgUserMergeEnableDelete = false; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145224 (https://bugzilla.wikimedia.org/67789) [05:51:39] pmtpa down [05:51:43] or a link [05:51:44] wait, pmtpa, yeah [05:51:50] "who cares" [05:51:57] * springle wandering in librenms [05:52:29] mail might still be affected with mchenry [05:52:44] or has that been switched... [05:52:49] yeah, and the pdf servers [05:53:16] call mark? [05:53:28] bblack: still awake? [05:53:58] need a network admin, yeah [05:56:52] I just texted mark [05:58:27] RECOVERY - Host es7 is UP: PING OK - Packet loss = 0%, RTA = 26.84 ms [05:58:27] RECOVERY - Host ps1-c1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 29.35 ms [05:58:27] RECOVERY - Host tarin is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [05:58:27] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 27.10 ms [05:58:27] RECOVERY - Host nfs1 is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [05:58:28] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 29.37 ms [05:58:28] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 29.17 ms [05:58:29] RECOVERY - Host db60 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [05:58:29] RECOVERY - Host db73 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [05:58:30] RECOVERY - Host es4 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [05:58:30] RECOVERY - Host 208.80.152.131 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [05:58:31] RECOVERY - Host ps1-c2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 28.16 ms [05:58:37] RECOVERY - Host linne is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [05:58:37] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 27.49 ms [05:58:37] RECOVERY - Host db74 is UP: PING OK - Packet loss = 0%, RTA = 26.76 ms [05:58:37] RECOVERY - Host dobson is UP: PING OK - Packet loss = 0%, RTA = 26.97 ms [05:58:37] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 27.38 ms [05:58:38] RECOVERY - Host ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [05:58:38] RECOVERY - Host db71 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [05:58:39] * greg-g breathes [05:58:43] heh [05:58:47] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 30.32 ms [05:58:47] RECOVERY - Host ps1-c3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 29.11 ms [05:58:47] RECOVERY - Host dataset2 is UP: PING OK - Packet loss = 0%, RTA = 27.50 ms [05:58:47] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 27.02 ms [05:58:48] RECOVERY - Host db69 is UP: PING OK - Packet loss = 0%, RTA = 26.93 ms [05:58:48] RECOVERY - Host mchenry is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [05:58:48] RECOVERY - Host db72 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [05:58:49] * greg-g texts mark again [05:58:57] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 28.20 ms [05:58:57] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 27.05 ms [05:59:07] RECOVERY - Host pdf3 is UP: PING OK - Packet loss = 0%, RTA = 28.39 ms [05:59:07] RECOVERY - Host 208.80.152.132 is UP: PING OK - Packet loss = 0%, RTA = 27.75 ms [05:59:30] and pdf rendering works again [06:00:21] so durign that, i couldn't get access even to ae2-1002.cr2-eqiad.wikimedia.org, from iron [06:00:36] :/ [06:01:01] PROBLEM - puppet last run on nfs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:01:01] PROBLEM - puppet last run on db72 is CRITICAL: CRITICAL: Puppet has 4 failures [06:01:01] PROBLEM - puppet last run on db73 is CRITICAL: CRITICAL: Puppet has 16 failures [06:01:01] PROBLEM - puppet last run on tridge is CRITICAL: CRITICAL: Puppet has 1 failures [06:01:01] PROBLEM - puppet last run on db60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:01:07] springle: fwiw sodium is the new main MX [06:01:13] ok, just puppet now, shush puppet [06:01:17] matanya: oh? nice [06:01:32] i figured it had switched recently [06:01:41] PROBLEM - puppet last run on db69 is CRITICAL: CRITICAL: Puppet has 4 failures [06:02:01] PROBLEM - puppet last run on sanger is CRITICAL: CRITICAL: Puppet has 1 failures [06:02:14] springle: e.g : https://gerrit.wikimedia.org/r/#/c/143886/ [06:02:31] ok, on that note, I'm going to bed [06:02:52] I'll ping mar-k about it tomorrow [06:03:01] RECOVERY - puppet last run on db73 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:03:36] springle: I meant polonium not sodium, too many elements ... :) [06:03:48] night greg-g [06:03:55] :) [06:04:01] RECOVERY - puppet last run on tridge is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:04:39] oh, i see lead also serves as a MX [06:09:01] RECOVERY - puppet last run on nfs1 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:10:01] RECOVERY - puppet last run on db72 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:10:41] RECOVERY - puppet last run on db69 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:13:02] RECOVERY - puppet last run on sanger is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:13:27] (03CR) 10Matanya: "compiled using puppet-compiler on:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144033 (owner: 10Matanya) [06:15:02] RECOVERY - puppet last run on db60 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:23:52] * matanya looks for _joe_ [06:28:44] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:45] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:14] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:53] (03PS1) 10Matanya: zuul: fully qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/145229 [06:34:44] PROBLEM - puppet last run on es1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:19] I missed some excitement in florida I see [06:37:30] (03CR) 10BBlack: [C: 032] Western-most Canada/US back to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/145221 (owner: 10BBlack) [06:40:35] !log all normally-ulsfo traffic is back on ulsfo [06:40:40] Logged the message, Master [06:41:25] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:08] ^ ms-be3004 puppetfail seems to have been a temporary network issue of some kind... Jul 10 06:38:07 ms-be3004 puppet-agent[13268]: (/Stage[main]/Apt/File[/usr/local/bin/apt2xml]) Could not evaluate: Connection timed out - connect(2) Could not retrieve file metadata for puppet:///modules/apt/apt2xml.py: Connection timed out - connect(2) [06:45:15] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:16] it ran fine manually afterwards [06:45:25] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:45:41] nite! [06:45:45] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:55] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:51:35] RECOVERY - puppet last run on es1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [07:02:52] PROBLEM - MySQL Processlist on db1064 is CRITICAL: CRIT 78 unauthenticated, 0 locked, 0 copy to table, 1 statistics [07:03:52] RECOVERY - MySQL Processlist on db1064 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [07:05:08] <_joe_> wow good job bblack [07:22:38] _joe_: will you be able to spend a few minutes with me today ? [07:23:11] and btw: http://blog.dustinkirkland.com/2014/07/scalable-parallel-video-transcoding-on.html [07:25:40] <_joe_> matanya: mmmh not really :) [07:25:41] <_joe_> matanya: maybe later, sorry :( [07:25:51] <_joe_> I have to rollout a big change for apache [07:26:01] sure :) [07:26:05] <_joe_> the kind of destroy-all-if-wrong change [07:26:12] oh nose [07:26:29] <_joe_> we're moving apache configs into puppet [07:26:37] <_joe_> :) [07:27:01] (03PS4) 10Giuseppe Lavagetto: mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 [07:30:33] (03PS1) 10Matanya: deployment: fully qualify var [operations/puppet] - 10https://gerrit.wikimedia.org/r/145234 [07:31:12] <_joe_> matanya: btw, thanks (again) for the work you're doing on this [07:31:30] my pleasure [07:33:27] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 (owner: 10Giuseppe Lavagetto) [07:33:38] <_joe_> come on jenkins [07:47:02] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 1519829 MB (3% inode=99%): [07:51:03] (03CR) 10Matanya: "Ran on puppet compiler - noop:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145234 (owner: 10Matanya) [07:52:00] <_joe_> !log doing a tagged run of puppet on all appservers to sync apache config [07:52:04] Logged the message, Master [07:54:31] <_joe_> load average: 26.20 on palladium, 5.46 on strontium [07:54:40] <_joe_> something is not correctly balanced I' [07:54:43] <_joe_> d say [07:56:52] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:42] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.020 second response time [08:00:07] <_joe_> If you stop seeing wikipedias in a couple of minutes, that will be me [08:01:25] oblivian is doing a graceful restart of all apaches [08:02:41] !log oblivian gracefulled all apaches [08:02:46] Logged the message, Master [08:04:12] <_joe_> mmmh I'm gonna do that again in a few minutes I guess [08:04:37] (03PS4) 10Matanya: cache: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140678 [08:05:03] thanks to puppet-compiler, found a typo! :) [08:06:00] <_joe_> matanya: that tool is awesome [08:06:29] totally. and the fact i have access to it saves so much time [08:09:52] (03CR) 10Matanya: "ran on puppet-compiler. noop:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140678 (owner: 10Matanya) [08:14:07] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 06:13:30 UTC [08:17:15] (03PS1) 10Giuseppe Lavagetto: mediawiki: make apache MaxClients be always smaller than ServerLimit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145245 [08:20:22] (03PS1) 10Matanya: eventlogging: port is a fact, qualify [operations/puppet] - 10https://gerrit.wikimedia.org/r/145246 [08:23:58] (03CR) 10Matanya: "ran puppet-compiler, noop:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145246 (owner: 10Matanya) [08:34:58] (03PS2) 10Giuseppe Lavagetto: mediawiki: make apache MaxClients be always smaller than ServerLimit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145245 [08:38:18] (03PS1) 10Matanya: ferm: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/145250 [09:00:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Added jobrunner.ini file [operations/puppet] - 10https://gerrit.wikimedia.org/r/145130 (owner: 10Aaron Schulz) [09:01:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: fully qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/145229 (owner: 10Matanya) [09:04:58] arghh [09:05:09] oh no that change seems fine [09:06:55] !log gallium deleted /var/lib/puppet/state/agent_catalog_run.lock from July 5th. Was preventing me to run puppet agent -tv [09:07:00] Logged the message, Master [09:07:29] !log gallium err was July 5th and file was from a minute ago ... ignore me [09:07:33] Logged the message, Master [09:09:52] (03PS3) 10Giuseppe Lavagetto: mediawiki: make apache MaxClients be always smaller than ServerLimit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145245 [09:10:05] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: Fetching origin [09:10:35] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [09:10:59] bah [09:11:01] hashar: i try not to break stuff :) thanks a lot for puppet-compiler [09:11:23] matanya: the job apparently fill /tmp pretty quickly when run against all nodes [09:12:03] not to run on all nodes? [09:12:34] then jenkins has a mechanism to remove a slave from the pool whenever a partition is filled :/ [09:13:25] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jul 10 09:13:17 UTC 2014 [09:13:31] <_joe_> hashar: no it's matanya running it on a lot of changes [09:13:36] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:13:40] <_joe_> each change reproduces a chroot of sort [09:17:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: make apache MaxClients be always smaller than ServerLimit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145245 (owner: 10Giuseppe Lavagetto) [09:19:46] matanya: sodium is not the new main MX [09:19:56] polonium ? [09:20:04] polonium and lead [09:20:24] yes, so my correction was right, thanks para [09:20:27] paravoid: [09:23:46] <_joe_> !log doing a tagged run to sync apache config [09:23:51] Logged the message, Master [09:24:35] <_joe_> paravoid: tagged puppet runs take less that 3 minutes to complete when run via salt :) [09:25:54] the bottleneck being the puppetmaster I suppose [09:26:14] <_joe_> yes [09:26:24] <_joe_> I feared it would be much worse [09:26:40] so we lost both waves to pmtpa last night [09:26:41] <_joe_> now it took just under 2 minutes [09:26:48] <_joe_> :) [09:28:05] <_joe_> paravoid: uh? [09:28:54] oblivian is doing a graceful restart of all apaches [09:30:09] !log oblivian gracefulled all apaches [09:33:03] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: Fetching origin [09:33:38] <_joe_> paravoid: oh that is you fixing things? [09:33:58] ? [09:34:01] no [09:35:31] <_joe_> wtf then [09:48:11] (03PS2) 10Reedy: Update size related dblists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145040 [09:48:19] (03CR) 10Reedy: [C: 032] Update size related dblists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145040 (owner: 10Reedy) [09:48:26] (03Merged) 10jenkins-bot: Update size related dblists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145040 (owner: 10Reedy) [09:48:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add swift-dispersion-report and stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/144932 (owner: 10Filippo Giunchedi) [09:48:52] !log reedy Synchronized database lists: (no message) (duration: 00m 13s) [09:48:56] Logged the message, Master [09:50:13] (03CR) 10Nemo bis: Update size related dblists (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145040 (owner: 10Reedy) [09:51:13] (03CR) 10Matanya: "checked on puppet compiler, noop:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145250 (owner: 10Matanya) [09:56:01] (03PS1) 10Filippo Giunchedi: swift: fix dispersion @proxy_address [operations/puppet] - 10https://gerrit.wikimedia.org/r/145259 [09:56:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix dispersion @proxy_address [operations/puppet] - 10https://gerrit.wikimedia.org/r/145259 (owner: 10Filippo Giunchedi) [09:58:11] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: Fetching origin [09:59:11] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: Fetching origin [10:05:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [10:06:20] <_joe_> what is wrong with this today? [10:16:24] (03PS1) 10Matanya: ganglia_new: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/145266 [10:27:59] (03PS8) 10Hashar: zuul: phase out zuulwikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [10:29:32] !log restart profiler-to-carbon on tungsten, seemingly cpu spinning [10:29:36] Logged the message, Master [10:54:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [10:54:53] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:09] (03CR) 10Giuseppe Lavagetto: Add init and upstart scripts (033 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [11:05:37] (03PS1) 10Matanya: geoip: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/145269 [11:10:46] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [11:10:46] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:31:49] any ops around? [11:32:37] (03CR) 10Steinsplitter: "my git is broken. i have asked to to so." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145144 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [11:32:41] <_joe_> Nikerabbit: I was about to go out for lunch [11:32:43] would like to change my ssh-key [11:32:53] Nikerabbit: Make a changeset [11:32:55] ;) [11:33:05] Reedy: oh, which repo is it? [11:33:25] operations/puppet [11:33:29] <_joe_> Reedy: or, he may file an RT ticket [11:33:41] <_joe_> and we'll take care of that [11:33:48] <_joe_> unless this is a security issue [11:33:56] <_joe_> in that case I can take care of this now [11:34:03] I guess it might be related to the translatewiki server [11:34:10] <_joe_> I guessed the same [11:34:13] <_joe_> :) [11:34:37] there is no evidence of compromise of the key but just in case [11:35:15] <_joe_> was your key on that server [11:35:20] <_joe_> the private key I mean [11:35:33] _joe_: I regularly use key forwarding to that server [11:35:47] <_joe_> he [11:36:04] <_joe_> so my next question "do you use a passphrase" is void of meaning [11:36:19] <_joe_> now, we need to find a good way for you to give me your new pubkey [11:37:03] <_joe_> but let's take this part of the conversation in private :) [11:37:28] _joe_: Ever been to Finland? :) [11:37:36] Reedy: have you [11:37:44] Nope [11:37:49] <_joe_> Reedy: well, technically yes [11:37:52] <_joe_> for about 1 hour [11:37:55] It's not that far from you [11:37:56] haha [11:38:20] And the Baltic is so warm for bathing :P [11:38:36] 3 hours flying or so [11:39:24] <_joe_> Nemo_bis: not that the North Atlantic in general is hot, btw [11:39:39] <_joe_> Nikerabbit: please see my PM [11:40:19] <_joe_> if you're not here now, I'm grabbing some lunch, ask to the next op around, or wait for me to get back [11:43:54] (03CR) 10Matanya: "compiled on pupet-compiler. https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/139/console" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145266 (owner: 10Matanya) [11:45:54] (03PS1) 10Steinsplitter: Adding new domains to wgCopyUploadsDomains. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) [11:53:40] (03PS1) 10Giuseppe Lavagetto: admin: change nikerabbit's key [operations/puppet] - 10https://gerrit.wikimedia.org/r/145274 [11:56:29] (03CR) 10TTO: Adding new domains to wgCopyUploadsDomains. (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [11:57:38] (03PS2) 10Giuseppe Lavagetto: admin: change nikerabbit's key [operations/puppet] - 10https://gerrit.wikimedia.org/r/145274 [11:58:06] (03CR) 10Nikerabbit: [C: 031] admin: change nikerabbit's key [operations/puppet] - 10https://gerrit.wikimedia.org/r/145274 (owner: 10Giuseppe Lavagetto) [11:58:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: change nikerabbit's key [operations/puppet] - 10https://gerrit.wikimedia.org/r/145274 (owner: 10Giuseppe Lavagetto) [11:59:36] aww I was too slow to fix the translatewiki.net name [11:59:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [12:00:14] Reedy, Nikerabbit: What's up with translatewiki? [12:00:31] https://twitter.com/translatewiki/status/487183685448630272 [12:00:33] Elasticsearch was open to the public [12:01:25] wow [12:01:56] So hopefully no data compromised, right? [12:01:57] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [12:02:47] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [12:03:53] Krenair: AFAIK(!) the elasticsearch vulnerability allows remote code execution, so one really doesn't know. [12:04:38] <_joe_> Trminator: I don't think that was the goal of the attacker honestly [12:04:52] And someone stored their Wikimedia production SSH key there? :/ [12:04:59] (03PS2) 10Steinsplitter: Adding new domains to wgCopyUploadsDomains. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) [12:05:00] Krenair: as far as we know, no, but we cannot be 100% sure [12:05:01] <_joe_> no [12:05:16] It seems it was just used in a ddos attack or similar [12:05:46] "just" [12:05:55] _joe_: yea, I'd guess they wanted the drone for the ddos, but one can't know unless you'd do a full analysis of the disk, right? ;) [12:06:25] <_joe_> Trminator: yes I already advised to reinstall from scratch [12:06:37] <_joe_> I'd also check the db users [12:06:39] which they are doing [12:06:49] <_joe_> once you recover the db dump [12:06:57] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [12:06:58] <_joe_> who knows if they've added some [12:07:28] <_joe_> I'm at lunch [12:09:01] looking at virt1000 [12:13:51] (03PS1) 10Springle: Use MariaDB event scheduler on coredb slaves. [operations/software] - 10https://gerrit.wikimedia.org/r/145276 [12:14:30] (03PS3) 10Hashar: zuul: remove /var/lib/git from server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 [12:14:32] (03PS9) 10Hashar: zuul: phase out zuulwikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [12:14:34] (03PS8) 10Hashar: zuul: migrate settings to role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [12:14:36] (03PS3) 10Hashar: zuul: introduced config hash in role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144708 [12:14:38] (03PS2) 10Hashar: zuul: install zuul from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144692 [12:14:40] (03PS3) 10Hashar: zuul: move Icinga checks to zuul::monitoring::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 [12:14:42] (03PS5) 10Hashar: zuul: remove $zuul_url from zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144997 [12:14:44] (03PS4) 10Hashar: zuul: monitor Zuul merger via nrpe [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 [12:19:52] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:20:23] (03CR) 10Springle: [C: 04-1] "v0.1. Still testing." [operations/software] - 10https://gerrit.wikimedia.org/r/145276 (owner: 10Springle) [12:29:36] !log restarted opendj on virt1000, ran out of fd [12:29:41] Logged the message, Master [12:29:52] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.000 second response time on port 389 [12:29:52] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.006 second response time on port 636 [12:30:10] (03PS1) 10Hashar: zuul: introduce 'zuul' system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/145278 [12:31:44] (03PS2) 10Hashar: zuul: introduce 'zuul' system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/145278 [12:43:21] !log restart opendj on virt1000 with higher ulimit -n [12:43:25] Logged the message, Master [12:44:04] (03PS1) 10Filippo Giunchedi: bump opendj open fd limit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145282 [12:48:38] !log ongoing schema changes: pl_from_namespace gerrit 117373. on terbium, osc_host.sh processes ok to kill in emergency [12:48:42] Logged the message, Master [12:50:48] springle: yay! Quick question re https://gerrit.wikimedia.org/r/#/c/135756/ - Should we get rid of the enums, rather than adding values to them? [12:52:42] Reedy: i have no issue with enum. they allow online schema changes nowadays [12:52:59] aha [12:53:27] have we ever had a policy for or against? [12:53:32] (03PS3) 10Hashar: zuul: introduce 'zuul' system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/145278 [12:53:34] (03PS1) 10Hashar: admin: contint-admins can now sudo as 'zuul' [operations/puppet] - 10https://gerrit.wikimedia.org/r/145289 [12:53:36] (03PS1) 10Hashar: zuul: switch to run as 'zuul' user BREAKING CHANGE [operations/puppet] - 10https://gerrit.wikimedia.org/r/145290 [12:53:43] Not really.. [12:54:13] Back in October 2010 I removed the enum from the code review extension as it was being changed on a semi-regular basis [12:54:13] https://github.com/wikimedia/mediawiki-extensions-CodeReview/commit/032861da22998d51d07f6d49ce6735dc75c89754 [12:54:49] the nice bit about enum is it makes devs think carefully about what they put in ;) [12:55:09] yeah fair enough [12:55:26] I think brion wasn't really a fan of them.. Might've been his suggestion to remove [12:55:27] if changes are frequent then it's just annoying [12:56:10] I was half wondering if/what/when we'd be adding more to it due to the nature of files that people will want to upload [12:57:40] hi apergos! [12:57:56] since you are on RT duty, would you mind checking this one out one last time and pushing it through? [12:57:56] https://gerrit.wikimedia.org/r/#/c/142483/6 [12:58:15] its just a spacing lint commit from a new volunteer [13:00:38] sec [13:00:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:02:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:04:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:05:37] springle: new percona-toolkit out today too [13:06:24] (03PS2) 10Ottomata: eventlogging: port is a fact, qualify [operations/puppet] - 10https://gerrit.wikimedia.org/r/145246 (owner: 10Matanya) [13:06:29] (03CR) 10Ottomata: [C: 032 V: 032] eventlogging: port is a fact, qualify [operations/puppet] - 10https://gerrit.wikimedia.org/r/145246 (owner: 10Matanya) [13:06:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:08:17] (03PS2) 10Hashar: zuul: switch to run as 'zuul' user BREAKING CHANGE [operations/puppet] - 10https://gerrit.wikimedia.org/r/145290 [13:08:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:09:48] !log restart pdns on virt1000 [13:09:54] Logged the message, Master [13:10:33] Reedy: heh.. we're behind a bit. 2.2.3 [13:10:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:11:07] SHPX LBH puppet [13:11:08] I am tired of it honestly [13:12:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:13:22] (03Abandoned) 10Yuvipanda: Tools: Add some i386 compat packages to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [13:13:37] (03Abandoned) 10Yuvipanda: [WIP] toollabs: Create mongo accounts for all tool users [operations/puppet] - 10https://gerrit.wikimedia.org/r/139685 (owner: 10Yuvipanda) [13:13:50] (03Abandoned) 10Yuvipanda: toollabs: Remove libvips-dev from dev_environment [operations/puppet] - 10https://gerrit.wikimedia.org/r/142819 (owner: 10Yuvipanda) [13:14:04] (03Abandoned) 10Yuvipanda: [WIP]diamond: Disable on all projects except tools, beta & graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144615 (owner: 10Yuvipanda) [13:14:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:16:52] PROBLEM - Puppet freshness on search1021 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 12:58:44 UTC [13:18:42] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Thu Jul 10 13:18:39 UTC 2014 [13:25:21] <_joe_> Nemo_bis: ping [13:25:31] pong [13:29:28] (03PS1) 10Hashar: zuul: switch installer from setuptools to pip [operations/puppet] - 10https://gerrit.wikimedia.org/r/145300 [13:31:27] (03PS2) 10Hashar: zuul: switch installer from setuptools to pip [operations/puppet] - 10https://gerrit.wikimedia.org/r/145300 [13:33:11] (03CR) 10Hashar: "Notice: /Stage[main]/Zuul/Git::Clone[integration/zuul]/Exec[git_clone_integration/zuul]/returns: executed successfully" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145300 (owner: 10Hashar) [13:35:52] _joe_: I got zuul installed on zuul instances properly \O/ :-D [13:36:07] and running as the zuul user instead of jenkins [13:36:32] <_joe_> hashar: :) [13:36:56] <_joe_> well done! [13:37:08] yeah I am quite happy [13:37:13] definitely took longer than expected :/ [13:37:26] you were wise to focus on hhvm! [13:37:49] integration-dev is a Zuul merger (role::zuul::merver) [13:37:49] integration-dev is a Zuul server (scheduler) (role::zuul::server) [13:37:49] That is beautiful [13:38:42] ottomata: I have not forgotten the changeset, was just in a short meeting [13:39:15] k, yeah no hurry at all [13:41:36] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 11:40:31 UTC [13:43:41] (03CR) 10Hashar: [C: 031 V: 032] "This is definitely good to be merged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144692 (owner: 10Hashar) [13:44:06] (03CR) 10Hashar: [C: 031 V: 032] "This is definitely good to be merged. Just moving things around." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 (owner: 10Hashar) [13:45:41] (03CR) 10Hashar: [C: 031 V: 032] "This is good to be merged. Adds a new monitoring in production for the zuul-merger process and I have triple checked the command." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [13:46:17] (03CR) 10Hashar: [C: 031 V: 032] "Will need to manually delete /var/lib/git to be a good citizen. Not a big deal though." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 (owner: 10Hashar) [13:47:20] anyone willing to merge in the 4 patches above? I have exercised them all week long on different instances :D they are very trivial ones that are guaranteed to work. [13:47:22] :D [13:52:26] ottomata: so there's one double quote that should go to single quote, shall I flag that or just merge this? :-D [13:53:08] up to you :) [13:54:06] (03PS7) 10ArielGlenn: Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [13:54:24] rebase [13:55:00] (03CR) 10Hashar: [C: 04-1] "The contint::firewall::labs class is applied on labs instances. Maybe it should be made a role and included in roles that depends on it. " [operations/puppet] - 10https://gerrit.wikimedia.org/r/144503 (owner: 10Matanya) [13:58:31] (03CR) 10ArielGlenn: [C: 032] Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [14:01:08] (03PS1) 10BBlack: naggen2: only pick up resources older than 1 hour by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145315 [14:03:44] <_joe_> Nikerabbit: ping [14:04:10] _joe_: pong [14:04:46] and in the end the double quotes wouln't have gone because there's an escape sequence in the string so... [14:04:48] doe [14:04:52] !log cycle-restarting swift proxy-server on ms-fe to apply config updates [14:04:56] Logged the message, Master [14:10:29] !log ran swift-dispersion-populate on eqiad and esams swift clusters [14:10:33] Logged the message, Master [14:10:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "a small correction, otherwise LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145315 (owner: 10BBlack) [14:12:53] (03CR) 10Gilles: "I'm going to add the minimum distance variable as well, because without it there will be noticeable quality degradation for sizes very clo" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [14:20:10] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jul 10 14:20:06 UTC 2014 [14:20:14] (03PS4) 10Gilles: Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [14:20:29] (03CR) 10Gilles: [C: 031] Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [14:25:05] ottomata: we're toying with an updated deployment schedule: https://www.mediawiki.org/wiki/Search#Wikis [14:28:11] can we have a talk about https://rt.wikimedia.org/Ticket/Display.html?id=7779 at some point today? [14:29:30] (03PS2) 10BBlack: naggen2: only pick up resources older than 1 hour by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145315 [14:31:48] (03PS4) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [14:33:42] manybubbles: tomorrow might be better, if that's ok [14:34:01] got lots of stuff and a doctors appointment (i think) and an interview later [14:35:10] ottomata: invited [14:50:14] manybubbles: You want to SWAT today? [14:50:21] anomie: sure! [14:50:24] Thanks [14:51:40] hoo: do you have appropriate submodule updates for SWAT today? [14:53:30] manybubbles: We're not ready... I guess we'll reschedule for tonight or cancel it entirely [14:53:45] hoo: good luck! [14:53:49] anomie: so I'm off the hook [14:53:53] there's some weird bug which we would like to see fix before going [14:55:24] basically the tests pass, bug or no bug :( [14:55:43] would prefer the tests catch the bug [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140710T1500) [15:02:29] (03PS5) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [15:07:14] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 16 MB (3% inode=99%): [15:07:24] arf [15:08:14] RECOVERY - Disk space on gallium is OK: DISK OK [15:09:34] happens from time to time [15:10:11] mutante: is there a way for me to receive Icinga emails whenever gallium / lanthanum have some issues? Ie disk space [15:11:24] emails! [15:11:27] uh [15:11:28] sms! [15:11:45] both works :D [15:12:34] ah [15:12:36] (03CR) 10Andrew Bogott: [C: 031] bump opendj open fd limit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145282 (owner: 10Filippo Giunchedi) [15:12:38] $nagios_contact_group [15:13:05] hashar: yes, the NRPE checks take contact group as an argument now [15:13:23] adding a group called CI ? [15:14:05] zuul.pp: contact_group => 'contint', [15:14:31] hashar: looks like it's already done and can be copied from the "zuul_gearman" check [15:14:43] (03PS1) 10Hashar: icinga: have gallium/lanthanum notified to contint [operations/puppet] - 10https://gerrit.wikimedia.org/r/145334 [15:14:47] mutante: ^^:D [15:14:50] :) [15:14:57] analytics did that on their node "base_analytics_logging_node" [15:15:06] maybe that will work [15:15:16] !log reinstalling analytics1026 and analytics1027 [15:15:22] Icinga has a 'contint' group already [15:15:23] Logged the message, Master [15:16:28] yea,confirmed it has the group. .. this part though i didnt now [15:16:32] "Note that there is no real node named "base_analytics_logging_node"." [15:16:39] compiling against neon https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/140/ [15:16:40] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145336 [15:16:42] (03PS1) 10Reedy: testwiki to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145337 [15:16:44] (03PS1) 10Reedy: Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145338 [15:16:46] (03PS1) 10Reedy: group0 to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145339 [15:17:09] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145336 (owner: 10Reedy) [15:17:25] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145336 (owner: 10Reedy) [15:17:43] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145337 (owner: 10Reedy) [15:17:49] mutante: yeah I think we added the contint together when adding the gearman check [15:17:50] (03Merged) 10jenkins-bot: testwiki to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145337 (owner: 10Reedy) [15:18:07] hashar: yes,it feels like it [15:18:15] !log reedy Started scap: testwiki to 1.24wmf13 and build l10n cache [15:18:17] (03CR) 10Hashar: "Puppet catalog compilation in Jenkins at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/140/console" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145334 (owner: 10Hashar) [15:18:20] Logged the message, Master [15:18:25] mutante: here is the diff http://puppet-compiler.wmflabs.org/140/change/145334/html/neon.wikimedia.org.html :D [15:18:35] it fails :-( [15:18:49] eh..why [15:18:53] Error: Failed to execute generator /usr/local/bin/naggen2 [15:18:54] hehe [15:19:02] the puppet compilation instance does not have icinga installed [15:19:03] so [15:19:05] cant compile! [15:19:08] Error: Failed to execute generator /usr/local/bin/naggen2: ? [15:19:09] sigh [15:19:24] hashar: i'm ok merging that anyways [15:19:29] (03CR) 10Hashar: "Error: Failed to execute generator /usr/local/bin/naggen2:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145334 (owner: 10Hashar) [15:19:47] hopefully not going to break anything [15:20:04] nah, the worst is that icinga does not restart [15:20:15] so it would not use broken config [15:20:20] and i know the group exists.. [15:20:43] (03CR) 10Dzahn: [C: 032] icinga: have gallium/lanthanum notified to contint [operations/puppet] - 10https://gerrit.wikimedia.org/r/145334 (owner: 10Hashar) [15:27:01] PROBLEM - Puppet freshness on mw1150 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 15:24:43 UTC [15:28:35] ^ wth, one random appserver? [15:28:47] hashar: it doesnt break things but it also doesnt work..it seems [15:28:47] RECOVERY - Puppet freshness on mw1150 is OK: puppet ran at Thu Jul 10 15:28:45 UTC 2014 [15:28:51] :-( [15:28:59] I blame analytics [15:29:27] Reedy, _joe_: Graph of APC cache free on the cluster -- [15:30:28] url crashed my client :) [15:31:00] greg-g: Hey [15:31:07] hashar: monitor_ganglia and nrpe::monitor_service have contact_group parameters..afraid we'd have to do it in eeach service then [15:31:07] hahah [15:31:16] bd808: I was about to die of anticipation [15:31:22] Reedy, _joe_: Shortened url -- http://is.gd/n0TT4w [15:31:43] mutante: I grabbed it from manifests/role/logging.pp [15:32:18] Looks like we have a lot of hosts with less than 10% free with global average being about 10% free [15:32:19] mutante: ah it is applied on a fake node bah [15:33:01] hashar: ok,now it's gettting strange,see this: [15:33:08] puppet_services.cfg: contact_groups admins,analytics [15:33:21] bd808: be interesting to see what happens at deploy time today [15:33:28] that is directly from resulting icinga config on neon [15:33:43] but it does not have any "admins,contint" [15:33:54] mutante: so that works for analytics but not for gallium/lanthanum right? [15:34:01] yes [15:34:05] Reedy: This logstash dashboard doesn't look so good either -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/APC%20thrash [15:34:26] bah, need to go get my yubikey [15:34:41] mutante: maybe I have to run puppet on those hosts [15:35:19] hashar: no, i think now..it never worked like that for analytics either to put it on the node [15:35:37] (03PS2) 10Ottomata: Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041) [operations/dns] - 10https://gerrit.wikimedia.org/r/145024 [15:35:44] mutante: they must use some other trick [15:36:45] those that i see there in config are "nrpe_check" [15:36:45] (03PS1) 10Hashar: Revert "icinga: have gallium/lanthanum notified to contint" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145344 [15:36:48] mutante: lets revert my change so :/ [15:36:53] hashar: they add it using "monitor_service" [15:36:55] (03PS2) 10Chad: Need oxygen access to get at lsearchd logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145054 [15:37:06] hashar: that's where those come from in the actual config [15:37:14] <^d> ottomata: Would you mind looking at 145054? ^^^ [15:37:41] hashar: example: modules/mysql_wmf/manifests/coredb/monitoring.pp line 86 [15:38:23] mutante: yeah that one would work. I was more interested in changing the contact group for the base monitors [15:39:02] mutante: I guess base::monitoring::host is realized before $nagios_contact_group [15:40:50] for Roan it was solved by adding him to the "admins" group [15:40:57] i guess he filters out the parsoid boxes :p [15:41:35] mutante: could it be that the monitor::host needs to be run on gallium first so it is later collected on puppetmaster with appropriate nagios group ? [15:41:43] hashar: oh.. we can do it via watchmouse [15:42:04] hashar: not sure [15:42:12] i doubt it [15:43:31] hashar: just found https://rt.wikimedia.org/Ticket/Display.html?id=6966 [15:44:22] oh [15:44:25] hashar: heh, and 4606 [15:44:44] I think $nagios_contact_group is not set / realized properly [15:46:16] (03CR) 10Dzahn: [C: 032] "yea,unfortunately this doesnt work like that just setting it on the node level. it seems analytics just tried it out as well. it works on " [operations/puppet] - 10https://gerrit.wikimedia.org/r/145344 (owner: 10Hashar) [15:47:56] one day we will figure it out hehe [15:48:05] not the first time we try :p [15:48:17] i cant find that one ticket i was expecting though [15:48:23] (03CR) 10Cmjohnson: [C: 032] Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041) [operations/dns] - 10https://gerrit.wikimedia.org/r/145024 (owner: 10Ottomata) [15:49:38] (03PS1) 10Cmjohnson: Revert "Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041)" [operations/dns] - 10https://gerrit.wikimedia.org/r/145345 [15:50:24] !log reedy Finished scap: testwiki to 1.24wmf13 and build l10n cache (duration: 32m 09s) [15:50:29] Logged the message, Master [15:52:15] hashar: manifests/nagios.pp lines 41-43 .. that's what makes stuff paging instead of just mail, via the "sms" contact group based on a service being critical, and it has "host_name => $title". ..maybe some hack there? [15:52:30] or..just watchmouse [15:53:33] (03CR) 10Cmjohnson: [C: 032] Revert "Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041)" [operations/dns] - 10https://gerrit.wikimedia.org/r/145345 (owner: 10Cmjohnson) [15:56:07] mutante: yeah forget about it . It is not that much needed. [15:56:14] (03CR) 10coren: [C: 032] "That's even better in the general case anyways." [operations/puppet] - 10https://gerrit.wikimedia.org/r/145195 (owner: 10Yuvipanda) [15:56:21] mutante: I receive mails for the service monitoring already :-) [15:56:52] !log gallium running a rather long du command in a screen. Need to have a good figure at how much disk space each jobs consume [15:56:56] Logged the message, Master [15:57:05] hashar: fair,ok [15:57:31] contact_groups => $hostname ? {... it would be ugly [15:57:57] ^d, we can copy those logs to stat1002 [15:58:00] if they aren't already there... [15:58:07] that is usually how people access them [15:58:24] ja they aren't there [15:59:38] mutante: na don't waste your time with it :-] [15:59:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] bump opendj open fd limit [operations/puppet] - 10https://gerrit.wikimedia.org/r/145282 (owner: 10Filippo Giunchedi) [15:59:52] mutante: there must be something badly broken that will take a long time to figure out [16:00:04] ^d, e.g. https://github.com/wikimedia/operations-puppet/blob/production/manifests/misc/statistics.pp#L556 [16:00:19] mutante: I am supposing include standard is realized before the nagios_group_contact is set , so that probably never works [16:00:43] mutante: analytics uses a virtual node which might have its statement realized before the child node [16:00:48] <_joe_> no. [16:01:03] <_joe_> argh I should close this computer [16:01:06] (03PS1) 10Cmjohnson: Revert "Revert "Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041)"" [operations/dns] - 10https://gerrit.wikimedia.org/r/145349 [16:01:11] _joe_: yeah definitely [16:01:11] <_joe_> hashar: scoping scoping scoping :) [16:01:20] <_joe_> :P [16:01:24] told you [16:01:33] <^d> ottomata: I don't have stat1002 access. I was just doing what manybubbles did :) [16:01:35] we could solve that using stages I guess [16:01:39] (03CR) 10Cmjohnson: [C: 032] Revert "Revert "Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041)"" [operations/dns] - 10https://gerrit.wikimedia.org/r/145349 (owner: 10Cmjohnson) [16:04:21] !log restarted pdns in turn on virt1000 and virt0 after opendj ulimit change [16:04:26] Logged the message, Master [16:05:05] oh, ^d, I must have missed that, or forgotten, i guess that's fine [16:05:27] i have to run to a doctors appointment, wonder if the wonderful apergos RT duty officer could take care of that [16:07:27] Reedy: There? [16:08:25] and I am off. Have a good day! [16:09:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [16:09:16] goodnight hashar [16:12:05] Reedy: aude: Hope your ok with that: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_July_7th (17:00-17:30 UTC today) [16:14:38] * aude ok with that [16:24:42] * apergos looks at the backread [16:26:13] ^d: care to fill me in? [16:26:18] aude: hoo looks fine [16:26:39] ah, nice :) [16:26:52] thanks [16:27:00] <^d> apergos: Just wanting to get at the lsearchd logs on oxygen. Copied what Nik had already done. [16:27:17] <^d> Was asking otto because he's our search liason. [16:27:48] ah [16:28:32] aude: hoo also, looks like the "In other projects sidebar" thing won't be "early july" any more eh? [16:28:43] https://bugzilla.wikimedia.org/show_bug.cgi?id=66226 [16:28:44] sadly [16:28:48] yeah :/ [16:29:03] would be nice to get it in our next branch... we hope so [16:29:06] * greg-g is just doing his assessment of "next month" thing right now [16:29:27] feel free to ping the right people to help with reviews [16:30:10] Reviews aren't the problem... more of the issue right now :D [16:30:17] The change is stuck on a -1 aFAIR [16:30:23] <^d> apergos: is the change in question. [16:30:24] ahh [16:30:33] hoo: good luck then :) [16:30:38] someone wants more tests for the code [16:31:00] tests are good :) [16:31:22] yeah [16:31:39] I not so secretly want Jenkins to auto-minus-one all changes that don't have associated tests [16:31:53] h3eh [16:31:55] * heh [16:32:04] some of the code that touches hooks is difficult for writing good tests, but yes we want [16:32:21] shall push for this to get done by branch day [16:32:24] yeah, always weird cases that might need over-riding some dumb algorithm ;) [16:39:19] ^d: is there an rt ticket for this? not being bureaucratic, we just need to have a record someplace, and let people comment if they have a better approach (I think it's fine to add you with the same access that e.g. manybubbles has) [16:40:33] <^d> No, I did not file rt. [16:40:35] <^d> I can do that. [16:41:00] yes please and I will comment on it immediately [16:42:05] (03PS1) 10Yuvipanda: tools: Don't specify postgresql-client version [operations/puppet] - 10https://gerrit.wikimedia.org/r/145360 [16:42:07] (03PS1) 10Yuvipanda: tools: Specify packages that are different in precuse vs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 [16:42:10] Coren: ^ [16:43:45] (03CR) 10coren: [C: 032] "Once upon a time, that installed an 8.3 by default. If we are sure that this is not the case on Precise, then all is well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/145360 (owner: 10Yuvipanda) [16:44:51] Coren: apt-cache show tells me: Depends: postgresql-client-9.1 [16:44:54] so I guess it'll install 9.1 [16:45:05] (03PS3) 10Chad: Need oxygen access to get at lsearchd logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145054 [16:45:20] <^d> apergos: rt #7837. Also added to commit summary. [16:45:49] thank you [16:49:10] Coren: there's one more :D however, I can't find a libgdal replacement :( [16:50:33] YuviPanda: I'm still waiting for Jenkins to wake up. [16:50:37] Coren: aaah [16:50:38] Coren: ok [16:50:41] Coren: ty [16:50:50] (03CR) 10jenkins-bot: [V: 04-1] tools: Specify packages that are different in precuse vs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 (owner: 10Yuvipanda) [16:50:58] hah [16:50:59] on time [16:52:11] updated [16:52:20] (03PS2) 10Yuvipanda: tools: Specify packages that are different in precuse vs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 [16:54:03] (03CR) 10jenkins-bot: [V: 04-1] tools: Specify packages that are different in precuse vs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 (owner: 10Yuvipanda) [16:54:06] what [16:55:05] (03PS3) 10Yuvipanda: tools: Specify packages that are different in precuse vs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 [16:55:08] lol, so not elseif, not elif, not else if, but elsif [16:55:10] nice, puppet [16:55:56] They pupped it up there [16:58:48] there should be an elsest that you can put after else, which executes if the else clause throws an exception :) [16:59:12] I found out recently that you can put an *else* with a for in python [16:59:29] * bblack is gunning for a job a puppet labs [16:59:44] YuviPanda: what does for-else do? [17:00:02] bblack: executes if the for exited 'unnaturally' [17:00:04] hoo: The time is nigh to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140710T1700) [17:00:10] heh [17:00:37] unnaturally means a break statement I assume [17:00:45] bblack: wait, I got it the other way aroudn [17:00:54] bblack: else executes if it was *not* exited unnaturally [17:00:56] bblack: indeed. [17:01:01] exceptions don't count, IIRC [17:01:03] just break [17:01:42] heh that's awesome [17:02:01] in a sick sort of way [17:06:02] bblack: heh, yeah [17:06:06] bblack: while also has the same thing [17:08:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:08:49] ^ that's me, varnish being a bitch on restart about bad mmap addrs. it's very temporary [17:11:28] csteipp: ping ? [17:15:25] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [17:20:09] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 15:19:35 UTC [17:21:57] !log hoo Synchronized php-1.24wmf12/extensions/Wikidata/: Fix a UI issue and two API related flaws (duration: 00m 14s) [17:21:58] Logged the message, Master [17:22:59] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [17:25:37] (03CR) 10coren: [C: 032] "I dislike having this conditional stuff, but there doesn't seem to be any reasonable way around it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/145361 (owner: 10Yuvipanda) [17:25:56] !log hoo Synchronized php-1.24wmf13/extensions/Wikidata/: Fix a UI issue and two API related flaws (same version as for wmf12) (duration: 00m 09s) [17:26:00] Logged the message, Master [17:26:18] all done [17:30:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:33:03] _joe_: still around ? [17:33:51] Unlikely [17:33:59] Might not be around till monday [17:40:26] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jul 10 17:40:24 UTC 2014 [17:57:06] (03PS1) 10BBlack: varnish (3.0.6plus~wm1) unstable; urgency=low [operations/debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/145385 [17:58:34] matanya: Sorry, was on a call. What's up? [17:59:09] hi csteipp are global blocks atomic action on the db side ? [17:59:56] Hmm... let me check [18:00:02] I got: A database query error has occurred. This may indicate a bug in the software. [18:00:02] Function: GlobalBlocking::insertBlock [18:00:02] Error: 1062 Duplicate entry '209.126.72.83-0' for key 'gb_address' (10.64.16.22) [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140710T1800) [18:00:12] this is why i'm asking ^^ [18:00:38] array( 'IGNORE' ) [18:01:21] legoktm: want to shed some light for csteipp ? [18:01:23] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145338 (owner: 10Reedy) [18:01:51] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145338 (owner: 10Reedy) [18:02:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf12 [18:02:55] Logged the message, Master [18:03:04] matanya: legoktm do we need to halt that ^^ [18:03:23] i don't think so [18:03:49] k [18:04:11] matanya: Is that fairly rare, or is that popping up a lot? [18:04:28] It does a check, then does the insert, so definitely a race condition. [18:04:39] (03PS2) 10BBlack: varnish (3.0.6plus~x-wm1) unstable; urgency=low [operations/debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/145385 [18:04:39] out of 3 gblocks, happened twice [18:04:44] We can pretty easily just have it upsert [18:05:16] greg-g: no, I'm pretty sure the bug as been around since ever [18:05:17] Hmm... that makes me wonder if the select is working right. [18:05:25] legoktm: kk [18:06:37] legoktm, csteipp : https://bugzilla.wikimedia.org/show_bug.cgi?id=67815 [18:07:07] I'm thinking we should do an array( 'IGNORE' ), and then check if any rows were updated, if not, display a edit conflict-type error message [18:07:49] legoktm: we used to get that message [18:08:03] it disappeared recently [18:08:34] well, the check we do before hand uses the slave, not master [18:08:55] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.6plus~x-wm1) unstable; urgency=low [operations/debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/145385 (owner: 10BBlack) [18:09:06] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145339 (owner: 10Reedy) [18:09:14] (03Merged) 10jenkins-bot: group0 to 1.24wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145339 (owner: 10Reedy) [18:09:22] legoktm: BUGBUGBUG [18:10:01] (03PS2) 10Reedy: Set $wgUserMergeEnableDelete = false; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145224 (https://bugzilla.wikimedia.org/67789) (owner: 10Legoktm) [18:10:08] (03CR) 10Reedy: [C: 032] Set $wgUserMergeEnableDelete = false; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145224 (https://bugzilla.wikimedia.org/67789) (owner: 10Legoktm) [18:10:09] ok, would love a fix for that, annoys the hell out of me when i need to gblock spammer/vandals, and i know I'll need to clean up more later [18:10:09] legoktm: I think we could get rid of the check, the delete, and the insert by just doing one upsert, right? Unless we intentionally want it to fail on conflicts that weren't supposed to be updates. [18:10:14] (03Merged) 10jenkins-bot: Set $wgUserMergeEnableDelete = false; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145224 (https://bugzilla.wikimedia.org/67789) (owner: 10Legoktm) [18:10:34] csteipp: yeah, it should fail for conflicts IMO [18:10:47] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf13 [18:10:50] yes, it was also a request from me [18:10:52] Logged the message, Master [18:10:59] please keep that [18:14:41] (03PS1) 10Tim Landscheidt: Tools: Use toollabs::hba in toollabs::webnode [operations/puppet] - 10https://gerrit.wikimedia.org/r/145388 [18:15:50] (03PS3) 10Reedy: Adding new domains to wgCopyUploadsDomains. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [18:16:00] (03CR) 10Reedy: [C: 032] Adding new domains to wgCopyUploadsDomains. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [18:16:17] (03PS2) 10Reedy: Amend last commonsuploads additions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144590 (owner: 10Nemo bis) [18:16:48] (03Merged) 10jenkins-bot: Adding new domains to wgCopyUploadsDomains. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145273 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [18:17:06] (03PS3) 10Reedy: Amend last commonsuploads additions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144590 (owner: 10Nemo bis) [18:17:11] (03CR) 10Reedy: [C: 032] Amend last commonsuploads additions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144590 (owner: 10Nemo bis) [18:18:12] (03Merged) 10jenkins-bot: Amend last commonsuploads additions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144590 (owner: 10Nemo bis) [18:18:20] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [18:19:17] (03PS5) 10Reedy: Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [18:19:22] (03CR) 10Reedy: [C: 032] Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [18:19:55] (03Merged) 10jenkins-bot: Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [18:20:37] (03PS2) 10Reedy: Add Foreign Word of the Day featured feed for Wiktionaries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144425 (https://bugzilla.wikimedia.org/67563) (owner: 10TTO) [18:20:41] (03CR) 10Reedy: [C: 032] Add Foreign Word of the Day featured feed for Wiktionaries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144425 (https://bugzilla.wikimedia.org/67563) (owner: 10TTO) [18:21:09] (03Merged) 10jenkins-bot: Add Foreign Word of the Day featured feed for Wiktionaries [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144425 (https://bugzilla.wikimedia.org/67563) (owner: 10TTO) [18:21:38] (03PS2) 10Reedy: Set up autopatrolled group for eswikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144419 (https://bugzilla.wikimedia.org/67557) (owner: 10TTO) [18:21:49] (03CR) 10Reedy: [C: 032] Set up autopatrolled group for eswikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144419 (https://bugzilla.wikimedia.org/67557) (owner: 10TTO) [18:21:57] (03Merged) 10jenkins-bot: Set up autopatrolled group for eswikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144419 (https://bugzilla.wikimedia.org/67557) (owner: 10TTO) [18:22:55] (03PS2) 10Reedy: Add a complete list of local interwikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144264 (https://bugzilla.wikimedia.org/954) (owner: 10TTO) [18:23:01] (03CR) 10Reedy: [C: 032] Add a complete list of local interwikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144264 (https://bugzilla.wikimedia.org/954) (owner: 10TTO) [18:26:36] (03Merged) 10jenkins-bot: Add a complete list of local interwikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144264 (https://bugzilla.wikimedia.org/954) (owner: 10TTO) [18:34:21] (03PS1) 10Reedy: Bump apc.shm_size to 360M [operations/puppet] - 10https://gerrit.wikimedia.org/r/145397 [18:35:03] (03CR) 10Chad: [C: 04-1] Bump apc.shm_size to 360M (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145397 (owner: 10Reedy) [18:35:10] (03CR) 10Reedy: [C: 04-1] "Needs rebasing :(" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [18:35:20] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [18:36:10] (03PS2) 10Reedy: Bump apc.shm_size to 360M [operations/puppet] - 10https://gerrit.wikimedia.org/r/145397 [18:36:43] !log reedy Synchronized database lists: (no message) (duration: 00m 14s) [18:36:48] Logged the message, Master [18:37:12] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 13s) [18:37:16] Logged the message, Master [18:37:40] springle: is there any chance of https://ishmael.wikimedia.org/ being available again (assuming the load issue can be fixed)? [18:38:39] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 16:37:48 UTC [18:46:24] robla: yt [18:46:25] ? [18:48:43] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [18:50:46] StevenW: what's up? [18:51:08] Just wanted to confirm: R35 is okay for the MediaWiki core team meeting Monday? [18:51:24] Growth team is in town all together for the first time, so we're booking R37 for Mon - Thurs. [18:51:55] If it's cool, Sarah (Rodlund) has it all set up. [18:52:32] StevenW: yeah, should be fine [18:52:37] Awesome, thanks! [18:58:03] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Thu Jul 10 18:58:01 UTC 2014 [18:58:14] hm, cmjohnson1, i'm attempting to pxe install the new nodes [18:58:16] getting this message: [18:58:22] ou have ordered a Dell System with no OS installed. If you have ordered [18:58:22] direct attach 3TB or larger drives, please be aware that not all OSs have [18:58:22] support for these larger drives. Please consult the following blog for [18:58:22] support levels for various OS's and choose your OS to install accordingly. [18:58:23] http://en.community.dell.com/dell-blogs/enterprise/b/tech-center/archive/2010/ [18:58:23] 12/16/breaking-through-the-2tb-partition-limitation-3tb-hard-drives-and-beyond [18:58:24] .aspx [18:58:41] haha [18:58:48] i don't see any dhcp reqs in carbon logs [18:58:58] (03PS1) 10Reedy: git ignore 'private' directory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145403 [18:59:00] (03PS1) 10Reedy: Remove AdminSettings and PrivateSettings from .gitignore [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145404 [18:59:15] (03CR) 10Reedy: [C: 04-1] Remove AdminSettings and PrivateSettings from .gitignore [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145404 (owner: 10Reedy) [18:59:37] (03CR) 10Reedy: [C: 032] git ignore 'private' directory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145403 (owner: 10Reedy) [18:59:44] (03Merged) 10jenkins-bot: git ignore 'private' directory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145403 (owner: 10Reedy) [19:00:18] ottomata: that netboot.cfg will not work for 3TB disks [19:00:29] but it should be installing on the 2.5 disks [19:00:37] which are only 250G or so [19:00:45] exactly...let's look at the partman cfg [19:00:45] hmm [19:00:59] k, raid1-30G [19:01:14] guess we need to know what device the 2.5s are [19:01:15] hmm [19:02:32] attempting to boot 1028 into bios.. [19:02:38] ottomata: i think we have to change the h/w to use those 2 disks first. [19:02:45] hm [19:03:01] it's been awhile...i don't recall [19:04:15] ottomata: iirc dell ships the servers with the 2 rear disks as disk 13 and 14 [19:04:39] 13 and 14, so probably later in the alphabet of sd devices ? [19:05:11] yes...but I think we change their location in bios settings....booting an1030 now to see [19:05:15] k [19:08:44] ottomata: were you able to console into them using console com2? [19:09:30] yes [19:10:02] cmjohnson1: i'm looking at bios..or maybe dell system setup? on 1028 [19:10:13] see lots of disk properties, but no options to change device labels [19:10:28] try boot order [19:10:36] i think i might be in the wrong menu [19:10:38] i thin i'm in system setup [19:10:45] trying 1029 too [19:14:07] (03PS2) 10Reedy: Remove AdminSettings and PrivateSettings from .gitignore [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145404 [19:14:18] (03CR) 10jenkins-bot: [V: 04-1] Remove AdminSettings and PrivateSettings from .gitignore [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145404 (owner: 10Reedy) [19:14:26] jaa, hmm, not seeing anything useful, cmjohnson1 [19:14:33] (03PS1) 10Reedy: Remove old AdminSettings.php (symlink) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145408 [19:14:35] i just got into an1041 [19:14:41] (03CR) 10jenkins-bot: [V: 04-1] Remove old AdminSettings.php (symlink) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145408 (owner: 10Reedy) [19:14:58] looking [19:16:33] !log Killed jenkins :-( [19:16:39] Logged the message, Master [19:17:30] ottomata: do you want this setup as jbod? [19:17:39] other than the 2.5s, yes [19:17:43] the 2.5s should have OS in raid 1 [19:17:49] aside from that i don't care :) [19:18:17] okay..so we have to go to raid bios and and raid1 the ssds and make them 1st and create individual VG for the 12 other disks [19:19:27] more work to do!...i forgot all about it [19:23:13] YuviPanda: can you show me how to set up an instance running mediawiki using vagrant instead of puppet? [19:23:35] cscott: moment, let me find a link. [19:23:42] i'm tired of fighting with role::mediawiki-install::labs [19:23:44] cscott: https://wikitech.wikimedia.org/wiki/Labs-vagrant [19:24:05] cscott: btw, create a trusty image, since vagrant is now on trusty [19:24:40] * bd808 needs to test that [19:25:37] and i can submit patches to https://gerrit.wikimedia.org/r/mediawiki/vagrant to create roles for parsoid/ve/winter/togetherjs ? [19:25:50] cscott: Yes please [19:26:04] We have VE with parsoid already [19:26:19] and can i have my instance automagically slurp the latest code from gerrit? [19:27:11] It will when you install the roles initially. I'm not sure if the git update script works on a labs instance [19:27:18] But we can fix it if it doesn't [19:27:53] cscott: hmm, it doesn't do that yet, but should be easy to add. [19:27:56] (slurp from gerrit) [19:27:58] The updater script should end up installed as /usr/local/bin/run-git-update [19:28:03] oh [19:28:06] I didn't know that [19:28:09] bd808: nice! [19:28:14] where in https://gerrit.wikimedia.org/r/mediawiki/vagrant are the roles hidden? [19:28:34] cscott: puppet/manafests/roles/ [19:28:44] manifests [19:29:12] i am suspicious of the short length of https://gerrit.wikimedia.org/r/mediawiki/vagrant [19:29:18] i mean, of https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/manifests/roles/parsoid.pp [19:30:03] Ah. That loads https://github.com/wikimedia/mediawiki-vagrant/blob/master/puppet/modules/mediawiki/manifests/parsoid.pp [19:31:40] ok, that looks reasonable. i guess all that's left is for me to download everything and get my hands dirty. [19:36:28] cscott: good luck :) do poke me if you have vagrant issues [19:36:29] err [19:36:31] labs-vagrant issues [19:37:09] cscott: btw, there was an issue that sometimes crops up that won't let you run labs-vagrant commands unless your cwd is /vagrant, so if that happens do try that [19:43:00] !log Jenkins upgrading Gearman plugin from 0.0.6 to 0.0.7 . That fix the way jobs labels are registered with Gearman [19:43:05] Logged the message, Master [19:45:41] !log deployed patch for bug65778 [19:45:44] Logged the message, Master [19:45:49] (03PS4) 10Ottomata: Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 [19:46:10] !log reedy Synchronized private: (no message) (duration: 00m 14s) [19:46:15] Logged the message, Master [19:46:40] (03CR) 10Ottomata: [C: 032 V: 032] Add settings to throttle scheduled runs [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144154 (owner: 10Milimetric) [19:47:22] (03CR) 10Gage: [C: 032] Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 (owner: 10Ottomata) [19:50:42] (03CR) 10Ottomata: [C: 032 V: 032] Add CORS support to public files [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144761 (owner: 10Milimetric) [19:52:44] (03CR) 10jenkins-bot: [V: 04-1] Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 (owner: 10Ottomata) [19:54:28] (03PS5) 10Ottomata: Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 [19:56:10] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [19:56:20] (03CR) 10Gage: [C: 032] Production now uses CDH (CDH5) module, also refactor roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242 (owner: 10Ottomata) [19:56:56] (03CR) 10Reedy: [C: 032 V: 032] Remove AdminSettings and PrivateSettings from .gitignore [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145404 (owner: 10Reedy) [19:57:14] (03PS1) 10Milimetric: Update to latest [operations/puppet] - 10https://gerrit.wikimedia.org/r/145419 [20:00:13] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [20:00:19] Logged the message, Master [20:05:30] (03CR) 10Reedy: "I've merged AdminSettings.php into PrivateSettings.php and switched for symlinks on the cluster.." [operations/puppet] - 10https://gerrit.wikimedia.org/r/145017 (owner: 10Reedy) [20:13:56] (03PS2) 10Reedy: Swap from AdminSettings to PrivateSettings for snapshots/dumps [operations/puppet] - 10https://gerrit.wikimedia.org/r/145017 [20:44:15] yurikR: ping [20:44:23] bblack, sup [20:44:41] hey do you still have anything set up that can test ESI brokenness in varnish? [20:44:51] bblack, not really [20:44:59] ok nevermind :) [20:45:33] bblack, the X-CS=ON code poached off the ESI stuff [20:45:41] so got rid of all the old stuff [20:47:02] ok [20:48:26] (03PS2) 10BBlack: varnish: remove default Content-Type coercion [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [20:58:06] (03PS1) 10Chad: Disable incoming link counts on commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145430 [20:58:25] (03CR) 10jenkins-bot: [V: 04-1] Disable incoming link counts on commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145430 (owner: 10Chad) [20:59:06] <^d> Reedy: You did privatesettings stuff, right? [20:59:09] <^d> Broke tests ^ [21:04:16] bblack, maybe at some point we will resurect it - at this point we have a GIF image workaround, which has a lot of problems (like bad fonts for uncommon scripts, image sizing, etc). [21:08:28] ^d: bleugh [21:08:50] Looks trivially enough fixable... [21:13:48] (03PS1) 10Reedy: Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 [21:13:57] (03CR) 10jenkins-bot: [V: 04-1] Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 (owner: 10Reedy) [21:14:34] lol, even worse [21:16:01] (03PS2) 10Reedy: Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 [21:16:10] (03CR) 10jenkins-bot: [V: 04-1] Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 (owner: 10Reedy) [21:20:22] (03PS3) 10Reedy: Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 [21:20:50] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:58] (03PS1) 10Ottomata: Using analytics-flex.cfg for analytics dell 720s that have 2 x 2.5 drives in flex bays [operations/puppet] - 10https://gerrit.wikimedia.org/r/145439 [21:22:04] (03CR) 10Reedy: [C: 032] Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 (owner: 10Reedy) [21:22:11] (03Merged) 10jenkins-bot: Fixup highlightTest.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145438 (owner: 10Reedy) [21:22:54] (03PS2) 10Reedy: Disable incoming link counts on commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145430 (owner: 10Chad) [21:23:32] <^d> ty Reedy [21:26:07] (03CR) 10Manybubbles: [C: 031] Disable incoming link counts on commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145430 (owner: 10Chad) [21:26:54] Reedy: well done :) [21:27:04] hasharAccounting: For breaking it? [21:27:04] :D [21:27:12] and fixing it :D [21:27:22] you shall not self merge! *grin* [21:33:26] (03PS1) 10Tim Landscheidt: Tools: Only update ssh configuration when necessary [operations/puppet] - 10https://gerrit.wikimedia.org/r/145441 [21:39:47] (03PS2) 10Ottomata: Update to latest [operations/puppet] - 10https://gerrit.wikimedia.org/r/145419 (owner: 10Milimetric) [21:39:50] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:40:24] (03CR) 10Ottomata: [C: 032 V: 032] Update to latest [operations/puppet] - 10https://gerrit.wikimedia.org/r/145419 (owner: 10Milimetric) [21:40:49] thanks! [21:41:49] (03PS2) 10Ottomata: Using analytics-flex.cfg for analytics dell 720s that have 2 x 2.5 drives in flex bays [operations/puppet] - 10https://gerrit.wikimedia.org/r/145439 [21:43:24] (03PS2) 10Tim Landscheidt: Tools: Use apt::repository instead of file resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/123882 [21:51:32] (03CR) 10Dzahn: [C: 031] naggen2: only pick up resources older than 1 hour by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145315 (owner: 10BBlack) [21:53:35] (03CR) 10Dzahn: [C: 032] "yep, that works on gallium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144694 (owner: 10Hashar) [21:54:14] mutante: that patch is part of a long chain of changes [21:54:57] (03CR) 10Dzahn: [C: 032] "ack" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144693 (owner: 10Hashar) [21:56:47] (03CR) 10Dzahn: [C: 032] zuul: install zuul from role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144692 (owner: 10Hashar) [21:58:43] hashar: gallium puppet run finished, applied all 3 [21:58:52] mutante: noop? :)D [21:59:04] not technically, but in a good way [21:59:06] /Nrpe::Check[check_zuul_merger]/File[/etc/nagios/nrpe.d/check_zuul_merger.cfg]/ensure: created [21:59:15] yeah that one was missing [21:59:16] and that's all [21:59:24] Zuul got split in two [21:59:30] completely forgot about that monitor :-/ [22:00:01] it already restarted nagios-nrpe-server on gallium and the config is there [22:00:05] just wait for neon [22:00:14] hopefully it will work :D [22:00:25] I triple checked the command being used [22:00:46] checked it too, wfm [22:00:51] great! [22:01:00] you can land https://gerrit.wikimedia.org/r/#/c/144696/ as well it just stop defining /var/lib/git [22:01:08] dir is empty already on gallium:) [22:01:41] you said you want to delete the entire dir? [22:01:52] it already has been deleted :) [22:02:00] that was to receive git replications from Gerrit [22:02:10] it's there, just empty [22:02:10] got moved age ago to the ssd disk [22:02:23] want me to rm it? [22:02:38] oh it does no harm :D [22:02:41] but yeah can be deleted [22:02:51] and need to be removed from puppet or it will recreate it [22:02:52] (03CR) 10MaxSem: [C: 04-2] "I'm still not convinced that TextExtracts should follow one user's preferences over all the other." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126226 (https://bugzilla.wikimedia.org/63164) (owner: 10Prtksxna) [22:02:58] that is just some puppet cleanup [22:03:19] (03CR) 10Dzahn: [C: 032] "already empty on gallium, deleting the dir itself too" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144696 (owner: 10Hashar) [22:04:07] hashar: rmdir'ed [22:04:25] (03CR) 10Hashar: "I am not sure whether it is a good idea. I find the role::cache hashes to be a nice way to handle different settings among realms. Follow " [operations/puppet] - 10https://gerrit.wikimedia.org/r/144708 (owner: 10Hashar) [22:04:41] mutante: awesome :) [22:05:04] [22:05:04] zuul_merger_service_running [22:05:04] OK [22:05:13] mutante: thank you! [22:06:08] sweet. i see in icinga, yep [22:06:38] yw [22:07:00] the rest I will figure out after my 1 week vacations :D [22:07:07] ah:) good timing then [22:07:10] enjoy that! [22:07:11] though it is probably sane [22:07:19] so you did not need paging, hehe [22:07:35] oh I am sure you guys will figure out how to restart the services :D [22:07:45] just /etc/init.d/ [22:07:48] :) yep [22:07:59] but Zuul is very stable, unless I play with it [22:08:10] Jenkins is another horse but I am sure everyone already had to restart it once or two [22:09:55] * hashar sleeps [22:31:29] (03PS1) 10Ori.livneh: doc.wikimedia.org: specify additional media-types & set a default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145454 [22:32:34] bblack: ^^ this is so we can remove the default type from VCL [22:33:45] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 10 Jul 2014 20:33:21 UTC [22:38:32] (03CR) 10Dzahn: "these just edit comments anyways,right" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145250 (owner: 10Matanya) [22:38:58] (03CR) 10Cmjohnson: [C: 032] Using analytics-flex.cfg for analytics dell 720s that have 2 x 2.5 drives in flex bays [operations/puppet] - 10https://gerrit.wikimedia.org/r/145439 (owner: 10Ottomata) [22:39:04] (03PS2) 10BBlack: doc.wikimedia.org: specify additional media-types & set a default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145454 (owner: 10Ori.livneh) [22:39:14] (03CR) 10BBlack: [C: 032 V: 032] doc.wikimedia.org: specify additional media-types & set a default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145454 (owner: 10Ori.livneh) [22:39:56] cmjohnson1: go ahead if you're in it [22:40:10] (03CR) 10Ori.livneh: "> it sounds like doc.wikimedia.org should get fixed before we make this change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [22:40:18] bblack done [22:40:24] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/137/change/145250/html/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145250 (owner: 10Matanya) [22:41:20] ori: I think we're good to merge the other as soon as puppet pushes that to doc/integration on gallium right? [22:41:29] (03CR) 10Ori.livneh: [C: 031] Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [22:41:41] bblack: i think so, yeah. [22:45:40] (03PS3) 10BBlack: varnish: remove default Content-Type coercion [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [22:46:11] (03CR) 10BBlack: [C: 032 V: 032] varnish: remove default Content-Type coercion [operations/puppet] - 10https://gerrit.wikimedia.org/r/143940 (owner: 10Ori.livneh) [22:58:17] Can I squeeze in a small cherry-pick? [22:58:21] Doing the module bump now. [22:58:39] Don't think anyone is doing anything... [22:59:12] also, swat is in 2 mins [22:59:28] I sort of presumed that's what he meant [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140710T2300) [23:00:41] MaxSem, are you taking the swat then? [23:00:47] otherwise I can; I haven't done it this week [23:00:55] go ahead [23:01:38] Yeah, that's what I meant. [23:01:55] superm401, can you add your thing to the deployment calendar? [23:03:36] mwalker: Thanks! [23:03:38] mwalker, yeah, working on it. [23:07:15] mwalker, done, sorry for the delay. [23:07:35] mwalker, that's the submodule bump. Actual change is https://gerrit.wikimedia.org/r/#/c/145457/ [23:18:22] greg-g: " [23:18:22] We need to improve puppet monitoring so that puppet breakages produce some sort of notification. [23:18:29] greg-g: do you mean real SMS? [23:18:49] icinga alerts, at least :) [23:18:50] or an email! [23:18:55] that already exists [23:19:01] resolved :) [23:19:10] heh, not on betalabs :P [23:19:14] assuming that's where that came from [23:19:20] no, it came from: [23:19:27] https://wikitech.wikimedia.org/wiki/Incident_documentation/20140618-Wikitech [23:19:49] and i claim but here it is https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=virt1000&service=Puppet+freshness [23:20:46] YuviPanda: wanna copy the new production check? [23:20:53] it's better now [23:21:00] though maybe not perfect [23:21:17] mutante: but, but.. manifests/misc/icinga.pp! [23:21:20] then you'd want an icinga-wm [23:21:27] (03PS3) 10BBlack: naggen2: only pick up resources older than 1 hour by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/145315 [23:21:29] (03PS1) 10BBlack: Add explicit homedirs for file_mover + pybal-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/145463 [23:21:30] mutante: there's an ops@ email about this going on, btw. [23:22:01] mutante: end goal is to reuse whatever prod's using with super minimal branching, and I think if that's going to be icinga it'll first have to be put into a module. Plus resource collection needs a replacement for labs [23:22:16] !log mwalker Started scap: Updating Core, VE, and GuidedTour for scap, {{gerrit|145400}}, {{gerrit|145401}}, {{gerrit|145431}}, and {{gerrit|145460}} [23:22:19] Logged the message, Master [23:22:53] YuviPanda: it's odd that we have modules/icinga but there are just templates in there [23:23:02] i am not sure why that is yet [23:23:09] mutante: yeah, I was looking for the pp files there and was surprised to find it in mics [23:23:11] *misc [23:23:24] partial modularization ? hmm [23:24:00] well it's just the empty dir..wth [23:24:15] (03PS2) 10BBlack: Add explicit homedirs for file_mover + pybal-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/145463 [23:24:26] YuviPanda: we should make a module, but i also think that is not a requirement to apply the role class on an instance [23:24:57] arr, there is none, well i mean include icinga::monitor [23:25:02] mutante: :D [23:25:09] mutante: ther's also prod specific config in there [23:25:29] mutante: I don't think applying the current icinga classes, as is, to any instance will have any effect. [23:26:10] it will have an effect, tells us which error to fix first :p [23:27:00] yea,it's messy, we should just try clean that all up step by step [23:27:24] (03CR) 10BBlack: [C: 032] Add explicit homedirs for file_mover + pybal-check [operations/puppet] - 10https://gerrit.wikimedia.org/r/145463 (owner: 10BBlack) [23:31:56] (03PS1) 10BBlack: fix file_mover .ssh dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/145467 [23:33:23] mutante: that's true, yeah :) [23:33:38] mutante: that'll probably be the process, I guess. self puppetmaster, apply, fix, repeat [23:33:49] while moving things into a module [23:33:57] (03CR) 10BBlack: [C: 032] fix file_mover .ssh dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/145467 (owner: 10BBlack) [23:35:35] (03CR) 10Scottlee: "Ping -- let me know if I need to do anything else here." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144737 (owner: 10Scottlee) [23:38:42] !log mwalker Finished scap: Updating Core, VE, and GuidedTour for scap, {{gerrit|145400}}, {{gerrit|145401}}, {{gerrit|145431}}, and {{gerrit|145460}} (duration: 16m 26s) [23:38:46] Logged the message, Master [23:38:56] puppet is so damned annoying about things sometimes [23:39:02] superm401, James_F; at long last scap has finished [23:39:09] please check your changes [23:39:10] Thanks, mwalker [23:39:21] (03PS1) 10BBlack: perms fix for file_mover? [operations/puppet] - 10https://gerrit.wikimedia.org/r/145471 [23:39:31] That was actually faster than I expected. [23:39:37] (03CR) 10BBlack: [C: 032 V: 032] perms fix for file_mover? [operations/puppet] - 10https://gerrit.wikimedia.org/r/145471 (owner: 10BBlack) [23:39:39] at long last? 16 minutes? [23:39:43] Mine was twice as long earlier ;) [23:40:09] it feels like an age [23:41:27] RECOVERY - puppet last run on erbium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:43:44] mwalker, seems good, don't see any regressions. [23:44:09] fantastic [23:46:16] any known status on labstore1001:/exp/dumps filesystem full? [23:47:17] it's been that way for nearly two weeks apparently, and dataset2 puppet is failing indirectly because of it as well [23:48:22] (03PS1) 10Dzahn: turn icinga into module pt1. separate classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/145472 [23:48:24] i think even longer than that [23:48:30] andrewbogott: labstore? [23:48:43] YuviPanda: ^ :p [23:48:54] mutante: I think there was a thread about it on ops... [23:49:03] we need to change the dumps policy, or buy more hardware [23:49:13] right, I vaguely know that it happened, and Coren was going to reallocate some space from something else temporarily (and buy more hardware after?) [23:51:50] andrewbogott: bblack https://rt.wikimedia.org/Ticket/Display.html?id=7578 ? [23:52:05] ACKNOWLEDGEMENT - Disk space on labstore1001 is CRITICAL: DISK CRITICAL - free space: /exp/dumps 0 MB (0% inode=99%): daniel_zahn see RT #7578 [23:52:29] yep! [23:52:48] bblack: alright, so status "needs bigger disks" [23:52:54] andrewbogott: nod, thx [23:53:47] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jul 10 23:53:43 UTC 2014 [23:54:25] (03CR) 10Dzahn: [C: 04-2] "Yuvi, i'll continue on it later :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [23:54:34] bbiaw