[00:00:12] PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:12] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:12] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:14] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:32] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:32] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:00:42] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [00:04:55] kaldari: Now syncing Gather for wmf22 [00:05:51] !log catrope Synchronized php-1.25wmf22/extensions/Gather: SWAT (duration: 01m 06s) [00:05:59] Logged the message, Master [00:09:05] RoanKattouw: actually, I’m going to wait until wmf23 is synced too [00:09:34] RoanKattouw: let me know if you want me to do the submodule update for wmf23 [00:10:22] !log catrope Synchronized php-1.25wmf23/extensions/VisualEditor: SWAT (duration: 01m 07s) [00:10:28] Logged the message, Master [00:11:29] !log catrope Synchronized php-1.25wmf23/extensions/ImageMetrics: SWAT (duration: 01m 07s) [00:11:34] Logged the message, Master [00:11:55] !log ssh: connect to host mw2213.codfw.wmnet port 22: Connection timed out [00:12:00] Logged the message, Mr. Obvious [00:12:36] !log catrope Synchronized php-1.25wmf23/extensions/Gather: SWAT (duration: 01m 07s) [00:12:41] Logged the message, Master [00:13:23] kaldari: tgr: Your stuff is deployed now, please check [00:13:35] RoanKattouw, the mw2213 thing is known [00:13:45] could not even ping that host earlier [00:13:53] RoanKattouw: testing. Looks liek there might still be a problem, so I’m not going to do the config change for now [00:15:14] (03CR) 10Kaldari: "Looks like there may still be some problems with Gather." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [00:15:17] springle: around? [00:15:24] springle: mostly just ping about https://gerrit.wikimedia.org/r/#/c/200170/ :) [00:17:41] YuviPanda: ok [00:20:43] RoanKattouw_away: works, thanks! [00:21:39] (03CR) 10Springle: [C: 031] "Since this is 5.5, the log file size change will need: https://dev.mysql.com/doc/refman/5.5/en/innodb-data-log-reconfiguration.html" [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:23:19] (03CR) 10Springle: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:24:06] (03CR) 10Yuvipanda: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:25:29] Coren: I’m going to do my checklist (https://etherpad.wikimedia.org/p/labs-maint-checklist, not fully done) on ^, and see how that goes [00:26:24] (03CR) 10coren: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:27:55] urandom: no, it hasn't. would you be willing to do it? [00:29:00] urandom: i didnt just run salt or something because < gwicke> with coordinated being 'wait until a node is back up & is processing requests' before proceeding [00:29:51] springle: if we move to mariadb at some point, would that also require a window? [00:30:15] mutante: I can restart them [00:30:26] urandom: thanks [00:30:28] !log restarting cassandra on restbase1002 [00:30:34] Logged the message, Master [00:31:36] mutante: thanks for making this change! [00:31:45] mutante: after the restart it typically takes ~1 minute for a node to re-join the cluster [00:32:44] alright [00:33:17] !log restarting cassandra on restbase1003 [00:33:22] Logged the message, Master [00:33:27] YuviPanda: It /is/ mariadb; just not mariadb 10 [00:33:42] oh, right. I thought we were gonna go from mysql 5.5 to mariadb 10... [00:33:51] Is jenkins just flooded or broken? [00:34:13] Coren: https://phabricator.wikimedia.org/T94643 does that sound like too much overhead to file a ticket like that? [00:34:17] springle: ^ [00:35:05] !log restarting cassandra on restbase1004 [00:35:10] Logged the message, Master [00:36:01] YuviPanda: The ticket is nice and verbose, and can double as announcement, but omygersh bold oversize FONT! :-) [00:36:24] Coren: haha :D yeah, I can basically copy paste that once filled in as an announcement... [00:37:37] What could go wrong? is always fun [00:40:35] !log restarting cassandra on restbase1005 [00:40:35] mutante: heh, I guess I wouldn’t put that in the announcement email [00:40:37] Logged the message, Master [00:40:54] !log restarting cassandra on restbase1006 [00:40:59] Logged the message, Master [00:42:52] YuviPanda: "what could possibly go wrong" is a popular gerrit merge comment :) [00:43:09] eh [00:43:10] heh [00:43:27] and then i think about an actual answer [00:43:57] and your template asks for that, so you can come up with really bad but also really unlikely stuff [00:44:02] I know of only one worse thing to say to call doom upon oneself. [00:44:50] mutante: Coren I just added ‘an’ answer. [00:44:58] mutante: Coren I guess the question needs refining then... [00:46:11] it's seriously just hard to answer [00:46:12] "At least things can't get any worse now." [00:46:37] hehe [00:46:42] Coren: mutante so the question in fll says 'What could go wrong? Worst case predictions? ' [00:47:17] mutante: Coren if it seems hard to answer I can just take it out. [00:47:23] I think > Is it possible to stop mid-way / roll back? If so, how? And what effects will that have? [00:47:25] is more important... [00:47:30] (and of course, the answer can be ‘no’) [00:47:31] YuviPanda: The problem with that question is that truly worse case are vanishingly unlikely but infinite in numbers. [00:47:38] i guess it's all about where you draw the line what is still considered likely [00:47:45] Coren: right. maybe phrase it like ‘most probable failure case’? [00:47:52] Rather than estimate what could go wrong, having contingencies in place is key. [00:47:59] yeah, totally. [00:48:01] i could always fall on rm -rf / [00:48:10] I guess part of the point of the question is to have you think about contingencies... [00:48:25] Sure, but if you find the most probable failure case, why are you not guarding against it explicitly? :-) [00:48:41] totally. That’s kind of the hope :D [00:49:03] FOr instance, yesterday, had I considered Precise would crap its pants I wouldn't have done the switch this way at all. [00:49:04] like, if we had done this for the NFS change, and gone ‘well, they might need a reboot’, then maybe we would’ve had a script to do the reboots? Or at least thought about the effects of mass reboot... [00:49:05] or somtehing. [00:49:30] YuviPanda: See, the problem is I *thought* I ruled this out because I tested it first. :-) [00:49:43] So the contingency wins over the scenario. [00:49:52] that's also the thing about the known unknowns and the unknown unknowns [00:49:53] I should have planned for a true outage in the first place. [00:50:10] true. [00:50:16] how would we have answered [00:50:18] > Is it possible to stop mid-way / roll back? If so, how? And what effects will that have? [00:50:20] Even though it seemed like a simple salt run would have done it. [00:50:28] For the switch? [00:50:31] yeah... [00:50:41] * Coren ponders. [00:51:36] "Not per se; rolling back has necessarily the same impact as the switch itself; if the new filesystem fails to work, however, we can switch back." [01:02:07] (03PS1) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [01:02:16] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [01:03:35] Coren: mutante so I got rid of the ‘what could go wrong’ part. [01:04:11] (03PS2) 10Andrew Bogott: Add a Horizon-specific nova policy file [puppet] - 10https://gerrit.wikimedia.org/r/201088 [01:04:21] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [01:08:46] springle: how long do you think downtime is gonna be? If it’s short enough I guess we could just do it on thursday… If not next tuesday maybe? [01:31:46] eh, i clicked a link on integration.wm and it responded with [01:31:52] Please wait while Jenkins is restarting [01:32:23] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:32:41] it wasn't the "restart service" button :p [01:32:53] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [01:33:00] mutante: It's being restarted.. [01:33:05] That takes about 30 minutes on average [01:33:28] ok, but did we expect the restart? [01:33:47] Yes [01:33:51] ok, good :) [01:34:00] it just seemed to me like i caused it almost [01:35:27] !log started zuul on gallium [01:35:36] Logged the message, Master [01:35:43] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:36:13] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [01:48:44] YuviPanda: say 15min [01:48:58] might be less (should be, but hey) [01:49:25] yeah, better buffer and not need it than otherwise, I guess :) [01:50:34] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:51:03] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [02:06:02] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [02:07:12] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [02:25:19] (03PS3) 10Yuvipanda: Labs: puppetize labsdb1005's mysql setup [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [02:25:30] (03CR) 10jenkins-bot: [V: 04-1] Labs: puppetize labsdb1005's mysql setup [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [02:26:22] (03PS1) 10Jalexander: Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) [02:26:25] (03CR) 10jenkins-bot: [V: 04-1] Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:26:36] damn that commit took a long time to submit [02:26:52] Jamesofur: the -1 is just jenkins being dead [02:27:02] oh good, because I just got massively confused [02:27:04] thanks :) [02:28:07] (03CR) 10Alex Monk: "Is this something we need to do in config so that it takes effect on meta? :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:28:38] Jamesofur, jenkins is very dead very often [02:29:03] !log l10nupdate Synchronized php-1.25wmf22/cache/l10n: (no message) (duration: 08m 46s) [02:29:13] Logged the message, Master [02:30:15] Krenair: amen [02:30:22] (03CR) 10Jalexander: "Sadly, yes, as far as I know we do :-/ if the extension was loaded on meta it should appear but it isn't (and I don't think there is a pla" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:30:26] :( poor Jenkins [02:30:51] and that often leads to people +2ing over jenkins which is not good [02:31:00] yup, it teaches bad habits [02:31:03] Jamesofur, we also wouldn't get i18n for the right on meta [02:31:08] (03PS3) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [02:31:08] little cry wolf [02:31:18] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [02:31:25] Krenair: yeah, though I don't have an enormous issue just filling that in if necessary. [02:31:30] but I don't think the stewards mind that too much [02:31:37] yeah [02:31:50] just a bit annoying we have to do it at all [02:31:54] * Jamesofur nods [02:31:56] agreed [02:32:10] not pretty [02:32:29] I've been trying to clear up Jenkins a bit [02:32:55] I've been looking at the things it yells about often and clearing them up [02:33:13] that's good, much needed [02:33:45] as in, repositories/branches that it can't pass new commits on? [02:35:09] no really just lint tests [02:35:29] I haven't pushed most of them yet [02:35:44] !log LocalisationUpdate completed (1.25wmf22) at 2015-04-01 02:34:41+00:00 [02:35:51] Logged the message, Master [02:39:28] Wow! Google smartbox! [02:40:10] (it's officially april fools at google) [02:41:44] haha [02:42:22] they're quite early this year [02:42:33] early? [02:42:48] relative to past years [02:42:56] also google maps had a change... [02:43:33] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:54] (03CR) 10Tim Landscheidt: tools: Ensure that proxylistener service is running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199830 (https://phabricator.wikimedia.org/T93121) (owner: 10Yuvipanda) [02:53:12] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59669 bytes in 0.065 second response time [02:55:17] (03PS8) 10Dzahn: WIP cassandra: add ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [02:58:17] !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 08m 41s) [02:58:26] Logged the message, Master [03:00:31] (03PS9) 10Dzahn: WIP cassandra: add ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [03:05:04] !log LocalisationUpdate completed (1.25wmf23) at 2015-04-01 03:04:00+00:00 [03:05:11] Logged the message, Master [03:05:34] (03CR) 10Dzahn: [C: 031] "this version confirmed with compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [03:06:41] (03PS10) 10Dzahn: cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [03:15:31] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [03:15:41] (03CR) 10Alex Monk: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [03:15:58] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [03:16:10] I never really found a reason to do a def main() in python [03:16:12] until now... [03:16:22] What's the reason? [03:18:53] Krenair: scope clash. [03:19:07] I have a ‘job’ variable outside, and one in a function, and they would clash [03:21:39] (03PS4) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [03:25:34] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [03:28:28] (03PS5) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [03:59:58] (03PS1) 10Yuvipanda: tools: Make webservice2 block for start / stop [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) [04:00:06] legoktm: help review python? ^ [04:03:13] (03PS2) 10Yuvipanda: tools: Make webservice2 block for start / stop [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) [04:25:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [04:37:03] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:15:13] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [05:31:52] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:52:53] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=84%): [05:58:46] <_joe_> Apr 1 05:58:20 uranium /usr/sbin/gmetad[16597]: RRD_update (/srv/ganglia/rrds/Text caches ulsfo/cp4017.ulsfo.wmnet/kafka.rdkafka.brokers.analytics1022-eqiad-wmnet_9092.22.tx.per_second.rrd): rrdcached: illegal attempt to update using time 1427867895.000000 when last update time is 1427867895.000000 (minimum one second step) [06:09:32] RECOVERY - Disk space on uranium is OK: DISK OK [06:10:54] <_joe_> !log manually rotated and compressed syslog and apache logs on uranium, still being spammed by kafka brokers [06:11:00] Logged the message, Master [06:11:23] (03CR) 10Gilles: "Easier/automated updating when security patches are issued is the main upside of pointing to system packages, yes." [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [06:29:43] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:27] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 1 06:42:24 UTC 2015 (duration 42m 23s) [06:43:34] Logged the message, Master [06:44:47] (03PS1) 10Gage: IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 [06:46:22] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:11] hi everyone, my name is Moritz Mühlenhoff and I'll be working with you starting today :-) [06:57:27] <_joe_> hi moritz [06:57:31] Mark is about to send an announcement mail, but I don't think that went out since he said he'd be in the Amsterdam data centre most of the day [06:57:31] <_joe_> and welcome [06:57:34] <_joe_> Giuseppe here [06:58:23] hi Guiseppe [06:58:35] <_joe_> a few more europeans tech ops will show up in a few, I guess [06:58:59] my Name is intricately German, so feel free to simply refer to me as jmm... [06:59:51] <_joe_> eheh, ok fair enough [06:59:59] <_joe_> I insist everyone calls me joe as well [07:00:22] <_joe_> (Giuseppe is pretty hard to pronounce correctly for most native english speakers) [07:07:55] jmm_: can't we call you "the hoff" instead? ;) [07:08:00] jmm_: hey, welcome [07:08:07] ! [07:10:22] gilles: hopefully not :-) the significance of "the hoff" in Germany is very much overrated, that's mostly a media phenomenon [07:10:26] hi ori [07:10:54] i wonder if we have anyone else whose first and last names form an alliteration [07:11:23] bblack! :) [07:11:28] <_joe_> yep [07:11:35] <_joe_> I was about to say that [07:12:13] all I can say is the name moritz brings me good memories http://upload.wikimedia.org/wikipedia/commons/3/33/Cervesa_Moritz_Llauna.jpg [07:12:44] oh, and Krinkle [07:13:07] <_joe_> right [07:13:38] avoid rochester! (https://en.wikipedia.org/wiki/Alphabet_murders) [07:14:05] gilles: I was very surprised when I saw that the first time in a supermarket [07:14:10] not that there is a shortage of reasons to avoid rochester [07:14:20] (03CR) 10Krinkle: "fixme: T94669. For labs instances the ganglia_class property doesn't seem to be getting a default from anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/198566 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [07:16:57] <_joe_> Krinkle: ouch, I thought ganglia wasn't used in labs? [07:17:33] _joe_: I dunno, it's being parsed at least. Which causes errors. [07:17:52] <_joe_> Krinkle: oh, damn import. [07:17:58] <_joe_> sigh, sorry [07:18:01] <_joe_> fixing it [07:18:05] It seems beta and staging also both set ganglia_class=old in their Hiera config [07:18:11] but not the rest of labs :) [07:19:27] <_joe_> Krinkle: yes, sorry' [07:24:17] (03PS1) 10Giuseppe Lavagetto: ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 [07:26:12] hi jmm_ ! welcome [07:26:35] godog: can you restart apertium-apy on sca1001/sca1002 please? [07:26:52] godog: hi Filippo! [07:27:12] jmm_: welcome! [07:27:18] kart_: sure, what's up? [07:27:31] godog: need restart as new pairs are installed. [07:27:53] they works fine in beta, but says not available in production. restart can help. [07:28:20] !log bounce apertium-apy on sca1001/sca1002 [07:28:28] Logged the message, Master [07:28:39] <_joe_> Krinkle: ^^ [07:28:42] legoktm: thanks :-) [07:28:51] (03CR) 10Krinkle: [C: 031] ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 (owner: 10Giuseppe Lavagetto) [07:29:04] _joe_: cherry-picking to integration-puppetmaster now [07:29:17] kart_: hah, mind opening a phab to get you access to restart apertium too? [07:30:18] godog: well, I did: https://phabricator.wikimedia.org/T89222 :P [07:30:35] godog: read it :) [07:32:05] kart_: ah ok, by design [07:33:04] godog: restart access is fine though. log is separate thing. [07:44:56] (03PS2) 10Krinkle: ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:15] (03CR) 10Krinkle: "Deployed on integration-puppetmaster. Works as expected :)" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:41] (03CR) 10Giuseppe Lavagetto: [C: 032] "Thanks for testing!" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:52] (03CR) 10Giuseppe Lavagetto: [V: 032] "Thanks for testing!" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:46:20] godog: thanks! [07:47:48] np [08:09:57] (03PS1) 10Filippo Giunchedi: wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) [08:21:26] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [08:28:58] (03PS6) 10KartikMistry: CX: Enable newarticle campaign in cawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (https://phabricator.wikimedia.org/T90876) [09:19:12] PROBLEM - puppet last run on virt1011 is CRITICAL: CRITICAL: Puppet has 9 failures [09:27:39] hoo: here? [09:27:43] yes [09:27:51] I saw your message from last night [09:27:59] (and then forgot about it) [09:28:07] still an issue? [09:28:22] paravoid: not this morning [09:28:30] won't know until evening [09:28:31] No, it's fine now... but yesterday evening it was insanely slow [09:28:38] mtrs, yes :) [09:28:47] https://phabricator.wikimedia.org/P463 [09:28:50] if that's helpful at all [09:29:14] proxycommand made gerrit super fast [09:35:52] RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:22] (03CR) 10Hashar: "I am tempted to entirely ignore the puppet_url_without_modules . From the issue conversation, it is not part of the style guide." [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [09:43:54] (03CR) 10Hashar: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [09:50:10] (03PS1) 10Giuseppe Lavagetto: ganglia: fix bits caches in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/201154 [09:53:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: fix bits caches in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/201154 (owner: 10Giuseppe Lavagetto) [09:58:58] _joe_: godog: akosiaris: when you have a minute, could you take a look at https://github.com/wikimedia/service-template-node/pull/29 ? [09:59:07] it's a collection of service init scripts [09:59:20] templates for service init scripts, better yet [09:59:39] <_joe_> mobrovac: will do, but I think those should not be part of that template tbh, but templates in puppet [10:00:01] _joe_: yes, that's the idea that they eventually end up in puppet [10:00:08] but you gotta start somewhere :) [10:00:42] <_joe_> mobrovac: we do have a nice module for handling multiple init scripts btw, base::something [10:00:55] yep, base::service_init [10:00:58] <_joe_> (and mind it, I wrote it, I can't remember the silly name I gave it) [10:01:01] <_joe_> ahah ok [10:01:01] no, service_unit [10:01:01] <_joe_> :P [10:01:26] if you look more closely to my PR, the names match what service_unit expects [10:01:50] the idea is that you pass service_unit the templates, and it chooses the init script to use based on the system [10:01:52] very neat [10:02:10] tim rages against the fd limit [10:02:49] he argues (compellingly) that it should be unlimited. not because services should be fast and loose with file descriptors, but because there really isn't any benefit in having upstart kill your service because you run into some arbitrary limit much lower than the system's [10:03:19] i agree [10:03:38] <_joe_> I don't [10:03:49] <_joe_> and I already explained that in length to tim [10:04:03] <_joe_> so I won't repeat myself [10:04:05] hence, why i didn't put unlimited there :) [10:04:42] <_joe_> esp if you're distributing your software with init scripts to everyone else, setting a reasonable open files limit for typical usage is a good idea [10:05:03] <_joe_> I do agree we don't need a open file limit on single-purpose machines, say hhvm appservers [10:05:09] <_joe_> well, sort of [10:06:13] high-but-unlimited [10:12:41] mobrovac: yep will take a look [10:21:04] (03CR) 10JanZerebecki: [C: 031] wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) (owner: 10Filippo Giunchedi) [10:36:42] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:33] RECOVERY - Host cp3043 is UP: PING OK - Packet loss = 0%, RTA = 90.21 ms [10:37:33] RECOVERY - Host cp3042 is UP: PING OK - Packet loss = 0%, RTA = 95.63 ms [10:37:33] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 95.41 ms [10:37:53] RECOVERY - Host cp3040 is UP: PING OK - Packet loss = 0%, RTA = 88.92 ms [10:37:53] RECOVERY - Host cp3047 is UP: PING OK - Packet loss = 0%, RTA = 88.90 ms [11:21:32] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 96, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - cp3018:eth1 (TEMP)BR [12:03:10] godog: thnx for the review [12:03:43] (03CR) 10Gergő Tisza: [C: 04-1] "The exact steps needed to build/update this should be included in a readme file. Also, a list of what python packages are included (and wh" [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [12:03:53] (03PS1) 10Faidon Liambotis: Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 [12:03:59] bblack: ^ [12:04:22] (03CR) 10Filippo Giunchedi: [WIP] Add role::mediawiki_vagrant_lxc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (owner: 10BryanDavis) [12:04:41] mobrovac: yw! [12:05:46] (03CR) 10BBlack: [C: 031] Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 (owner: 10Faidon Liambotis) [12:06:48] (03CR) 10Faidon Liambotis: [C: 032] Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 (owner: 10Faidon Liambotis) [12:07:08] (03PS1) 10Faidon Liambotis: Revert "Drain esams of all traffic (scheduled maintenance)" [dns] - 10https://gerrit.wikimedia.org/r/201173 [12:07:12] let's have that ready too :) [12:07:31] :) [12:08:00] !log draining esams [12:08:07] Logged the message, Master [12:08:21] How long will that take? [12:08:26] (maint. window) [12:08:37] an hour or two? [12:08:50] Ok, so not into the evening hours, good [12:08:58] hopefully not :) [12:41:00] (03PS2) 10Gilles: Initial venv [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) [12:42:12] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 [12:45:43] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: puppet fail [12:48:58] gilles: I am wondering whether we should fill a task for ops to review the Sentry deployment strategy [12:49:10] so they can triage / assign the review properly [12:49:25] sure, why not [12:49:32] gilles: could be made a sub task of https://phabricator.wikimedia.org/T84956 and added to their #operations project [12:49:47] though adding reviewers on https://gerrit.wikimedia.org/r/#/c/201006/ might be enough hehe [13:01:02] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:13] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:06:07] (03PS1) 10Hashar: zuul: replace 'zuul-merger' by $NAME [puppet] - 10https://gerrit.wikimedia.org/r/201181 [13:06:09] (03PS2) 10Alexandros Kosiaris: Assign role::ganeti to ganeti boxes [puppet] - 10https://gerrit.wikimedia.org/r/200587 [13:07:22] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 67, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Core: csw2-esams:xe-4/1/0 {#10614} [10Gbps DF]BRxe-1/2/0: down - Core: csw2-esams:xe-5/0/39 {#10088} [10Gbps DF]BRxe-1/1/0: down - Core: csw2-esams:xe-5/0/38 {#10089} [10Gbps DF]BR [13:09:07] (ignore that) [13:12:13] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 [13:17:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:08] (03PS1) 10Faidon Liambotis: Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 [13:42:30] (03CR) 10BBlack: "My concern here is the CPU impact of selecting such strong ECDH." [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [13:45:32] (03CR) 10Faidon Liambotis: [C: 032] Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 (owner: 10Faidon Liambotis) [13:45:39] jenkins broken [13:45:48] (03CR) 10Faidon Liambotis: [V: 032] Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 (owner: 10Faidon Liambotis) [13:46:34] (03PS1) 10coren: Tweaks to the conntrack collector: [puppet] - 10https://gerrit.wikimedia.org/r/201188 (https://phabricator.wikimedia.org/T90437) [13:50:15] (03CR) 10coren: [C: 032] "Trivial fix that affects a single (not working) collector on one box." [puppet] - 10https://gerrit.wikimedia.org/r/201188 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [13:52:27] (03PS6) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [13:53:12] akosiaris: ping [13:53:28] akosiaris: i think i figured the zotero translation timeout issue [13:53:45] s/figured/figured out/ [13:54:03] mobrovac: ok, please do tell [13:54:29] akosiaris: it seems zotero <-> urldownloader comm does not work for https [13:54:33] :/ [13:54:43] somone here which is sitting in datacenter? [13:54:54] akosiaris: when i input the same url but http:// it works [13:54:57] (03PS7) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [13:55:24] mobrovac: interesting, let me check.. Weird though, in other cases https has worked fine through url-downloader [13:56:12] Steinsplitter: what's up? [13:57:01] godog: can you make some photos (when you have time, of course) for https://commons.wikimedia.org/wiki/Category:Wikimedia_servers - last are from 2013 [13:57:42] akosiaris: however, judging from the zotero logs, it seems that responses for https arrive too, but too late (the req times out 5 secs before zotero receives the response) [13:58:02] mobrovac: for example: curl -v -x url-downloader.wikimedia.org:8080 "https://www.google.com" [13:58:05] works fine [13:58:13] so this maybe more specific than that [13:58:30] Steinsplitter: that's better tracked in a phab ticket (I'm not in the dc, just on ops duty) [13:58:36] akosiaris: yeah, i know, tried that myself [13:58:59] godog: phab. O_O. ok :-) [13:59:42] (03CR) 10Andrew Bogott: [C: 032] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [13:59:51] (03PS1) 10coren: More fix to conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/201190 (https://phabricator.wikimedia.org/T90437) [14:00:04] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T1400). [14:00:06] mobrovac: btw .. curl on books.google.com returns 401 [14:00:22] yep, seen that too [14:00:25] maybe user-agent ? [14:00:30] akosiaris: for both http(S) [14:00:33] doesn't seem to be IP or something [14:00:38] that could be a reason as well [14:00:44] (03PS1) 10Faidon Liambotis: Remove csw2-esams public IPs, csw1-esams [dns] - 10https://gerrit.wikimedia.org/r/201191 [14:00:53] not sure if zotero is capable of doing anything with 401s [14:01:01] not that there is much to do [14:01:20] (03CR) 10Faidon Liambotis: [C: 032] Remove csw2-esams public IPs, csw1-esams [dns] - 10https://gerrit.wikimedia.org/r/201191 (owner: 10Faidon Liambotis) [14:01:55] akosiaris: strange thing though - book.google.com return 401 for both https?, but the same url with http works with zotero [14:01:58] https does not [14:02:02] well, times out [14:02:16] 401s wit curl [14:02:18] with [14:02:24] true. it also might be completely unrelated [14:02:31] as I said, UA filtering ? [14:02:32] :) [14:02:38] let's verify that [14:02:40] (03PS1) 10Andrew Bogott: One-letter fix! s/horison/horizon/ [puppet] - 10https://gerrit.wikimedia.org/r/201192 [14:03:34] mobrovac: indeed [14:03:42] curl is blocked. unrelated [14:03:46] (03CR) 10Andrew Bogott: [C: 032] One-letter fix! s/horison/horizon/ [puppet] - 10https://gerrit.wikimedia.org/r/201192 (owner: 10Andrew Bogott) [14:04:14] _joe_: hola, yt? [14:04:22] <_joe_> nuria: I am [14:04:28] Tada! http://article.gmane.org/gmane.comp.db.cassandra.user/45442 [14:04:41] mobrovac: that being said, curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36" -v -x url-downloader.wikimedia.org:8080 "https://books.google.com" [14:04:45] works just fine [14:04:54] <_joe_> urandom: eheh [14:05:02] _joe_: this commit that got merged as of recent: ] https://gerrit.wikimedia.org/r/#/c/198566/ [14:05:25] akosiaris: let's try this - i'll up the timeout in zotero-server manually on sca1001, you restart then zotero and let's see if that's the root cause ? [14:05:26] _joe_: doesn't work on labs , returns an error [14:05:27] urandom: lol... not suprised [14:05:38] <_joe_> nuria: uhm we fixed that, supposedly [14:05:38]