[00:00:12] PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:12] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:12] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:13] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:14] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:32] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [00:00:32] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:00:42] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [00:04:55] kaldari: Now syncing Gather for wmf22 [00:05:51] !log catrope Synchronized php-1.25wmf22/extensions/Gather: SWAT (duration: 01m 06s) [00:05:59] Logged the message, Master [00:09:05] RoanKattouw: actually, I’m going to wait until wmf23 is synced too [00:09:34] RoanKattouw: let me know if you want me to do the submodule update for wmf23 [00:10:22] !log catrope Synchronized php-1.25wmf23/extensions/VisualEditor: SWAT (duration: 01m 07s) [00:10:28] Logged the message, Master [00:11:29] !log catrope Synchronized php-1.25wmf23/extensions/ImageMetrics: SWAT (duration: 01m 07s) [00:11:34] Logged the message, Master [00:11:55] !log ssh: connect to host mw2213.codfw.wmnet port 22: Connection timed out [00:12:00] Logged the message, Mr. Obvious [00:12:36] !log catrope Synchronized php-1.25wmf23/extensions/Gather: SWAT (duration: 01m 07s) [00:12:41] Logged the message, Master [00:13:23] kaldari: tgr: Your stuff is deployed now, please check [00:13:35] RoanKattouw, the mw2213 thing is known [00:13:45] could not even ping that host earlier [00:13:53] RoanKattouw: testing. Looks liek there might still be a problem, so I’m not going to do the config change for now [00:15:14] (03CR) 10Kaldari: "Looks like there may still be some problems with Gather." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [00:15:17] springle: around? [00:15:24] springle: mostly just ping about https://gerrit.wikimedia.org/r/#/c/200170/ :) [00:17:41] YuviPanda: ok [00:20:43] RoanKattouw_away: works, thanks! [00:21:39] (03CR) 10Springle: [C: 031] "Since this is 5.5, the log file size change will need: https://dev.mysql.com/doc/refman/5.5/en/innodb-data-log-reconfiguration.html" [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:23:19] (03CR) 10Springle: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:24:06] (03CR) 10Yuvipanda: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:25:29] Coren: I’m going to do my checklist (https://etherpad.wikimedia.org/p/labs-maint-checklist, not fully done) on ^, and see how that goes [00:26:24] (03CR) 10coren: Labs: puppetize labstore1005's mysql setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [00:27:55] urandom: no, it hasn't. would you be willing to do it? [00:29:00] urandom: i didnt just run salt or something because < gwicke> with coordinated being 'wait until a node is back up & is processing requests' before proceeding [00:29:51] springle: if we move to mariadb at some point, would that also require a window? [00:30:15] mutante: I can restart them [00:30:26] urandom: thanks [00:30:28] !log restarting cassandra on restbase1002 [00:30:34] Logged the message, Master [00:31:36] mutante: thanks for making this change! [00:31:45] mutante: after the restart it typically takes ~1 minute for a node to re-join the cluster [00:32:44] alright [00:33:17] !log restarting cassandra on restbase1003 [00:33:22] Logged the message, Master [00:33:27] YuviPanda: It /is/ mariadb; just not mariadb 10 [00:33:42] oh, right. I thought we were gonna go from mysql 5.5 to mariadb 10... [00:33:51] Is jenkins just flooded or broken? [00:34:13] Coren: https://phabricator.wikimedia.org/T94643 does that sound like too much overhead to file a ticket like that? [00:34:17] springle: ^ [00:35:05] !log restarting cassandra on restbase1004 [00:35:10] Logged the message, Master [00:36:01] YuviPanda: The ticket is nice and verbose, and can double as announcement, but omygersh bold oversize FONT! :-) [00:36:24] Coren: haha :D yeah, I can basically copy paste that once filled in as an announcement... [00:37:37] What could go wrong? is always fun [00:40:35] !log restarting cassandra on restbase1005 [00:40:35] mutante: heh, I guess I wouldn’t put that in the announcement email [00:40:37] Logged the message, Master [00:40:54] !log restarting cassandra on restbase1006 [00:40:59] Logged the message, Master [00:42:52] YuviPanda: "what could possibly go wrong" is a popular gerrit merge comment :) [00:43:09] eh [00:43:10] heh [00:43:27] and then i think about an actual answer [00:43:57] and your template asks for that, so you can come up with really bad but also really unlikely stuff [00:44:02] I know of only one worse thing to say to call doom upon oneself. [00:44:50] mutante: Coren I just added ‘an’ answer. [00:44:58] mutante: Coren I guess the question needs refining then... [00:46:11] it's seriously just hard to answer [00:46:12] "At least things can't get any worse now." [00:46:37] hehe [00:46:42] Coren: mutante so the question in fll says 'What could go wrong? Worst case predictions? ' [00:47:17] mutante: Coren if it seems hard to answer I can just take it out. [00:47:23] I think > Is it possible to stop mid-way / roll back? If so, how? And what effects will that have? [00:47:25] is more important... [00:47:30] (and of course, the answer can be ‘no’) [00:47:31] YuviPanda: The problem with that question is that truly worse case are vanishingly unlikely but infinite in numbers. [00:47:38] i guess it's all about where you draw the line what is still considered likely [00:47:45] Coren: right. maybe phrase it like ‘most probable failure case’? [00:47:52] Rather than estimate what could go wrong, having contingencies in place is key. [00:47:59] yeah, totally. [00:48:01] i could always fall on rm -rf / [00:48:10] I guess part of the point of the question is to have you think about contingencies... [00:48:25] Sure, but if you find the most probable failure case, why are you not guarding against it explicitly? :-) [00:48:41] totally. That’s kind of the hope :D [00:49:03] FOr instance, yesterday, had I considered Precise would crap its pants I wouldn't have done the switch this way at all. [00:49:04] like, if we had done this for the NFS change, and gone ‘well, they might need a reboot’, then maybe we would’ve had a script to do the reboots? Or at least thought about the effects of mass reboot... [00:49:05] or somtehing. [00:49:30] YuviPanda: See, the problem is I *thought* I ruled this out because I tested it first. :-) [00:49:43] So the contingency wins over the scenario. [00:49:52] that's also the thing about the known unknowns and the unknown unknowns [00:49:53] I should have planned for a true outage in the first place. [00:50:10] true. [00:50:16] how would we have answered [00:50:18] > Is it possible to stop mid-way / roll back? If so, how? And what effects will that have? [00:50:20] Even though it seemed like a simple salt run would have done it. [00:50:28] For the switch? [00:50:31] yeah... [00:50:41] * Coren ponders. [00:51:36] "Not per se; rolling back has necessarily the same impact as the switch itself; if the new filesystem fails to work, however, we can switch back." [01:02:07] (03PS1) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [01:02:16] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [01:03:35] Coren: mutante so I got rid of the ‘what could go wrong’ part. [01:04:11] (03PS2) 10Andrew Bogott: Add a Horizon-specific nova policy file [puppet] - 10https://gerrit.wikimedia.org/r/201088 [01:04:21] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [01:08:46] springle: how long do you think downtime is gonna be? If it’s short enough I guess we could just do it on thursday… If not next tuesday maybe? [01:31:46] eh, i clicked a link on integration.wm and it responded with [01:31:52] Please wait while Jenkins is restarting [01:32:23] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:32:41] it wasn't the "restart service" button :p [01:32:53] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [01:33:00] mutante: It's being restarted.. [01:33:05] That takes about 30 minutes on average [01:33:28] ok, but did we expect the restart? [01:33:47] Yes [01:33:51] ok, good :) [01:34:00] it just seemed to me like i caused it almost [01:35:27] !log started zuul on gallium [01:35:36] Logged the message, Master [01:35:43] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:36:13] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [01:48:44] YuviPanda: say 15min [01:48:58] might be less (should be, but hey) [01:49:25] yeah, better buffer and not need it than otherwise, I guess :) [01:50:34] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [01:51:03] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [02:06:02] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [02:07:12] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [02:25:19] (03PS3) 10Yuvipanda: Labs: puppetize labsdb1005's mysql setup [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [02:25:30] (03CR) 10jenkins-bot: [V: 04-1] Labs: puppetize labsdb1005's mysql setup [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [02:26:22] (03PS1) 10Jalexander: Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) [02:26:25] (03CR) 10jenkins-bot: [V: 04-1] Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:26:36] damn that commit took a long time to submit [02:26:52] Jamesofur: the -1 is just jenkins being dead [02:27:02] oh good, because I just got massively confused [02:27:04] thanks :) [02:28:07] (03CR) 10Alex Monk: "Is this something we need to do in config so that it takes effect on meta? :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:28:38] Jamesofur, jenkins is very dead very often [02:29:03] !log l10nupdate Synchronized php-1.25wmf22/cache/l10n: (no message) (duration: 08m 46s) [02:29:13] Logged the message, Master [02:30:15] Krenair: amen [02:30:22] (03CR) 10Jalexander: "Sadly, yes, as far as I know we do :-/ if the extension was loaded on meta it should appear but it isn't (and I don't think there is a pla" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [02:30:26] :( poor Jenkins [02:30:51] and that often leads to people +2ing over jenkins which is not good [02:31:00] yup, it teaches bad habits [02:31:03] Jamesofur, we also wouldn't get i18n for the right on meta [02:31:08] (03PS3) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [02:31:08] little cry wolf [02:31:18] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [02:31:25] Krenair: yeah, though I don't have an enormous issue just filling that in if necessary. [02:31:30] but I don't think the stewards mind that too much [02:31:37] yeah [02:31:50] just a bit annoying we have to do it at all [02:31:54] * Jamesofur nods [02:31:56] agreed [02:32:10] not pretty [02:32:29] I've been trying to clear up Jenkins a bit [02:32:55] I've been looking at the things it yells about often and clearing them up [02:33:13] that's good, much needed [02:33:45] as in, repositories/branches that it can't pass new commits on? [02:35:09] no really just lint tests [02:35:29] I haven't pushed most of them yet [02:35:44] !log LocalisationUpdate completed (1.25wmf22) at 2015-04-01 02:34:41+00:00 [02:35:51] Logged the message, Master [02:39:28] Wow! Google smartbox! [02:40:10] (it's officially april fools at google) [02:41:44] haha [02:42:22] they're quite early this year [02:42:33] early? [02:42:48] relative to past years [02:42:56] also google maps had a change... [02:43:33] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:54] (03CR) 10Tim Landscheidt: tools: Ensure that proxylistener service is running (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199830 (https://phabricator.wikimedia.org/T93121) (owner: 10Yuvipanda) [02:53:12] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59669 bytes in 0.065 second response time [02:55:17] (03PS8) 10Dzahn: WIP cassandra: add ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [02:58:17] !log l10nupdate Synchronized php-1.25wmf23/cache/l10n: (no message) (duration: 08m 41s) [02:58:26] Logged the message, Master [03:00:31] (03PS9) 10Dzahn: WIP cassandra: add ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [03:05:04] !log LocalisationUpdate completed (1.25wmf23) at 2015-04-01 03:04:00+00:00 [03:05:11] Logged the message, Master [03:05:34] (03CR) 10Dzahn: [C: 031] "this version confirmed with compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [03:06:41] (03PS10) 10Dzahn: cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) [03:15:31] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [03:15:41] (03CR) 10Alex Monk: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [03:15:58] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/200170 (https://phabricator.wikimedia.org/T88234) (owner: 10coren) [03:16:10] I never really found a reason to do a def main() in python [03:16:12] until now... [03:16:22] What's the reason? [03:18:53] Krenair: scope clash. [03:19:07] I have a ‘job’ variable outside, and one in a function, and they would clash [03:21:39] (03PS4) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [03:25:34] (03CR) 10jenkins-bot: [V: 04-1] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [03:28:28] (03PS5) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [03:59:58] (03PS1) 10Yuvipanda: tools: Make webservice2 block for start / stop [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) [04:00:06] legoktm: help review python? ^ [04:03:13] (03PS2) 10Yuvipanda: tools: Make webservice2 block for start / stop [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) [04:25:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [04:37:03] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:15:13] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [05:31:52] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:52:53] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=84%): [05:58:46] <_joe_> Apr 1 05:58:20 uranium /usr/sbin/gmetad[16597]: RRD_update (/srv/ganglia/rrds/Text caches ulsfo/cp4017.ulsfo.wmnet/kafka.rdkafka.brokers.analytics1022-eqiad-wmnet_9092.22.tx.per_second.rrd): rrdcached: illegal attempt to update using time 1427867895.000000 when last update time is 1427867895.000000 (minimum one second step) [06:09:32] RECOVERY - Disk space on uranium is OK: DISK OK [06:10:54] <_joe_> !log manually rotated and compressed syslog and apache logs on uranium, still being spammed by kafka brokers [06:11:00] Logged the message, Master [06:11:23] (03CR) 10Gilles: "Easier/automated updating when security patches are issued is the main upside of pointing to system packages, yes." [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [06:29:43] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:27] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 1 06:42:24 UTC 2015 (duration 42m 23s) [06:43:34] Logged the message, Master [06:44:47] (03PS1) 10Gage: IPsec: improved cipher selection [puppet] - 10https://gerrit.wikimedia.org/r/201135 [06:46:22] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:11] hi everyone, my name is Moritz Mühlenhoff and I'll be working with you starting today :-) [06:57:27] <_joe_> hi moritz [06:57:31] Mark is about to send an announcement mail, but I don't think that went out since he said he'd be in the Amsterdam data centre most of the day [06:57:31] <_joe_> and welcome [06:57:34] <_joe_> Giuseppe here [06:58:23] hi Guiseppe [06:58:35] <_joe_> a few more europeans tech ops will show up in a few, I guess [06:58:59] my Name is intricately German, so feel free to simply refer to me as jmm... [06:59:51] <_joe_> eheh, ok fair enough [06:59:59] <_joe_> I insist everyone calls me joe as well [07:00:22] <_joe_> (Giuseppe is pretty hard to pronounce correctly for most native english speakers) [07:07:55] jmm_: can't we call you "the hoff" instead? ;) [07:08:00] jmm_: hey, welcome [07:08:07] ! [07:10:22] gilles: hopefully not :-) the significance of "the hoff" in Germany is very much overrated, that's mostly a media phenomenon [07:10:26] hi ori [07:10:54] i wonder if we have anyone else whose first and last names form an alliteration [07:11:23] bblack! :) [07:11:28] <_joe_> yep [07:11:35] <_joe_> I was about to say that [07:12:13] all I can say is the name moritz brings me good memories http://upload.wikimedia.org/wikipedia/commons/3/33/Cervesa_Moritz_Llauna.jpg [07:12:44] oh, and Krinkle [07:13:07] <_joe_> right [07:13:38] avoid rochester! (https://en.wikipedia.org/wiki/Alphabet_murders) [07:14:05] gilles: I was very surprised when I saw that the first time in a supermarket [07:14:10] not that there is a shortage of reasons to avoid rochester [07:14:20] (03CR) 10Krinkle: "fixme: T94669. For labs instances the ganglia_class property doesn't seem to be getting a default from anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/198566 (https://phabricator.wikimedia.org/T93776) (owner: 10Giuseppe Lavagetto) [07:16:57] <_joe_> Krinkle: ouch, I thought ganglia wasn't used in labs? [07:17:33] _joe_: I dunno, it's being parsed at least. Which causes errors. [07:17:52] <_joe_> Krinkle: oh, damn import. [07:17:58] <_joe_> sigh, sorry [07:18:01] <_joe_> fixing it [07:18:05] It seems beta and staging also both set ganglia_class=old in their Hiera config [07:18:11] but not the rest of labs :) [07:19:27] <_joe_> Krinkle: yes, sorry' [07:24:17] (03PS1) 10Giuseppe Lavagetto: ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 [07:26:12] hi jmm_ ! welcome [07:26:35] godog: can you restart apertium-apy on sca1001/sca1002 please? [07:26:52] godog: hi Filippo! [07:27:12] jmm_: welcome! [07:27:18] kart_: sure, what's up? [07:27:31] godog: need restart as new pairs are installed. [07:27:53] they works fine in beta, but says not available in production. restart can help. [07:28:20] !log bounce apertium-apy on sca1001/sca1002 [07:28:28] Logged the message, Master [07:28:39] <_joe_> Krinkle: ^^ [07:28:42] legoktm: thanks :-) [07:28:51] (03CR) 10Krinkle: [C: 031] ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 (owner: 10Giuseppe Lavagetto) [07:29:04] _joe_: cherry-picking to integration-puppetmaster now [07:29:17] kart_: hah, mind opening a phab to get you access to restart apertium too? [07:30:18] godog: well, I did: https://phabricator.wikimedia.org/T89222 :P [07:30:35] godog: read it :) [07:32:05] kart_: ah ok, by design [07:33:04] godog: restart access is fine though. log is separate thing. [07:44:56] (03PS2) 10Krinkle: ganglia: set a default ganglia_class for labs in general as well [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:15] (03CR) 10Krinkle: "Deployed on integration-puppetmaster. Works as expected :)" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:41] (03CR) 10Giuseppe Lavagetto: [C: 032] "Thanks for testing!" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:45:52] (03CR) 10Giuseppe Lavagetto: [V: 032] "Thanks for testing!" [puppet] - 10https://gerrit.wikimedia.org/r/201139 (https://phabricator.wikimedia.org/T94669) (owner: 10Giuseppe Lavagetto) [07:46:20] godog: thanks! [07:47:48] np [08:09:57] (03PS1) 10Filippo Giunchedi: wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) [08:21:26] (03CR) 10Alexandros Kosiaris: [C: 031] cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [08:28:58] (03PS6) 10KartikMistry: CX: Enable newarticle campaign in cawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (https://phabricator.wikimedia.org/T90876) [09:19:12] PROBLEM - puppet last run on virt1011 is CRITICAL: CRITICAL: Puppet has 9 failures [09:27:39] hoo: here? [09:27:43] yes [09:27:51] I saw your message from last night [09:27:59] (and then forgot about it) [09:28:07] still an issue? [09:28:22] paravoid: not this morning [09:28:30] won't know until evening [09:28:31] No, it's fine now... but yesterday evening it was insanely slow [09:28:38] mtrs, yes :) [09:28:47] https://phabricator.wikimedia.org/P463 [09:28:50] if that's helpful at all [09:29:14] proxycommand made gerrit super fast [09:35:52] RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:22] (03CR) 10Hashar: "I am tempted to entirely ignore the puppet_url_without_modules . From the issue conversation, it is not part of the style guide." [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [09:43:54] (03CR) 10Hashar: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/200770 (owner: 10Legoktm) [09:50:10] (03PS1) 10Giuseppe Lavagetto: ganglia: fix bits caches in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/201154 [09:53:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: fix bits caches in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/201154 (owner: 10Giuseppe Lavagetto) [09:58:58] _joe_: godog: akosiaris: when you have a minute, could you take a look at https://github.com/wikimedia/service-template-node/pull/29 ? [09:59:07] it's a collection of service init scripts [09:59:20] templates for service init scripts, better yet [09:59:39] <_joe_> mobrovac: will do, but I think those should not be part of that template tbh, but templates in puppet [10:00:01] _joe_: yes, that's the idea that they eventually end up in puppet [10:00:08] but you gotta start somewhere :) [10:00:42] <_joe_> mobrovac: we do have a nice module for handling multiple init scripts btw, base::something [10:00:55] yep, base::service_init [10:00:58] <_joe_> (and mind it, I wrote it, I can't remember the silly name I gave it) [10:01:01] <_joe_> ahah ok [10:01:01] no, service_unit [10:01:01] <_joe_> :P [10:01:26] if you look more closely to my PR, the names match what service_unit expects [10:01:50] the idea is that you pass service_unit the templates, and it chooses the init script to use based on the system [10:01:52] very neat [10:02:10] tim rages against the fd limit [10:02:49] he argues (compellingly) that it should be unlimited. not because services should be fast and loose with file descriptors, but because there really isn't any benefit in having upstart kill your service because you run into some arbitrary limit much lower than the system's [10:03:19] i agree [10:03:38] <_joe_> I don't [10:03:49] <_joe_> and I already explained that in length to tim [10:04:03] <_joe_> so I won't repeat myself [10:04:05] hence, why i didn't put unlimited there :) [10:04:42] <_joe_> esp if you're distributing your software with init scripts to everyone else, setting a reasonable open files limit for typical usage is a good idea [10:05:03] <_joe_> I do agree we don't need a open file limit on single-purpose machines, say hhvm appservers [10:05:09] <_joe_> well, sort of [10:06:13] high-but-unlimited [10:12:41] mobrovac: yep will take a look [10:21:04] (03CR) 10JanZerebecki: [C: 031] wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) (owner: 10Filippo Giunchedi) [10:36:42] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:13] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:33] RECOVERY - Host cp3043 is UP: PING OK - Packet loss = 0%, RTA = 90.21 ms [10:37:33] RECOVERY - Host cp3042 is UP: PING OK - Packet loss = 0%, RTA = 95.63 ms [10:37:33] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 95.41 ms [10:37:53] RECOVERY - Host cp3040 is UP: PING OK - Packet loss = 0%, RTA = 88.92 ms [10:37:53] RECOVERY - Host cp3047 is UP: PING OK - Packet loss = 0%, RTA = 88.90 ms [11:21:32] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 96, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - cp3018:eth1 (TEMP)BR [12:03:10] godog: thnx for the review [12:03:43] (03CR) 10Gergő Tisza: [C: 04-1] "The exact steps needed to build/update this should be included in a readme file. Also, a list of what python packages are included (and wh" [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [12:03:53] (03PS1) 10Faidon Liambotis: Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 [12:03:59] bblack: ^ [12:04:22] (03CR) 10Filippo Giunchedi: [WIP] Add role::mediawiki_vagrant_lxc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (owner: 10BryanDavis) [12:04:41] mobrovac: yw! [12:05:46] (03CR) 10BBlack: [C: 031] Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 (owner: 10Faidon Liambotis) [12:06:48] (03CR) 10Faidon Liambotis: [C: 032] Drain esams of all traffic (scheduled maintenance) [dns] - 10https://gerrit.wikimedia.org/r/201172 (owner: 10Faidon Liambotis) [12:07:08] (03PS1) 10Faidon Liambotis: Revert "Drain esams of all traffic (scheduled maintenance)" [dns] - 10https://gerrit.wikimedia.org/r/201173 [12:07:12] let's have that ready too :) [12:07:31] :) [12:08:00] !log draining esams [12:08:07] Logged the message, Master [12:08:21] How long will that take? [12:08:26] (maint. window) [12:08:37] an hour or two? [12:08:50] Ok, so not into the evening hours, good [12:08:58] hopefully not :) [12:41:00] (03PS2) 10Gilles: Initial venv [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) [12:42:12] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 [12:45:43] PROBLEM - puppet last run on wtp1006 is CRITICAL: CRITICAL: puppet fail [12:48:58] gilles: I am wondering whether we should fill a task for ops to review the Sentry deployment strategy [12:49:10] so they can triage / assign the review properly [12:49:25] sure, why not [12:49:32] gilles: could be made a sub task of https://phabricator.wikimedia.org/T84956 and added to their #operations project [12:49:47] though adding reviewers on https://gerrit.wikimedia.org/r/#/c/201006/ might be enough hehe [13:01:02] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:13] RECOVERY - puppet last run on wtp1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:06:07] (03PS1) 10Hashar: zuul: replace 'zuul-merger' by $NAME [puppet] - 10https://gerrit.wikimedia.org/r/201181 [13:06:09] (03PS2) 10Alexandros Kosiaris: Assign role::ganeti to ganeti boxes [puppet] - 10https://gerrit.wikimedia.org/r/200587 [13:07:22] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 67, down: 3, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Core: csw2-esams:xe-4/1/0 {#10614} [10Gbps DF]BRxe-1/2/0: down - Core: csw2-esams:xe-5/0/39 {#10088} [10Gbps DF]BRxe-1/1/0: down - Core: csw2-esams:xe-5/0/38 {#10089} [10Gbps DF]BR [13:09:07] (ignore that) [13:12:13] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 [13:17:43] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:08] (03PS1) 10Faidon Liambotis: Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 [13:42:30] (03CR) 10BBlack: "My concern here is the CPU impact of selecting such strong ECDH." [puppet] - 10https://gerrit.wikimedia.org/r/201135 (owner: 10Gage) [13:45:32] (03CR) 10Faidon Liambotis: [C: 032] Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 (owner: 10Faidon Liambotis) [13:45:39] jenkins broken [13:45:48] (03CR) 10Faidon Liambotis: [V: 032] Use csw2-esams' mgmt interface [puppet] - 10https://gerrit.wikimedia.org/r/201184 (owner: 10Faidon Liambotis) [13:46:34] (03PS1) 10coren: Tweaks to the conntrack collector: [puppet] - 10https://gerrit.wikimedia.org/r/201188 (https://phabricator.wikimedia.org/T90437) [13:50:15] (03CR) 10coren: [C: 032] "Trivial fix that affects a single (not working) collector on one box." [puppet] - 10https://gerrit.wikimedia.org/r/201188 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [13:52:27] (03PS6) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [13:53:12] akosiaris: ping [13:53:28] akosiaris: i think i figured the zotero translation timeout issue [13:53:45] s/figured/figured out/ [13:54:03] mobrovac: ok, please do tell [13:54:29] akosiaris: it seems zotero <-> urldownloader comm does not work for https [13:54:33] :/ [13:54:43] somone here which is sitting in datacenter? [13:54:54] akosiaris: when i input the same url but http:// it works [13:54:57] (03PS7) 10Andrew Bogott: Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 [13:55:24] mobrovac: interesting, let me check.. Weird though, in other cases https has worked fine through url-downloader [13:56:12] Steinsplitter: what's up? [13:57:01] godog: can you make some photos (when you have time, of course) for https://commons.wikimedia.org/wiki/Category:Wikimedia_servers - last are from 2013 [13:57:42] akosiaris: however, judging from the zotero logs, it seems that responses for https arrive too, but too late (the req times out 5 secs before zotero receives the response) [13:58:02] mobrovac: for example: curl -v -x url-downloader.wikimedia.org:8080 "https://www.google.com" [13:58:05] works fine [13:58:13] so this maybe more specific than that [13:58:30] Steinsplitter: that's better tracked in a phab ticket (I'm not in the dc, just on ops duty) [13:58:36] akosiaris: yeah, i know, tried that myself [13:58:59] godog: phab. O_O. ok :-) [13:59:42] (03CR) 10Andrew Bogott: [C: 032] Add a Horizon-specific nova policy file. [puppet] - 10https://gerrit.wikimedia.org/r/201088 (owner: 10Andrew Bogott) [13:59:51] (03PS1) 10coren: More fix to conntrack collector [puppet] - 10https://gerrit.wikimedia.org/r/201190 (https://phabricator.wikimedia.org/T90437) [14:00:04] chasemp: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T1400). [14:00:06] mobrovac: btw .. curl on books.google.com returns 401 [14:00:22] yep, seen that too [14:00:25] maybe user-agent ? [14:00:30] akosiaris: for both http(S) [14:00:33] doesn't seem to be IP or something [14:00:38] that could be a reason as well [14:00:44] (03PS1) 10Faidon Liambotis: Remove csw2-esams public IPs, csw1-esams [dns] - 10https://gerrit.wikimedia.org/r/201191 [14:00:53] not sure if zotero is capable of doing anything with 401s [14:01:01] not that there is much to do [14:01:20] (03CR) 10Faidon Liambotis: [C: 032] Remove csw2-esams public IPs, csw1-esams [dns] - 10https://gerrit.wikimedia.org/r/201191 (owner: 10Faidon Liambotis) [14:01:55] akosiaris: strange thing though - book.google.com return 401 for both https?, but the same url with http works with zotero [14:01:58] https does not [14:02:02] well, times out [14:02:16] 401s wit curl [14:02:18] with [14:02:24] true. it also might be completely unrelated [14:02:31] as I said, UA filtering ? [14:02:32] :) [14:02:38] let's verify that [14:02:40] (03PS1) 10Andrew Bogott: One-letter fix! s/horison/horizon/ [puppet] - 10https://gerrit.wikimedia.org/r/201192 [14:03:34] mobrovac: indeed [14:03:42] curl is blocked. unrelated [14:03:46] (03CR) 10Andrew Bogott: [C: 032] One-letter fix! s/horison/horizon/ [puppet] - 10https://gerrit.wikimedia.org/r/201192 (owner: 10Andrew Bogott) [14:04:14] _joe_: hola, yt? [14:04:22] <_joe_> nuria: I am [14:04:28] Tada! http://article.gmane.org/gmane.comp.db.cassandra.user/45442 [14:04:41] mobrovac: that being said, curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36" -v -x url-downloader.wikimedia.org:8080 "https://books.google.com" [14:04:45] works just fine [14:04:54] <_joe_> urandom: eheh [14:05:02] _joe_: this commit that got merged as of recent: ] https://gerrit.wikimedia.org/r/#/c/198566/ [14:05:25] akosiaris: let's try this - i'll up the timeout in zotero-server manually on sca1001, you restart then zotero and let's see if that's the root cause ? [14:05:26] _joe_: doesn't work on labs , returns an error [14:05:27] urandom: lol... not suprised [14:05:38] <_joe_> nuria: uhm we fixed that, supposedly [14:05:38] _joe_: [14:05:42] https://www.irccloud.com/pastebin/uMyPwIBC [14:05:43] <_joe_> where does it happens? [14:05:50] mobrovac: sure, but I fear it may not be the root cause [14:06:00] godog, thanks. i created https://phabricator.wikimedia.org/T94694 :-) [14:06:08] mobrovac: anyway, nothing to lose, let's do it [14:06:18] akosiaris: let's at least rule it out, ok will let you know when to restart [14:06:19] (03CR) 10Chad: [C: 032] Remove a me-only hack from git repos in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200938 (owner: 10Chad) [14:06:23] <_joe_> nuria: it was fixed in https://gerrit.wikimedia.org/r/#/c/201139/ [14:06:28] (03Merged) 10jenkins-bot: Remove a me-only hack from git repos in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200938 (owner: 10Chad) [14:06:31] (03CR) 10coren: [V: 031] "resolv.conf definitely should be under puppet control; tunables in there can make or break resolution." [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [14:06:37] _joe_: ahhhh, you can tell i just woke up, Thnak you! [14:06:42] *thank you! [14:06:44] <_joe_> nuria: :) [14:06:52] (03CR) 10Chad: [C: 032] Checkout proper branch when using master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200996 (owner: 10Chad) [14:06:58] (03Merged) 10jenkins-bot: Checkout proper branch when using master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200996 (owner: 10Chad) [14:07:09] <_joe_> and sorry for the disruption, I forgot that file is imported everywhere rather than included [14:07:18] (03CR) 10coren: [C: 032] "Typo fxi." [puppet] - 10https://gerrit.wikimedia.org/r/201190 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [14:07:41] akosiaris: heh forgot i can't change it, no root perms :P [14:07:42] akosiaris: [14:08:06] akosiaris: /srv/deployment/zotero/translation-server/translation-server/server_translation.js:28 [14:08:14] change that to, e.g. 60 [14:08:26] on sca1001 [14:09:22] mobrovac: done [14:09:34] Steinsplitter: ack, thanks :D [14:11:11] akosiaris: did you restart zotero? [14:11:14] ah yes, you did [14:11:23] akosiaris: nope, didn't help [14:11:54] maybe the translator is malfunctioning ? [14:12:02] whatever that means [14:12:22] eqiad slowness starting again [14:12:48] akosiaris: if that were the case, the http://url would return an error, which it doesn't [14:13:17] akosiaris: i wonder if https://books.google.com is an exception or the rule [14:20:24] (03CR) 10Andrew Bogott: [C: 032] Disable automatic updating of resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/200999 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [14:20:43] !log Shutting down lvs3003 and lvs3004 [14:20:50] Logged the message, Master [14:23:01] akosiaris: i think i've solved the mystery [14:23:03] :P [14:23:10] akosiaris: http://kb.mozillazine.org/Network.proxy.%28protocol%29 [14:23:40] andrewbogott: mistery solved on cgroup and silver btw [14:23:43] akosiaris: in defaults.js, we've got only proxy.http, causing zotero not to use the proxy for https and consequently timing out [14:24:02] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:03] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:05] godog: yeah? What was it? [14:24:42] andrewbogott: missing package cgroup-bin, I've updated the ticket [14:24:59] huh [14:25:00] thanks! [14:25:21] np [14:27:59] mobrovac: I was coming to a similar conclusion. zotero has not sent a single request for HTTPS links to url-downloader [14:28:07] let me check your theory [14:28:11] yep [14:28:21] akosiaris: already preparing a puppet patch :P [14:28:28] but yeah check it first [14:28:52] !log asw-esams: mark@asw-esams> request system power-off member 3 [14:28:58] Logged the message, Master [14:29:05] (03CR) 10JanZerebecki: [C: 04-1] "See inline comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/201003 (owner: 10Smalyshev) [14:29:27] (03PS1) 10Pmlineditor: Enable transwiki import from English Wikisource on Telugu Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201196 (https://phabricator.wikimedia.org/T94531) [14:29:29] (03PS1) 10Andrew Bogott: Raise use_dnsmasq for new instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201197 [14:30:09] (03PS1) 10Mobrovac: Let zotero use our proxy for HTTPS requests as well [puppet] - 10https://gerrit.wikimedia.org/r/201198 [14:31:26] mobrovac: success [14:31:36] yuhuu [14:31:40] akosiaris: the patch is ^^ [14:32:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] Let zotero use our proxy for HTTPS requests as well (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/201198 (owner: 10Mobrovac) [14:32:24] !log mark@csw2-esams> request system power-off member 1 [14:32:29] Logged the message, Master [14:32:30] mobrovac: yeah, it's network.proxy.ssl, not https [14:32:33] but otherwise LGTM [14:32:38] ah ? [14:33:03] akosiaris: http://kb.mozillazine.org/Network.proxy.%28protocol%29 says possible values are http, https, ftp, ... [14:33:12] yeah I know [14:33:17] but open up a firefox [14:33:24] and do the about:config dance [14:33:32] akosiaris: yeah i did, i see .ssl there [14:33:46] akosiaris: anyhow, putting .ssl works ? [14:33:49] yes [14:33:51] ok [14:34:06] * mobrovac wonders if proxy.https would work as well [14:34:07] :P [14:34:14] I can try [14:34:16] gimme a sec [14:34:32] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 67, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - Core: asw-esams:xe-3/0/44 {#10660} [10Gbps DF]BRxe-1/2/0: down - Core: asw-esams:xe-3/0/45 {#10661} [10Gbps DF]BR [14:35:10] mobrovac: nope [14:35:19] hmmm strange [14:35:21] doc fail i guess [14:35:22] documentation on mozillazine.org is not correct [14:35:25] !log upgrading junos on cr2-knams (esams is depooled, ignore alerts) [14:35:26] yep [14:35:30] Logged the message, Master [14:35:47] (03PS2) 10Mobrovac: Let zotero use our proxy for HTTPS requests as well [puppet] - 10https://gerrit.wikimedia.org/r/201198 [14:38:33] (03CR) 10Alexandros Kosiaris: [C: 032] Let zotero use our proxy for HTTPS requests as well [puppet] - 10https://gerrit.wikimedia.org/r/201198 (owner: 10Mobrovac) [14:42:24] mobrovac: seems like we are OK [14:42:35] nice figuring that out [14:43:03] PROBLEM - BGP status on cr2-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.246 [14:43:04] yey! [14:43:07] akosiaris: thnx! [14:43:33] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:33] PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:33] (speaking of which, i should create a access-req ticket for restarting zotero) [14:43:33] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:33] PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:33] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:42] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:42] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:42] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:42] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:42] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:43] PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:43] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:44] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:44] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:53] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:53] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:53] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:02] PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host amssq37 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:03] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:12] these are all okay, ignore them [14:44:13] PROBLEM - Host amssq46 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:13] PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:13] PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:13] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:23] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:23] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:23] PROBLEM - Host amssq41 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:23] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:33] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:33] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:33] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:33] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:41] Oh noes! Amesterdam asplode! [14:44:42] PROBLEM - Host nescio is DOWN: CRITICAL - Time to live exceeded (91.198.174.106) [14:44:43] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:43] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:43] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:43] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:43] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:43] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:44] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:53] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [14:44:54] PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:54] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:02] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:02] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:02] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:02] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:02] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:03] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:03] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:04] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:04] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:12] PROBLEM - Host cr2-knams is DOWN: CRITICAL - Network Unreachable (91.198.174.246) [14:45:12] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:12] PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:12] PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:12] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:13] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:13] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:14] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:14] PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:15] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:16] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:16] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:17] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:20] greg-g: yt? [14:53:32] RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 88.10 ms [14:53:32] RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 89.65 ms [14:53:32] RECOVERY - Host amssq39 is UP: PING OK - Packet loss = 0%, RTA = 87.36 ms [14:53:32] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 87.39 ms [14:53:32] RECOVERY - Host amssq31 is UP: PING OK - Packet loss = 0%, RTA = 89.63 ms [14:54:48] RECOVERY - Host cr2-knams is UP: PING OK - Packet loss = 0%, RTA = 88.03 ms [14:55:14] PROBLEM - Host mobile-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.204) [14:55:18] paravoid: the bits-lb for esams IPv6 page is you, right ? [14:55:30] ignore [14:55:32] ok [14:55:33] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [14:55:53] PROBLEM - Host mobile-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1:c [14:55:58] PROBLEM - Host bits-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.202) [14:55:58] phuedx: what's up? [14:56:24] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [14:56:34] greg-g: i wanted to let you know that i've been asked to swot deploy a fix for jdlrobson / the gather folk [14:57:01] the lstprop ones? [14:57:05] i've added myself to the calendar: https://wikitech.wikimedia.org/wiki/Deployments and am lining up the commits [14:57:07] yarrrp [14:57:10] cool [14:57:13] that cool? [14:57:23] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: puppet fail [14:57:24] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [14:57:28] should be, whoever takes on swat this morning will let you know :) [14:57:34] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [14:57:43] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [14:57:48] it'll be my first proper derploy [14:57:54] RECOVERY - Host mobile-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 87.32 ms [14:58:00] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 91.57 ms [14:58:06] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 93.45 ms [14:58:27] ^ongoing maintenance [14:58:34] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail [14:59:03] PROBLEM - Host wikidata is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [14:59:30] phuedx: we all love derploys [14:59:48] * phuedx puts on his D hat [14:59:55] :) [15:00:02] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 91.46 ms [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, jdlrobson, phuedx: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T1500). Please do the needful. [15:00:17] there are so many applicable D-words! :) [15:00:53] RECOVERY - Host mobile-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 89.15 ms [15:01:13] anyone around for deploy but me today? [15:01:25] i'm around [15:01:37] I’m here [15:01:39] (03CR) 10Eevans: [C: 031] "LGTM +1" [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [15:01:43] RECOVERY - Host bits-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 89.80 ms [15:02:22] cool, phuedx, did you want to deploy the gather patches? I'm fine with that. [15:02:52] thcipriani: sure -- i'll be updating testwiki first -- gather's been throwing exceptions and they're trying to fix 'em [15:02:56] just lining up patches :) [15:03:08] kk [15:03:12] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [15:04:03] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:04:24] core patches are here: https://gerrit.wikimedia.org/r/#/c/201201/ and https://gerrit.wikimedia.org/r/#/c/201202/ [15:04:43] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:52] RECOVERY - Host wikidata is UP: PING OK - Packet loss = 0%, RTA = 90.53 ms [15:04:52] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:04:53] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:05:03] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:05:26] don't know if i have to seek approval but it doesn't hurt the first few times, right? [15:05:32] greg-g, thcipriani ^ [15:06:10] jdfi [15:06:11] ;) [15:06:15] er, jfdi [15:06:48] i read it as jfdi anyway [15:06:57] i feel like such a monster +2ing my own patches [15:07:10] does the feeling go away? [15:07:26] It goes away for a while, and then after you cause a site outage it comes back [15:07:28] you can only numb it with whiskey [15:07:38] turns into power drunkeness. [15:07:38] andrewbogott: :) [15:09:40] * phuedx waits for jerkins [15:09:50] i'm coffee drunk, is that suitable? [15:10:10] i could even it out with whiskey [15:10:16] later though, obvs [15:10:27] phuedx: tottallly [15:11:34] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:12:54] wow [15:13:07] merging takes a while o.O [15:13:22] phuedx: sometime 10-12 minutes :) [15:13:43] i can't maintain this level of nervousness for 12 minutes!!! [15:13:54] dat zend job. [15:13:58] what kind of #wikimedia-operations are you guys running here? [15:14:15] pfff zend [15:14:20] i use go-php [15:15:08] (03PS4) 10Chad: Better support checking out MediaWiki & extension masters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 [15:17:42] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:33] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 87.24 ms [15:21:31] (03PS1) 10coren: Labs: monitor and alert labnet1001 for conntrack [puppet] - 10https://gerrit.wikimedia.org/r/201203 (https://phabricator.wikimedia.org/T90437) [15:23:06] (03CR) 10jenkins-bot: [V: 04-1] Labs: monitor and alert labnet1001 for conntrack [puppet] - 10https://gerrit.wikimedia.org/r/201203 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [15:23:26] ok [15:23:36] submodule updates are merged on 22 and 23 [15:24:11] (03PS2) 10coren: Labs: monitor and alert labnet1001 for conntrack [puppet] - 10https://gerrit.wikimedia.org/r/201203 (https://phabricator.wikimedia.org/T90437) [15:24:50] phuedx: nice, want me to git wrangle and sync or are you doing that? [15:25:07] thcipriani: i'll give it a go if it's ok with you? [15:25:15] just doing the checks for the extension [15:25:20] that's fine with me [15:27:43] thcipriani: can't ssh to the testwiki server [15:28:05] hrrm [15:28:52] thcipriani: while i figure out would you mind updating testwiki -- jdlrobson and joakino are anxious to test it out [15:28:58] not really sure how to sync just to testwiki. ^d ? [15:29:48] (03PS11) 10Filippo Giunchedi: cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [15:29:52] (03CR) 10GWicke: [C: 031] cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [15:30:12] <^d> thcipriani: Which mw* is it? Just ssh to that and run sync-common there [15:30:24] mw1017 looks like [15:30:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add ferm rules using hiera data [puppet] - 10https://gerrit.wikimedia.org/r/197840 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [15:30:53] ^d: have to run as mwdeploy? [15:31:12] <^d> lemme see [15:31:32] <^d> No. Doing it now anyway [15:31:35] urandom gwicke testing [15:31:53] ^d: kk, thanks. [15:31:53] ta ^d [15:31:55] <^d> !log ran sync-common on mw1017 for testwiki fun [15:32:00] Logged the message, Master [15:32:04] jdlrobson, joakino ^^ [15:32:12] godog, urandom: would be good to deploy this to the staging cluster first (praseodymium, cerium, xenon) [15:32:32] thcipriani: jdlrobson and joakino will be poking testwiki for a while i guess [15:32:41] <^d> gwicke: mm staging? [15:32:47] gwicke: yep it will, restbase* doesn't have base::firewall [15:32:55] godog: cool [15:33:11] yaay the exception disappeared [15:33:30] ^d: the cassandra staging cluster, not labs staging [15:33:31] thcipriani: phuedx can you keep an eye on the exception log for a bit whilst joakino and i tear through the feature sets [15:33:45] !log disable puppet on xenon, praseodymium, cerium, restbase* to test https://gerrit.wikimedia.org/r/197840 [15:33:50] Logged the message, Master [15:33:58] <^d> gwicke: ok, nvm :) [15:34:10] * phuedx loads up fatal/exception monitor [15:36:13] phuedx: done with patches? [15:36:18] phuedx: joakino looks good to me [15:36:25] kart_: for now -- testing 'em on test wiki [15:36:37] permission to do enwiki now from me :) [15:37:33] kart_: let's get yours merged while this is happening... [15:37:45] thcipriani: sure. [15:38:02] thcipriani: these patches are backport to wmf23 [15:39:27] jdlrobson: +1 on my side [15:39:41] been poking around, everything works fine if there is nothing weird on the logs [15:39:47] phuedx: ^ [15:40:02] so i've updated the gather extension on tin for 25wmf23 and 22 [15:40:15] \o/ [15:40:16] thcipriani: are you merging? Specially, good if anyone who reviewed core patch do that :) [15:40:22] anomie: ^^ [15:40:35] jdlrobson: it's still disabled in prod, right [15:40:37] ? [15:40:47] kart_: yeah would be good to get someone to review 201131, especially [15:41:18] i see, it's on testwiki and test2wiki [15:41:21] otherwise it's disabled [15:41:46] thcipriani: i'm going to sync-dir the gather extension for wmf23 and 22 [15:41:56] anomie: can you check: https://gerrit.wikimedia.org/r/#/c/201131 ? [15:41:58] phuedx: okie doke. [15:42:19] i'm not expecting turbulence as the extension is disabled everywhere but on test [15:42:21] thcipriani: uls patch is okay to go. [15:43:18] (03CR) 10BryanDavis: [C: 031] wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) (owner: 10Filippo Giunchedi) [15:43:24] kart_: kk +2'd [15:44:51] i'm running sync-dir -- where's dem messages? [15:44:57] (03Abandoned) 10Hashar: contint: disable hhvm stacktraces / map [puppet] - 10https://gerrit.wikimedia.org/r/195035 (https://phabricator.wikimedia.org/T64788) (owner: 10Hashar) [15:45:02] !log phuedx Synchronized php-1.25wmf23/extensions/Gather: Updating the Gather extension for 1.25wmf23 (duration: 01m 07s) [15:45:07] oh [15:45:12] there's dem messages [15:45:12] Logged the message, Master [15:45:33] thcipriani: you can go ahead with sync-dir etc too :) [15:45:44] until I found anomie around :/ [15:45:45] saw a sync error with mw2213 [15:45:56] shall i dump it in chan? [15:46:14] phuedx: sure [15:46:27] 15:45:02 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf23', '--include', 'php-1.25wmf23/extensions', '--include', 'php-1.25wmf23/extensions/Gather', '--include', 'php-1.25wmf23/extensions/Gather/***', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', [15:46:27] 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw2213.codfw.wmnet returned [255]: ssh: connect to host mw2213.codfw.wmnet port 22: Connection timed out [15:47:43] !log ssh: connect to host mw2213.codfw.wmnet port 22: Connection timed out during sync-dir initiated from tin [15:47:48] Logged the message, Master [15:48:02] jdlrobson, joakino: update is deployed for both release branches -- now we have to get the config change sorted [15:48:03] !log phuedx Synchronized php-1.25wmf22/extensions/Gather/: Updating the Gather extension for 1.25wmf22 (duration: 01m 06s) [15:48:09] Logged the message, Master [15:48:17] phuedx: [15:48:22] oh, codfw. This may be a known thing. bd808 has that been happening? [15:48:27] there's 15 minutes left in the window, want to leave it on testwiki and test some more? [15:48:36] phuedx: errors from the mw2* hosts are acceptable right now but we should log them. [15:48:42] I !logged above [15:48:48] noted [15:48:53] thanks bd808 [15:49:08] yw [15:49:58] jdlrobson, joakino: 10 minutes left in the window -- test on testwiki some more? [15:50:00] thcipriani: We just put codfw back into the scap target list yesterday. It won't surprise me if there are a few messed up hosts there at the moment [15:50:10] phuedx: am doing but still not seeing an issues on test wiki [15:50:22] are logs clean? no fatals? [15:50:37] thcipriani: can you sync uls? [15:50:46] kart_: yup, working on it now [15:50:52] thcipriani: cool. [15:51:33] gwicke urandom mutante it worked however I'm surprised it applied wider rules [15:51:52] phuedx: is my patch likely to make it in? [15:52:05] It only affects wikitech, so pretty safe. [15:52:35] andrewbogott: mibad -- i didn't see that [15:52:46] * phuedx hasn't done a confploy before /cc thcipriani [15:53:22] jdlrobson: phuedx i've tried all use cases, crud, add and remove items, new user with CTA, all lists page [15:53:26] all seem to work fine [15:53:31] yeh i don't know what else i can test :) [15:53:35] so 👍 on me [15:53:40] godog: as in allows access to the app ports to all internal networks? [15:53:42] i'm trying to think of bizarre edge cases [15:54:05] gwicke: yes [15:54:29] godog: that was the old config, maybe ferm didn't update? [15:54:38] So I guess the plan is to continue logging errors about mw2213 until they go away? [15:54:41] :D [15:55:01] andrewbogott: in answer to your question: unfortunately i think not -- i was here to do two specific patches, which i did /very/ slowly (it's my first time) [15:55:24] it should only take ~10 seconds to merge, unless Jenkins is stuck [15:55:45] godog: I vaguely remember that mutante had to restart the ferm service once before to get changes to apply [15:55:46] gwicke: indeed [15:58:01] !log upgrading junos on cr1-esams (esams is depooled, ignore alerts) [15:58:07] Logged the message, Master [15:58:47] jdlrobson, joakino: i can't see anything in the logs on flourine [15:58:47] (03PS1) 10Hoo man: Add new wikidata folders, define dataset folders in puppet [puppet] - 10https://gerrit.wikimedia.org/r/201208 (https://phabricator.wikimedia.org/T72385) [15:59:05] phuedx: ok let us know when we are live :) [15:59:43] jdlrobson: evening swat? [16:00:15] i'm reading the config deploy procedure now ;) [16:00:21] and i've got a meeting :/! [16:00:27] I am going to bring fluorine down for a few minutes to fix bios setting...any objections? [16:00:34] kart_: ok, preped, preparing to sync now [16:01:24] <^d> Is swat still going on? [16:01:28] <^d> cmjohnson1: Might want to wait if so [16:01:38] ^d: sync is about to happen i think [16:01:40] I'm finishing swat up right now [16:01:43] ^ that [16:02:01] <^d> cmjohnson1: We just lose logging, so doing it not during a deploy window is best :) [16:02:14] "just" logging [16:02:23] lol [16:02:28] ^d k..thx [16:02:28] someday we will have logstash again. Someday [16:02:34] <^d> Yeah dunno how that just snuck in there [16:02:37] * cmjohnson1 goes to check cal [16:02:41] gwicke: I guess praseodymium was an attempt to get upgraded to jessie? looking at apt sources [16:02:56] andrewbogott: would you mind bumping to the evening swat window? it appears i have a meeting [16:03:10] or someone far more capable than i could help :) [16:03:11] :( ok [16:03:52] !log thcipriani Synchronized php-1.25wmf23/extensions/UniversalLanguageSelector: swat [[gerrit:201122]] (duration: 01m 07s) [16:03:53] andrewbogott: sorry about the slowness this morning. [16:03:59] Logged the message, Master [16:04:08] thcipriani: thanks! [16:04:18] kart_: yup. [16:04:33] that's on me -- i'll get better as i do more [16:05:14] andrewbogott: so we're bumping 201197 to evening? If so, SWAT complete. [16:05:20] phuedx: look at pile of patches next time :D [16:05:31] thcipriani: yeah, I’m updating the calendar [16:05:50] urandom gwicke the ferm config is live on the test cluster [16:05:51] kk, my bad, probably should have just cranked out that easy on first. [16:06:03] kart_: yeah :/ i didn't look close enough to the window [16:06:08] <~ rookie [16:06:23] PROBLEM - Host cr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.245) [16:06:36] kart_ <-- newbie, novice etc :) [16:07:57] (03PS1) 10Chad: Make sure mediawiki-config is cloned before running sync-common [puppet] - 10https://gerrit.wikimedia.org/r/201209 [16:09:50] !log bounce cassandra on test cluster [16:09:55] Logged the message, Master [16:12:15] (03CR) 10ArielGlenn: [C: 032] "thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/201208 (https://phabricator.wikimedia.org/T72385) (owner: 10Hoo man) [16:13:17] (03CR) 10Jdlrobson: [C: 031] "Is now deployed and tested @Kaldari" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [16:14:43] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 89.14 ms [16:17:34] (03PS1) 10Filippo Giunchedi: ferm: install libnet-dns-perl [puppet] - 10https://gerrit.wikimedia.org/r/201210 (https://phabricator.wikimedia.org/T92680) [16:18:23] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 3 failures [16:19:12] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:19:23] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 2 failures [16:19:38] (03CR) 10Dzahn: [C: 031] "thank you! first time we are actually using this and couldn't catch that in compiler" [puppet] - 10https://gerrit.wikimedia.org/r/201210 (https://phabricator.wikimedia.org/T92680) (owner: 10Filippo Giunchedi) [16:19:45] (03CR) 10BryanDavis: [C: 031] Make sure mediawiki-config is cloned before running sync-common [puppet] - 10https://gerrit.wikimedia.org/r/201209 (owner: 10Chad) [16:22:15] (03PS2) 10Filippo Giunchedi: wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) [16:22:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] wikimania_scholarships: don't manage open/close dates [puppet] - 10https://gerrit.wikimedia.org/r/201143 (https://phabricator.wikimedia.org/T92358) (owner: 10Filippo Giunchedi) [16:24:01] godog: cool, will look after meeting [16:24:03] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [16:25:32] gwicke: do you know why praseodymium had jessie in sources.list btw? [16:25:59] godog: likely to get cassandra 2.1.3, which is what we use in prod [16:26:13] there's a ticket to update the test cluster to jessie in general [16:26:47] yeah we should coordinate that next week [16:26:56] https://phabricator.wikimedia.org/T90955 [16:30:50] (03PS1) 10Nuria: Correcting end of line on wikimetrics.pp [puppet] - 10https://gerrit.wikimedia.org/r/201212 [16:33:42] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [16:33:53] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:33:54] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 3 failures [16:34:52] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 3 failures [16:35:13] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] create shell for lpintscher [puppet] - 10https://gerrit.wikimedia.org/r/200891 (https://phabricator.wikimedia.org/T94390) (owner: 10John F. Lewis) [16:35:33] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 6 failures [16:35:52] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:36:03] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:38:31] (03PS1) 10Dzahn: create shell user for Moritz (jmm) [puppet] - 10https://gerrit.wikimedia.org/r/201213 (https://phabricator.wikimedia.org/T94707) [16:39:03] godog: :) you merged that like a second before i was uploading, rebase war :) [16:39:36] (03PS2) 10Filippo Giunchedi: zuul: replace 'zuul-merger' by $NAME [puppet] - 10https://gerrit.wikimedia.org/r/201181 (owner: 10Hashar) [16:39:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: replace 'zuul-merger' by $NAME [puppet] - 10https://gerrit.wikimedia.org/r/201181 (owner: 10Hashar) [16:40:01] mutante: aha whoops [16:41:11] (03CR) 10Dzahn: [C: 032] Correcting end of line on wikimetrics.pp [puppet] - 10https://gerrit.wikimedia.org/r/201212 (owner: 10Nuria) [16:41:41] godog: merging on master [16:42:20] mutante: sure I'm done [16:44:23] (03PS1) 10Faidon Liambotis: Fix reverse DNS for esams GRE tunnel from cr2->cr1 [dns] - 10https://gerrit.wikimedia.org/r/201214 [16:44:25] (03PS1) 10Faidon Liambotis: Add IP for mr1-esams.oob [dns] - 10https://gerrit.wikimedia.org/r/201215 [16:45:13] (03CR) 10Faidon Liambotis: [C: 032] Fix reverse DNS for esams GRE tunnel from cr2->cr1 [dns] - 10https://gerrit.wikimedia.org/r/201214 (owner: 10Faidon Liambotis) [16:46:01] (03CR) 10Faidon Liambotis: [C: 032] Add IP for mr1-esams.oob [dns] - 10https://gerrit.wikimedia.org/r/201215 (owner: 10Faidon Liambotis) [16:48:16] !log Updated Wikimania Scholarships to bde1a27 (Improve performance of phase2 report query) [16:48:23] Logged the message, Master [16:48:25] (03PS2) 10Faidon Liambotis: Revert "Drain esams of all traffic (scheduled maintenance)" [dns] - 10https://gerrit.wikimedia.org/r/201173 [16:48:45] (03CR) 10Faidon Liambotis: [C: 031] ferm: install libnet-dns-perl [puppet] - 10https://gerrit.wikimedia.org/r/201210 (https://phabricator.wikimedia.org/T92680) (owner: 10Filippo Giunchedi) [16:50:02] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:23] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:42] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:42] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:43] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:53:14] (03PS1) 10Shanmugamp7: Enable Extension:Shorturl on sa wiki projects Bug: T94660 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) [16:55:24] (03PS2) 10Dzahn: create shell account for Moritz (jmm) [puppet] - 10https://gerrit.wikimedia.org/r/201213 (https://phabricator.wikimedia.org/T94707) [16:59:47] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:59:56] (03CR) 10Giuseppe Lavagetto: [C: 032] create shell account for Moritz (jmm) [puppet] - 10https://gerrit.wikimedia.org/r/201213 (https://phabricator.wikimedia.org/T94707) (owner: 10Dzahn) [17:00:51] (03CR) 1020after4: [C: 031] "ok I like this solution. Do we update make-wmf-branch to use this as the new truth?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 (owner: 10Chad) [17:01:44] godog: did some poking on the test cluster, LGTM [17:03:10] gwicke: cool [17:04:06] (03PS1) 10Dzahn: add jmm to ops [puppet] - 10https://gerrit.wikimedia.org/r/201227 (https://phabricator.wikimedia.org/T94707) [17:04:10] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain esams of all traffic (scheduled maintenance)" [dns] - 10https://gerrit.wikimedia.org/r/201173 (owner: 10Faidon Liambotis) [17:04:23] !log repooling esams [17:04:31] Logged the message, Master [17:10:13] RECOVERY - RAID on analytics1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:10:22] RECOVERY - salt-minion processes on wtp2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:10:53] ottomata: an1021 kafka alert [17:11:06] kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 3.93121685344e-132 [17:11:12] 3d 5h [17:11:24] RECOVERY - salt-minion processes on wtp2005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:11:32] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [17:12:19] thanks paravoid [17:12:40] (03CR) 10Dzahn: [C: 032] "signed L3, and adding to groups will be separate" [puppet] - 10https://gerrit.wikimedia.org/r/201213 (https://phabricator.wikimedia.org/T94707) (owner: 10Dzahn) [17:12:43] !log initiated kafka replica election [17:12:52] Logged the message, Master [17:13:12] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:15:30] (03Abandoned) 10Andrew Bogott: Make labs resolv.conf play nice with resolvconf [puppet] - 10https://gerrit.wikimedia.org/r/200595 (https://phabricator.wikimedia.org/T93691) (owner: 10Andrew Bogott) [17:15:43] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5208.3904517 [17:15:52] ottomata: what was it? [17:16:07] the usual kafka zk timeout that causes 1021 to drop out from being a leader from any topics [17:16:16] it happens about once every week or month or so [17:16:22] :( [17:16:37] it isn't a big problem, as the other 3 brokers all take over [17:16:38] it only happens to 1021. [17:17:02] i have spent many hours trying to understand why, but my current plan is to replace it the next time we do a hardware order, and maybe try to use it as a hadoop node instead [17:17:04] not sure [17:19:51] (03PS1) 10Alexandros Kosiaris: ganeti.cfg partman configuration [puppet] - 10https://gerrit.wikimedia.org/r/201231 [17:20:34] <_joe_> !log restarted hhvm on mw1194, stuck in HPHP::StatCache::refresh [17:20:41] Logged the message, Master [17:20:41] (03PS1) 10coren: Labs: make net saturation monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/201232 [17:21:32] (03CR) 10jenkins-bot: [V: 04-1] Labs: make net saturation monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/201232 (owner: 10coren) [17:21:38] (03PS2) 10coren: Labs: make net saturation monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/201232 [17:21:52] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.109 second response time [17:21:52] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 64282 bytes in 0.888 second response time [17:22:44] paravoid: Speaking of, we probably want to either finish solving that bonding problem or rip out the partial config. [17:24:58] (03PS1) 10Giuseppe Lavagetto: icinga: fix puppet scoping in template [puppet] - 10https://gerrit.wikimedia.org/r/201233 [17:25:10] (03PS2) 10Giuseppe Lavagetto: icinga: fix puppet scoping in template [puppet] - 10https://gerrit.wikimedia.org/r/201233 [17:25:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] icinga: fix puppet scoping in template [puppet] - 10https://gerrit.wikimedia.org/r/201233 (owner: 10Giuseppe Lavagetto) [17:25:50] godog: Care to sanity check https://gerrit.wikimedia.org/r/#/c/201232/2 for me? Should be trivial. [17:27:13] RECOVERY - HHVM busy threads on mw1194 is OK: OK: Less than 30.00% above the threshold [76.8] [17:27:16] (03CR) 10Filippo Giunchedi: [C: 031] Labs: make net saturation monitoring configurable [puppet] - 10https://gerrit.wikimedia.org/r/201232 (owner: 10coren) [17:27:19] Coren: yep, LGTM [17:27:23] <- off [17:27:28] Danke. o/ [17:27:45] o/~ [17:28:05] (03CR) 10coren: [C: 032] "Should fix non-labstore1001" [puppet] - 10https://gerrit.wikimedia.org/r/201232 (owner: 10coren) [17:28:59] (03PS2) 10Dzahn: access: add jmm (Moritz) to ops [puppet] - 10https://gerrit.wikimedia.org/r/201227 (https://phabricator.wikimedia.org/T94707) [17:29:23] RECOVERY - HHVM queue size on mw1194 is OK: OK: Less than 30.00% above the threshold [10.0] [17:30:07] (03PS2) 10Alexandros Kosiaris: ganeti.cfg partman configuration [puppet] - 10https://gerrit.wikimedia.org/r/201231 [17:30:28] (03CR) 10Mark Bergsma: [C: 031] access: add jmm (Moritz) to ops [puppet] - 10https://gerrit.wikimedia.org/r/201227 (https://phabricator.wikimedia.org/T94707) (owner: 10Dzahn) [17:31:30] (03CR) 10Alexandros Kosiaris: [C: 032] ferm: install libnet-dns-perl [puppet] - 10https://gerrit.wikimedia.org/r/201210 (https://phabricator.wikimedia.org/T92680) (owner: 10Filippo Giunchedi) [17:31:47] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti.cfg partman configuration [puppet] - 10https://gerrit.wikimedia.org/r/201231 (owner: 10Alexandros Kosiaris) [17:31:54] (03CR) 10Dzahn: [C: 032] access: add jmm (Moritz) to ops [puppet] - 10https://gerrit.wikimedia.org/r/201227 (https://phabricator.wikimedia.org/T94707) (owner: 10Dzahn) [17:32:07] (03CR) 10Alexandros Kosiaris: [C: 032] Assign role::ganeti to ganeti boxes [puppet] - 10https://gerrit.wikimedia.org/r/200587 (owner: 10Alexandros Kosiaris) [17:32:24] and we managed to have 3 at the same time :) [17:32:29] shall i merge it all? [17:32:54] done [17:33:08] (03CR) 10Alexandros Kosiaris: [C: 032] Ganeti module/role introduced [puppet] - 10https://gerrit.wikimedia.org/r/198794 (https://phabricator.wikimedia.org/T87258) (owner: 10Alexandros Kosiaris) [17:33:28] (03CR) 10Alexandros Kosiaris: [C: 032] Ganeti eqiad cluster DNS and Service records [dns] - 10https://gerrit.wikimedia.org/r/200573 (owner: 10Alexandros Kosiaris) [17:34:33] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [17:35:31] mutante: heh, we had way more [17:35:52] I was just submitting like a week's work :-) [17:36:00] heh:) [17:36:13] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:22] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:32] PROBLEM - LVS HTTP IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:03] PROBLEM - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:09] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:15] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:20] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:26] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:32] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [17:37:56] well there it goes [17:38:12] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:41] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:51] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:12] RECOVERY - LVS HTTP IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4044 bytes in 2.460 second response time [17:39:21] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [17:39:33] (03CR) 10Chad: "I guess the question is do we want to use the extension meta repo in prod vs. just using submodules of core." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 (owner: 10Chad) [17:39:42] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18006 bytes in 5.027 second response time [17:39:48] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 4048 bytes in 4.898 second response time [17:40:04] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64665 bytes in 5.297 second response time [17:40:53] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18038 bytes in 0.250 second response time [17:41:13] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18079 bytes in 0.435 second response time [17:41:44] RECOVERY - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4049 bytes in 0.338 second response time [17:42:04] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64509 bytes in 9.755 second response time [17:42:24] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [17:42:43] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [17:42:43] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [17:43:47] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: puppet fail [17:43:56] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet has 7 failures [17:45:06] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [17:45:26] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [17:45:36] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 4044 bytes in 0.340 second response time [17:45:57] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [17:46:06] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:37] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18046 bytes in 0.288 second response time [17:47:57] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 758 bytes in 0.367 second response time [17:48:27] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 18056 bytes in 0.509 second response time [17:48:57] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64631 bytes in 0.592 second response time [17:49:36] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [17:49:56] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: puppet fail [17:50:07] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [17:50:37] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [17:50:47] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [17:51:57] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: puppet fail [17:52:02] (03PS1) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [17:52:17] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: puppet fail [17:53:58] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:55:07] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:55:07] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:55:38] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:55:43] !log added jmm to ops and wmf LDAP groups [17:55:49] Logged the message, Master [17:55:56] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:56:27] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:57:18] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:57:19] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:57:19] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:57:37] PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: puppet fail [17:57:46] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T1800). Please do the needful. [18:01:30] csteipp: yt? [18:01:40] nuria: Yeah [18:01:43] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [18:02:02] csteipp: we changed expiration time of last-access-cookie to 30 days, it's on patch [18:02:32] nuria: Thanks, I'll look. [18:02:34] csteipp: can you explain me what is the issue with having that cookie not expire? [18:02:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:02:57] csteipp: if it is brief and you have time, if not just update wiki page at your convenience [18:04:42] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:04:52] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:04:55] (03CR) 10Smalyshev: Adopt dumpwikidatajson.sh to the new naming pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) (owner: 10Hoo man) [18:05:14] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:03] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:10] nuria: I'll update the wiki [18:07:10] csteipp: ok, also i think you should look at https://gerrit.wikimedia.org/r/#/c/199179/ regarding searches, cookies and storing searches client side. Just mentioned you on ticket. [18:07:12] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:12] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:23] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:07:35] robh: i also think it schould be possible for a staffer to volonteer a few minutes after work and upload a few photos. Victor = VGrigas ? [18:13:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [18:15:29] (03Abandoned) 10Tim Landscheidt: WIP: Snapshot [puppet] - 10https://gerrit.wikimedia.org/r/200648 (https://phabricator.wikimedia.org/T93691) (owner: 10Tim Landscheidt) [18:30:32] (03PS2) 10Negative24: Default ignore alternate file domain config [puppet] - 10https://gerrit.wikimedia.org/r/199564 (https://phabricator.wikimedia.org/T93837) [18:31:27] (03CR) 10Andrew Bogott: "Ah, crap, sorry, didn't realize you had an alternative solution in progress. It's not impossible to undo the change of my last patch, if " [puppet] - 10https://gerrit.wikimedia.org/r/200648 (https://phabricator.wikimedia.org/T93691) (owner: 10Tim Landscheidt) [18:35:34] RECOVERY - RAID on ms-be2002 is OK: OK: optimal, 13 logical, 13 physical [18:36:54] (03CR) 10Rush: [C: 032] "seems sensible thanks" [puppet] - 10https://gerrit.wikimedia.org/r/199564 (https://phabricator.wikimedia.org/T93837) (owner: 10Negative24) [18:37:02] (03CR) 10Rush: [V: 032] "seems sensible thanks" [puppet] - 10https://gerrit.wikimedia.org/r/199564 (https://phabricator.wikimedia.org/T93837) (owner: 10Negative24) [18:43:48] (03PS2) 10Alex Monk: Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [18:44:16] (03CR) 10Alex Monk: [C: 04-1] Enable Extension:Shorturl on sa wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201216 (https://phabricator.wikimedia.org/T94660) (owner: 10Shanmugamp7) [18:47:47] (03CR) 10Alex Monk: [C: 031] Enable transwiki import from English Wikisource on Telugu Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201196 (https://phabricator.wikimedia.org/T94531) (owner: 10Pmlineditor) [18:52:34] (03CR) 10Cscott: "Yes, I should probably do it during a OCG deploy window (aka, Parsoid & services)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200038 (owner: 10Cscott) [19:08:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - asw-d-eqiad:xe-1/1/0 {#} [10Gbps DF]BR [19:08:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 206, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-4/0/3: down - asw-d-eqiad:xe-6/0/31 {#} [10Gbps DF]BRxe-5/2/1: down - asw-d-eqiad:xe-1/1/2 {#} [10Gbps DF]BR [19:10:56] (03CR) 10Tim Landscheidt: "I didn't have a better solution, I just had more optimism to find one :-)." [puppet] - 10https://gerrit.wikimedia.org/r/200648 (https://phabricator.wikimedia.org/T93691) (owner: 10Tim Landscheidt) [19:15:37] (03PS1) 10Dzahn: add jmm to icinga SMS contact group [puppet] - 10https://gerrit.wikimedia.org/r/201250 (https://phabricator.wikimedia.org/T94717) [19:16:16] !log upgrading cr1/2-eqiad<->asw-d-eqiad capacity (T92914) [19:16:36] (03CR) 10Dzahn: [C: 032] add jmm to icinga SMS contact group [puppet] - 10https://gerrit.wikimedia.org/r/201250 (https://phabricator.wikimedia.org/T94717) (owner: 10Dzahn) [19:18:39] Logged the message, Master [19:19:34] (03PS1) 10Dzahn: icinga: give jmm permission to run commands [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) [19:21:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [19:22:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [19:23:27] (03CR) 10Dzahn: [C: 04-1] "wait, do i have to use full realname here? that would add non-US-ASCII chars to puppet manifests --> T91453 :P" [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) (owner: 10Dzahn) [19:23:58] (03PS1) 1020after4: Add 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201253 [19:24:00] (03PS1) 1020after4: Wikipedias to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201254 [19:24:02] (03PS1) 1020after4: Group0 to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201255 [19:24:23] (03CR) 10Dzahn: "nevermind, it doesn't matter since it's not a manifest, just a file, but still: https://phabricator.wikimedia.org/T94729" [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) (owner: 10Dzahn) [19:24:57] (03PS2) 10Dzahn: icinga: give jmm permission to run commands [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) [19:28:44] (03PS1) 1020after4: Remove 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201256 [19:28:52] Lydia_WMDE: your shell access should work. wanna try? [19:29:32] (03CR) 1020after4: [C: 032] Remove 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201256 (owner: 1020after4) [19:29:36] (03Merged) 10jenkins-bot: Remove 1.25wmf17 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201256 (owner: 1020after4) [19:30:06] (03PS2) 1020after4: Add 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201253 [19:30:50] (03CR) 1020after4: [C: 032] Add 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201253 (owner: 1020after4) [19:30:55] (03Merged) 10jenkins-bot: Add 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201253 (owner: 1020after4) [19:32:14] mutante: not sure if she'll be around atm :) [19:32:36] JohnFLewis: yea, i'm updating on ticket [19:33:31] I spoke to her ever so briefly about it so she'll likely shout up if it doesn't work by tomorrow [19:33:49] !log twentyafterfour Started scap: testwiki to php-1.25wmf24 and rebuild l10n cache [19:33:54] Logged the message, Master [19:37:21] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:37:31] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:42:07] (03CR) 10Dduvall: [WIP] Add role::mediawiki_vagrant_lxc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (owner: 10BryanDavis) [19:48:06] (03CR) 10Andrew Bogott: [C: 031] "I don't fully understand graphite config, but at the very least this shouldn't break anything" [puppet] - 10https://gerrit.wikimedia.org/r/201203 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [19:48:09] !log twentyafterfour scap failed: CalledProcessError Command 'cp '/srv/mediawiki-staging/php-1.25wmf24/cache/l10n/'*.cdb '/tmp/scap_l10n_1816369030'' returned non-zero exit status 1 (duration: 14m 20s) [19:48:17] Logged the message, Master [19:52:07] (03CR) 10coren: [C: 032] ""Shouldn't break anything" is a good objective." [puppet] - 10https://gerrit.wikimedia.org/r/201203 (https://phabricator.wikimedia.org/T90437) (owner: 10coren) [19:54:21] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:55:34] (03PS3) 10Yuvipanda: tools: Make webservice2 block for start / stop [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) [19:55:49] (03CR) 10Yuvipanda: [C: 032 V: 032] "Let's do this!" [puppet] - 10https://gerrit.wikimedia.org/r/201100 (https://phabricator.wikimedia.org/T93334) (owner: 10Yuvipanda) [19:55:51] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:57:21] urandom: firewalling limited to the other cassandra seeds, taken from hiera, works now [19:57:38] mutante: that's awesome, thank you [19:58:36] urandom: for example: iptables -L | grep 7199 and cat /etc/ferm/conf.d/10_cassandra-cql [19:59:01] thanks to godog, he added the package needed for @resolve to work in ferm [19:59:14] mutante: this is just staging right now? [19:59:23] urandom: yes [20:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T2000). Please do the needful. [20:05:50] I can has root help on tin? there are bad permissions in /srv/deployment/scap/scap/.git [20:05:53] godog: Around? twentyafterfour needs a root to fix bad permissions on the scap trebuchet directory [20:05:56] PROBLEM - puppet last run on graphite1002 is CRITICAL: CRITICAL: Puppet has 2 failures [20:06:06] PROBLEM - graphite.wikimedia.org on graphite1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [20:06:13] (03PS1) 10Yuvipanda: tools: Make webservice2 support Restarted webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/201268 (https://phabricator.wikimedia.org/T93334) [20:06:21] Coren: ^ [20:06:21] Need a root to do `chmod -R g+w /srv/deployment/scap/scap/.git` [20:06:23] (03CR) 10jenkins-bot: [V: 04-1] tools: Make webservice2 support Restarted webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/201268 (https://phabricator.wikimedia.org/T93334) (owner: 10Yuvipanda) [20:06:30] bd808: heya! I can help! [20:06:36] sweet [20:06:44] * YuviPanda looks [20:06:55] (03PS2) 10Yuvipanda: tools: Make webservice2 support Restarted webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/201268 (https://phabricator.wikimedia.org/T93334) [20:07:20] also: why is there an s in there? drwxrwsr-x [20:07:29] is that sticky bit? is it needed? [20:07:57] It should make new files inherit the directory permissions [20:08:03] in theory [20:08:25] !log run chmod -R g+w . on tin with CWD /srv/deployment/scap/scap/.git [20:08:27] twentyafterfour: done [20:08:33] Logged the message, Master [20:15:36] (03CR) 10Yuvipanda: [C: 032] tools: Make webservice2 support Restarted webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/201268 (https://phabricator.wikimedia.org/T93334) (owner: 10Yuvipanda) [20:19:09] !log twentyafterfour Started scap: retrying: testwiki to php-1.25wmf24 and rebuild l10n cache [20:19:17] Logged the message, Master [20:22:03] !log twentyafterfour scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_3136974758" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 02m 53s) [20:22:08] Logged the message, Master [20:23:29] grr [20:23:37] bd808: ^ [20:23:50] Is that all it told you? [20:24:11] yep [20:24:34] ok. Run that command by hand to see what really broke it [20:24:52] Usually it means there is a messed up extension [20:25:21] You'll need to change the tmp dir [20:26:06] huh. the tmp dir is still there so you should be able to jsut copy and run [20:26:13] yeah ... [20:26:16] missing extension [20:26:21] ok I can figure it out from here [20:26:42] The error reporting from scap still sux [20:28:42] !log deploying RESTBase 0.5.0 [20:28:49] Logged the message, Master [20:30:07] (03PS1) 10Yuvipanda: tools: Make bigbrother stop using webservice [puppet] - 10https://gerrit.wikimedia.org/r/201283 (https://phabricator.wikimedia.org/T90855) [20:30:11] Coren: ^ perl patch! :) [20:30:17] (to bigbrother) [20:31:00] (03CR) 10coren: [C: 031] "That'll work." [puppet] - 10https://gerrit.wikimedia.org/r/201283 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [20:32:56] !log finished RESTBase 0.5.0 deployment [20:33:01] Logged the message, Master [20:47:36] (03PS1) 10BryanDavis: Trebuchet: run all state changing git commands with umask 002 [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) [20:52:15] (03PS1) 10EBernhardson: Use production settings for wgContentHandlerUseDB in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201345 [20:55:45] (03CR) 10Chad: [C: 032] Use production settings for wgContentHandlerUseDB in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201345 (owner: 10EBernhardson) [20:55:52] (03Merged) 10jenkins-bot: Use production settings for wgContentHandlerUseDB in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201345 (owner: 10EBernhardson) [20:56:13] (03PS2) 10Yuvipanda: tools: Make bigbrother stop using old webservice [puppet] - 10https://gerrit.wikimedia.org/r/201283 (https://phabricator.wikimedia.org/T90855) [20:56:27] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make bigbrother stop using old webservice [puppet] - 10https://gerrit.wikimedia.org/r/201283 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [21:18:23] !log twentyafterfour Started scap: testwiki to 1.25wmf24 and rebuild l10n cache [21:18:28] Logged the message, Master [21:18:46] (03PS2) 10Thcipriani: Allow override of sync_common config [puppet] - 10https://gerrit.wikimedia.org/r/198173 (https://phabricator.wikimedia.org/T91548) [21:19:10] !log twentyafterfour scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_186708348" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 00m 46s) [21:19:53] good god [21:20:40] no details? [21:20:41] !log twentyafterfour Started scap: once again: testwiki to 1.25wmf24 and rebuild l10n cache [21:20:46] Logged the message, Master [21:21:32] greg-g: aude no details. I'll try to fix that so that it captures the output from the sub-command. it's working now I think [21:21:48] hm [21:22:17] (03PS1) 10Yuvipanda: tools: Make webservice2 accept old style -{server} arguments [puppet] - 10https://gerrit.wikimedia.org/r/201352 (https://phabricator.wikimedia.org/T90855) [21:27:03] (03PS2) 10Yuvipanda: tools: Make webservice2 accept old style -{server} arguments [puppet] - 10https://gerrit.wikimedia.org/r/201352 (https://phabricator.wikimedia.org/T90855) [21:27:14] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Make webservice2 accept old style -{server} arguments [puppet] - 10https://gerrit.wikimedia.org/r/201352 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [21:27:46] 21:27:27 Started sync-proxies - yay! [21:38:33] 21:37:43 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'mw1010.eqiad.wmnet', 'mw1033.eqiad.wmnet', 'mw1070.eqiad.wmnet', 'mw1097.eqiad.wmnet', 'mw1216.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet', 'mw2001.codfw.wmnet', 'mw2041.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw2119.codfw.wmnet', 'mw2187.codfw.wmnet'] on mw2213.codfw.wmnet returned [255]: Host key verification [21:38:34] failed. [21:39:03] mw2213 is still the same broken one, isn't it? [21:39:08] Yeah [21:39:21] But at least it's throwing a different error now [21:39:31] It's getting better. Last time it was timing out I think. [21:39:31] It's still slowing down syncs though, so getting that box fixed would be nice [21:39:32] Guys, I don't think ops is going to fix it simply by us complaining enough times, they do have a task open >_> [21:39:34] I'm guessing it's been rebuilt so now it's got a new key? [21:39:53] It is being installed, last thing I heard [21:39:57] should be collected by puppet and sent to tin [21:40:00] Before you couldn't even ping it [21:40:07] but that might take a while [21:40:22] We should remove it from the list of things to sync to in the mean time [21:41:08] oh, apparently installation was complete [21:41:31] it's just a cached key on tin ... [21:42:56] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: puppet fail [21:43:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:43:48] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:45:07] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:45:14] 6operations, 10ops-codfw, 3codfw-appserver-setup, 3wikis-in-codfw: mw2208-2209, mw2213 have unreachable mgmt interfaces - https://phabricator.wikimedia.org/T93857#1172199 (10mmodell) how long does it take for puppet to collect the host keys from a newly installed machine? I assume it should happen with the... [21:45:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172200 (10GWicke) @dzahn, staging looks good to me. Could we carefully roll this out to one production box at a time? [21:56:05] !log repooled cp107[1234] in pybal (eqiad upload, row D) [21:56:09] Logged the message, Master [22:01:17] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:04:06] (03PS1) 10BBlack: repool cp107[1234] (eqiad upload) backends [puppet] - 10https://gerrit.wikimedia.org/r/201363 [22:04:44] (03CR) 10BBlack: [C: 032 V: 032] repool cp107[1234] (eqiad upload) backends [puppet] - 10https://gerrit.wikimedia.org/r/201363 (owner: 10BBlack) [22:05:15] Yuvipanda: tools: Make webservice2 accept old style -{server} arguments (0734af4) [22:05:19] ^ ok to merge? [22:06:12] (03PS3) 10Yuvipanda: mediawiki: Ensure that /etc/php5/apache dir exists [puppet] - 10https://gerrit.wikimedia.org/r/196773 (https://phabricator.wikimedia.org/T88442) [22:06:20] bblack: yeah, is ok. I thought I had merged... [22:06:26] ^d: bd808 ^^ [22:06:37] done! [22:06:41] not sure if that’ll work, since it assumes that that package is installed... [22:07:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:07:15] YuviPanda: interestingly, I think it was merged on strontium but not palladium? [22:07:21] never seen that before [22:07:42] bblack: ugh. [22:07:50] bblack: I manually merged it on strontium... [22:08:07] :) [22:08:11] bblack: I think what happened is I forgot to merge on palladium, saw the strontium error, still didn’t remember to merge on palladium but assumed strontium failed and merged in strontium... [22:08:20] haha [22:08:32] (03CR) 10Chad: [C: 031] "Worked as intended in staging and passed compiler: http://puppet-compiler.wmflabs.org/667/change/201209/html/" [puppet] - 10https://gerrit.wikimedia.org/r/201209 (owner: 10Chad) [22:08:50] <^d> YuviPanda: ^ [22:09:13] (03PS2) 10Yuvipanda: Make sure mediawiki-config is cloned before running sync-common [puppet] - 10https://gerrit.wikimedia.org/r/201209 (owner: 10Chad) [22:10:15] (03PS2) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [22:10:27] (03CR) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) (owner: 10Hoo man) [22:10:33] (03CR) 10Yuvipanda: [C: 032] Make sure mediawiki-config is cloned before running sync-common [puppet] - 10https://gerrit.wikimedia.org/r/201209 (owner: 10Chad) [22:13:30] ^d: your patch went ok btw [22:13:36] <^d> \o/ [22:14:55] !log twentyafterfour Finished scap: once again: testwiki to 1.25wmf24 and rebuild l10n cache (duration: 54m 14s) [22:15:00] Logged the message, Master [22:15:07] blech [22:15:23] That should be <30 minutes [22:15:35] :( [22:16:19] (03CR) 10Krinkle: Move web::sites to web::prod_sites; begin unification in new class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [22:16:29] 6operations, 10MediaWiki-extensions-GeoData, 10OpenStreetMap, 10Wikimedia-Search: Assess growing GeoData hardware requirements - https://phabricator.wikimedia.org/T94768#1172320 (10MaxSem) 3NEW [22:16:59] (03CR) 10Chad: Move web::sites to web::prod_sites; begin unification in new class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [22:17:16] <^d> YuviPanda: Error: Failed to apply catalog: Could not find dependency Package[libapache2-mod-php5] for File[/etc/php5/apache2/php.ini] at /etc/puppet/modules/mediawiki/manifests/php.pp:21 [22:17:21] bd808: it would be a lot faster if we didn't duplicate the entire repo for each new branch [22:17:39] instead of syncing the delta we're syncing all of mediawiki-core every wednesday [22:17:43] twentyafterfour: for sure. And if we didn't rsync all the things [22:18:10] The rsync for scap covers everything in /srv/mediawiki-staging [22:18:21] that's a lot of fstat calls [22:18:29] (03CR) 10Krinkle: Better support checking out MediaWiki & extension masters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 (owner: 10Chad) [22:19:03] (03CR) 10Chad: Better support checking out MediaWiki & extension masters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 (owner: 10Chad) [22:19:43] We are getting a lot of 5+ minute individual host syncs now too [22:19:56] I wonder if we are swamping out the rsync servers? [22:20:00] (03CR) 1020after4: "@chad: I like the idea of using the extensions repo instead of core - it isolates the submodule fiddling into it's own dedicated repo, see" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198783 (owner: 10Chad) [22:27:12] (03CR) 1020after4: [C: 032] Wikipedias to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201254 (owner: 1020after4) [22:29:26] (03PS2) 1020after4: Wikipedias to 1.25wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201254 [22:30:07] andrewbogott, ... hi [22:30:11] you put https://gerrit.wikimedia.org/r/#/c/201264/ up for swat [22:30:16] but it's not merged to master? [22:30:34] I can merge it right now if you like. Not sure what the process is. [22:30:44] SWAT can't take patches that aren't on master at least [22:31:08] ok… is that better? [22:31:20] no [22:31:25] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf23 [22:31:26] that's much worse, you self-approved [22:31:30] Logged the message, Master [22:31:36] ... [22:31:48] There’s not really anyone who can review OSM code but me. [22:32:01] then how is it in production? :/ [22:32:03] But, tell me what you’d like me to do and I’ll do it. [22:32:13] It only runs on wikitech [22:33:21] andrewbogott, re-read https://www.mediawiki.org/wiki/Gerrit/%2B2#Self-merge [22:37:46] (03PS2) 1020after4: Group0 to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201255 [22:37:47] andrewbogott, why does OpenStackNovaProject have both getName and getProjectName which do the same thing? [22:37:58] (03CR) 1020after4: [C: 032] Group0 to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201255 (owner: 1020after4) [22:40:07] andrewbogott, anyway this seems harmless enough that I'll accept it for swat today. [22:40:15] Krenair: I don’t know — I presume that getName() was added after the fact do comply with some api standard. [22:40:33] Our long-term goal is to murder most of that extension — today’s patch is a tiny, transitional step in that direction :) [22:41:02] Thanks for accepting the change. If you’re interested in reviewing future OSM patches I’d love to add you — at the moment I don’t know that anyone cares. [22:41:33] (03Merged) 10jenkins-bot: Group0 to 1.25wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201255 (owner: 1020after4) [22:41:48] I can't really claim to know a huge amount about OpenStack, but we'll see [22:42:06] greg-g: The mobile team was wondering if we could do our config change deployment a little before the SWAT window (nothing else is scheduled then). Should only take a few minutes. [22:43:57] kaldari: what's the config change? (I responded as such to another ping in aother channel ;) ) [22:44:24] (03CR) 10Kaldari: Enable Gather on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [22:44:30] that one, I guess :) [22:44:34] kaldari: kk [22:44:37] yep :) [22:44:53] OK, I’ll go ahead and do it now. [22:46:01] (03CR) 10Kaldari: [C: 032] Enable Gather on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [22:46:08] (03Merged) 10jenkins-bot: Enable Gather on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201065 (owner: 10Kaldari) [22:46:35] jouncebot, next [22:46:35] In 0 hour(s) and 13 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T2300) [22:47:47] greg-g: I think we need to lengthen Wednesday train deployment window. it's taken me waaaay longer than my window pretty much every week I've done it. [22:48:35] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf24 [22:48:40] Logged the message, Master [22:49:25] so now I'm getting yet a different error from mw2213: [22:49:27] sudo -u mwdeploy -n -- /usr/bin/rsync -l tin.eqiad.wmnet::common/wikiversions*.{json,cdb} /srv/mediawiki on mw2213.codfw.wmnet returned [255]: Permission denied (publickey,password). [22:49:27] andrewbogott, I think we also need wmf24 [22:49:48] * andrewbogott asks dumbly: [22:49:53] What’s master these days? 25? [22:50:02] 1.25 [22:50:31] Dang. Yeah, I guess it needs to be on the intervening branch(es) as well. [22:50:39] How much of a pain is that? [22:50:41] (03PS1) 10Faidon Liambotis: Drain esams of all traffic v2 (scheduled outage) [dns] - 10https://gerrit.wikimedia.org/r/201371 [22:50:47] bblack: ^ [22:50:51] twentyafterfour: I'm starting to get annoyed with the codfw mws [22:51:03] andrewbogott, ... not sure I understand [22:51:34] it needs to be on 1.25wmf23 because that's what wikitech currently runs, it needs to be on 1.25wmf24 because wikitech will start running that on tuesday [22:51:45] Ah, yeah, that’s what I meant too :) [22:52:21] Sorry, I only barely understand how the deploy process works, I only just started having to pay attention to it for wikitech. [22:52:40] 1.25wmf25 doesn’t yet (or will never) exist, right? [22:52:49] andrewbogott: I'd be happy to review changes to that extension as well. I might even poke at improving it if it wasn't on the hit list. what's the alternative if we kill it off? [22:52:58] andrewbogott: you can make the cherry-picks to the deployment branches from the gerrit interface [22:53:10] andrewbogott: right https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap#Schedule_for_the_deployments [22:53:22] andrewbogott: Then you have to merge and prepare a submodule update for the branch [22:53:22] twentyafterfour: https://horizon.wikimedia.org/ [22:53:24] wmf24 is the last 1.25 [22:53:31] 1.26wmf1 is next [22:53:39] (03PS1) 10Hoo man: Generalize wikidata dump scripts [puppet] - 10https://gerrit.wikimedia.org/r/201372 [22:53:49] andrewbogott: is that open to everyone yet? I wanna try it ;) [22:53:57] twentyafterfour: shell name and password [22:54:06] But don’t create or delete instances, it’ll break stuff :) [22:54:13] * andrewbogott tried to disable that last night but failed [22:54:14] (03CR) 10BBlack: [C: 031] Drain esams of all traffic v2 (scheduled outage) [dns] - 10https://gerrit.wikimedia.org/r/201371 (owner: 10Faidon Liambotis) [22:54:35] andrewbogott: ok [22:55:46] (03CR) 10Jforrester: [C: 031] Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [22:56:01] andrewbogott: that looks like it's going to be really nice [22:56:14] twentyafterfour: yeah, but it’ll take a while before it fits our use case [22:56:28] (03CR) 10GWicke: [C: 031] Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [22:57:03] (03CR) 10Jforrester: [C: 04-2] "Not until enwiki is running wmf24 (8 April)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200105 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [22:57:17] !log kaldari Synchronized wmf-config/InitialiseSettings.php: enabling Gather on enwiki (duration: 00m 13s) [22:57:25] Logged the message, Master [22:57:36] (03CR) 10Jforrester: [C: 04-2] "Not until all wikis are in RESTbase, including private ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200107 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [22:57:43] !log twentyafterfour Purged l10n cache for 1.25wmf22 [22:57:47] Logged the message, Master [22:59:54] Krenair: Thanks for updating the deployment schedule [23:00:04] RoanKattouw, ^d, andrewbogott: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150401T2300). [23:00:11] (am doing it) [23:00:56] here [23:01:06] kaldari, twentyafterfour: you guys both all done, right? [23:01:16] Krenair: yep, all done now [23:01:28] (03CR) 10Alex Monk: [C: 032] Raise use_dnsmasq for new instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201197 (owner: 10Andrew Bogott) [23:01:35] (03Merged) 10jenkins-bot: Raise use_dnsmasq for new instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201197 (owner: 10Andrew Bogott) [23:01:40] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1172440 (10faidon) Now that asw-d has more capacity, the new eqiad-upload caches, cp107[1234], were repooled today and everything looks about the same as before. [23:01:42] I'm done [23:01:44] andrewbogott, can you check this is working when I sync it? [23:01:51] yep! [23:02:33] hmm [23:02:37] !log krenair Synchronized wmf-config/wikitech.php: https://gerrit.wikimedia.org/r/#/c/201197/ (duration: 00m 11s) [23:02:42] mw2213 is back to "Host key verification failed" [23:02:43] Logged the message, Master [23:02:48] andrewbogott, ok [23:02:53] (03CR) 10Catrope: [C: 04-1] "This needs the $wgExtensionFunctions thing from https://gerrit.wikimedia.org/r/#/c/200196/1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:03:19] (03PS4) 10Catrope: Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:03:23] krenair: testing... [23:03:26] thanks [23:03:46] (03CR) 1020after4: [C: 031] Trebuchet: run all state changing git commands with umask 002 [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis) [23:04:16] nice of you to join us, znc >_> [23:04:24] Krenair: yep, the use_dnsmasq thing is working great. Is the other change sync’d too now? [23:04:39] (03PS5) 10Catrope: Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:04:41] nope, waiting on jenkins to merge it first [23:04:46] ‘k [23:04:51] (03CR) 10Catrope: [C: 031] Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:05:45] oh there is : https://gerrit.wikimedia.org/r/#/c/169716/ [23:06:38] andrewbogott, jenkins can take anywhere from a few minutes to what sometimes feels like half an hour, depending on the queue. luckily we seem to be at the front of the queue today [23:07:41] Krenair: someone else suggested just +2'ing all the core bumps before SWAT, can rebase and deploy in the order they get merged instead of waiting [23:08:08] kind of irritating that we don't get to choose which tests to run. for OSM changes to WMF production there's no point running the hhvm tests, because silver runs zend still [23:08:25] ebernhardson, ok, I just +2'd your flow change as well then [23:08:59] Krenair: we definitely run more tests than needed for a lot of changes. not to mention the tests run like 3 times per patch revision [23:12:56] (03CR) 10Alex Monk: "> krenair@tin:/srv/mediawiki-staging$ mwscript eval.php zerowiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:14:43] (03PS3) 10Hoo man: Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [23:15:03] (03PS6) 10Catrope: Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:15:16] (03CR) 10Hoo man: "Totally re-did this change on top of my commit chain." [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [23:15:42] oh hurry up jenkins [23:16:11] (03CR) 10Smalyshev: [C: 031] Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [23:16:26] (03CR) 10Alex Monk: [C: 031] "ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:17:05] (03PS4) 10Hoo man: Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [23:17:52] blame zend [23:18:59] errr... ah. andrewbogott have to do a part of the process I've never run before :) [23:19:06] I have* [23:19:21] Krenair: I’m patient :) [23:23:21] hmm.. I think it's fine [23:23:21] * Jamesofur gets a drink while he waits [23:23:21] okay, the instructions are totally wrong [23:23:21] that sounds normal [23:23:21] andrewbogott, okay, syncing to wikitech [23:23:22] !log krenair Synchronized php-1.25wmf23/extensions/OpenStackManager/nova/OpenStackNovaHost.php: https://gerrit.wikimedia.org/r/#/c/201367/ (duration: 00m 15s) [23:23:22] mw2213... still upset about host keys [23:23:22] but should be good [23:23:22] Logged the message, Master [23:23:44] please test andrewbogott [23:23:54] ok! [23:25:00] Krenair: my patch is broken :( [23:25:06] Best to fix it or revert? [23:25:10] how broken? [23:25:48] (03PS1) 10EBernhardson: Enable VE for editing within Flow on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201374 [23:26:30] andrewbogott, are we throwing an exception when you try to create a new instance, or...? [23:26:37] Krenair: yeah, that’s right. [23:26:39] And I know the fix [23:27:09] how simple is it? [23:27:48] quite. One second... [23:28:15] Jamesofur [23:28:23] o\ [23:28:25] (03CR) 10Alex Monk: [C: 032] Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [23:28:25] I had it right in the previous patch version and then overthought :( [23:28:32] (03Merged) 10jenkins-bot: Allow gather-hidelist to be used in global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201093 (https://phabricator.wikimedia.org/T94652) (owner: 10Jalexander) [23:28:42] Krenair: https://gerrit.wikimedia.org/r/#/c/201376/ [23:28:45] lemme hotfix and double-check [23:28:46] -- [Jamesofur] idle 00:00:00 [23:28:49] :O [23:28:52] :P [23:29:00] Hey James :D [23:29:06] howdy howdy :) [23:29:22] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/201093/ (duration: 00m 12s) [23:29:24] Jamesofur, ^ [23:29:29] Logged the message, Master [23:29:30] thank ye, checkin [23:29:51] Krenair: looks good, thanks [23:30:04] huh, it's internationalised too? [23:30:06] magic [23:30:08] it is, weird [23:30:09] lol [23:30:14] (03PS2) 10EBernhardson: Enable VE for editing within Flow on phase0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201374 [23:30:19] Was just checking that [23:30:22] Krenair: ok, verified that that change fixes things. [23:30:34] andrewbogott, okay [23:30:45] Thank you, sorry for the dumb mistake [23:31:06] andrewbogott, I have a feeling I've run into that exact thing myself before [23:31:48] moving on to ebernhardson while that goes through jenkins [23:32:00] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1172491 (10bd808) I don't know if it would be acceptable, but there are [[https://www.vagrantup.com/downloads.html|universal deb packages]] provided by Vagrant directly... [23:33:33] ebernhardson [23:33:40] !log krenair Synchronized php-1.25wmf24/extensions/Flow/modules/editor/editors/visualeditor/ext.flow.editors.visualeditor.js: https://gerrit.wikimedia.org/r/#/c/201360/ (duration: 00m 11s) [23:33:47] Logged the message, Master [23:33:53] that's "Don't consider visualeditor-enable" [23:33:58] i.e. https://github.com/wikimedia/mediawiki-extensions-Flow/commit/31fb3020fec8f2ec1500f62a281187eab3d1b2b2 [23:34:16] Krenair: it needs the whole stack, plus the config change i just added to swat (https://gerrit.wikimedia.org/r/201374) to be able to test [23:34:24] its a new features thats not turned on [23:34:54] Krenair: oh, its actually like 3 patches in that bump, just sync-dir extensions/Flow [23:35:04] !log krenair Synchronized php-1.25wmf24/extensions/Flow: https://gerrit.wikimedia.org/r/#/c/201360/ (duration: 00m 13s) [23:35:05] yeah doing [23:35:09] Logged the message, Master [23:35:11] is that better now? [23:35:21] Krenair: yea, but i cant test without the config change too :) [23:35:42] sigh [23:35:47] that wasn't on the list when I loaded the page [23:35:52] Krenair: sorry, i forgot and just added it [23:35:57] realized i couldn't test [23:36:10] (03CR) 10Alex Monk: [C: 032] Enable VE for editing within Flow on phase0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201374 (owner: 10EBernhardson) [23:36:17] (03Merged) 10jenkins-bot: Enable VE for editing within Flow on phase0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201374 (owner: 10EBernhardson) [23:36:59] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/201374/ (duration: 00m 12s) [23:37:04] Logged the message, Master [23:37:06] ebernhardson [23:38:37] Krenair: checking, minor issue :S looking into it [23:39:19] Krenair: nm, worked fine with debug=true, probably just a cache error [23:39:23] ok [23:39:31] (03PS1) 10Yuvipanda: tools: make webservice2 behave like webservice when called so [puppet] - 10https://gerrit.wikimedia.org/r/201383 (https://phabricator.wikimedia.org/T90855) [23:39:34] it's only test wikis and mw.org anyway [23:39:40] (03CR) 10jenkins-bot: [V: 04-1] tools: make webservice2 behave like webservice when called so [puppet] - 10https://gerrit.wikimedia.org/r/201383 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [23:40:10] (03PS2) 10Yuvipanda: tools: make webservice2 behave like webservice when called so [puppet] - 10https://gerrit.wikimedia.org/r/201383 (https://phabricator.wikimedia.org/T90855) [23:40:56] (03CR) 10Alex Monk: [C: 032] Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:40:58] James_F [23:41:03] (03Merged) 10jenkins-bot: Make VisualEditor access RESTbase directly in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200098 (https://phabricator.wikimedia.org/T90374) (owner: 10Jforrester) [23:41:06] Krenair [23:41:50] !log krenair Synchronized wmf-config: https://gerrit.wikimedia.org/r/200098 (duration: 00m 12s) [23:41:51] ^ please test [23:41:58] Logged the message, Master [23:42:20] (03CR) 10Yuvipanda: [C: 032] tools: make webservice2 behave like webservice when called so [puppet] - 10https://gerrit.wikimedia.org/r/201383 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [23:43:30] Krenair, James_F: looking good so far [23:43:35] Yup, works well. [23:44:04] ok then [23:44:10] returning to finish fixing up wikitech with andrewbogott [23:45:25] James_F: but for me mw.org still loads through the PHP API [23:45:40] gwicke: debug=true or wait for RL cache to expire. [23:45:58] ah, k [23:45:58] (And the cache looks like it's now expired.) [23:46:01] 5 minutes? [23:46:23] yup, got it now [23:46:26] Cool. [23:46:54] ha! nice to see the RB html request handily beating the PHP API request [23:47:02] Yup. [23:47:08] 92ms vs. 465ms. [23:47:40] ..and we have a new bottleneck [23:47:47] ;) [23:47:54] -_- [23:47:55] gwicke: this reminds me, we should add to [23:48:20] I wonder how much that'll speed it up by [23:48:45] ori: another option would be to set up a proxy rule for the actual domain, like /rest/ or /api/ [23:49:01] better for spdy as well [23:49:33] gwicke: If you can spend a few minutes, please have a look at: https://phabricator.wikimedia.org/T93913 (not urgent, but having comments would be nice) [23:50:26] hoo: hadn't seen that yet, thanks for the pointer [23:50:53] gwicke: yeah, I think opsen want to have a single IP for all the various hostnames implicated in standard page views [23:52:16] yeah, I'm happy to see that we are now on track to leverage SPDY/HTTP2 rather than continuing to build new things around the limitations of HTTP1 [23:53:28] in general yeah, it would be better not to fragment the commonly-accessed-hostnames space any further if we can help it [23:53:44] I suspect we'll keep upload separate, and meta for login may have to stay separate. [23:54:02] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172550 (10Dzahn) @gwicke one box at a time is not really possible unless we break up the regex in site.pp node /^restbase100[1-... [23:54:52] (upload could merge up too in theory, but in the current architecture it's nice to have it split out for traffic balancing given how huge a bw chunk it is on its own) [23:55:01] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172551 (10Dzahn) ..well unless we stop puppet on all of them, merge, and then reactivate puppet one by one [23:55:21] (03PS1) 10Mattflaschen: Add NS_TALK to VE for Flow, only on MW.org, test, and test2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201388 (https://phabricator.wikimedia.org/T94282) [23:55:55] oh I guess /topic outdated re: esams, but it's about to become un-outdated again in 5 minutes :) [23:55:56] bblack: makes sense [23:57:00] Krenair: are we still waiting on Jenkins? [23:57:10] nope [23:57:31] or... are we. hmm [23:57:57] (03PS1) 10Dzahn: cassandra: add firewalling on prod [puppet] - 10https://gerrit.wikimedia.org/r/201389 (https://phabricator.wikimedia.org/T92680) [23:58:05] That's strange [23:59:00] apparently it's still going? [23:59:08] I bet it restarted them again. sigh [23:59:21] !log depooling esams ahead of 2h planned GTT link outage coming up in 1h [23:59:29] Logged the message, Master [23:59:33] (03CR) 10BBlack: [C: 032] Drain esams of all traffic v2 (scheduled outage) [dns] - 10https://gerrit.wikimedia.org/r/201371 (owner: 10Faidon Liambotis) [23:59:40] OK