[00:04:35] !log upgrading bugzilla to 4.2.7 [00:04:47] Logged the message, Master [00:06:46] can someone explain to me like i'm five [00:06:58] why wikiversions.dat says: test2wiki php-1.22wmf22 * [00:07:23] blargh. never mind. I refreshed and now test2 magically says 22 as well. [00:14:13] AaronSchulz: did you have a chance to look at https://gerrit.wikimedia.org/r/#/c/90280/ ? [00:15:02] seems fine, can't be merged yet of course [00:15:26] (03CR) 10Aaron Schulz: [C: 031] "Can be merged around deployement" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90280 (owner: 10Legoktm) [00:16:10] do you mean like the normal weekly deployment or full deployment of the extension? [00:17:32] (03PS1) 10Reedy: Kill postrewrites.conf, already handled in main.conf under wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90460 [00:18:46] I just realised I didn't properly follow up from that math rendering bug [00:20:54] did we have a bug for that? [00:20:58] yes [00:21:08] I replied to it explaining what you found and how you fixed it [00:21:09] never mind, found it [00:21:44] hope you don't mind [00:22:17] actually the MW cgroup was missing on almost all apaches [00:22:26] sigh [00:22:30] on those ones you listed, the cgroup filesystem wasn't even mounted [00:22:34] legoktm: the extension [00:22:38] so cgconfig had failed [00:22:50] not just mw-cgroup [00:23:03] I think I copied those from SAL [00:23:40] AaronSchulz: it's already on test + test2, or does it need to be everywhere? [00:24:07] yeah, but there was also [00:24:09] 00:41 Tim: manually created MW cgroups on all apaches since apparently the init script is totally broken [00:24:30] legoktm: ahh, didn't know that [00:24:35] I'm at a conf this week and pretty busy, sorry if the "this should be fixed" reply on the bug sounds a bit "can someome fix it for me" :) [00:24:36] I guess it can be done anytime then [00:24:42] !log awight synchronized php-1.22wmf21/extensions/CentralNotice [00:24:44] I just didn't want empty queues checked [00:24:54] Logged the message, Master [00:25:06] (03CR) 10Aaron Schulz: "Actually anytime :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90280 (owner: 10Legoktm) [00:25:18] thanks :) [00:25:28] somehow it already needs a rebase... [00:25:47] greg-g: AaronSchulz and I want to switch mediawiki writing files in swift to both pmtpa/eqiad (multiwrite) sometime next week [00:26:35] greg-g: last time around we agreed to involve you and schedule it as to not conflict with deployments -- that was a bit more risky (ceph) though [00:27:44] but we can schedule it this time too, is monday okay e.g. before or during lightning deploy time? [00:31:39] !log fixing MW cgroup on mw1109 [00:31:45] :( [00:31:51] Logged the message, Master [00:32:00] csteipp: hey [00:32:10] Hey paravoid [00:32:13] csteipp: remind me, is Special:CentralAutoLogin/start?type=script supposed to be cached? [00:32:30] let me check for sure.. [00:32:52] (it's not, ~30% of apache requests right now) [00:33:07] paravoid: Yes, it is [00:33:45] ok, I'll reopen #54195 [00:33:51] let me collect headers first [00:33:58] That would be really helpful [00:34:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: Connection refused [00:34:32] !log awight synchronized php-1.22wmf21/extensions/CentralNotice [00:35:33] (03PS1) 10Andrew Bogott: Include php5-cli in mediawiki_singlenode. [operations/puppet] - 10https://gerrit.wikimedia.org/r/90462 [00:36:25] (03CR) 10Andrew Bogott: [C: 032] Include php5-cli in mediawiki_singlenode. [operations/puppet] - 10https://gerrit.wikimedia.org/r/90462 (owner: 10Andrew Bogott) [00:36:39] (03CR) 10Andrew Bogott: [C: 032] Switch labs instances to use the mysql module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/89764 (owner: 10Andrew Bogott) [00:38:23] root@mw1109:/var/log# initctl stop cgconfig [00:38:23] cgconfig stop/waiting [00:38:23] root@mw1109:/var/log# initctl status mw-cgroup [00:38:23] mw-cgroup start/running [00:38:41] so if you then start cgconfig, it doesn't start mw-cgroup because it's already started [00:39:10] that's what I pointed out before and what https://gerrit.wikimedia.org/r/#/c/83067/ was supposed to fix [00:39:36] I confess I didn't validate the fix worked as intended after merging [00:40:35] not sure why it would help [00:41:23] my interpetation of the fix was that when you'd do "stop cgconfig" it would automatically stop mw-cgroup too [00:42:26] well, a few minutes after I stopped it, the status of cgconfig is still "stop/waiting" [00:42:34] not stop/stopped [00:42:45] so I guess the event wasn't emitted [00:42:57] I'm going to do some more testing along these lines [00:43:50] hm, maybe we should make it 'stop on stopping' then [00:44:10] i don't remember ever seeing stop/stopped, but I don't have much experience with upstart [00:45:01] thanks, I can handle it if you'd prefer that, although not now for sure [00:47:54] the event was emitted [00:48:19] it goes sopping->killed->post-stop->waiting [00:48:27] and post-stop->waiting triggers the stopped event [00:48:52] cgred uses "stop on stopped cgconfig" and it was correctly stopped [00:51:02] (03PS1) 10Ori.livneh: Send VisualEditor metrics to Ganglia via StatsD [operations/puppet] - 10https://gerrit.wikimedia.org/r/90464 [00:51:39] could someone merge this ^ ? [00:56:09] paravoid: Could you grab a few more headers and confirm if all of the redirects for /start are redirecting to the special page's alias on the same wiki? [00:57:29] The cache'd version should be a redirect to login.wikimedia.org/..../checkLoggedIn [01:01:13] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [01:01:30] !log awight synchronized php-1.22wmf21/extensions/CentralNotice [01:02:16] !log awight synchronized php-1.22wmf21/extensions/CentralNotice [01:03:25] (03PS1) 10Dzahn: install the bug-attachment.wm.org cert on kaulen [operations/puppet] - 10https://gerrit.wikimedia.org/r/90468 [01:03:58] mutante: can my patch (90464) ride along? [01:05:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: Connection refused [01:05:44] ori-l: sorry, i don't know what that is doing, it's getting late and last time something with statsd caused reverts [01:06:09] ok, np. [01:07:24] without the one above any apache restart would have killed Bugzilla:P [01:07:27] bbiaw [01:08:30] (03CR) 10Dzahn: [C: 032] install the bug-attachment.wm.org cert on kaulen [operations/puppet] - 10https://gerrit.wikimedia.org/r/90468 (owner: 10Dzahn) [01:17:04] csteipp: I even see http->https redirects with cache-control: private -- is this because of XFF & geo? [01:17:45] arg, no, bugzilla issue with the Apache config [01:18:15] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.316 second response time [01:19:15] PROBLEM - HTTP on kaulen is CRITICAL: Connection refused [01:19:42] !log awight synchronized php-1.22wmf21/extensions/CentralNotice [01:20:55] brought it back up.. this sucks [01:21:08] we had changes merged to apache config a while back but apparently never tested [01:21:15] RECOVERY - HTTP on kaulen is OK: HTTP OK: HTTP/1.1 302 Found - 489 bytes in 0.056 second response time [01:21:25] those new SSL certs aren't right [01:22:56] so it seems that mw-cgroup was somehow missing the relevant triggers in init's soft state [01:23:17] and that when I edited mw-cgroup.conf to test another theory, the configuration was reloaded and the problem was resolved [01:23:43] as soon as I edited it, mw-cgroup started getting cgconfig's events [01:37:46] (03CR) 10Dzahn: "this broke things on kaulen. see RT #5011. momentarily live-fixed and puppet deactivated" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [02:15:02] !log LocalisationUpdate completed (1.22wmf21) at Fri Oct 18 02:15:02 UTC 2013 [02:15:16] Logged the message, Master [02:27:12] (03PS1) 10Springle: s3 master rotation for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90476 [02:28:34] (03CR) 10Springle: [C: 032] s3 master rotation for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90476 (owner: 10Springle) [02:33:23] !log springle synchronized wmf-config/db-eqiad.php 's3 master rotation for upgrade' [02:33:34] Logged the message, Master [02:34:35] springle, you know how the other day I needed you to figure out why mysql wouldn't come up on an instance? [02:34:45] yep [02:34:52] hi, just got the following error when trying to save an edit: [02:34:53] Error: 1290 The MariaDB server is running with the --read-only option so it cannot execute this statement (10.64.16.27) [02:34:57] Today I have exactly the same question… pretty sure it's not the same dumb mistake, although probably a similar one [02:34:57] is this known? [02:35:04] drdee: yes [02:35:13] ok [02:35:16] something about read-only mode isn't so readonly [02:35:27] andrewbogott: 5 mins [02:37:05] (03PS1) 10Springle: S3 master rotation done [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90480 [02:37:33] (03CR) 10Springle: [C: 032] S3 master rotation done [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90480 (owner: 10Springle) [02:37:50] !log springle synchronized wmf-config/db-pmtpa.php 's3 master rotation for upgrade' [02:38:02] Logged the message, Master [02:38:23] springle, any time you have a moment. [02:38:28] gee tin rsync is slow today [02:38:31] Damn, it /looks/ like the same problem as before... [02:39:40] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 315 seconds [02:39:40] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 323 seconds [02:39:50] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 324 seconds [02:39:50] PROBLEM - MySQL Replication Heartbeat on db34 is CRITICAL: CRIT replication delay 325 seconds [02:40:10] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 349 seconds [02:40:10] PROBLEM - MySQL Replication Heartbeat on db39 is CRITICAL: CRIT replication delay 351 seconds [02:40:11] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 352 seconds [02:43:04] (03PS1) 10Springle: db1038 is s3 master [operations/puppet] - 10https://gerrit.wikimedia.org/r/90481 [02:43:36] !log springle synchronized wmf-config/db-eqiad.php 's3 master rotation done' [02:43:45] !log LocalisationUpdate completed (1.22wmf22) at Fri Oct 18 02:43:45 UTC 2013 [02:43:50] Logged the message, Master [02:44:04] Logged the message, Master [02:44:18] !log springle synchronized wmf-config/db-pmtpa.php 's3 master rotation done' [02:44:31] Logged the message, Master [02:45:03] andrewbogott: ok, where-abouts [02:45:11] puppet-testing-9.pmtpa.wmflabs [02:46:12] publickey denied [02:46:29] from bastion-restricted.wmflabs.org ? [02:48:02] mutante: i think you were truncated mid-sentence [02:48:09] on 2675 [02:48:14] springle, sorry, one minute... [02:48:48] springle, better now? [02:49:40] yep [02:50:42] springle, this is my first attempt at applying this puppet class to a fresh machine. I know it works when applied to one that already has the db set up. [02:51:56] /mnt/mysql hasn't been initialized. needs mysql_install_db run once to setup stuff [02:52:32] Hm, ok. Yet another failing of the puppetlabs module I guess :( [02:52:50] :) [02:53:45] thanks [02:53:46] i won't do it. presumably you want to tweak and test [02:53:49] np [02:55:41] yep, thanks. [02:55:45] I'll but you again when I have a patch [02:56:02] :) [02:56:33] Well, mysql_install_db doesn't appear anywhere in puppet code. So how was this working before? [02:56:40] Clearly we modify datadir in several places [02:57:46] normally dpkg runs mysql_install_db via hook. maybe datadir is being modified afterwards [02:58:32] or previously datadirs were cloned diectly. certainly the coredbs usually get a datadir copy to avoid slow reload [02:58:50] * springle guessing [03:00:50] andrewbogott: /var/lib/mysql does have stuff setup and dated today. mysql_install_db ran sometime, but in the wrong place [03:01:13] well, sure it's run by dpkg, but... [03:01:37] Surely the package is installed and then the datadir is set, right? It can't happen in the other order because my.cnf wouldn't exist before the package install. [03:01:48] So, we'll always get a /var/lib/mysql because that's what the dpkg installs. [03:02:28] Ah! It does dpkg-reconfigure [03:02:48] if mysqld is not running its safe to just cp -r to the new datadir and set perms [03:04:52] well… that presumes that the datadir is only ever set once… if set a second time we'd have to know the old data dir to cp [03:05:05] I think that reconfigure is the right solution, just have to get the order of ops right [03:05:13] sounds good [03:10:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 18 03:10:02 UTC 2013 [03:10:14] Logged the message, Master [03:48:43] andrewbogott: would you like me to look over the module? [03:49:00] ori-l, the mysql module you mean? [03:49:08] Sure, although I'm somewhat down the path at this point. [03:51:10] I didn't mean a thumbs up / thumbs down, just linting. Up to you. [03:51:32] Yeah, linting would be great. [03:52:02] It has a bunch of stuff for rpm in there… not sure if we should just commit to forking and rip that stuff out. [03:53:33] I... well, you know. :) [03:55:50] yeah :) [03:55:57] that's the change, right? https://gerrit.wikimedia.org/r/#/c/88666/ [03:56:08] * ori-l promises to not be a -1 terrorist [03:57:11] That change is of interest, although not the one I'm working on now... [03:57:59] I'm not sure I'm up to the task of reconciling mysql, mysql_wmf, and coredb_mysql. [03:58:15] So they'll coexist for a while. Right now I'm working on 'mysql' [03:58:35] 'mysql' is a puppetlabs module, mysql_wmf is just copypasta of some of our old mysql_wmf classes. [03:58:44] (which, oddly, do not install mysql) [04:05:41] (03PS1) 10Andrew Bogott: dpkg-reconfigure after my.cnf changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90483 [04:06:22] springle: ^ [04:09:32] nice [04:09:44] Ugh, I'm having terrible latency connecting to labs right now. That's just me, right, not happening on your end? [04:10:51] seems ok here [04:11:12] ok, good [04:12:13] gerrit won't talk to me either :( [04:12:23] Seems increasingly unlikely that I'll get this fixed before bed [04:13:34] i think coredb_mysql will stay separate from this for some time. i'd be little nervous to link the two too tightly [04:14:47] fine with me, I'm scared to mess with coredb anyway [04:15:37] although, that paranoia is mainly in the my.cnf config area, and core monitoring needs to be hard to break [04:15:41] (03CR) 10Andrew Bogott: [C: 032] dpkg-reconfigure after my.cnf changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/90483 (owner: 10Andrew Bogott) [04:16:36] d'you mind logging into sockpuppet and merging that? I can't reach the cluster right now and don't want to leave a mess :( [04:16:49] ok [04:16:59] because you speak of mysql and monitoring, i'm gonna dump this link .. but then also disappear for bed [04:17:02] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=8&hoststatustypes=3&serviceprops=2097162&nostatusheader [04:17:08] andrewbogott: done [04:17:09] (they are UNKNOWN, not crit) [04:17:21] but shouldn't they be crit .. or removed [04:17:22] thanks! [04:17:36] db1023,db44,db64 .. [04:17:56] mutante: they should [04:18:06] can't be that urgent because they are 25d old [04:18:14] but would be nice if we can get rid of them [04:18:27] they aren't urgent. will do [04:18:47] cool, thanks :) [04:22:20] well… clearly I've used all of my internets for the day. Thanks for your help, springle. [04:22:32] I'll worry about merging the mysql_wmf module next week probably. [04:22:39] np :) [04:22:44] good effort [04:22:49] (03PS1) 10Ori.livneh: coredb_mysql: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/90486 [04:22:54] :P [04:23:19] * ori-l blows out his smoking pistol. [04:23:23] (03CR) 10jenkins-bot: [V: 04-1] coredb_mysql: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/90486 (owner: 10Ori.livneh) [04:23:41] god damn it, jenkins. [04:23:43] heh [04:24:28] (03PS2) 10Ori.livneh: coredb_mysql: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/90486 [04:31:42] (03CR) 10Springle: [C: 032] coredb_mysql: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/90486 (owner: 10Ori.livneh) [04:32:09] thanks [04:58:40] (03PS1) 10Springle: insert HAproxy for S2 master rotation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90487 [04:59:59] (03CR) 10Springle: [C: 032] insert HAproxy for S2 master rotation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90487 (owner: 10Springle) [05:00:50] !log springle synchronized wmf-config/db-eqiad.php 'insert HAproxy for S2 master rotation' [05:01:06] Logged the message, Master [05:07:38] (03PS1) 10Springle: depool new master db1036 from slaves ready for HAproxy write-traffic switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90488 [05:08:07] (03CR) 10Springle: [C: 032] depool new master db1036 from slaves ready for HAproxy write-traffic switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90488 (owner: 10Springle) [05:09:06] !log springle synchronized wmf-config/db-eqiad.php 'depool new master db1036 from slaves ready for HAproxy write-traffic switch' [05:09:19] Logged the message, Master [05:16:45] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CRIT replication delay 45700 seconds [05:18:45] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [05:19:05] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 302 seconds [05:19:25] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 322 seconds [05:19:35] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 331 seconds [05:19:45] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 334 seconds [05:19:45] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 339 seconds [05:19:55] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 346 seconds [05:19:55] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 348 seconds [05:20:05] PROBLEM - MySQL Replication Heartbeat on db1036 is CRITICAL: CRIT replication delay 358 seconds [05:20:14] * springle need to silence heartbeat during rotation.. bogus.. [05:24:38] (03PS1) 10Springle: remove HAproxy after S2 master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90489 [05:25:29] (03PS1) 10Ori.livneh: Set one a one-year Cache-control: max-age header for fonts. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/90490 [05:25:41] TimStarling: could you review that? ^ [05:26:00] (03CR) 10Springle: [C: 032] remove HAproxy after S2 master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90489 (owner: 10Springle) [05:27:23] !log springle synchronized wmf-config/db-eqiad.php 'remove HAproxy after S2 master switch' [05:27:37] Logged the message, Master [05:28:19] !log springle synchronized wmf-config/db-pmtpa.php 'sync new S6 master setting to pmtpa' [05:28:32] Logged the message, Master [06:02:35] (03PS1) 10Springle: insert HAproxy into S7 for master rotation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90491 [06:07:19] springle: when did we first start with haproxy? maybe we could get a wikitech page on it? [06:07:36] (or if you just tell me about it I can maybe write something) [06:08:13] jeremyb: today really. i'm still experimenting how to fit it in. this stuff is only for master rotations [06:08:40] springle: hah, i just found a museum piece. RT 1351 [06:08:40] whether it gets used for proper load balancing slaves or whatever, still to decide [06:10:18] springle: ok, well it's mediawiki -> haproxy -> current master? [06:10:26] heh.. yeah, won't be going back to that script [06:10:36] and it was mediawiki -> current master before ? [06:10:46] during the switch-over yes [06:10:54] for a few mins [06:10:59] then back to normal [06:11:36] ok, so it's only for during a master switch [06:11:46] so far, yes [06:11:49] k [06:12:02] other usage won't happen until next year (likely) as it's down the list a bit [06:12:15] if it gets used more than for during switches it would be cool to get that represented on dbtree [06:12:26] definitely [06:13:03] didn't see a ticket or anything on https://wikitech.wikimedia.org/wiki/Projects [06:13:09] i guess maybe it's just your own list [06:13:32] only my own stuff-to-think-about list so far [06:13:33] anyway, interesting choice. vs. e.g. lvs because we already have people that know that [06:13:37] right [06:13:55] * jeremyb will keep an eye out for more haproxy :) [06:14:31] springle: do you want to reject the ticket? :) [06:15:28] lvs is a possibility. depends on whether it's done for load balancing or HA reasons, imo [06:17:17] plus how MHA will be used with puppet, etc [06:17:22] many choices [06:18:23] yeah [06:18:35] * jeremyb goes to sleep [06:18:55] thanks for the info [06:20:12] (03CR) 10Springle: [C: 032] insert HAproxy into S7 for master rotation [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90491 (owner: 10Springle) [06:21:18] !log springle synchronized wmf-config/db-eqiad.php 'insert temporary HAproxy for S7 master rotation' [06:21:34] Logged the message, Master [06:24:31] (03PS1) 10Springle: depool db1039 for master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90494 [06:24:53] (03CR) 10Springle: [C: 032] depool db1039 for master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90494 (owner: 10Springle) [06:25:45] !log springle synchronized wmf-config/db-eqiad.php 'depool db1039 for master switch' [06:25:58] Logged the message, Master [06:28:05] (03PS1) 10Springle: remove Haproxy after S7 master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90495 [06:28:45] PROBLEM - MySQL Replication Heartbeat on db1039 is CRITICAL: CRIT replication delay 313 seconds [06:28:45] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 316 seconds [06:28:45] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 317 seconds [06:28:46] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 320 seconds [06:28:55] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 325 seconds [06:28:55] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 325 seconds [06:28:55] PROBLEM - MySQL Replication Heartbeat on db37 is CRITICAL: CRIT replication delay 326 seconds [06:29:05] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 334 seconds [06:29:13] (03CR) 10Springle: [C: 032] remove Haproxy after S7 master switch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/90495 (owner: 10Springle) [06:30:02] !log springle synchronized wmf-config/db-eqiad.php 'remove Haproxy after S7 master switch' [06:30:17] Logged the message, Master [06:41:12] (03PS1) 10Springle: update topology after rotations and upgrades [operations/puppet] - 10https://gerrit.wikimedia.org/r/90496 [06:42:29] (03CR) 10Springle: [C: 032] update topology after rotations and upgrades [operations/puppet] - 10https://gerrit.wikimedia.org/r/90496 (owner: 10Springle) [06:49:45] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [06:49:45] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [06:49:55] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [06:49:55] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [06:49:55] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [06:50:05] RECOVERY - MySQL Replication Heartbeat on db1036 is OK: OK replication delay 0 seconds [06:50:05] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [06:50:25] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [06:51:05] RECOVERY - MySQL Replication Heartbeat on db39 is OK: OK replication delay 0 seconds [06:51:15] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [06:51:15] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [06:51:35] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [06:51:45] RECOVERY - MySQL Replication Heartbeat on db1039 is OK: OK replication delay 0 seconds [06:51:45] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [06:51:45] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [06:51:45] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [06:51:46] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 0 seconds [06:51:46] RECOVERY - MySQL Replication Heartbeat on db34 is OK: OK replication delay 0 seconds [06:51:46] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 0 seconds [06:51:55] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [06:51:55] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [06:51:55] RECOVERY - MySQL Replication Heartbeat on db37 is OK: OK replication delay 0 seconds [06:52:05] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [06:59:57] (03PS1) 10Springle: stop generating icinga messages for db hosts out of action [operations/puppet] - 10https://gerrit.wikimedia.org/r/90498 [07:00:58] (03CR) 10Springle: [C: 032] stop generating icinga messages for db hosts out of action [operations/puppet] - 10https://gerrit.wikimedia.org/r/90498 (owner: 10Springle) [08:23:05] PROBLEM - Puppet freshness on cp1022 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:06] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:06] PROBLEM - Puppet freshness on cp1036 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:15] PROBLEM - Puppet freshness on cp1021 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:15] PROBLEM - Puppet freshness on cp1028 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:15] PROBLEM - Puppet freshness on cp1035 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:15] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:25] PROBLEM - Puppet freshness on cp1027 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:25] PROBLEM - Puppet freshness on cp1034 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:25] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:35] PROBLEM - Puppet freshness on cp1026 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:35] PROBLEM - Puppet freshness on cp1033 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:45] PROBLEM - Puppet freshness on cp1025 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:45] PROBLEM - Puppet freshness on cp1032 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:55] PROBLEM - Puppet freshness on cp1024 is CRITICAL: No successful Puppet run in the last 10 hours [08:23:55] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:05] PROBLEM - Puppet freshness on cp1023 is CRITICAL: No successful Puppet run in the last 10 hours [08:24:05] PROBLEM - Puppet freshness on cp1030 is CRITICAL: No successful Puppet run in the last 10 hours [08:25:49] ugh, well sorry about that, those are lies but it's the usual: they are in decom.pp and not all the way gone [08:25:55] PROBLEM - Disk space on cp1022 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=54%): [08:26:20] (03PS1) 10Ori.livneh: coredb_mysql: convert a few additional leftover tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/90506 [08:28:55] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Fri Oct 18 08:28:49 UTC 2013 [08:29:25] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:31:50] apergos: morning :-] since you have done a few sql changes recently, do you have any idea how we maintained the DB user grants/passwords ? [08:31:58] apergos: that is not obvious in our puppet manifests :( [08:32:46] tbh I don't know, when we had a problem with one of the slaves having not gotten the right info we dumped the table from a good host and shovelled it into the slave [08:32:55] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Fri Oct 18 08:32:45 UTC 2013 [08:33:04] passwords are in the private repo so they won't get lost [08:33:05] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [08:33:40] here is where I wish icinga would quit telling me 'not authorized', I would turn off notiications for cp1021-1036 [08:33:41] oh well [08:33:55] RECOVERY - Puppet freshness on cp1032 is OK: puppet ran at Fri Oct 18 08:33:50 UTC 2013 [08:34:31] I should stare at the decom logic in the manifests again cause clearly that's broken for this case [08:34:45] PROBLEM - Puppet freshness on cp1032 is CRITICAL: No successful Puppet run in the last 10 hours [08:35:45] RECOVERY - Puppet freshness on cp1027 is OK: puppet ran at Fri Oct 18 08:35:40 UTC 2013 [08:36:11] when I've found that a host didn'thave the right permissions for a user (the subnet was too narrow) I've just changed them on the master so othey look like the rest in the cluster... that seems to be how it's gone for now [08:36:22] apergos: so one would fetch the password from the puppet repo and manually create the user? [08:36:25] PROBLEM - Puppet freshness on cp1027 is CRITICAL: No successful Puppet run in the last 10 hours [08:36:26] I know springle is working on some method to sync those all up [08:36:48] the puppet labs mysql module would let us create the user / ensure appropriate grants [08:36:55] not sure we want to rely on it though [08:37:19] to rely on it in labs you mean? [08:38:33] why not? [08:41:20] na I mean in production [08:41:27] I looked at the mysql puppet labs module [08:41:35] it has some stuff like database_grant() and database_user() [08:41:44] so potentially we could use those wrappers to set up grants/user [08:41:55] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Fri Oct 18 08:41:46 UTC 2013 [08:42:05] PROBLEM - Puppet freshness on cp1036 is CRITICAL: No successful Puppet run in the last 10 hours [08:42:12] i was merely asking because I have to create a Jenkins user on the continuous integration MySQL server, was wondering how to handle it [08:42:17] I guess I will submit the password in private [08:42:20] and set it up manually. [08:42:45] RECOVERY - Puppet freshness on cp1024 is OK: puppet ran at Fri Oct 18 08:42:42 UTC 2013 [08:42:55] PROBLEM - Puppet freshness on cp1024 is CRITICAL: No successful Puppet run in the last 10 hours [08:43:01] would recommend that for now [08:43:55] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Fri Oct 18 08:43:47 UTC 2013 [08:43:56] I guess, lets be pragmatic :-] [08:44:01] now i have to figure out the grants I need hehe [08:44:05] PROBLEM - Puppet freshness on cp1030 is CRITICAL: No successful Puppet run in the last 10 hours [08:44:45] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Fri Oct 18 08:44:42 UTC 2013 [08:45:35] PROBLEM - Puppet freshness on cp1026 is CRITICAL: No successful Puppet run in the last 10 hours [08:51:00] hashar: can you please teach me how to test puppet code in labs or point me to docs? I have an instance, but don't know how to access it [08:54:48] matanya, https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [08:55:01] thanks MaxSem [08:55:03] matanya: in a few minutes yes [08:56:05] RECOVERY - Puppet freshness on cp1033 is OK: puppet ran at Fri Oct 18 08:55:59 UTC 2013 [08:56:35] PROBLEM - Puppet freshness on cp1033 is CRITICAL: No successful Puppet run in the last 10 hours [08:56:45] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Fri Oct 18 08:56:39 UTC 2013 [08:57:15] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [09:00:45] RECOVERY - Puppet freshness on cp1022 is OK: puppet ran at Fri Oct 18 09:00:41 UTC 2013 [09:00:55] RECOVERY - Puppet freshness on cp1031 is OK: puppet ran at Fri Oct 18 09:00:51 UTC 2013 [09:00:57] akosiaris: if you are around, the PHP segfault is still happening but I got my mediawiki coverage report :-] [09:01:05] PROBLEM - Puppet freshness on cp1022 is CRITICAL: No successful Puppet run in the last 10 hours [09:01:06]