[00:00:04] (03PS4) 10Dzahn: search - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137996 (owner: 10Rush) [00:00:16] (03PS4) 10Dzahn: bugzilla - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137995 (owner: 10Rush) [00:00:42] (03PS3) 10Dzahn: deployment,replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [00:00:52] (03PS3) 10Dzahn: jenkins - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137992 (owner: 10Rush) [00:01:18] (03PS4) 10Dzahn: parsoid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137997 (owner: 10Rush) [00:01:25] (03PS3) 10Dzahn: icinga - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138006 (owner: 10Rush) [00:01:37] (03PS3) 10Dzahn: rancid - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138005 (owner: 10Rush) [00:01:49] (03PS3) 10Dzahn: stats - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [00:01:59] (03PS3) 10Dzahn: nfs - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138003 (owner: 10Rush) [00:02:02] (03PS1) 10Yurik: Updated labs config for new zero exts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138506 [00:02:13] (03PS3) 10Dzahn: openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [00:02:51] (03PS3) 10Dzahn: install-server-replace generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/138001 (owner: 10Rush) [00:03:01] (03PS3) 10Dzahn: dataset-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138000 (owner: 10Rush) [00:03:03] (03PS1) 10Rush: phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 [00:03:10] (03PS3) 10Dzahn: logging-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137999 (owner: 10Rush) [00:09:45] (03PS4) 10Dzahn: generic: remove systemuser definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [00:10:45] (03CR) 10Dzahn: generic: remove systemuser definition (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [00:12:55] (03CR) 10jenkins-bot: [V: 04-1] phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 (owner: 10Rush) [00:14:07] (03CR) 10Dzahn: [C: 031] puppetproxy: match role name to class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/138370 (owner: 10Matanya) [00:45:47] is there a comprehensive list of all wmf wikis somewhere in machine-readable format? [00:46:02] I thiiiiink so [00:46:07] <^d> all.dblist [00:46:18] http://lists.wikimedia.org/pipermail/analytics/2014-June/002159.html [00:46:39] <^d> Can be grabbed from noc.wikimedia.org/conf/ or operations/mediawiki-config.git, take your pick. [00:46:49] In other words https://www.mediawiki.org/w/api.php?action=sitematrix&format=jsonfm [00:46:53] ^d: is there a mapping from those to hostnames? [00:47:14] there's the SiteMatrix API endpoint [00:47:43] jackmcbarn: https://www.mediawiki.org/w/api.php?action=sitematrix [00:47:54] ori: thanks [00:47:59] <^d> jackmcbarn: InitialiseSettings.php has some overrides, but it's a pretty easy 1:1 mapping to guess for most wikis. [00:48:05] <^d> But I guess API works too :p [00:48:31] ori: Tch, copycat [00:48:53] oh, i missed your message [00:53:07] PROBLEM - DPKG on elastic1016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:56:07] RECOVERY - DPKG on elastic1016 is OK: All packages OK [00:59:41] thats pretty odd [00:59:57] I was using apt just then [01:01:05] !log upgraded all elasticsearch servers in production to 1.2.1. They are just restoring the last few shards on the last node now and they'll spend a few hours tonight rebalancing after the upgrade but otherwise I'm done. [01:01:10] Logged the message, Master [01:08:07] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [01:13:26] (03PS2) 10Rush: phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 [01:13:36] (03CR) 10jenkins-bot: [V: 04-1] phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 (owner: 10Rush) [01:16:58] (03PS3) 10Rush: phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 [01:19:05] (03CR) 10Rush: [C: 032] phabricator.wikimedia.org MX & A records [operations/dns] - 10https://gerrit.wikimedia.org/r/138507 (owner: 10Rush) [01:31:06] gerrit dead? [01:31:10] Guice provision errors: [01:31:10] 1) Cannot open ReviewDb [01:31:10] at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:70) [01:31:12] while locating com.google.gerrit.reviewdb.server.ReviewDb [01:31:14] 1 error [01:32:12] it's worse than last time [01:32:18] unconditional, every request [01:32:28] <^d> fixing. [01:32:35] 503 unavailable now, so I guess someones rebooting it [01:32:38] cool :) [01:32:54] <^d> Wait someone's rebooting already? [01:32:57] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m2-master [01:32:57] I was restarting teh service, seems not to have been the solution [01:33:06] ori: ugh my fault [01:33:15] !log restarted gerrit on ytterbium [01:33:20] Logged the message, Master [01:33:27] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 343 seconds [01:33:45] springle: eventlogging, you mean? 'sokay. tell me if i should restart it. [01:34:07] PROBLEM - gerrit process on ytterbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:34:23] i guess they're on the same host [01:34:27] so you mean both :) [01:34:29] what is the gerrit error? [01:34:44] actuall connection fail reason i mean [01:35:06] <^d> I'm trying to find out. [01:35:13] right now just a 503 Service Temporarily Unavailable [01:36:11] springle: Caused by: java.sql.SQLException: Unable to load authentication plugin ''. [01:36:20] wtf [01:36:43] <^d> That's weird. [01:36:46] <^d> And new. [01:36:56] ah i see the problem [01:37:07] springle: http://p.defau.lt/?Ci5910vpXNqCvI6IftseAw , from ytterbium:/var/lib/gerrit2/review_site/logs/error_log [01:38:13] <^d> springle: Ah? Please do share :) [01:40:40] eventlogging sez: sqlalchemy.exc.OperationalError: (OperationalError) (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement') [01:41:44] i gotta run! sean, if you could, could you please run 'eventloggingctl start' on vanadium once the db is back and writable? [01:43:36] gerrit should be back [01:43:44] checking eventlogging [01:43:56] ^d: ^ [01:44:07] RECOVERY - gerrit process on ytterbium is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:44:16] <^d> gerrit's much happier now, yes. [01:44:17] <^d> thx! [01:44:30] i'll email the list with what I broke [01:44:58] eventlogging's upstart thingy hit the respawn limit [01:45:04] * ori questions the wisdom of having that directive there at all [01:45:29] well that was fun [01:45:36] one bug and one breakage [01:45:53] we still love you [01:45:57] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [01:46:36] (03PS3) 10Ori.livneh: Puppet compiler for Tim's redirects.dat DSL [operations/puppet] - 10https://gerrit.wikimedia.org/r/138292 [01:50:12] tho if i set 'respawn limit infinite' the process check will flap [01:50:24] * ori runs for real [02:03:25] (03CR) 10Krinkle: Puppet compiler for Tim's redirects.dat DSL (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [02:07:07] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [02:15:28] <^d> springle: Thanks for the report. Soooo....MariaDB bug on upgrade? [02:15:45] !log LocalisationUpdate completed (1.24wmf7) at 2014-06-10 02:14:41+00:00 [02:15:54] Logged the message, Master [02:16:54] ^d: i'm not sure if it counts as a connector/j bug or a mysqld bug [02:17:26] it would only trigger if grants with the old password format exist [02:17:38] <^d> Hmm. [02:17:42] why connector/j cares about the server-side password format, i don't know [02:18:22] <^d> I'd imagine it doesn't. My guess is that mysql is expecting the new format because upgrade. [02:18:28] <^d> Then barfs when the old format doesn't work. [02:18:41] * ^d is guessing though [02:18:45] no, mysql accepts both formats [02:18:56] <^d> Well hmm x2 then. [02:19:22] soemthing to do with: proceedHandshakeWithPluggableAuthentication [02:19:37] but i'm not about to become a java person to find out [02:20:18] <^d> If only we had some java folks around here ;-) [02:20:41] oh, old connector/j [02:21:00] <^d> Very likely. It doesn't get upgraded automatically. [02:21:02] i guess we could upgrade. but it works now [02:21:19] <^d> Gerrit doesn't bundle it with the .war (I assume licensing reasons) [02:22:04] <^d> springle: We can look at upgrading. It'll cause downtime so I'd have to schedule it. [02:24:03] presumably eventually gerrit and/or ytterbium will upgrade. leave it until then, since we won't hit this particular issue again [02:25:01] that or phabricator will take over the world [02:27:14] !log switched traffic db1048 to db1020. broke gerrit briefly; see ops email [02:27:18] Logged the message, Master [02:28:03] <^d> springle: It won't get an upgrade from upgrading ytterbium. Far more manual than that :( [02:28:30] (03PS1) 10Springle: switch m2-master to db1020. socat redirect in place for db. [operations/dns] - 10https://gerrit.wikimedia.org/r/138527 [02:28:31] <^d> Ask me again sometime when it's not dinnertime and after 7:30 :) [02:28:42] :D [02:29:17] !log LocalisationUpdate completed (1.24wmf8) at 2014-06-10 02:28:14+00:00 [02:29:22] Logged the message, Master [02:29:45] (03CR) 10Springle: [C: 032] switch m2-master to db1020. socat redirect in place for db. [operations/dns] - 10https://gerrit.wikimedia.org/r/138527 (owner: 10Springle) [02:32:27] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [03:05:16] (03PS3) 1001tonythomas: Styled the alias field value differently [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) [03:06:09] (03CR) 10TTO: "The "small" relates to the "edit" link, visible in the DOM but not in the page source itself (it seems to be injected by JavaScript). This" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) (owner: 1001tonythomas) [03:06:58] (03CR) 10TTO: "Ah yes, thanks Tony!" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) (owner: 1001tonythomas) [03:25:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 10 03:24:19 UTC 2014 (duration 24m 18s) [03:25:30] Logged the message, Master [03:40:31] !log switched mchenry to use m2-master/m2-slave for OTRS address lookups [03:40:36] Logged the message, Master [04:09:08] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [04:57:15] !log db1048 down for upgrade [04:57:21] Logged the message, Master [05:08:07] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [05:19:06] (03PS1) 10Springle: Remove db1048 from m2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/138535 [05:21:32] (03CR) 10Springle: [C: 032] Remove db1048 from m2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/138535 (owner: 10Springle) [05:40:15] PROBLEM - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 123124 MB (6% inode=99%): /var/lib/hadoop/data/j 124275 MB (6% inode=99%): /var/lib/hadoop/data/e 97014 MB (5% inode=99%): /var/lib/hadoop/data/f 74695 MB (3% inode=99%): /var/lib/hadoop/data/g 102765 MB (5% inode=99%): /var/lib/hadoop/data/c 108696 MB (5% inode=99%): /var/lib/hadoop/data/k 121492 MB (6% inode=99%): /var/lib/hadoop/da [05:47:39] (03PS1) 10Springle: Move db1048 into m3 as future lvm snapshot slave for phabricator. Combine m2 boxes db1020 and db1046 into same role::mariadb::misc. [operations/puppet] - 10https://gerrit.wikimedia.org/r/138537 [05:50:23] (03CR) 10Springle: [C: 032] Move db1048 into m3 as future lvm snapshot slave for phabricator. Combine m2 boxes db1020 and db1046 into same role::mariadb::misc. [operations/puppet] - 10https://gerrit.wikimedia.org/r/138537 (owner: 10Springle) [05:57:58] <_joe|away> hey springle [06:01:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:03:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:05:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:07:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:09:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:11:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:12:11] !log xtrabackup clone db1043 to db1048 [06:12:15] Logged the message, Master [06:13:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:14:25] (03PS2) 10Giuseppe Lavagetto: add haithams to analytics-users [operations/puppet] - 10https://gerrit.wikimedia.org/r/138495 (owner: 10Dzahn) [06:15:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:16:16] (03CR) 10Giuseppe Lavagetto: [C: 032] add haithams to analytics-users [operations/puppet] - 10https://gerrit.wikimedia.org/r/138495 (owner: 10Dzahn) [06:17:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:19:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:21:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:23:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:25:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:27:11] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 05:58:29 UTC [06:28:11] RECOVERY - Puppet freshness on cp4013 is OK: puppet ran at Tue Jun 10 06:28:03 UTC 2014 [06:32:20] (03PS4) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) [06:41:31] (03PS2) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [07:10:02] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [07:39:44] _joe_: do you have any idea what's up with those tungsten alerts? [07:39:51] esp. the UNKNOWN ones [07:40:21] <_joe_> paravoid: where are they? I may have missed them [07:40:27] icinga [07:41:13] <_joe_> oh not right now, in general [07:41:20] yeah [07:41:31] <_joe_> ok yes [07:41:41] <_joe_> both me and godog investigated a little [07:42:04] <_joe_> it seems that under certain conditions uwsgi and apache will simply refuse to talk to each other [07:42:09] !log enabled pt-slave-delay for dbstore1001, 24h all shards [07:42:13] Logged the message, Master [07:42:19] <_joe_> why so, we have no idea. [07:42:36] <_joe_> usually restarting uwsgi solves the problem, sometimes it solves itself [07:43:07] <_joe_> I've extensively used uwsgi in the past, but never with apache [07:59:17] (03PS2) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [08:00:56] _joe_: how do you feel about notify/subscribe instead? [08:01:12] I think the reason is not there is so that puppet won't ever restart redis automatically [08:02:01] <_joe_> paravoid: yes, so I created a 'proxy exec' [08:02:12] <_joe_> it will restart redis only if it's just listening on localhost [08:02:20] <_joe_> in any other case, it wont [08:02:46] <_joe_> (I am testing this right now just to be sure) [08:03:43] I'm wondering if we should just enable a regular notify/subscribe [08:04:05] <_joe_> yeah, that is scary. A change in puppet would invalidate all redis caches at once [08:04:17] <_joe_> I do see the rationale behind that [08:04:22] (03PS1) 10Faidon Liambotis: Create a new role::mail hierarchy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138543 [08:04:24] (03PS1) 10Faidon Liambotis: mail: move clamav include in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138544 [08:04:26] (03PS1) 10Faidon Liambotis: mail: move backup includes in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138545 [08:04:26] <_joe_> (not restarting) [08:04:28] (03PS1) 10Faidon Liambotis: Introduce a new minimal exim4 module [operations/puppet] - 10https://gerrit.wikimedia.org/r/138546 [08:04:30] (03PS1) 10Faidon Liambotis: Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 [08:04:47] <_joe_> eheh quite the set of changes [08:04:54] and I'm not done yet [08:05:04] I'm untangling a mess [08:05:16] your comparator can be very useful here [08:05:25] <_joe_> isn't that our job description? [08:05:29] <_joe_> I hope so [08:05:41] <_joe_> I found an elegant way to make builds namespaced [08:05:56] <_joe_> so that subsequent builds will not overwrite previous runs [08:06:12] (03PS1) 10Springle: Make dbstore non-delayed by default. Apply suitable replag thresholds to dbstore1001 for 24h pt-slave-delay, and dbstore1002 for analytics. [operations/puppet] - 10https://gerrit.wikimedia.org/r/138548 [08:06:20] <_joe_> in the afternoon, if RT leaves me some room, I'll upgrade that [08:09:02] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [08:09:26] (03CR) 10Springle: [C: 032] Make dbstore non-delayed by default. Apply suitable replag thresholds to dbstore1001 for 24h pt-slave-delay, and dbstore1002 for analytics. [operations/puppet] - 10https://gerrit.wikimedia.org/r/138548 (owner: 10Springle) [08:12:35] _joe_: jenkins says "pending—puppet-compiler02.eqiad.wmflabs is offline" [08:12:47] <_joe_> yeah seen that [08:14:49] (03PS4) 10Odder: Move queries for bugs with ASSIGNED status [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 [08:43:40] (03PS1) 10Faidon Liambotis: install-server: move lvs3xxx stanzas [operations/puppet] - 10https://gerrit.wikimedia.org/r/138550 [08:43:42] (03PS1) 10Faidon Liambotis: install-server: add server lead.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/138551 [08:44:04] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install-server: move lvs3xxx stanzas [operations/puppet] - 10https://gerrit.wikimedia.org/r/138550 (owner: 10Faidon Liambotis) [08:44:14] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install-server: add server lead.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/138551 (owner: 10Faidon Liambotis) [08:48:05] (03PS1) 10Faidon Liambotis: install-server: switch server "lead" to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/138552 [08:48:07] (03PS1) 10Faidon Liambotis: install-server: use raid1-lvm for lead [operations/puppet] - 10https://gerrit.wikimedia.org/r/138553 [08:48:31] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install-server: switch server "lead" to trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/138552 (owner: 10Faidon Liambotis) [08:48:41] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install-server: use raid1-lvm for lead [operations/puppet] - 10https://gerrit.wikimedia.org/r/138553 (owner: 10Faidon Liambotis) [09:03:32] (03CR) 10Filippo Giunchedi: [C: 031] monitoring: add check for git merging of important repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/138313 (owner: 10Giuseppe Lavagetto) [09:10:48] matanya: did you find out if other files needed fixing in facter with ruby 1.9? (https://gerrit.wikimedia.org/r/#/c/137940/2) [09:11:03] godog: the rsync module [09:11:18] * matanya digs the code again [09:12:39] godog: modules/rsync/spec/defines/server_module_spec.rb [09:13:00] and modules/rsync/templates/module.erb [09:13:41] i think that is all [09:13:48] <_joe_> matanya: I suppose none of the two is used in prod? [09:14:05] <_joe_> matanya: btw, what problems did you and andrewbogott_afk had to solve yesterday evening? [09:14:18] <_joe_> I was mounting furniture so I did not have time to follow it [09:14:19] <_joe_> :P [09:14:39] mainly path issues [09:14:46] <_joe_> which issues? [09:14:55] <_joe_> did you check the rest of the code for the same issues? [09:15:02] i.e puppet:///modules/blah [09:15:07] where blah was missing [09:15:18] <_joe_> yeah path issues is the kind of things we won't get with the compiler [09:15:19] or modules was missing [09:15:40] or files was present (we didn't have this one!) [09:15:59] and the one godog just asked about [09:17:47] and some password module paths too [09:19:43] (03CR) 10Filippo Giunchedi: "I see a link from http://dumps.wikimedia.org/backup-index.html but yes you are right it hasn't been applied yet for reasons I'm not still " [operations/puppet] - 10https://gerrit.wikimedia.org/r/134121 (owner: 10Filippo Giunchedi) [09:22:07] <_joe_> matanya: thanks a ton for the help [09:23:02] :) [09:23:10] <_joe_> now it's very very funny how the one server we can't upgrade to puppet 3 is the catalogs compiler [09:23:13] <_joe_> :P [09:23:23] ironic [09:26:09] <_joe_> matanya: bundler black magic [09:26:52] usual ruby packages weirdness [09:27:54] (03PS1) 10Aude: Enable data transclusion for wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 [09:28:20] (03CR) 10Aude: [C: 04-2] "not until later, around general deployment time" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 (owner: 10Aude) [09:29:20] (03PS5) 10Giuseppe Lavagetto: monitoring: add check for git merging of important repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/138313 [09:29:44] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: add check for git merging of important repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/138313 (owner: 10Giuseppe Lavagetto) [09:33:43] (03PS6) 10Giuseppe Lavagetto: monitoring: add check for git merging of important repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/138313 [09:36:10] (03PS2) 10Faidon Liambotis: Create a new role::mail hierarchy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138543 [09:36:12] (03PS2) 10Faidon Liambotis: Introduce a new minimal exim4 module [operations/puppet] - 10https://gerrit.wikimedia.org/r/138546 [09:36:14] (03PS2) 10Faidon Liambotis: Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 [09:36:16] (03PS2) 10Faidon Liambotis: mail: move clamav include in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138544 [09:36:18] (03PS2) 10Faidon Liambotis: mail: move backup includes in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138545 [09:38:38] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: add check for git merging of important repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/138313 (owner: 10Giuseppe Lavagetto) [09:55:41] (03CR) 10Hashar: [C: 031] "I am fine with the change and publishing the slow parse publicly. I am just too paranoid to formally validate it and would like more expe" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [09:59:49] (03PS1) 10Faidon Liambotis: mchenry: remove classes that don't work [operations/puppet] - 10https://gerrit.wikimedia.org/r/138561 [10:00:39] (03CR) 10Faidon Liambotis: [C: 032] mchenry: remove classes that don't work [operations/puppet] - 10https://gerrit.wikimedia.org/r/138561 (owner: 10Faidon Liambotis) [10:02:07] (03PS3) 10Faidon Liambotis: Create a new role::mail hierarchy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138543 [10:02:09] (03PS3) 10Faidon Liambotis: Introduce a new minimal exim4 module [operations/puppet] - 10https://gerrit.wikimedia.org/r/138546 [10:02:11] (03PS3) 10Faidon Liambotis: Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 [10:02:13] (03PS3) 10Faidon Liambotis: mail: move clamav include in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138544 [10:02:15] (03PS3) 10Faidon Liambotis: mail: move backup includes in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138545 [10:02:35] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: NRPE: Command check_puppet_merged not defined [10:02:45] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: NRPE: Command check_puppet_merged not defined [10:02:55] PROBLEM - Unmerged changes on repository puppet on virt1000 is CRITICAL: NRPE: Command check_puppet_merged not defined [10:06:35] _joe_: hi. The check_puppet_merged is apparently not defined in nrpe :-( ^^^^ [10:06:49] ( From gerrit https://gerrit.wikimedia.org/r/#/c/138313/ ) [10:09:18] <_joe_> argh [10:09:35] <_joe_> I thought declaring an nrpe_service whould be enough [10:09:48] <_joe_> let me check [10:10:15] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [10:11:25] <_joe_> and it should... wtf. [10:13:32] <_joe_> hashar: I'm puzzled. https://gerrit.wikimedia.org/r/#/c/138313/6/modules/monitoring/manifests/icinga/git_merge.pp [10:13:44] <_joe_> here I do define an nrpe::monitor_service [10:13:54] <_joe_> which should create the config [10:14:59] <_joe_> ok, found the problem [10:17:36] (03PS1) 10Giuseppe Lavagetto: monitoring: fix require name [operations/puppet] - 10https://gerrit.wikimedia.org/r/138564 [10:18:09] <_joe_> sigh, typos. [10:18:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] monitoring: fix require name [operations/puppet] - 10https://gerrit.wikimedia.org/r/138564 (owner: 10Giuseppe Lavagetto) [10:21:33] <_joe_> and more problems are there [10:25:17] _joe_: well done :] [10:26:21] <_joe_> hashar: I won't say so :P [10:32:56] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: NRPE: Command check_mediawiki_config_merged not defined [10:33:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: NRPE: Command check_puppet_merged not defined [10:33:35] <_joe_> it will fail again, please disregard them [10:34:07] <_joe_> I am trying to do one more patch to sum it all up [10:42:05] (03PS1) 10Giuseppe Lavagetto: check_git_merge: fix execution switches and permissions [operations/puppet] - 10https://gerrit.wikimedia.org/r/138567 [10:43:54] (03PS1) 10Filippo Giunchedi: add carbon/statsd CNAMEs in ulsfo and esams [operations/dns] - 10https://gerrit.wikimedia.org/r/138568 [10:44:27] (03PS2) 10Giuseppe Lavagetto: check_git_merge: fix execution switches and permissions [operations/puppet] - 10https://gerrit.wikimedia.org/r/138567 [10:46:13] (03PS2) 10Filippo Giunchedi: add carbon/statsd CNAMEs in ulsfo and esams [operations/dns] - 10https://gerrit.wikimedia.org/r/138568 [10:47:08] (03CR) 10TTO: [C: 031] "I agree with this change, and it seems few other people really care one way or the other, so I say to merge this as well." [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/106761 (https://bugzilla.wikimedia.org/59893) (owner: 1001tonythomas) [10:49:12] btw I'm not sure what is our policy re: https://gerrit.wikimedia.org/r/#/c/138568/ (relying on resolv.conf vs explicit fqdn) [10:49:29] (03CR) 10Giuseppe Lavagetto: [C: 032] check_git_merge: fix execution switches and permissions [operations/puppet] - 10https://gerrit.wikimedia.org/r/138567 (owner: 10Giuseppe Lavagetto) [10:51:02] (03CR) 10Faidon Liambotis: [C: 031] add carbon/statsd CNAMEs in ulsfo and esams [operations/dns] - 10https://gerrit.wikimedia.org/r/138568 (owner: 10Filippo Giunchedi) [10:55:27] if we are trying to get rid of the former I'm happy to abandon it too [10:55:49] well that can work, but isn't it easier to just do it explicitly with puppet/facts/etc? [10:58:17] (03PS4) 10Faidon Liambotis: Create a new role::mail hierarchy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138543 [10:58:19] (03PS4) 10Faidon Liambotis: Introduce a new minimal exim4 module [operations/puppet] - 10https://gerrit.wikimedia.org/r/138546 [10:58:21] (03PS4) 10Faidon Liambotis: Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 [10:58:23] (03PS4) 10Faidon Liambotis: mail: move clamav include in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138544 [10:58:25] (03PS4) 10Faidon Liambotis: mail: move backup includes in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138545 [10:59:55] yep I think the latter is better e.g. when services are not localized, was confused by the fact that it did exist in pmtpa though [11:02:09] (03PS3) 10Filippo Giunchedi: leave only one statsd/carbon-relay CNAME [operations/dns] - 10https://gerrit.wikimedia.org/r/138568 [11:02:50] (03PS1) 10Giuseppe Lavagetto: monitoring: make define name unique [operations/puppet] - 10https://gerrit.wikimedia.org/r/138573 [11:03:10] akosiaris: the 138543-138545 changes that are about to be merged are backup-related (was buggy before), cf. http://puppet-compiler.wmflabs.org/change/138545/html/ [11:03:23] (03CR) 10Faidon Liambotis: [C: 032] Create a new role::mail hierarchy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138543 (owner: 10Faidon Liambotis) [11:03:36] (03CR) 10Faidon Liambotis: [C: 032] mail: move clamav include in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138544 (owner: 10Faidon Liambotis) [11:03:43] (03CR) 10Faidon Liambotis: [C: 032] mail: move backup includes in the role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138545 (owner: 10Faidon Liambotis) [11:04:30] (03CR) 10jenkins-bot: [V: 04-1] monitoring: make define name unique [operations/puppet] - 10https://gerrit.wikimedia.org/r/138573 (owner: 10Giuseppe Lavagetto) [11:06:25] (03PS2) 10Giuseppe Lavagetto: monitoring: make define name unique [operations/puppet] - 10https://gerrit.wikimedia.org/r/138573 [11:06:44] (03PS3) 10Giuseppe Lavagetto: monitoring: make define name unique [operations/puppet] - 10https://gerrit.wikimedia.org/r/138573 [11:08:18] (03PS1) 10Filippo Giunchedi: enable statsd reporting for swift proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138574 [11:08:21] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: make define name unique [operations/puppet] - 10https://gerrit.wikimedia.org/r/138573 (owner: 10Giuseppe Lavagetto) [11:09:36] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [11:09:46] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:10:40] <_joe_> :) [11:10:47] <_joe_> at last [11:11:10] godog: <3 [11:11:34] \o/ (spring) cleanup and metrics FTW [11:11:55] <_joe_> godog: yeah [11:12:25] <_joe_> well, bbl (lunch time) [11:12:37] (03CR) 10Faidon Liambotis: "We had this enabled for pmtpa and it flooded graphite with *a lot* of data. Swift has a sampling rate config option that isn't configured " [operations/puppet] - 10https://gerrit.wikimedia.org/r/138574 (owner: 10Filippo Giunchedi) [11:12:56] RECOVERY - Unmerged changes on repository puppet on virt1000 is OK: No changes to merge. [11:13:40] speaking of which, re: dashboarding I was thinking we could simply have graph images in wikitech, plus perhaps templates to make it less painful [11:15:36] why not gdash? [11:15:44] (or whatever replaces it, like grafana) [11:17:35] (03CR) 10Faidon Liambotis: [C: 032] Introduce a new minimal exim4 module [operations/puppet] - 10https://gerrit.wikimedia.org/r/138546 (owner: 10Faidon Liambotis) [11:18:53] woooo [11:19:32] err: Could not apply complete catalog: Found 1 dependency cycle: [11:19:32] (Class[Exim4] => Class[Exim::Roled::Mailman] => Exim4::Dkim[lists.wikimedia.org] => File[/etc/exim4/dkim/lists.wikimedia.org-wikimedia.key] => Service[exim4] => Class[Exim4]) [11:19:36] hrmm [11:19:41] _joe|away: the compiler didn't find that [11:19:43] that's weird [11:21:22] gdash as it is now isn't very useful IMO (e.g. only one column, no timespan selection) though I had a look at the upstream version and it seemed better, graphana also is nice but way higher tech than "" :) anyways just a thought I think I'll give it a try and see what comes out [11:28:32] (03PS1) 10Faidon Liambotis: mail: fix dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/138578 [11:28:37] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [11:29:14] (03CR) 10Faidon Liambotis: [C: 032] mail: fix dependency cycle [operations/puppet] - 10https://gerrit.wikimedia.org/r/138578 (owner: 10Faidon Liambotis) [11:35:06] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:40:34] (03PS1) 10Faidon Liambotis: exim4: revert files' permission to previous state [operations/puppet] - 10https://gerrit.wikimedia.org/r/138580 [11:40:52] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim4: revert files' permission to previous state [operations/puppet] - 10https://gerrit.wikimedia.org/r/138580 (owner: 10Faidon Liambotis) [11:40:54] <_joe|away> paravoid: the dependencies cycles get resolved at application time only [11:41:02] ah right [11:41:03] makes sense [11:41:12] (well, kind of) [11:41:15] <_joe|away> not really, but that's how it is [11:41:16] <_joe|away> :P [11:41:25] <_joe|away> I discovered that earlier today [12:19:16] that salt tab completion feature gets me every time [12:19:28] <_joe_> paravoid: same here [12:24:28] interesting [12:24:38] only 14 lucid hosts left [12:25:34] 4 hardy, 14 lucid, 641 precise, 6 trusty [12:25:40] er, 8 trusty, sorry [12:25:58] <_joe_> 4 hardys? [12:26:17] yeah [12:26:51] i said I thought there weren't that many anymore [12:26:57] i was being corrected :P [12:27:31] (03PS1) 10coren: Labs: sync federation up to replication [operations/software] - 10https://gerrit.wikimedia.org/r/138586 (https://bugzilla.wikimedia.org/59682) [12:28:49] (03CR) 10coren: [C: 032] "Trivial." [operations/software] - 10https://gerrit.wikimedia.org/r/138586 (https://bugzilla.wikimedia.org/59682) (owner: 10coren) [12:40:57] (03PS2) 10Faidon Liambotis: varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 [12:41:36] (03CR) 10Faidon Liambotis: [C: 032 V: 032] varnish: don't set X-WAP on mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/118249 (owner: 10Faidon Liambotis) [12:41:43] <_joe_> and that is the end of a terrible technology [12:44:05] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 3646 seconds [12:45:10] _joe_: ^ [12:46:53] <_joe_> matanya: uhm interesting [12:48:28] <_joe_> matanya: it is really behind 1 [12:48:35] <_joe_> (1 commit I mean) [12:48:42] <_joe_> so someone forgot to puppet-merge? [12:48:56] on strontium ? [12:49:26] <_joe_> no, the merge happens on palladium, but I thought it was propagated to the other servers quickly [12:49:56] instantly [12:50:10] well not in the academic sense but you get the idea [12:50:52] akosiaris: https://gerrit.wikimedia.org/r/#/c/138370/ in your spare time :) [12:50:54] <_joe_> akosiaris: ok so, why is strontium behind 1 commit? [12:51:39] (03CR) 10Alexandros Kosiaris: [C: 032] puppetproxy: match role name to class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/138370 (owner: 10Matanya) [12:51:45] <_joe_> and it still is [12:52:17] thanks [12:52:20] <_joe_> well, the alarm did give us a hint [12:52:27] <_joe_> that something wrong is going on [12:52:35] someone did not use puppet-merge ? [12:52:38] <_joe_> akosiaris: how do we distribute the changes? [12:52:42] <_joe_> akosiaris: nope [12:52:45] <_joe_> palladium is OK [12:52:50] <_joe_> virt1000 as well [12:53:05] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:53:21] seems like a matter of time [12:53:31] unless one of you did something [12:53:38] <_joe_> matanya: well it was some minutes late [12:53:43] <_joe_> quite some minutes I'd say [12:53:51] <_joe_> it's not acceptable anyways [12:54:09] and in this case, i would look why is the delay :) [12:54:12] no, it synced because I run pupept-merge [12:54:15] <_joe_> you may end up being served different catalogs on different machines of the same cluster [12:54:21] <_joe_> akosiaris: oh ok [12:54:33] <_joe_> so it's puppet-merge directly that syncs [12:54:38] <_joe_> and if the sync fails? [12:54:57] <_joe_> maybe the committer did not notice an error [12:55:24] _joe_: modules/puppetmaster/templates/post-merge.erb is the answer you are looking for [12:55:36] <_joe_> just found it [12:55:39] I am wondering as well what happened [12:56:19] <_joe_> well, if this happens often, it's something we need to work on [12:56:25] first time I see it [12:56:34] <_joe_> well, first time it's monitored [12:56:35] <_joe_> :) [12:57:12] I don't recall ever having the problem however [12:57:20] <_joe_> me neither [12:57:35] it would have bitten us somehow. Weird... [12:57:43] <_joe_> but is exactly the kind of situation I wanted to monitor by putting the alert on strontium as well [12:57:49] <_joe_> funny it happened on day one [12:58:24] <_joe_> the seasoned sysadmin in me would tend to think coincidences don't exist in our line of work [12:58:36] <_joe_> well, they seldom do [12:58:47] <_joe_> so let's hope it's one of those cases [12:58:49] wait and see how often it happens [12:59:28] hmmm [12:59:48] it could be a by-product of the admin changes [13:00:17] <_joe_> you think? if so, that would happen at every commit [13:00:18] I see a log line for ssh session opened for user oblivian [13:00:25] <_joe_> that is me [13:00:27] <_joe_> :) [13:01:03] maybe you are the first one to run puppet-merge with your user and not root ? [13:01:11] although it is unimportant [13:01:16] it uses gitpuppet and not root [13:01:19] <_joe_> no I did not :) [13:01:26] meh... wild goose chase... [13:01:55] <_joe_> akosiaris: let's see if that ever happens again :) [13:05:13] akosiaris / _joe_ any hosts left in 208.80.152.127 address space ? [13:05:58] <_joe_> matanya: what do you need to do? [13:06:24] clean tampa, specificly now modules/install-server/files/autoinstall/netboot.cfg [13:06:57] * matanya remindes himself to ask what he wants to do, and not the result [13:07:36] <_joe_> ok, so I figure we won't install new servers in tampa [13:07:52] <_joe_> If I do understand correctly what netboot.cfg is for [13:08:18] <_joe_> you just want to leave the entries for servers still in tampa [13:08:31] <_joe_> don't we have a page on wikitech for that? [13:08:40] <_joe_> or maybe some RT ticket [13:09:18] i looked _joe_ didn't find much [13:09:30] (03PS3) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [13:09:30] <_joe_> in either place? [13:09:36] yes [13:09:45] <_joe_> that is underwhelmning [13:10:01] <_joe_> let me think of a way to find out [13:10:03] I know the misc, and few db's and es are on 12th [13:10:39] dns [13:10:43] is what we use for that [13:10:55] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [13:11:28] mark: i'm talking about auto-install, we are not going to install in tampa, are we? [13:11:51] <_joe_> matanya: no, but leave the configs for the existing pmtpa servers there [13:11:52] pretty unlikely [13:12:01] but no reason to remove them now [13:12:06] <_joe_> in case we need to reinstall one [13:14:27] no reason to remove them now <-- I have been hearing this for a long time :) when i remove stuff it pushes things forward, from my experience [13:15:04] no [13:15:08] this is the wrong kind of cleanup [13:15:16] we're unlikely to need it still, but we may need it [13:15:23] in which case it's more complicated to do so later [13:15:34] removing this stuff is trivial and it doesn't harm having it for a few months [13:15:39] and it doesn't really move anything else forward [13:16:26] well, you are the boss :D [13:16:58] manybubbles: re- bug 66243 yes, you are reading it wrong [13:17:22] and editor created a redirect to overcome the search issue [13:17:55] <_joe_> I think being the boss is less important than being right, matanya. And we did give you basically the same advice. [13:17:56] that's not an answer of "I agree" [13:18:22] well, in this sense, i agree [13:18:22] i'd love to hear a compelling reason for why removing such entries, which we may or may not need again, really matters or helps [13:19:37] for example, we have monitoring code of memcaches in tampa, is that the right clean up ? [13:19:38] (03PS4) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [13:19:54] yes, since we no longer have nor need memcaches in tampa [13:20:28] (03CR) 10Filippo Giunchedi: "ah indeed! thanks for filling me in with the context, it looks like this is an instance of this bug https://bugs.launchpad.net/swift/+bug/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138574 (owner: 10Filippo Giunchedi) [13:21:41] (03CR) 10Giuseppe Lavagetto: "Note to code reviewer: puppet sucks." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [13:34:04] (03PS1) 10Matanya: memcached: remove pmtpa virtual group and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/138590 [13:38:12] is role::puppetproxy even used anywhere? [13:39:11] paravoid: nope [13:39:12] (03PS1) 10Faidon Liambotis: Kill role::puppetproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138592 [13:39:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Kill role::puppetproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/138592 (owner: 10Faidon Liambotis) [13:39:54] we kept it around due to the fear of the ams <-> eqiad link not working correctly [13:45:05] (03PS5) 10Faidon Liambotis: Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 [13:45:45] (03CR) 10Faidon Liambotis: [C: 032] Replace exim::simple-mail-sender with a role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/138547 (owner: 10Faidon Liambotis) [13:46:55] paravoid: exciting news re. exim4 module! [13:47:09] not nearly done yet [13:47:16] I want to get rid of all the enable_* clauses [13:47:37] oh i know. I actually spent a couple hours pondering this yesterday out of frustration with the mess that was [13:47:51] I've been playing with a couple of alternative solutions [13:47:54] what do you envision for a layout for exim4.config? [13:48:05] what are your ideas? [13:48:09] one is denormalizing and actually having e.g. an entirely different config for otrs [13:48:22] right [13:48:23] since some of them (otrs for example) are quite different [13:48:32] i was leaning toward that too actually [13:48:35] the other one is maybe using concat::fragment [13:48:56] ha! I was looking at a way of pulling in templates into a template with ruby [13:49:38] but you know, in the end i think denormalization will be easier to support [13:49:51] templates in templates is possible without concat, too [13:50:10] but concat could provide us with a way to mix-and-match roles [13:50:20] right [13:50:42] so sodium could have "include role::mail::lists" and "include role::mail::mx" at the same time [13:50:52] right. to me if you have to do a major brain exercise to envision what the end file will look like, its just hard to support [13:51:30] I think we have the worst of both world now [13:51:34] *worlds [13:52:01] indeed. what if we standardize on all the site variables, and have a dir of /exim4.conf-{whatever}.erb ? [13:52:07] if the config was split up into the different roles, but you could inspect the fragments independently, maybe it'd be cleaner, dunno [13:52:49] yeah [13:52:58] how do you mean? [13:53:09] the exim4.conf-whatever idea, I didn't get that [13:53:49] oh. so for example in a role class or whatever you would call class exim4 with template =>. 'exim4.conf-otrs.erb' [13:53:52] sec [13:54:06] well that's already kinda possible :) [13:54:10] with the current scheme [13:54:58] yes [13:55:16] i just didn;t want to go hog wild with the flexibility until I understood what our plan was going to be [13:55:27] sorry-there's a kid drop-off going on here. distracting [13:55:31] right, that's a good point [13:55:35] I haven't fully decided yet [13:55:56] I'm going to switch gears a bit, I need to provision what we have for a new MX in eqiad [13:56:02] ok [13:56:02] as mchenry is still in tampa [13:56:07] then I'll clean up further [13:56:19] then I'll take care of sodium, which also means cleaning up the mailman stuff [13:56:33] great. I'm happy to help adapt the otrs config as a guineapig when ready [13:56:33] then dovecot/sanger [13:56:38] awesome :) [13:56:50] I've previously split SA/clamav btw [13:56:57] that was weeks/months ago [13:56:58] also--there's a bunch of config in there for aluminium, which can be deprecated [13:57:05] yep, saw that. [13:57:20] oh? which part? [13:57:35] checking [13:58:35] ha. less than I thought actually, because I bypassed mail.pp [13:59:13] just templates and files in exim, private/dovecot, etc [14:01:40] i think it will be fine to simply rip that out of puppet. that box is on the long tail of deprecation, doing about 10% of what it used to [14:01:54] please do :) [14:02:02] k. [14:02:18] I don't know what's actually being used and what is not [14:02:34] yep. [14:02:35] so I can't clean up myself and it's quite likely I might waste time transitioning something to newer code [14:02:50] no problem, I can probably do it today [14:02:51] is this misc::fundraising::mail ? [14:03:26] I guess it is [14:03:32] I almost transitioned that to the new exim4 module [14:03:58] yes. and it's not worth the effort [14:04:16] hehe glad that I didn't then [14:04:20] :-) [14:07:26] (03Abandoned) 10Faidon Liambotis: contint: set up and maintain a coredumps directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/119225 (https://bugzilla.wikimedia.org/62623) (owner: 10Faidon Liambotis) [14:09:55] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [14:13:37] (03PS1) 10Jgreen: check short_report_template as boolean in spamassassin local.cf template [operations/puppet] - 10https://gerrit.wikimedia.org/r/138593 [14:16:26] (03CR) 10Jgreen: [C: 032 V: 031] check short_report_template as boolean in spamassassin local.cf template [operations/puppet] - 10https://gerrit.wikimedia.org/r/138593 (owner: 10Jgreen) [14:24:21] (03PS1) 10Jgreen: remove deprecated fundraising mail config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138604 [14:24:29] \o/ [14:25:19] I am developing a theory about puppet. Every time we rip out puppet content we get happy. I think the reasonable conclusion is that puppet code is the root of all unhappiness. [14:26:27] (03CR) 10Filippo Giunchedi: "we can certainly investigate aptly as well as it looks promising, I have some swift work to catch up with too so I might be able to get to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136128 (owner: 10Filippo Giunchedi) [14:27:37] go go jenkins [14:28:19] interesting. this time jenkins verified my commit but didn't bot-message [14:28:55] (03CR) 10Jgreen: [C: 032 V: 031] remove deprecated fundraising mail config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138604 (owner: 10Jgreen) [14:29:09] Jeff_Green: that is grrrit-wm being lagged out I guess [14:29:16] Jeff_Green: the bot is independent [14:29:21] oic [14:31:06] grocery store [14:34:25] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:18] <_joe_> hey godog, want to try (again) to debug this? [14:36:38] hahah "sure" [14:37:24] <_joe_> godog: 22% iowait [14:37:35] <_joe_> this is not the uwsgi bug [14:44:15] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [14:51:21] (03PS1) 10Dzahn: decom ekrem, former irc box [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 [14:59:21] (03CR) 10Dzahn: [C: 031] "there's still python /usr/local/bin//udpmxircecho.py rc-pmtpa ekrem.wikimedia.org but that does nothing anymore, right" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 (owner: 10Dzahn) [14:59:34] mutante: anyone would think you've been waiting for that one ;) [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140610T1500) [15:00:13] Reedy: hehe:) [15:01:04] Nothing to SWAT at the moment, unless MatmaRex decides to show up. [15:01:14] Reedy: root@ekrem:/etc/apache2/sites-enabled# ls [15:01:14] irc.wikimedia.org mobile.wikipedia.org wap.wikipedia.org [15:01:26] uh [15:01:41] remnants.. [15:01:45] heh [15:01:55] wap :p [15:01:59] ircecho shouldn't be doing anything either [15:02:04] (03CR) 10Matanya: "from RT 4784:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 (owner: 10Dzahn) [15:02:48] wap.wikimedia.org seems deaded [15:02:52] completely [15:02:54] yes [15:03:07] mobile should be on the apache redirects [15:03:10] matanya: good catch!! [15:03:23] and irc.wikimedia.org? o_0 [15:03:28] Reedy: "a setting in InitialiseSettings.php in the wikimedia-config repo that would be affected " [15:03:48] I already replaced that [15:03:48] thanks mutante :) [15:03:48] :) [15:03:50] Reedy: irc.wm.org existed only for the redirect to meta, i just did not like it was "It works" in the past [15:04:04] 'wmgRC2UDPAddress' => array( [15:04:04] 'default' => '208.80.154.160', // eqiad: argon [15:04:04] ), [15:04:17] :) [15:04:17] I did it when chasemp confirmed the ircd was down [15:04:22] great [15:04:27] thanks Reedy [15:04:37] gogogogog [15:04:37] :D [15:04:46] manybubbles: around ? [15:04:55] btw graphite choking there was because of requesting reqstats.edits.* which is just too much, probably wise to disable that dashboard if not overly important (ori (?)) [15:05:55] (03CR) 10Dzahn: [C: 032] "< Reedy> I'default' => '208.80.154.160', // eqiad: argon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 (owner: 10Dzahn) [15:06:48] (03PS2) 10Dzahn: decom ekrem, former irc box [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 [15:06:58] mutante: So how many left is that now? [15:07:23] 13 or so [15:07:31] matanya: yeah [15:07:35] 11 in misc_pmtpa [15:07:57] hi manybubbles an re 66243 [15:08:13] Reedy: dobson,fenari,linne,manutius,mchenry,mexia,pdf2,pdf3,sanger,tarin,tridge [15:08:14] an editor set a redirect to override the missing search results [15:08:42] long live fenari [15:08:42] (03CR) 10Dzahn: [C: 032] decom ekrem, former irc box [operations/puppet] - 10https://gerrit.wikimedia.org/r/138608 (owner: 10Dzahn) [15:09:16] the actual urls to compare are: https://www.wikidata.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=Special%3ASearch&go=%D7%9C%D7%93%D7%A3 vs https://he.wikipedia.org/w/index.php?search=%D7%94%D7%95%D7%9C%D7%99%D7%93%D7%99%D7%99&title=%D7%9E%D7%99%D7%95%D7%97%D7%93%3A%D7%97%D7%99%D7%A4%D7%95%D7%A9&go=%D7%9C%D7%A2%D7%A8%D7%9A&fulltext=1 [15:09:28] so, manybubbles still 2 missing [15:12:22] matanya: got it [15:12:23] !log ekrem - revoke salt,puppet keys, stop agents/minion [15:12:29] Logged the message, Master [15:12:48] matanya: {"_index":"hewiki_content_1401724632","_type":"page","_id":"495403","found":false} [15:12:50] manybubbles: if you need help with that weird RTL language, do let me know [15:12:56] matanya: thanks! [15:13:07] I think I've got a non-rtl issue here though.... [15:13:15] gonna have to hunt it [15:15:08] but is the issue clear manybubbles i fear i didn't define it well enough [15:15:22] ottomata: elastic1017 cronspam? [15:17:11] elastic1017? hm [15:19:18] hm ok, puppet is off on elastic1017 and elasticsearch isoffline [15:19:27] gonna comment out the cron, puppet will restore whenever that is all fixed [15:20:16] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: work better with jenkins + fixes [operations/software] - 10https://gerrit.wikimedia.org/r/138614 [15:21:19] !log ekrem - rm from stored configs/icinga [15:21:24] Logged the message, Master [15:22:35] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: work better with jenkins + fixes [operations/software] - 10https://gerrit.wikimedia.org/r/138614 [15:23:36] (03CR) 10Dzahn: [C: 031] memcached: remove pmtpa virtual group and lint (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138590 (owner: 10Matanya) [15:25:17] matanya: omg, you are mixing functional and lint change *g*,,kidding [15:25:32] i'll fix one tab and go ... [15:27:04] (03PS2) 10Dzahn: memcached: remove pmtpa virtual group and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/138590 (owner: 10Matanya) [15:36:54] (03CR) 10Dzahn: [C: 032] memcached: remove pmtpa virtual group and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/138590 (owner: 10Matanya) [15:37:26] (03CR) 10Dzahn: "removes Tampa monitoring and replaces Tabs-only" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138590 (owner: 10Matanya) [15:42:22] matanya: do you know about CVN bots? [15:58:20] ori: Who do I need to bribe for https://gerrit.wikimedia.org/r/138216 ? [15:59:42] <_joe_> hoo: you could try with an opsen :) [16:00:11] (03CR) 10Dzahn: [C: 032] beta: fix scap for videoscalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/137274 (owner: 10BryanDavis) [16:00:58] _joe_: What about you? :D [16:01:36] <_joe_> hoo: yeah, in ~ 30 mins [16:01:44] <_joe_> I do have a couple of things to do before [16:07:27] Not urgent at all... just want to get things doen [16:10:52] <_joe_> hoo: ok then maybe later [16:11:08] <_joe_> it's 6 PM and I'm about to take a break [16:11:08] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: Last successful Puppet run was Mon 09 Jun 2014 19:06:33 UTC [16:11:23] <_joe_> hoo: add me as a reviewer (Giuseppe [16:13:52] (03PS3) 10Giuseppe Lavagetto: puppet-compiler: work better with jenkins + fixes [operations/software] - 10https://gerrit.wikimedia.org/r/138614 [16:15:14] (03PS4) 10Giuseppe Lavagetto: puppet-compiler: work better with jenkins + fixes [operations/software] - 10https://gerrit.wikimedia.org/r/138614 [16:16:05] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: work better with jenkins + fixes [operations/software] - 10https://gerrit.wikimedia.org/r/138614 (owner: 10Giuseppe Lavagetto) [16:20:28] (03CR) 10Dzahn: [C: 032] Redirect https traffic from old metrics sites to wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/133089 (https://bugzilla.wikimedia.org/64276) (owner: 10QChris) [16:23:08] (03PS1) 10Filippo Giunchedi: Revert "Replace exim::simple-mail-sender with a role class" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138623 [16:23:44] morning [16:23:46] volunteers for that last review? I think it is needed to unblock puppet in labs [16:23:53] hey ori [16:24:47] (03CR) 10Dzahn: "$redirect_target = "https://metrics.wmflabs.org/"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133089 (https://bugzilla.wikimedia.org/64276) (owner: 10QChris) [16:25:19] godog: i think it's an important dashboard. we can disable it temporarily if it helps graphite recover, but we should really start scaling out graphite imo [16:25:59] we have three logstash boxes and one graphite..! [16:26:14] ori: yes agreed it's only going to get more desperate [16:27:39] (03CR) 10Dzahn: [C: 04-1] "why should we delete README, tests and all that from an imported module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 (owner: 10Ori.livneh) [16:27:53] indeed, I said disabling it because one could casually click and bring down graphite in its current setup [16:28:11] well bring down for a while that is [16:28:14] yeah, that sounds fine [16:28:20] i mean, suboptimal but sensible [16:29:27] (03CR) 10Dzahn: [C: 031] Increase bacula pool size [operations/puppet] - 10https://gerrit.wikimedia.org/r/137564 (owner: 10Alexandros Kosiaris) [16:29:35] yup, speaking of that I was intrigued by that cassandra graphite backend, but didn't end up doing anything serious with it [16:29:39] the old graphite setup had some more aggressive caching configured at the reverse proxy layer [16:29:48] i'm not sure we ported that in the end [16:32:18] (03CR) 10Ori.livneh: "@Dzahn: this module came up several times on the lists and we agreed we should get rid of it and replace it with something more specific t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 (owner: 10Ori.livneh) [16:34:51] (03CR) 10Dzahn: "ok, i don't have a vote then, we have talked about replacing one apache setup with another one soo many times i can't even count it anymor" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 (owner: 10Ori.livneh) [16:35:29] (03PS1) 10Giuseppe Lavagetto: admins: grant access to stat1002 to halfak [operations/puppet] - 10https://gerrit.wikimedia.org/r/138625 [16:35:45] <3 [16:36:05] <_joe_> :) [16:36:46] <_joe_> halfak: not merging it now though [16:37:33] it's the first time I touch the new admin.yaml thing [16:38:21] <_joe_> ori: I wrote a very thorough explanation of how we must keep into account the number of physical cores in determining maxclients and lost it in a browser crash [16:38:40] <_joe_> :/ [16:38:53] _joe_: do you remember the formula? [16:39:21] <_joe_> ori: kind of, trying to recreate it now [16:39:29] <_joe_> last item of my daily todo list :) [16:41:31] http://www.johnlund.com/Images/81888418.jpg [16:41:59] <_joe_> no it was a simple logarithmic formula [16:42:12] mutante: heya. Thanks for the merge of metrics redirect (133089) \o/. But I do not get your post merge comment. [16:42:34] mutante: The change seems to work for me as expected. Did it break something? [16:42:44] what rolls downstairs, alone or in pairs, rolls over your neighbor's dog? [16:43:15] what's great for a snack and fits on your back? it's log, log, log! [16:43:29] qchris: no, it works fine. the comment was supposed to say "yea, it redirects where it should now" [16:43:42] mutante: Ah. Ok :-) Thanks! [16:44:11] qchris: i watched it on stat1001, it just rewrote the site config and also refreshed service:) [16:44:11] mutante: Certificates do not match alias ... but that's expected and still an improvement over the previous situation. [16:44:22] qchris: yes [16:44:29] mutante: Danke! [16:44:33] bitte [16:47:04] <_joe_> ori: in the meantime, originating from a ticket of yours [16:47:25] <_joe_> change 138317 [16:47:33] <_joe_> if you care to take a look [16:54:03] (03PS1) 10Filippo Giunchedi: disable reqstats.edits dash involving many metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/138630 [16:54:49] ori (or anyone else interested) https://gerrit.wikimedia.org/r/#/c/138630/ [16:55:19] (03CR) 10Ori.livneh: [C: 031] disable reqstats.edits dash involving many metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/138630 (owner: 10Filippo Giunchedi) [16:57:22] (03CR) 10Rush: [C: 031] "yup, that's all there is to it for adding him to this group. looks good." [operations/puppet] - 10https://gerrit.wikimedia.org/r/138625 (owner: 10Giuseppe Lavagetto) [16:59:27] (03CR) 10Giuseppe Lavagetto: role::mediawiki::webserver: set maxclients dynamically, dissolve bits role (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [17:00:13] (03PS2) 10Giuseppe Lavagetto: admins: grant access to stat1002 to halfak [operations/puppet] - 10https://gerrit.wikimedia.org/r/138625 [17:00:20] _joe_: sweet, thanks, i'll amend [17:00:25] _joe_: i'll review the redis patch too, thanks [17:00:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admins: grant access to stat1002 to halfak [operations/puppet] - 10https://gerrit.wikimedia.org/r/138625 (owner: 10Giuseppe Lavagetto) [17:00:35] OMG pings! [17:02:03] <_joe__> nick _joe_ [17:02:10] <_joe__> oh well [17:05:30] (03PS1) 10Dzahn: add check_snmp_environment from Nagios exchange [operations/puppet] - 10https://gerrit.wikimedia.org/r/138632 [17:08:38] (03PS1) 10MarkTraceur: Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 [17:09:45] (03PS2) 10Dzahn: add check_snmp_environment from Nagios exchange [operations/puppet] - 10https://gerrit.wikimedia.org/r/138632 [17:10:08] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [17:11:50] (03PS1) 10Ori.livneh: Update stdlib to latest supported release from PuppetLabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/138635 [17:17:22] <_joe|away> ori: btw, the puppet compiler now has namespace support [17:17:39] what does that mean? [17:17:44] <_joe|away> so the jenkins builds do not get overwritten [17:18:16] <_joe|away> so you can have both one build for the change and one seeking puppet 3 incompatibilities for the change and they do not clash [17:20:34] (03PS2) 10Filippo Giunchedi: disable reqstats.edits dash involving many metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/138630 [17:20:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] disable reqstats.edits dash involving many metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/138630 (owner: 10Filippo Giunchedi) [17:20:52] Today's going to be full of finishing the last 7 reviews and doing wellness reimbursement grunt work. Ping me if you need me. [17:22:27] greg-g: Enjoy :p [17:22:32] <_joe|away> godog: btw, feeds from mwprof to graphite seem to be broken. again. [17:23:00] <_joe|away> it's the damn mwprof-to-carbon that gets stuck [17:23:07] <_joe|away> I'll try to debug it [17:23:08] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 1220 seconds [17:23:28] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 1220 seconds [17:24:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:24:28] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:24:50] (03CR) 10Ori.livneh: [C: 04-1] redis: restart service upon first install (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [17:25:18] Tim-away: when you get back online ping [17:25:18] <_joe|away> ori: profiler-to-carbon is stuck waiting data from mwprof, it seems the socket just needs a timeout [17:25:27] <_joe|away> ori: which repository is that? [17:25:49] https://github.com/wikimedia/operations-software-mwprof , https://github.com/wikimedia/operations-software-mwprof-reporter [17:27:11] _joe|away: btw I think the check should compare the local time on the machine not the time of the last change, that palladium warning happened 3 minutes after I've merged from gerrit but the alarm says 10+ minutes [17:27:25] (03CR) 10Giuseppe Lavagetto: redis: restart service upon first install (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [17:27:56] <_joe|away> godog: yeah you're right [17:28:04] <_joe|away> never thought of that. [17:28:09] <_joe|away> tomorrow :) [17:28:22] <_joe|away> ori: thanks [17:28:43] is there a way for teh unmerged changes check to show the who? [17:28:48] would flag ppl in chat that way [17:28:58] <_joe|away> !log restarted profiler-to-carbon, stuck waiting data from mwprof [17:29:03] Logged the message, Master [17:29:08] <_joe|away> chasemp: we could, yes [17:29:27] _joe|away: if you are in there tomorrow you and saw fit to do that I would think it's cool :) [17:37:49] volunteers for https://gerrit.wikimedia.org/r/#/c/138623/ ? I think it broke puppet in labs [17:38:15] paging paravoid [17:38:30] hoping there is a simple rollforward for it? I guess otherwise I can +1 [17:39:58] rollforward would be to change ldap attributes I think plus whatever system picks that, I'm not comfortable doing that though [17:40:24] (03CR) 10Rush: [C: 031] "As as simple revert should be gtg, I would run it manually on mchenry tho just in case, and probably ping faidon directly ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138623 (owner: 10Filippo Giunchedi) [17:40:33] godog: understood man, neither am I :) [17:46:19] <_joe|away> so my advice is: leave it to labs experts to fix [17:55:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:00:04] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140610T1800) [18:00:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:03:27] ugh, my backports weren't merged? :/ [18:03:55] bloody hell [18:04:01] "Developer not present for SWAT window" [18:04:15] i said i was not going to be present and no one complained that it would be wrong… [18:04:36] Krinkle: the vector dropdowns being fucked up were not ficed on wmf8 :/ [18:04:44] not fixed* [18:05:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:08:28] HaithamS_: you shouldn't be editing your ssh config on bastion [18:08:38] those instructions are for your local .ssh/config [18:08:56] also, those kraken access instructions are mainly for the web gui [18:08:57] s [18:09:05] which you probably don't need access to (yet) [18:09:07] ottomata > oh, I see. [18:09:20] this is probably a better setup [18:09:21] https://wikitech.wikimedia.org/wiki/Server_access_responsibilities#SSH [18:09:23] MatmaRex: Can't care right now. [18:09:29] fixing ci.. [18:09:34] now when I try to access analytics1010 i keep getting an error. [18:09:38] alright then [18:09:38] and I"m supposed to not be working right now [18:10:01] HaithamS_: try stat1002.eqiad.wmnet [18:10:02] MatmaRex: ask Reedy nicely during the mw window [18:10:04] that is what you should be accessing [18:10:06] to get to hadoop [18:10:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:10:25] i've already asked nicely once. not my problem anymore [18:10:44] (03PS1) 10Reedy: Update non Wikipedias to 1.24wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138649 [18:11:10] MatmaRex: patience and understanding are great virtues of someone who wants to work with others [18:11:34] MatmaRex: Have you tried bribing though? :p [18:11:46] * aude ask nicely https://gerrit.wikimedia.org/r/#/c/138644/ :) [18:11:55] can be after deploy but not too long [18:14:03] !log reedy Synchronized php-1.24wmf8/extensions/Wikidata/: (no message) (duration: 00m 16s) [18:14:07] (03CR) 10Reedy: [C: 032] Update non Wikipedias to 1.24wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138649 (owner: 10Reedy) [18:14:07] Logged the message, Master [18:14:17] (03Merged) 10jenkins-bot: Update non Wikipedias to 1.24wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138649 (owner: 10Reedy) [18:14:17] thanks [18:14:47] (03CR) 10Aude: "deploy time!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 (owner: 10Aude) [18:14:53] :D [18:15:11] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [18:15:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Update non Wikipedias to 1.24wmf8 [18:15:20] Logged the message, Master [18:15:40] !log reedy Synchronized docroot and w: Update non Wikipedias to 1.24wmf8 (duration: 00m 16s) [18:15:45] Logged the message, Master [18:16:34] (03PS2) 10Reedy: Enable data transclusion for wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 (owner: 10Aude) [18:17:04] (03CR) 10Reedy: [C: 032] Enable data transclusion for wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 (owner: 10Aude) [18:17:14] \o/ [18:17:20] :D [18:17:53] (03Merged) 10jenkins-bot: Enable data transclusion for wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138557 (owner: 10Aude) [18:18:39] !log reedy Synchronized wmf-config/InitialiseSettings.php: Enable data transclusion for wikiquote (duration: 00m 14s) [18:18:44] Logged the message, Master [18:19:06] looks good [18:20:11] RECOVERY - check_mysql on lutetium is OK: Uptime: 505463 Threads: 2 Questions: 2389151 Slow queries: 333 Opens: 561 Flush tables: 2 Open tables: 64 Queries per second avg: 4.726 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:20:21] Hey opsen, anyone here familiar with the !(path) syntax in bash expansion? [18:20:23] e.g. $ rm -rf !(extensions) [18:20:46] would that do the same as $ rm -rf ./ (current directory), but preserve extensions and eveyrthign inside of it [18:20:48] aude: but does it work is the question :p [18:20:59] it seems to do so, but I'd like a little more reassurance. [18:23:55] JohnLewis: it does [18:23:56] * aude preview but not save [18:24:04] aude: Perfect :) [18:24:55] aude: Want me to do the task of notify communities of this or shall we hold on for a while? [18:25:49] JohnLewis: i don't see lydia around [18:26:19] let's see if she's around, if not i say do it :) [18:27:01] I don't think she is (I poke her an hour and half ago) but I don't think she'll mind me doing her job, I did it for the announcement of the date anyway :) [18:27:22] I'll let you make that call though :p [18:27:39] * aude wait 5 min [18:27:57] just poke me when you want me to take her job over ;) [18:30:07] rm -rf !(keep_this_dir); works on my localhost (Mac) [18:30:11] but seems to not work in production [18:30:16] bash: !: event not found [18:30:25] I'm familiar with that error but not sure how to work around it in this case [18:31:04] RoanKattouw: ottomata: Reedy: bash experts :) [18:31:25] This is supposed to fix Jenkins for what its worth.. [18:33:02] Krinkle: Test it? :P [18:33:10] I did, that's how I got the error [18:33:14] I'd probably more likely ask akosiaris :) [18:33:42] I can disable that silly feature in bash scripts with set +H [18:33:50] but then it fails on [18:33:50] bash: syntax error near unexpected token `(' [18:34:29] this is part of a horrible workaround for another workaround for a problem in 2011, but I don't have time for that now. Ugh.. [18:34:58] for remove all but I usually do ls | grep -v 'not this one' | xargs rm [18:34:59] but that's not always perfect either [18:35:28] Hm.. I could use find [18:35:46] find also works well for this use case usually [18:36:43] Ha, gotcha [18:36:45] shopt -s extglob [18:36:48] not enabled by default [18:36:53] apparently it is enabled of rmy local bash [18:37:34] perfect, did a runthrough with rm -riv [18:58:31] !log shutting down ekrem [18:58:36] Logged the message, Master [18:59:26] !log reedy Synchronized php-1.24wmf8/skins/vector/components/tabs.less: (no message) (duration: 00m 14s) [18:59:31] Logged the message, Master [19:00:46] PROBLEM - Host ekrem is DOWN: PING CRITICAL - Packet loss = 100% [19:01:04] ^ mutante is that a matter of puppet running to remove monitoring there? [19:02:32] PROBLEM - Disk space on analytics1020 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 111565 MB (5% inode=99%): /var/lib/hadoop/data/d 74543 MB (3% inode=99%): /var/lib/hadoop/data/k 102933 MB (5% inode=99%): /var/lib/hadoop/data/c 120597 MB (6% inode=99%): /var/lib/hadoop/data/g 124044 MB (6% inode=99%): /var/lib/hadoop/data/h 96038 MB (5% inode=99%): /var/lib/hadoop/data/j 105180 MB (5% inode=99%): /var/lib/hadoop/da [19:04:02] chasemp: it's unbelievably annoying because i try hard every time for that NOT to happen.. it' [19:04:23] hmm [19:04:33] chasemp: it's a timing issue with removing the puppet stored configs and shutting down the host and the puppet agent running [19:04:35] (03PS3) 10Ottomata: Add partman recipe for 12 drive Kafka Brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138451 [19:06:03] ACKNOWLEDGEMENT - Host ekrem is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn decom, RT #7659 [19:06:50] chasemp: so first puppet agent must stop, then it needs to be deleted from puppet stored configs, then puppet run on neon, then it's gone, but if puppet starts again it can come back on the next run on neon [19:07:13] good times [19:07:13] root@palladium:~# puppetstoredconfigclean.rb ekrem.wikimedia.org [19:07:13] Killing ekrem.wikimedia.org...done. [19:07:28] papaul: welcome to wikimedia-operations [19:07:29] that is the one part, then puppet run on icinga [19:09:06] !log git-deploy: Deploying integration/slave-scripts I9521890b911714edf2 [19:09:12] Logged the message, Master [19:10:21] <_joe|away> papaul: hi! welcome! [19:12:09] paravoid: any idea why Varnish is configured to give "unknown, unknown, , , " XFF headers? [19:13:26] RECOVERY - Disk space on analytics1020 is OK: DISK OK [19:14:04] csteipp: we can't really avoid exceptions if the first entries are "unknown" and the server_addr is a proxy of ours [19:14:13] * AaronSchulz would like to have those not happen [19:15:07] (03PS4) 10Ottomata: Add partman recipe for 12 drive Kafka Brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138451 [19:15:52] AaronSchulz: That means they were added before it got to varnish, right? [19:16:26] RECOVERY - Disk space on analytics1013 is OK: DISK OK [19:16:39] gah, you're right...that's the pre-flipped header, so it should work out [19:17:17] (03PS5) 10Ottomata: Add partman recipe for 12 drive Kafka Brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138451 [19:20:20] (03PS6) 10Ottomata: Add partman recipe for 12 drive Kafka Brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138451 [19:27:01] (03CR) 10Ottomata: [C: 032 V: 032] Add partman recipe for 12 drive Kafka Brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138451 (owner: 10Ottomata) [19:28:28] (03PS3) 10Dzahn: inserted iotop to toollabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/135185 (owner: 10Petrb) [19:30:47] (03CR) 10Dzahn: [C: 032] "makes sense to have for admins" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135185 (owner: 10Petrb) [19:31:42] andrewbogott: can i ask you to look at ? [19:32:35] ori: I'm kind of buried in a hard problem right now, so won't be able to review immediately [19:33:00] andrewbogott: no problem [19:34:19] PROBLEM - DPKG on gallium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:35:28] (03PS1) 10Ottomata: Fix miss aligned close brackets [operations/puppet] - 10https://gerrit.wikimedia.org/r/138664 [19:35:41] (03PS2) 10Ottomata: Fix miss aligned close brackets [operations/puppet] - 10https://gerrit.wikimedia.org/r/138664 [19:35:41] that gallium issue is me [19:36:06] (03CR) 10Ottomata: [C: 032 V: 032] Fix miss aligned close brackets [operations/puppet] - 10https://gerrit.wikimedia.org/r/138664 (owner: 10Ottomata) [19:36:53] (03CR) 10Dzahn: [C: 04-1] "i think there's a typo, $conf vs. $confs" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 (owner: 10Chad) [19:36:59] !log Broke Jenkins by silently upgrading it :-( [19:37:04] Logged the message, Master [19:38:17] hah [19:38:53] (03CR) 10Dzahn: [C: 031] "added RobH/Chris because it's procurement" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137691 (owner: 10Alexandros Kosiaris) [19:41:13] (03PS1) 10Hashar: contint: update Jenkins command line [operations/puppet] - 10https://gerrit.wikimedia.org/r/138666 [19:41:19] RECOVERY - DPKG on gallium is OK: All packages OK [19:41:31] (03CR) 10Dzahn: [C: 031] torrus: csw2-esams in accessswitches, not corerouters [operations/puppet] - 10https://gerrit.wikimedia.org/r/131473 (owner: 10Alexandros Kosiaris) [19:42:10] !log Jenkins upgraded from 1.532.2 to 1.554.2 (i.e. bumped to a new LTS version). [19:42:10] :-( [19:42:14] Logged the message, Master [19:44:34] (03PS3) 10Ori.livneh: ::apache: delete everything we're not already using [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 [19:44:36] (03PS1) 10Ori.livneh: apache: clean up init.pp & params.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 [19:45:08] !log power cycling analytics1012, attempting to reinstall as kafka broker with new kafka partman recipe [19:45:14] Logged the message, Master [19:46:19] (03PS1) 10Ori.livneh: apache::vhost: removed unused 'configure_firewall' option [operations/puppet] - 10https://gerrit.wikimedia.org/r/138669 [19:46:48] hashar: but there seem to be some nice fixes in the changelog.. right [19:46:52] " NullPointerException when trying to mark slave temporarily offline (issue 21875) " [19:46:56] f.e. [19:47:21] new LTS version sounds good? [19:47:37] !log Jenkins restarted apparently properly. Any breakage would probably be related to the version switch :-D [19:47:42] Logged the message, Master [19:48:06] (03CR) 10Dzahn: [C: 031] "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138666 (owner: 10Hashar) [19:48:30] mutante: yeah that sounds good. Though new Jenkins versions often come with new pluginsthat have to be upgraded :D I tend to do the upgrade during the quiet european morning with prior announcement [19:48:37] hashar: it works apparently:) [19:48:42] Main test build succeeded. [19:48:48] yeah apparently :-D [19:49:07] (03CR) 10Dzahn: [C: 032] contint: update Jenkins command line [operations/puppet] - 10https://gerrit.wikimedia.org/r/138666 (owner: 10Hashar) [19:49:13] if I am still around in two hours you will know it ended up broken hehe :D [19:49:31] there, do you have it running with the right options already? [19:49:55] nop [19:50:00] that option is probably harmless [19:50:24] I sent in puppet to make sure I will remember to revisit eventually. I guess you can get it merged [19:50:33] i did [19:50:50] yes, just the "headless" option is going to be added [19:50:58] thx! [19:51:18] will look at the debian change log for clues [19:51:50] applied, but you'd want one more restart [19:52:26] will do! thank you! [19:52:31] yw [20:10:54] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [20:24:24] (03PS1) 10Ottomata: Add network configs for analytics1-a and analytics1-d networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/138676 [20:25:04] mutante: ^ [20:25:32] looking [20:28:21] (03CR) 10Dzahn: [C: 031] "lgtm, consistent with the existing networks in network.pp, gateway IPs reachable from carbon, netmasks ok" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138676 (owner: 10Ottomata) [20:28:52] danke, merging [20:30:10] (03CR) 10Ottomata: [C: 032 V: 032] Add network configs for analytics1-a and analytics1-d networks [operations/puppet] - 10https://gerrit.wikimedia.org/r/138676 (owner: 10Ottomata) [20:35:46] Reedy: thanks for the backports <3 [20:36:48] (03PS1) 10Jgreen: monitor incoming mail for OTRS [operations/puppet] - 10https://gerrit.wikimedia.org/r/138681 [20:41:31] (03PS1) 10Legoktm: Set $wgTitleBlacklistLogHits = true on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://bugzilla.wikimedia.org/66450) [20:43:06] oof, mutante, ok so, i have a blank screen in the install right now...i guess I should just wait and see what happens? [20:43:12] or is this what it looks like if partman fails? [20:46:49] (03CR) 10Ottomata: [C: 031] monitor incoming mail for OTRS [operations/puppet] - 10https://gerrit.wikimedia.org/r/138681 (owner: 10Jgreen) [20:48:16] (03CR) 10Jgreen: [C: 032 V: 031] monitor incoming mail for OTRS [operations/puppet] - 10https://gerrit.wikimedia.org/r/138681 (owner: 10Jgreen) [20:52:45] oo, it is starting the partitioner... [20:54:44] mutante: still there? [20:56:11] (03PS1) 10Andrew Bogott: Include role::mail::sender in role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/138722 [20:58:15] (03CR) 10Andrew Bogott: [C: 032] Include role::mail::sender in role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/138722 (owner: 10Andrew Bogott) [21:00:04] bsitu: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140610T2100) [21:01:34] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [21:03:54] ottomata: back.. i see it's booting now.. nice [21:04:00] yup, booting [21:04:09] now on the partman craziness [21:04:12] i seem to have created 17 partitions! :p [21:04:16] on each device! [21:04:18] not intended! [21:04:22] :o [21:04:37] did you copy an existing recipe and adjust it? [21:04:42] yes [21:04:43] but a lot [21:04:44] https://gerrit.wikimedia.org/r/#/c/138451/6/modules/install-server/files/autoinstall/partman/analytics-kafka.cfg [21:04:45] its complicated [21:05:09] string /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sd [21:05:12] g /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl [21:05:15] ya, 12 drives [21:05:18] heh, you really have all those ? [21:05:24] yup [21:05:32] i want: [21:05:52] sda and sdb to have a 30G raid1 /, and a 1G raid 1 swap [21:05:53] then [21:06:05] that shoudl be sda1,sdb1 and sda2,sdb2 [21:06:07] in raid1 [21:06:07] then [21:06:23] sda3 and sdb3 fill up the remaining space on those drives [21:06:29] with a physical ext4 partition [21:06:32] then. [21:06:48] sd{c..l}1 should be physical ext4 partitions [21:06:50] that take the whole drive [21:07:25] it seems that with that current recipe [21:07:40] i have created sda1 and sdb1 as a 30G raid1, and sda2 and sdb2 as a 30G raid1 [21:07:43] dunno how that happened [21:07:49] and then also, a bunch more partitions for each device [21:07:58] 1..17 :/ [21:09:14] uhm, i don't know, i have gotten away with using existing recipes mostly, except one for swift or so [21:09:58] !log montly sms credit check: 1,447.36 SMS credits. will check again in 30 days [21:10:04] Logged the message, RobH [21:10:44] hmph [21:10:50] dunno if i should give up and do it by hand [21:10:55] this would be really nice to get working though [21:11:02] as it is basically the same recipe that hadoop nodes will use [21:11:08] there are only a few kafka brokers [21:11:12] ottomata: but the part that it creates all those partitions seems normal.. they all have $primary{ } [21:11:15] but will be 20 some hadoop nodes [21:11:17] vs. raid [21:11:38] yes, i have 2 30G raid partitions somehow, on sda1,sdb1 and sda2,sdb2 [21:11:40] dunno how that happened [21:11:42] but the others don't have raid [21:13:05] which one did you copy? [21:13:12] and are you sure that worked [21:13:21] there might be recipes in the repo that dont work [21:14:07] well, i mean [21:14:11] there are no other recipes that do this [21:14:14] i think i started from raid1-30G [21:14:36] there are no other recipes that try to do different things to differnent devices [21:14:56] all other recipes are either: do this thing to taht one device, OR do this thing to all devices [21:16:40] mutante: , the installer gave me this error: [21:16:40] Identical mount points for two file systems [21:16:40] Two file systems are assigned the same mount point [21:16:40] (/var/spool/kafka/a): SCSI1 (0,0,0), partition #6 (sda) and SCSI1 [21:16:40] (0,1,0), partition #6 (sdb). [21:17:16] ottomata: maybe you can do it manually in the installer, but then get the partman config that the installer creates and put it in puppet? [21:17:27] oh, it will create one? [21:17:31] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 18:17:10 UTC [21:17:37] andrewbogott: any other puppet 3 shortcomings ? [21:18:00] matanya: I'm fixing a labs bug right now but it's unrelated to puppet 3 [21:18:19] I absolutely cannot build a precise image with puppet 3 and am wasting a bunch of time on that. But it may be moot, I could just stop building precise images entirely... [21:18:41] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [21:18:46] i'm all for such a change [21:19:27] ottomata: http://cptyesterday.wordpress.com/2012/06/17/notes-on-using-expert_recipe-in-debianubuntu-preseed-files/ hmmm [21:20:42] (03PS1) 10Jgreen: tweak icinga check intervals for exim incoming mail rate on OTRS server [operations/puppet] - 10https://gerrit.wikimedia.org/r/138733 [21:21:37] (03CR) 10PiRSquared17: [C: 04-1] "Please add" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://bugzilla.wikimedia.org/66450) (owner: 10Legoktm) [21:23:01] ottomata: "Unfortunately the expert_recipe part of partman can currently only handles a single disk for partition based recipes." ? [21:23:36] hmmmmMMM [21:23:46] (03CR) 10Jgreen: [C: 032 V: 031] tweak icinga check intervals for exim incoming mail rate on OTRS server [operations/puppet] - 10https://gerrit.wikimedia.org/r/138733 (owner: 10Jgreen) [21:23:47] that's not totally true, because you can do raid with multiple disks [21:23:49] buuuuuut [21:23:49] yeah [21:23:50] hm [21:23:53] If expert_recipe is used with a LVM setup then use multiple disks can be used [21:23:56] that was in 2012, wasn't that changed ? [21:24:38] i remember i read about some work around this issue, somewhere [21:24:47] hmm, i don't mind LVM [21:24:55] i am just using the whole disk so didn't seem necessary [21:25:01] there is that example in the link above [21:25:03] yeah [21:25:15] ok, let's do this piecemeal, will see if I can get the first 2 drives formatted without the ohter craziness first [21:25:33] mutante: wait, though, you said I could get a partman recipe after the fact? [21:25:38] if I partition manually? [21:25:52] On several Tools instances, /var is filling up due to python-diamond creating a huge log file in /var/log/diamond for today. Is that a general thing? [21:26:41] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [21:27:05] andrewbogott: one of our TODOs for HHVM is to migrate the beta cluster to trusty, so that's another point in favor of that approach [21:27:27] ori: in favor of what approach? [21:27:42] not bothering with precise / puppet 3 if it's a bother [21:29:15] ottomata: i'm not sure, i was just thinking maybe the installer creates a preseed file from what the user selects and passes it to partman-auto [21:29:21] that was wild guess [21:29:40] i doubt it, i think its the other way around [21:29:41] ori: Ah, yeah. [21:29:58] the d-i partman stuff is an automated way to interact with the debian-installer [21:30:14] ori: I wrote such a magnificent hack, though, I''ll be sad to not finish it :) (It tries to unmount a volume, catches an exception, does an lsof, kills /every single process/ in the lsof, repeats) [21:30:17] its actually a way of describing how to push buttons in the debian installer :/ [21:30:22] ori: within context that actually sort of makes sense :) [21:30:30] haha, i know the feeling [21:30:35] that sounds awesome [21:30:54] well this is quite annoying, i dont think I can do what I want to do actually [21:30:58] andrewbogott: share the hack, but don't push to puppet :) [21:31:04] ok, will revert to somethign simpler and try again tomorrow [21:31:10] time to sign out for the day! [21:31:16] mutante: thanks for your help [21:31:26] cya [21:31:32] chasemp: What is /var/log/diamond/diamond.log and is it necessary? [21:31:37] matanya: It still doesn't work right, and takes 30 minutes to test. So… I'm not /that/ thrilled at ironing out the last 3 or 4 bugs... [21:31:54] ok, just let it go [21:32:37] or in the popular way of saying it: https://www.youtube.com/watch?v=moSFlvxnbgk [21:32:46] scfc_de: it's the local log of the statistics polling, it's not necessary in that it won't work without it, but it doesn't logging those details to stdout seemingly so upstart logging isn't useful in those cases. it's handy but not could-never-live-without [21:33:03] * andrewbogott still hasn't seen that movie and is increasingly alienated from his native culture as a result [21:33:27] <^d> I haven't seen any new movies in awhile. [21:33:30] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 10 Jun 2014 18:33:10 UTC [21:33:40] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Jun 10 21:33:33 UTC 2014 [21:34:39] ^d: except on planes [21:34:50] chasemp: Okay, then I'll delete them on Tools to get some room to breathe. [21:35:50] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%): [21:35:52] <^d> mutante: I don't usually watch movies on planes either. Usually playing SNES games ;-) [21:35:59] (03CR) 10Jalexander: [C: 04-1] "PiR's suggestion may resolve my concerns but since I'm digging in to see whats being shown (and roping legal in for a quick review) -1ing " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://bugzilla.wikimedia.org/66450) (owner: 10Legoktm) [21:37:57] (03CR) 10Dzahn: "matanya, did you recently fix this already or is it slightly different?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108086 (owner: 10Tim Landscheidt) [21:38:21] ^d: true, and then the entertainment system crashes and they reboot it [21:39:04] <^d> mutante: I meant playing them on my laptop :p [21:39:56] (03CR) 10Matanya: "fixed in https://gerrit.wikimedia.org/r/#/c/135757/2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108086 (owner: 10Tim Landscheidt) [21:40:14] (03PS1) 10Jgreen: change ganglia metric name 'exim messages in' to 'exim_messages_in' [operations/puppet] - 10https://gerrit.wikimedia.org/r/138736 [21:40:28] ^d: oh, heh:) did you know https://en.wikipedia.org/wiki/Kaillera ? [21:40:54] (03Abandoned) 10Tim Landscheidt: Report last successful Puppet run in 24-hour format [operations/puppet] - 10https://gerrit.wikimedia.org/r/108086 (owner: 10Tim Landscheidt) [21:40:56] ^d: remember the edithaton patch for Library of Israel, from last week? [21:41:08] it didn't work for some reason, can you check why ? [21:41:13] (03CR) 10PiRSquared17: "Yeah, I'm actually a bit surprised it was merged (I was about to abandon it, frankly, since pagemoves and editing cannot be logged), but I" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://bugzilla.wikimedia.org/66450) (owner: 10Legoktm) [21:41:31] <^d> mutante: I did not. [21:41:53] ^d: it made it possible to play original Pacman in MAME via the network across the world [21:42:04] too bad it's Windows, proprietary AND MAME dropped it [21:42:21] you could play ALL the old arcade games against remote people [21:42:26] ^d: https://gerrit.wikimedia.org/r/#/c/136750/ <-- this one [21:42:37] http://www.mamedb.com/ [21:43:09] (03CR) 10Jgreen: [C: 032 V: 031] change ganglia metric name 'exim messages in' to 'exim_messages_in' [operations/puppet] - 10https://gerrit.wikimedia.org/r/138736 (owner: 10Jgreen) [21:43:11] <^d> matanya: I wonder if loginwiki needs to be implicitly part of that list. [21:43:19] we still have the pending patch to add a link to Twitter to varnish error page [21:43:34] chasemp: How can I make diamond make reopen its log file? Currently, it holds a fds for the deleted ones which means the space doesn't get returned to the pool. [21:43:34] go vote on https://gerrit.wikimedia.org/r/#/c/97190/2 [21:43:41] ^d: I thought about it, but don't have enough knowledge to tell [21:44:33] matanya: scfc_de , thanks, one less in the queue :) [21:44:37] scfc_de: best idea I have is a restart? [21:45:30] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 959 seconds [21:45:56] (03CR) 10Dzahn: "bump, this is one of the oldest ones around" [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [21:46:10] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 959 seconds [21:46:20] (03CR) 10Dzahn: "1 year birthday in 3 days" [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [21:47:34] chasemp: That is service "diamond"? [21:47:40] it is [21:47:45] scfc_de: it is [21:48:20] (03CR) 10Dzahn: "quote: "I suppose if, at some point, a rebase of this results in a Jenkins +2 then on that day it'll be safe to merge." fixing path confl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/109073 (owner: 10Andrew Bogott) [21:48:45] chasemp: Thanks. [21:49:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:49:30] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:50:31] unmerged change alerter ++ [21:50:56] ++ [21:51:09] (03CR) 10Aklapper: [C: 04-1] "Whoops, lines 139-140 ((with ASSIGNED status), in front of the
    ) still have to get removed, otherwise we have the query duplica" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [21:51:40] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [21:54:05] (03PS2) 10Dzahn: ldap: Fix typo in usage messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114740 (owner: 10Tim Landscheidt) [21:55:14] is that one just growing over time? [21:55:23] the mediawiki job running on tungsten one [21:55:27] (03CR) 10Aklapper: [C: 031] "Tested patchset 3 locally and works as expected. Thanks! +1" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) (owner: 1001tonythomas) [21:55:58] (03CR) 10Dzahn: [C: 031] "this will be hard to rebase now but still good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [21:56:30] greg-g: I think so. It's been popping up quite a bit recently. [21:56:42] (03CR) 10Dzahn: [C: 032] ldap: Fix typo in usage messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/114740 (owner: 10Tim Landscheidt) [21:56:43] (over a few days) [21:59:03] greg-g, need to sync labs, noop for prod. https://gerrit.wikimedia.org/r/#/c/138506 [22:00:00] yurikR: kk [22:00:27] http://somethingsinistral.net/blog/the-angry-guide-to-puppet-3/ [22:00:33] (03CR) 10Yurik: [C: 032] Updated labs config for new zero exts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138506 (owner: 10Yurik) [22:00:48] (03Merged) 10jenkins-bot: Updated labs config for new zero exts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138506 (owner: 10Yurik) [22:03:52] what's going on with the lack of data here: http://gdash.wikimedia.org/dashboards/jobq/ [22:05:37] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [22:10:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [22:10:32] soo..... [22:10:35] AaronSchulz: ^ [22:11:00] AaronSchulz: anything wrong with the job queue? incinga thinks so, and is complaining a lot about it [22:14:23] it's hard to make sense of those warnings [22:14:27] <^d> I think either the check is a little screwy or the data is. Job queue itself seems fine. [22:16:00] k [22:16:22] I don't see anything out of the ordinary though [22:16:55] (03CR) 10Tim Landscheidt: "Will this still allow different setups (Tools or the VERP project come to mind) to be possible?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138722 (owner: 10Andrew Bogott) [22:21:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [22:30:07] (03PS2) 10Dzahn: Fix manage-keys-nfs misnomer in usage message [operations/puppet] - 10https://gerrit.wikimedia.org/r/114739 (owner: 10Tim Landscheidt) [22:30:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [22:31:13] (03CR) 10Dzahn: [C: 032] Fix manage-keys-nfs misnomer in usage message [operations/puppet] - 10https://gerrit.wikimedia.org/r/114739 (owner: 10Tim Landscheidt) [22:33:54] nice quit message :p [22:38:27] (03CR) 10Dzahn: [C: 032] "libpng3-dev - can't select versions from package 'libpng3-dev' as it is purely virtual" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102630 (owner: 10Tim Landscheidt) [22:39:50] (03PS2) 10Dzahn: Tools: Resolve virtual package requirements [operations/puppet] - 10https://gerrit.wikimedia.org/r/102630 (owner: 10Tim Landscheidt) [22:46:26] (03CR) 10Dzahn: [C: 032] Tools: Resolve virtual package requirements [operations/puppet] - 10https://gerrit.wikimedia.org/r/102630 (owner: 10Tim Landscheidt) [22:47:47] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Jun 10 22:47:37 UTC 2014 [22:52:37] (03CR) 10Legoktm: [C: 04-1] "The log is being improved in I1b7396cebaa528edca043d5d3dfbf9d950d0e116, let's wait until then." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138684 (https://bugzilla.wikimedia.org/66450) (owner: 10Legoktm) [22:55:26] (03CR) 10Dzahn: [C: 032] Styled the alias field value differently [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) (owner: 1001tonythomas) [22:55:43] (03CR) 10Dzahn: [V: 032] Styled the alias field value differently [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (https://bugzilla.wikimedia.org/62160) (owner: 1001tonythomas) [22:56:05] (03PS3) 10Chad: Fixup lsearchd config slightly [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 [22:59:47] RECOVERY - Disk space on stat1002 is OK: DISK OK [23:00:11] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140610T2300) [23:00:25] there aren't any [23:00:28] so i'll volunteer [23:00:32] wheee [23:00:41] * ori is selfless like that [23:00:46] haha [23:00:59] I was just thinking I should do SWAT today, because I haven't done it in week [23:01:00] s [23:01:03] But 0 patches is easy :) [23:01:06] you all get to do it! [23:01:32] everyone can easily do nothing, right? [23:01:55] there's nothing to step on each others' toes with, nothing complicated, nothing is awesome. [23:02:07] marktraceur: is this resolved now?:) https://bugzilla.wikimedia.org/show_bug.cgi?id=62160 [23:02:20] Errr [23:02:21] or, to reuse my quotable from yesterday's meditation thing "no-thing is easy" [23:02:50] mutante: Ehhhh [23:02:52] It's closer [23:03:21] mutante: I'd like it to actually be different style, not just separated by punctuation [23:03:29] 'cause I can put (whatever) in front of a title too [23:04:11] marktraceur: it is italic now [23:06:32] marktraceur: well.. in the line on top it is [23:09:53] (03CR) 10Dzahn: [C: 031] "lgtm, Chad, now?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 (owner: 10Chad) [23:10:27] PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:37] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:47] PROBLEM - swift-object-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:47] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:47] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:47] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:53] uh [23:10:57] PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:57] PROBLEM - Disk space on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:05] mutante: ^ [23:11:07] PROBLEM - swift-container-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:07] PROBLEM - check if dhclient is running on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:07] PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:18] PROBLEM - puppet disabled on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:18] PROBLEM - check configured eth on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:18] PROBLEM - RAID on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:18] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:24] greg-g: server is still up [23:11:27] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:27] PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:27] PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:30] i expected it to be frozen [23:11:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [23:11:38] nice lack of dependencies [23:14:17] PROBLEM - exim incoming message rate on iodine is CRITICAL: exim_messages_in CRITICAL: 0.0 [23:14:44] greg-g: i'm hoping for recoveries since those things are running [23:15:26] (03CR) 10Chad: [C: 031] "Whenever is fine by me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 (owner: 10Chad) [23:16:49] (03CR) 10Andrew Bogott: [C: 04-2] "Rather than revert, I put in https://gerrit.wikimedia.org/r/#/c/138722/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138623 (owner: 10Filippo Giunchedi) [23:17:38] greg-g: ugh, or not .. BUG: soft lockup - CPU#3 stuck for 22s! [swift-container [23:19:49] !log rebooting unresponsive ms-be1003 [23:19:54] Logged the message, Master [23:21:37] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [23:23:37] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:53] (03CR) 10Andrew Bogott: "> Will this still allow different setups (Tools or the VERP project come to mind) to be possible?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138722 (owner: 10Andrew Bogott) [23:26:47] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [23:26:52] greg-g: ^ [23:26:58] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [23:26:58] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [23:26:58] RECOVERY - check if dhclient is running on ms-be1003 is OK: PROCS OK: 0 processes with command name dhclient [23:26:58] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [23:27:07] RECOVERY - puppet disabled on ms-be1003 is OK: OK [23:27:07] RECOVERY - check configured eth on ms-be1003 is OK: NRPE: Unable to read output [23:27:07] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [23:27:07] RECOVERY - RAID on ms-be1003 is OK: OK: optimal, 14 logical, 14 physical [23:27:17] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [23:27:17] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:27:17] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [23:27:17] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [23:27:27] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [23:27:37] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [23:27:37] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [23:27:37] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [23:27:37] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:27:48] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [23:27:48] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [23:27:55] mutante: what just happened? oom? [23:29:12] andrewbogott: Jun 10 23:18:26 ms-be1003 kernel: [5935664.442078] BUG: soft lockup - CPU#4 stuck for 23s! [xfsaild/sdn3:1146] [23:29:20] let me forward you what i could save [23:29:28] see SAL though, it's not the first time this box does it [23:29:33] uh [23:29:39] xfs? [23:29:44] ok. No need to mail me details, was just wondering if this was some interesting Swift issue [23:29:55] andrewbogott: got a few mins for CR? [23:30:12] um… a few, sure. [23:30:15] forwarded you something, andrewbogott , jgage .. does that tell us more? [23:30:40] xfs does not have a glorious record as far as stability goes :( [23:31:06] cpu stuck, xfs referenced.. i wonder if we're looking at a disk issue [23:31:28] andrewbogott: https://gerrit.wikimedia.org/r/#/c/137884/ , https://gerrit.wikimedia.org/r/#/c/138668/ and https://gerrit.wikimedia.org/r/#/c/138669/ [23:31:50] andrewbogott: i can shepherd them to production [23:32:25] * jgage installs smartmontools and checks out disk health on ms-be1003 [23:32:58] aw it's perc, no smart [23:33:37] jgage: cool! [23:34:23] ok smartctl works on perc, you just have to pass -d megaraid,N where N is the disk number [23:34:40] !log updated labs Trusty image w/puppet3, made default [23:34:45] Logged the message, Master [23:35:11] ...fsvo works. doens't expose the counters i want to see. [23:35:14] ori: Just to confirm… the apache module is currently unused? [23:35:36] it is currently used, but: [23:35:41] a) we agreed we want to get rid of it [23:35:54] b) i think we can rennovate it from inside rather than create a brand new module and end up with dupes [23:36:00] c) the patch removes things we *aren't* using [23:36:28] the remaining code will be heavily refactored in place [23:37:13] Can you tell me a bit about how you ensured that c) is true? [23:37:14] (03CR) 10Dzahn: [C: 032] Fixup lsearchd config slightly [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 (owner: 10Chad) [23:38:09] laboriously and manually.. :/ [23:38:29] ^d: ^ [23:38:37] andrewbogott: everything in spec/* is obviously not required; that's for the tests [23:38:50] tests/* ditto [23:38:57] docs ditto [23:39:02] ok based on this the xfs aild is a watchdog, so it may have simply detected the hang rather than caused it: http://oss.sgi.com/archives/xfs/2012-11/msg00570.html [23:39:04] <^d> mutante: Just saw, thx [23:39:37] andrewbogott: that leaves: dev.pp, php.pp, proxy.pp, mod/auth_kerb.pp, mod/disk_cache.pp, python.pp, ssl.pp [23:41:05] (03PS1) 10Yurik: Labs: Replaced $wgJsonConfigStorage with $wgJsonConfigs...['store'] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138754 [23:42:05] greg-g, syncing labs setting https://gerrit.wikimedia.org/r/#/c/138754/ [23:42:26] (03CR) 10Yurik: [C: 032] Labs: Replaced $wgJsonConfigStorage with $wgJsonConfigs...['store'] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138754 (owner: 10Yurik) [23:42:32] (03Merged) 10jenkins-bot: Labs: Replaced $wgJsonConfigStorage with $wgJsonConfigs...['store'] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138754 (owner: 10Yurik) [23:42:54] andrewbogott: http://p.defau.lt/?OfPRK31wKASuhOsJxhFiZQ [23:43:46] (03PS1) 10MaxSem: Beta: create a shared PageImages blacklist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138755 [23:49:49] ori, one other thing, did you preserve the upstream license? [23:50:03] hacking up and rebuilding the upstream module seems good, as long as we don't forget that it's a derived work [23:51:03] andrewbogott: yep [23:51:24] modules/apache/LICENSE is untouched [23:52:01] (03CR) 10MaxSem: [C: 032] Beta: create a shared PageImages blacklist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138755 (owner: 10MaxSem) [23:52:03] (03CR) 10Andrew Bogott: [C: 031] "This seems like a fine approach. But don't merge at the end of a day :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 (owner: 10Ori.livneh) [23:52:09] (03Merged) 10jenkins-bot: Beta: create a shared PageImages blacklist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138755 (owner: 10MaxSem) [23:52:14] andrewbogott: :(( [23:52:45] ori: Well, I mean, at the end of your day :) If you're going to be around for a few hours to notice icinga puppet alerts... [23:52:56] yep [23:53:39] andrewbogott: sorry, i shouldn't have frowned. i asked for a review and you gave one, thanks very much for that. and it's useful to hear deployment concerns. [23:53:49] that was just kneejerk [23:54:17] RECOVERY - exim incoming message rate on iodine is OK: exim_messages_in OKAY: 4.0 [23:57:35] (03CR) 10Andrew Bogott: apache: clean up init.pp & params.pp (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 (owner: 10Ori.livneh) [23:58:22] (03CR) 10Andrew Bogott: [C: 031] apache::vhost: removed unused 'configure_firewall' option [operations/puppet] - 10https://gerrit.wikimedia.org/r/138669 (owner: 10Ori.livneh)