[00:01:28] paravoid: is this exim config really faulty? i believe it is as Tim L. says [00:01:31] https://gerrit.wikimedia.org/r/#/c/127481/1 [00:01:45] it's toollabs [00:02:21] (03CR) 10Faidon Liambotis: [C: 032] Tools: Remove faulty exim configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/127481 (owner: 10Tim Landscheidt) [00:02:28] thx! [00:02:35] yvw :) [00:04:35] out now, have a nice weekend all [00:04:40] I've just noticed Visual Editor is enabled on OTRS Wiki, but it doesn't seem to work. Clicking "edit" returns an error ("Error loading data from server: parsoidserver-http-bad-status: 404"). Is that currently being worked on and/or should that go to bugzilla? [00:05:08] gwicke: ^^ [00:07:16] (03PS1) 10Springle: db1047 mariadb 10 now replicates s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144096 [00:12:23] pajz: hmm.. James_F ^^ [00:12:26] ;) [00:12:35] hot potato routing [00:12:54] subbu has left already [00:14:14] gwicke: Bleh. [00:14:38] (03CR) 10Springle: [C: 032] db1047 mariadb 10 now replicates s2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/144096 (owner: 10Springle) [00:15:39] James_F: are you looking into it? [00:15:45] Yeah. [00:15:49] Didn't realise it went out. [00:16:21] James_F: k, thx! [00:27:07] James_F: deploying.. [00:27:18] gwicke: Thanks! [00:28:15] !log deployed parsoid config change e21a534 to support VE on the OTRS wiki [00:28:21] Logged the message, Master [00:29:10] Thanks. [00:31:20] pajz: Should be fixed now. Sorry! [00:35:18] Thanks. Out of curiosity, was there a request to enable VE on OTRS Wiki? [00:37:26] Also, it's still not working for me. The error is gone, but it's now just loading endlessly (both FF and Chrome). [00:37:54] Looks like beta labs is in read-only mode right now. [00:38:07] James_F, ^ [00:38:52] pajz: Yeah, I chatted with the OTRS admins and in the end we said it was OK as a Beta Feature (but not on by default). [00:38:59] pajz: Wasn't expecting it to go out today, sorry. [00:39:45] pajz: Do you get anything in the console? [00:40:00] But, uhm, it is now enabled by default, isn't it? [00:40:57] pajz: Indeed. [00:44:33] jackmcbarn, I'm not really sure what I should be looking for in the console. [00:45:03] Provided I'm even looking at the right thing. [00:48:23] James_F: ^ [00:48:45] James_F, http://pastebin.com/TAvK7uTA [00:48:47] pajz: Well, if there are any messages at all that's generally a bad sign. :-) [00:49:27] Oh. Eurgh. [00:49:35] Damn private wikis with terrible code. [00:51:48] pajz: Yeah, I chatted with the OTRS admins and in the end we said it was OK as a Beta Feature (but not on by default). [00:51:52] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [00:51:55] It was supposed to be opt in [00:54:01] RD: Yeah. :-( [00:54:11] RD, could you check and see if it works for you now? I'm seeing an error message in chrome which is related to a (now-deleted) .js file of mine, so I wonder if that could be related. [00:54:54] Appears to be working now, yes. (FF) [00:55:06] Now to make it opt-in.. :D [00:55:06] pajz: Yeah, it works for me now. [00:55:17] Sorry. I'm being bias [00:55:17] Ah. Hmm. [00:57:36] yeah, I second the request to make it opt-in if that's possible. [01:03:33] Hmm, I can't get it to work. [01:09:56] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:33:41] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [01:51:35] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [01:59:38] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:05:28] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53279 bytes in 0.409 second response time [02:15:18] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:28] PROBLEM - RAID on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:21] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [02:33:33] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-04 02:32:29+00:00 [02:33:38] Logged the message, Master [02:34:01] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [03:03:52] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-04 03:02:49+00:00 [03:03:56] Logged the message, Master [03:29:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 4 03:28:29 UTC 2014 (duration 28m 28s) [03:29:41] Logged the message, Master [04:31:25] (03CR) 10Ori.livneh: [C: 04-1] "On reflection, I think that this is more risky than it needs to be. Let's defer the actual use of the Puppet-managed configuration files t" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [04:50:13] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [05:07:11] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:41] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:08:01] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [05:08:11] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [05:08:31] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [06:27:32] PROBLEM - puppet last run on elastic1012 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:22] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:32] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:32] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:42] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:42] PROBLEM - puppet last run on search1016 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:42] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:52] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:52] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:52] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 4 failures [06:28:52] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:02] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:02] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:02] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:02] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:02] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:03] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:03] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:04] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:05] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:12] PROBLEM - puppet last run on search1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:12] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:12] PROBLEM - puppet last run on db74 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:12] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:22] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:22] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:22] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:22] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:22] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:23] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:32] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:32] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:42] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:42] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:42] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:52] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:52] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:02] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:02] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:02] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:03] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:12] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:12] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:12] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:22] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:08] i'm not sure why icinga-wm has to speak in near-palindromes [06:44:09] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:44:09] RECOVERY - puppet last run on db74 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:44:29] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:44:29] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:44:39] RECOVERY - puppet last run on elastic1012 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:44:39] RECOVERY - puppet last run on search1016 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:44:49] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:44:59] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:45:09] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:09] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:09] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:45:09] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on search1010 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:53:06] (03CR) 10Gilles: [C: 031] Remove remaining surveys for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 (owner: 10MarkTraceur) [07:39:34] (03PS1) 10Yurik: Handle ZERO's new carrier ip subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/144131 [07:46:58] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:48:16] <_joe_> sigh [07:52:48] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.012 second response time [07:59:39] (03CR) 10Filippo Giunchedi: [C: 031] "one ignorable comment (prefixed with ~~) but looks good!" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [08:01:31] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [08:05:39] good morning [08:06:33] morning? [08:06:45] it's night for us! :P [08:07:14] <_joe_> it's always night for someone [08:08:05] (03PS8) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [08:10:38] MaxSem: might technically stillb e night for me [08:10:53] went back home at something like 3am. [08:12:09] gj [08:12:33] selebrated ID? [08:14:27] <_joe_> hashar: you debauched frenchman! you're supposed to CODE until 3 am, not have fun like normal people! [08:15:12] ho [08:15:28] i would code if I was a software developer :D [08:15:37] HAAAAAA [08:16:02] CODE PUPPET THEN [08:16:16] oh puppet is not code [08:16:30] it just a a harness for masochists newbie sysadmins [08:17:07] <_joe_> hashar: you never tried coding erlang I guess [08:17:59] na [08:18:07] but I could since I mostly copy paste from stackoverflow [08:20:34] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:21:36] <_joe_> hashar: :P [08:23:32] _joe_: have you starred at the zuul puppet patches? [08:24:20] <_joe_> hashar: nope sorry [10:02:58] (03PS9) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [10:04:05] (03PS10) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with new packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [10:57:49] (03CR) 10Alexandros Kosiaris: [C: 032] deprecated syntax in mysql/generic_my.cnf.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143529 (owner: 10Dzahn) [10:59:18] (03CR) 10Alexandros Kosiaris: [C: 032] apt: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/143271 (owner: 10Matanya) [11:04:41] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Complete puppet failure [11:09:41] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:18:51] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:01] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:11] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:11] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:42] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 612 seconds ago with 0 failures [11:19:52] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [11:20:02] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [11:20:02] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [11:40:40] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Complete puppet failure [11:43:20] (03PS1) 10Alexandros Kosiaris: Fix eventlogging duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/144146 [11:47:35] it seems wikitech is still vulnerable agains mitm as it has an old libssl. was reported here: https://bugzilla.wikimedia.org/show_bug.cgi?id=53259#c15 [11:50:40] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:50:51] _joe_: could you find someone to look at it? [11:51:28] He's /away [11:51:33] I've just poked some Opsen [11:51:41] thx [11:58:20] PROBLEM - DPKG on virt1000 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:59:28] jzerebecki: ^^ being worked on ;) [12:00:50] (03CR) 10Alexandros Kosiaris: [C: 032] Fix eventlogging duplicate definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/144146 (owner: 10Alexandros Kosiaris) [12:01:53] PROBLEM - puppetmaster https on virt1000 is CRITICAL: Connection refused [12:01:53] PROBLEM - HTTP on virt1000 is CRITICAL: Connection refused [12:02:13] PROBLEM - Memcached on virt1000 is CRITICAL: Connection refused [12:03:13] RECOVERY - Memcached on virt1000 is OK: TCP OK - 0.001 second response time on port 11000 [12:03:45] (03CR) 10Alexandros Kosiaris: "This broke puppet on tungsten due to a duplicate definition. The same name was re-used a couple of lines above. Fixed in Iff01dce7dc032f21" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143865 (https://bugzilla.wikimedia.org/67073) (owner: 10Nuria) [12:03:53] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.157 second response time [12:03:53] RECOVERY - HTTP on virt1000 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.003 second response time [12:15:53] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [12:19:21] (03CR) 10Alexandros Kosiaris: [C: 04-2] "NAK, See https://gerrit.wikimedia.org/r/#/c/143306, we should be merging this (and its dependencies) next week." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144092 (owner: 10Dzahn) [12:26:15] keystone upgrade sux [12:26:26] RECOVERY - DPKG on virt1000 is OK: All packages OK [12:26:31] and makes dpkg complain [12:28:44] !log executed dist-upgrade on virt1000. Keystone configure phase failed in keystone-manage db-sync and hence dpkg configure failed. It was trying to create an already existing index in the database. Dropped the index, ran dpkg --configure -a to recreate the index (and whatever else keystone-manage db_sync does). All is back to normal. [12:28:49] Logged the message, Master [12:28:57] openstack sux [12:29:08] that's just stupid [12:29:46] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:29:51] thanks akosiaris [12:48:58] <_joe_> akosiaris: talk about idempotent packages.... [12:53:40] _joe_: btw, only late at night I realized that a big(?) part of your point was the use of exec rather than file, makes sense as well :) [12:53:46] I'm investigating is_puppet_master now [12:54:03] hmm, actually I should just use the sudo solution [12:54:16] depending on is_puppet_master conflates the puppetmaster and diamond roles, sounds... wrong [12:54:21] * YuviPanda goes back to writing code [12:54:48] <_joe_> eheh [12:56:03] at least cat is not a bash builtin :) [13:35:23] (03PS10) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [13:37:42] (03PS11) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [13:38:34] hmm, every puppet run seems to be failing for me with 'Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class passwords::mysql::phabricator for graphite-test.eqiad.wmflabs on node graphite-test.eqiad.wmflabs' [13:41:29] hmm, had to comment those out [13:42:21] YuviPanda: That's usually defined in puppet private... not sure how you handle that for labs [13:42:40] hoo: yeah, but I shouldn't need that at all, I guess, cine I'm not doing anything with phab [13:42:48] the phab module should be fixed [13:43:50] YuviPanda: Yeah... you could check whether the class is defined in there [13:43:52] YuviPanda: it is probably missing from labs/private (gerrit repo, public) [13:44:05] or that [13:44:17] godog: aaah, right. I suppose labs puppetmaster add that as well, but this one just forgot to add it [13:44:23] I can't access the private repo, can someone copy it over? [13:44:52] YuviPanda: labs/private is public despite the name :) just contains dummy passwords [13:45:38] hmm, right. but I guess it'll be easier to just copy them from the private repo while dummying out the passwords, but I guess I can also look for the individual variables being used and dummy them out myself [13:45:49] but for now I just commented those out and my (unrelated) work continues... [13:47:56] <_joe_> YuviPanda: labs/private has that class [13:48:02] oh [14:14:16] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:40] (03PS12) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [14:17:06] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.005 second response time [14:17:16] _joe_: chasemp ^ python + sudo solution [14:28:54] (03CR) 10Yuvipanda: "Rewritten to fix issues pointed out by the comments." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [14:32:53] <_joe_> YuviPanda: :) [14:33:09] _joe_: it's somewhat ugly, but I guess beats messing around the puppet package [14:33:29] _joe_: do remove the -2 (and maybe +2? :) ) if you're ok with the solution [14:33:47] <_joe_> YuviPanda: yes, whenever I can [14:34:01] _joe_: ty! I'll be afk for a while now, though. [14:34:03] <_joe_> I'm kinda busy with something that incredibly decided to break on friday [14:34:22] _joe_: ah, that never happens! :) [14:34:28] _joe_: good luck, and thanks for all the comments/help! [14:37:14] (03PS13) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [14:39:36] <_joe_> YuviPanda|zzz: cronspam from graphite-test [14:44:19] (03CR) 10coren: [C: 031] "I'm really not in love with the idea of invoking sudo with configurable parameters from deep within a monitoring process; but this is a de" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [14:53:19] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:09] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [14:57:49] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Fri 04 Jul 2014 12:56:58 UTC [15:01:47] _joe_: oh? to you? [15:02:17] I don't know how that'll happen :| [15:04:01] <_joe_> diamond : parse error in /etc/sudoers.d/50_diamond_sudo_for_puppet near line 6 ; TTY=pts/2 ; PWD=/ ; [15:04:06] <_joe_> YuviPanda: ^^ [15:04:09] ah [15:04:13] yeah, I fixed that in a later patch [15:04:17] let me update the puppetmaster [15:05:27] _joe_: should be fixed now [15:05:41] <_joe_> ok [15:05:42] <_joe_> :) [15:05:53] well, should be fixed once this puppet run completes [15:06:23] yeah, seems to be fixed now [15:10:58] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [15:12:16] (03CR) 10Giuseppe Lavagetto: [C: 031] "de-moting to +1 as per small comment." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [15:12:38] <_joe_> YuviPanda: sorry to be your nightmare [15:12:41] <_joe_> :) [15:14:17] _joe_: updated [15:14:24] (03PS14) 10Yuvipanda: diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 [15:14:42] _joe_: and if you think this was a nightmare, you should work with more designers :) [15:15:13] <_joe_> YuviPanda: I'm color and style blind, which guarantees I'm always left very far from any frontend work [15:15:50] <_joe_> thank god :) [15:16:00] _joe_: aaah! :) I've adopted a philosophy of never questioning designers' colors, but even then... :) [15:19:48] _joe_: want to upgrade to a +2, since the comment was fixed? :) [15:19:56] <_joe_> oh yes sorry [15:20:08] <_joe_> I was deep down in apache configs right now :) [15:20:30] _joe_: ah, ok :) Sorry to pester, just poked since you first +2'd and then +1'd :) [15:20:46] <_joe_> don't worry [15:20:54] :) [15:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: Let diamond read the puppet state file [operations/puppet] - 10https://gerrit.wikimedia.org/r/143861 (owner: 10Yuvipanda) [15:21:30] _joe_: w00t [15:21:35] _joe_: thanks a lot! [15:21:51] <_joe_> YuviPanda: puppet-merged [15:22:21] (03PS1) 10Milimetric: Add setting for max instances per recurrent run [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144154 [15:22:25] Anyone firm with salt around? [15:22:41] <_joe_> I often spill it [15:22:49] <_joe_> hoo: what do you need? [15:23:02] _joe_: look at fenari pid 21625 [15:23:19] <_joe_> ok 1 sec [15:23:27] heavy in ram and on cpu... running for a couple of days now [15:24:32] <_joe_> hoo: is it salt that copies python files in tmp? [15:24:39] <_joe_> ... [15:24:47] _joe_: According to ppid, yes [15:24:54] I just walked the process tree up :P [15:24:55] <_joe_> hoo: I've seen that [15:25:09] <_joe_> my question was out of shock actually [15:25:10] <_joe_> :) [15:26:19] <_joe_> hoo: seems like it's reading every file in /home/.snapshot/hourly.0/wikipedia/htdocs/foundation/w2/ [15:26:33] <_joe_> so probably in /home/.snapshot [15:26:56] ah, that's why sda is so busy :P [15:27:04] <_joe_> so, it's not stuck, it's just doing something apparently stupid [15:27:23] <_joe_> we can just kill it, but I'm not sure that's a good idea honestly [15:28:19] mh, yes... I'm not into that enoguh to know :/ [15:28:20] brb [15:28:36] <_joe_> me neither [15:37:06] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Jul 4 15:37:03 UTC 2014 [15:40:33] <_joe_> !log restarting salt-minion, killing io hungry job on fenari running since jun 30, 00 AM [15:40:38] Logged the message, Master [15:42:08] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [16:00:04] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:01:10] see you on monday *wave* [16:03:17] hello anyone, how can i acknowledge an icinga alarm? [16:04:59] ping _joe_ [16:05:13] Unless you're ops, in most cases you can't [16:05:38] if the alarm comes to my group .. i should be able right? [16:06:03] Not necesserily [16:06:08] Reedy: as in "nagios/icinga group" [16:06:50] ahh I see,Reedy, can you tell me how you do it in the icinga UI and i will try? [16:07:28] <_joe_> nuria: hey [16:07:49] hello, _joe_ [16:08:21] <_joe_> nuria: what alarm do you need to acknowledge? [16:08:23] i was trying to "acknowledge" teh troughput alarm (must be missconfigured, it is alrming for "over" but should alarm for "under") [16:08:47] "throughput of event logging" [16:09:02] i cannot do it on the uI ( i must not have permits) [16:09:35] <_joe_> done [16:10:38] ok, thanks _joe_ [16:10:57] will look at alarm and script to see if teh "under" parameter is not being passed correctly [16:22:01] <_joe_> nuria: it used to... [16:22:12] <_joe_> YuviPanda|zzz: more cronspam [16:22:15] <_joe_> from labs [16:22:22] <_joe_> well I'm off now [16:22:34] <_joe_> someone will take care of it hopefully [16:25:25] YuviPanda|zzz: Dude, diamond is spamming root again. [16:27:14] (03PS1) 10coren: Revert "diamond: Let diamond read the puppet state file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144161 [16:28:20] (03CR) 10coren: [C: 032] "Reverting now as this will likely cause gmail to throttle opsen email again." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144161 (owner: 10coren) [16:29:16] Come /on/ Jenkins. [16:30:25] (03CR) 10coren: [V: 032] "+2V, this needs reversion now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/144161 (owner: 10coren) [16:32:56] <_joe_> Coren: and you'll need to run puppet on those machines [16:33:07] <_joe_> Coren: the change seemed correct btw [16:33:10] _joe_: I was about to do a salt run now. [16:33:16] <_joe_> Coren: ok [16:33:25] _joe_: Agreed, but some hosts seemingly do not apply the sudo rule right. [16:33:30] <_joe_> so there must be something wrong on those hosts [16:33:32] <_joe_> yes [16:34:02] <_joe_> Coren: a badly configure puppetmaster maybe? [16:35:23] _joe_: Proably. I'm still trying to find a matching grain or minion glob to match. Don't want to do puppet agent on '*' [16:35:39] <_joe_> on labs? [16:35:47] <_joe_> that would kill virt1000 [16:37:32] I know, that's why I'm trying to find a suitable grain. [16:37:48] Doing it on -G puppet::self [16:38:08] <_joe_> Coren: or, you could just kill diamond everywhere [16:38:17] <_joe_> that will get restarted on the next puppet run [16:38:21] <_joe_> with the correct code [16:38:31] ... not insane. [16:38:34] <_joe_> as long as the deployment puppetmaster is updated [16:38:55] <_joe_> Coren: they're all deployment-* [16:39:04] <_joe_> so you must find that puppetmaster [16:40:19] And figure out how /it/ gets upstream. [16:40:22] &^@%# [16:40:47] It's deployment-salt [16:41:47] Ah, and they have unmerged changes. [16:42:51] I expect that's why things broke in the first place. [16:44:08] * Coren manually fixes the merge conflict [16:45:42] <_joe_> yes [16:45:56] <_joe_> people should not automerge [16:47:20] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:20] PROBLEM - DPKG on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:20] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:21] PROBLEM - RAID on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:30] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:30] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:30] PROBLEM - check if dhclient is running on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:30] PROBLEM - SSH on ms-be1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:30] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:31] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:31] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:32] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:32] PROBLEM - check configured eth on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:40] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:10] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 587 seconds ago with 0 failures [16:48:10] RECOVERY - DPKG on ms-be1005 is OK: All packages OK [16:48:11] RECOVERY - swift-account-reaper on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:48:20] RECOVERY - swift-object-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:48:20] RECOVERY - check if dhclient is running on ms-be1005 is OK: PROCS OK: 0 processes with command name dhclient [16:48:20] RECOVERY - swift-account-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:48:20] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:48:20] RECOVERY - swift-container-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:48:21] RECOVERY - swift-container-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:48:21] RECOVERY - swift-account-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:48:22] RECOVERY - swift-object-auditor on ms-be1005 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:48:22] RECOVERY - check configured eth on ms-be1005 is OK: NRPE: Unable to read output [16:48:30] RECOVERY - swift-container-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:48:30] PROBLEM - Disk space on ms-be1005 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error [16:57:20] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:20] PROBLEM - SSH on lvs1002 is CRITICAL: Server answer: [16:57:21] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:21] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:30] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:40] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:40] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:57:40] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:20] RECOVERY - SSH on lvs1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:59:09] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [16:59:10] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [16:59:19] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [16:59:19] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [16:59:20] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [16:59:29] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [16:59:29] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 572 seconds ago with 0 failures [16:59:29] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [17:01:09] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 1 failures [17:38:06] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Fri 04 Jul 2014 15:37:03 UTC [17:57:18] (03PS2) 10Milimetric: Add setting for max instances per recurrent run [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144154 [18:17:29] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Jul 4 18:17:24 UTC 2014 [19:20:12] sorry, I think I just accidentally created a new ticket in RT. [19:20:28] it (and the original ticket) and be closed... [19:28:57] (03PS1) 10Nuria: Eventlogging monitoring, pass true w/o quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144178 [20:05:45] !log Ran sync-common on fenari to update the docs on noc.wikimedia.org [20:05:50] Logged the message, Master [20:06:13] Nemo_bis: ^ that was what you asked for earlier (regarding dblists) [20:08:46] thanks [20:18:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:35:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:08:39] (03PS1) 10Ottomata: Production now uses CDH (CDH5) module, also refactor hadoop.pp role [operations/puppet] - 10https://gerrit.wikimedia.org/r/144242