[00:05:00] RECOVERY - Solr on vanadium is OK: All OK [00:05:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:06:50] PROBLEM - Disk space on ms-fe2 is CRITICAL: NRPE: Command check_disk_space not defined [00:06:53] PROBLEM - Disk space on ms-be7 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:00] PROBLEM - Disk space on ms-be3 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:00] PROBLEM - Disk space on ms-fe1 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:00] PROBLEM - Disk space on ms-be6 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:10] PROBLEM - Disk space on ms-be1004 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:21] PROBLEM - Disk space on ms-be1012 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:30] PROBLEM - Disk space on ms-be1007 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:30] PROBLEM - Disk space on ms-fe4 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:40] PROBLEM - Disk space on ms-be1005 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:40] PROBLEM - Disk space on ms-be1003 is CRITICAL: NRPE: Command check_disk_space not defined [00:07:50] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:07:46 UTC 2013 [00:08:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:10] RECOVERY - Puppet freshness on mw1104 is OK: puppet ran at Tue Mar 26 00:10:06 UTC 2013 [00:10:20] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Tue Mar 26 00:10:10 UTC 2013 [00:10:27] New patchset: Yurik; "Update X-CS handling to new k/v pair spec" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [00:10:54] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:10:42 UTC 2013 [00:11:24] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:11:53] RECOVERY - Puppet freshness on mc1013 is OK: puppet ran at Tue Mar 26 00:11:42 UTC 2013 [00:12:10] RECOVERY - Puppet freshness on db71 is OK: puppet ran at Tue Mar 26 00:12:03 UTC 2013 [00:13:00] RECOVERY - Puppet freshness on db35 is OK: puppet ran at Tue Mar 26 00:12:57 UTC 2013 [00:13:00] RECOVERY - Puppet freshness on db43 is OK: puppet ran at Tue Mar 26 00:12:57 UTC 2013 [00:13:21] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Tue Mar 26 00:13:12 UTC 2013 [00:13:32] RECOVERY - Puppet freshness on search31 is OK: puppet ran at Tue Mar 26 00:13:21 UTC 2013 [00:13:33] RECOVERY - Puppet freshness on db56 is OK: puppet ran at Tue Mar 26 00:13:28 UTC 2013 [00:13:40] RECOVERY - Puppet freshness on db51 is OK: puppet ran at Tue Mar 26 00:13:34 UTC 2013 [00:13:40] RECOVERY - Puppet freshness on db63 is OK: puppet ran at Tue Mar 26 00:13:36 UTC 2013 [00:13:51] RECOVERY - Puppet freshness on db58 is OK: puppet ran at Tue Mar 26 00:13:40 UTC 2013 [00:13:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:13:44 UTC 2013 [00:13:51] RECOVERY - Puppet freshness on db64 is OK: puppet ran at Tue Mar 26 00:13:50 UTC 2013 [00:13:51] RECOVERY - Puppet freshness on db32 is OK: puppet ran at Tue Mar 26 00:13:50 UTC 2013 [00:13:52] RECOVERY - Puppet freshness on mw1012 is OK: puppet ran at Tue Mar 26 00:13:50 UTC 2013 [00:14:24] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:14:42] PROBLEM - Puppet freshness on mw1126 is CRITICAL: Puppet has not run in the last 10 hours [00:14:43] RECOVERY - Puppet freshness on db60 is OK: puppet ran at Tue Mar 26 00:14:34 UTC 2013 [00:14:43] * Bsadowski1 trips over a wire in the datacenter [00:14:45] j/k [00:15:10] :P [00:15:40] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [00:15:50] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:15:40 UTC 2013 [00:15:51] RECOVERY - Puppet freshness on mw1 is OK: puppet ran at Tue Mar 26 00:15:47 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on mw10 is OK: puppet ran at Tue Mar 26 00:16:00 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on mw1019 is OK: puppet ran at Tue Mar 26 00:16:01 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on mw1017 is OK: puppet ran at Tue Mar 26 00:16:01 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on mw1018 is OK: puppet ran at Tue Mar 26 00:16:02 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Tue Mar 26 00:16:03 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on virt4 is OK: puppet ran at Tue Mar 26 00:16:08 UTC 2013 [00:16:14] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Tue Mar 26 00:16:08 UTC 2013 [00:16:15] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Tue Mar 26 00:16:08 UTC 2013 [00:16:22] RECOVERY - Puppet freshness on virt3 is OK: puppet ran at Tue Mar 26 00:16:10 UTC 2013 [00:16:22] RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Tue Mar 26 00:16:15 UTC 2013 [00:16:22] RECOVERY - Puppet freshness on mw9 is OK: puppet ran at Tue Mar 26 00:16:15 UTC 2013 [00:16:23] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Tue Mar 26 00:16:15 UTC 2013 [00:16:23] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Tue Mar 26 00:16:15 UTC 2013 [00:16:23] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:16:23] RECOVERY - Puppet freshness on mw4 is OK: puppet ran at Tue Mar 26 00:16:18 UTC 2013 [00:16:23] RECOVERY - Puppet freshness on mw6 is OK: puppet ran at Tue Mar 26 00:16:18 UTC 2013 [00:16:24] RECOVERY - Puppet freshness on mw2 is OK: puppet ran at Tue Mar 26 00:16:18 UTC 2013 [00:16:31] RECOVERY - Puppet freshness on mc13 is OK: puppet ran at Tue Mar 26 00:16:20 UTC 2013 [00:16:31] RECOVERY - Puppet freshness on mc1010 is OK: puppet ran at Tue Mar 26 00:16:21 UTC 2013 [00:16:31] RECOVERY - Puppet freshness on mc9 is OK: puppet ran at Tue Mar 26 00:16:21 UTC 2013 [00:16:32] RECOVERY - Puppet freshness on mc1001 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:32] RECOVERY - Puppet freshness on mc1005 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:32] RECOVERY - Puppet freshness on mc1006 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:32] RECOVERY - Puppet freshness on mc1008 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:33] RECOVERY - Puppet freshness on mw1116 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:33] RECOVERY - Puppet freshness on mc5 is OK: puppet ran at Tue Mar 26 00:16:22 UTC 2013 [00:16:33] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Tue Mar 26 00:16:23 UTC 2013 [00:16:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:16:41 UTC 2013 [00:17:09] RECOVERY - somebody tripped over a wire, fixed by wirebot [00:17:12] RECOVERY - Puppet freshness on mw1037 is OK: puppet ran at Tue Mar 26 00:17:02 UTC 2013 [00:17:13] RECOVERY - Puppet freshness on mw69 is OK: puppet ran at Tue Mar 26 00:17:07 UTC 2013 [00:17:13] RECOVERY - Puppet freshness on mw48 is OK: puppet ran at Tue Mar 26 00:17:07 UTC 2013 [00:17:21] RECOVERY - Puppet freshness on mw44 is OK: puppet ran at Tue Mar 26 00:17:11 UTC 2013 [00:17:22] RECOVERY - Puppet freshness on mw62 is OK: puppet ran at Tue Mar 26 00:17:13 UTC 2013 [00:17:22] RECOVERY - Puppet freshness on mw1102 is OK: puppet ran at Tue Mar 26 00:17:13 UTC 2013 [00:17:22] RECOVERY - Puppet freshness on mw1103 is OK: puppet ran at Tue Mar 26 00:17:14 UTC 2013 [00:17:22] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Tue Mar 26 00:17:14 UTC 2013 [00:18:31] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 00:18:28 UTC 2013 [00:19:20] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:20:41] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [00:20:41] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [00:20:41] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [00:20:41] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [00:21:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [00:22:40] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [00:22:40] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [00:22:40] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [00:25:14] New patchset: Asher; "adding db1051" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55831 [00:32:01] New patchset: Asher; "adding db1051" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55831 [00:32:41] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [00:32:53] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55831 [00:37:10] New review: Asher; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55302 [00:39:10] !log asher synchronized wmf-config/db-eqiad.php 'adding db1051 at a low warmup weight' [00:39:16] Logged the message, Master [00:55:12] New patchset: Asher; "found another mariadb performance regression with extended_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55834 [00:59:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55834 [01:06:35] PROBLEM - Disk space on tola is CRITICAL: NRPE: Command check_disk_space not defined [01:09:54] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:10:09] New review: Faidon; "A combination of multiple unrelated things without being obvious why each of these classes are neede..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53587 [01:12:05] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 01:11:55 UTC 2013 [01:12:34] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:14:11] New review: coren; "They're not unrelated, they are all roles within the tool labs; it's not clear to me how else they a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [01:16:27] paravoid: Perhaps you and Ryan need to... I dunno... confer about what is or is not the proper scope of a changeset and the right way to deploy all of this? :-) [01:16:42] you want ryan to take a look at this? [01:16:50] I'll let you try [01:16:56] he'll just -2 it just for perl :-) [01:17:44] paravoid: can you take a look at https://gerrit.wikimedia.org/r/#/c/55302/ ? [01:17:45] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:18:03] New patchset: Dzahn; "fix sorting in wikimedias_html, add wikivoyage, and other minor fixes" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/55837 [01:18:04] Heh. That's not it. He just specifically told me (to "one big changeset with all the config") something along the lines of "Oh god no, do it bite-size; classes first, then other changesets for bits of the configuration" :-) [01:18:32] binasher: I can, but you seem to have reviewed this already? [01:19:13] paravoid: yurik and the mobile team really don't like my review :) [01:19:26] why? [01:19:29] it sounds very reasonable to me [01:19:34] heya guys [01:19:47] also, another mobile header? [01:19:55] wtf? [01:19:58] andrew is in seder and we've got full root partitions on three boxes. [01:20:10] why can't they do this default language thing on the backend? [01:20:11] could i get some quick help to resolve it? [01:20:20] paravoid: But what I don't get is what you're confused with; the module describes the tool labs; every class is one of the roles in it. [01:20:41] oh, they just use it for the special page [01:20:42] doh [01:20:45] paravoid: i was looking at some of the varnish acl code and my read is that varnish is just sequentially matching ip's against each address or range in an acl definition instead of doing it more intelligently [01:21:07] yes, in a series of nested ifs [01:21:21] that's what I remember from last time i was looking at this [01:21:27] anybody? [01:21:36] this is a bit important. [01:21:46] LeslieCarr: ^^^ [01:21:54] LeslieCarr: fun of rt duty I guess :-) [01:22:06] paravoid: the mobile team doesn't trust their ability to actually compress the list without making mistakes [01:22:37] lol [01:23:15] New patchset: Dzahn; "fix sorting in wikimedias_html, add wikivoyage, and other minor fixes" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/55837 [01:25:35] dschoon: which boxes are those? [01:25:49] binasher: I can give my -1 too [01:26:01] dschoon: what can we delete? [01:26:48] paravoid: analytics1003,1009 and 1026 [01:27:03] dschoon: var/log/kafka would be a candidate [01:27:25] sorry, texting with andrew at the same time [01:27:29] 3.8G Mar 26 01:27 kafka.log [01:27:51] i have an03 taken care of, i believe. [01:28:04] as it has 700G of hadoop data :P [01:28:10] which is not used at all. [01:29:57] dschoon: an1009: /var/lib/hadoop 1018G [01:30:07] hm. [01:30:36] checking. [01:30:39] it has subdirs called e,f, g, h,i and j [01:30:45] each between 100 and 200G [01:31:38] wait, no [01:31:46] dschoon: ana1026: /var/log/udp2log 16G [01:31:56] while it's true the data isn't used, /var/lib/hadoop is mount for jbod [01:32:01] so it doesn't effect / [01:32:45] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:32:45] RECOVERY - Puppet freshness on tola is OK: puppet ran at Tue Mar 26 01:32:42 UTC 2013 [01:32:55] * Bsadowski1 accidentally spills water on the datacenter servers [01:33:09] j/k [01:33:36] dschoon: i can just gzip kafka.log .. k? [01:33:45] checking [01:35:04] mutante: yup [01:35:07] go for it [01:36:16] !log analytics1003 - gzipping kafka.log to free some disk [01:36:25] Logged the message, Master [01:36:51] RECOVERY - Disk space on analytics1009 is OK: DISK OK [01:36:59] not for long! [01:37:10] i think something is filling it up. [01:37:25] paravoid, hi, just spoke with Asher discussing the ACL issue [01:37:51] !log analytics1003 - gzipping jmxtrans logs to free some disk [01:37:54] probably udp2log [01:37:57] ugh [01:37:57] Logged the message, Master [01:38:26] just the rotated ones, not the current one [01:38:34] it sucks, i know, but here's the problem - partner has scheduled release tomorrow evening (wed morning their time) [01:38:34] there it goes again. [01:38:51] RECOVERY - Disk space on analytics1003 is OK: DISK OK [01:38:55] dschoon: ana1003: /dev/md0 19G 17G 1.5G 93% / [01:39:04] it might fill up again [01:39:08] an09 is out [01:39:51] PROBLEM - Disk space on analytics1009 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=87%): [01:40:11] dschoon: what can we do with udp2log.log on 1026? [01:40:29] you can rm [01:40:37] fair enough:) [01:40:59] !log analytics1026 - delete 16G udp2log.log to free disk [01:41:06] Logged the message, Master [01:41:14] an03 is the critical one, btw. [01:41:28] i didn't realize that until now [01:41:37] only an03 handles the mobile stream [01:41:41] thank you SO much, mutante [01:42:14] 1026 - it just has / mounted, so /var/log is in it, but that didnt resolve it yet? ehmm [01:42:42] dschoon: but 1003 looks ok now per Icinga [01:42:50] yeah [01:42:58] ok, cool [01:42:58] i think 09 and 26 are going to be hosed [01:43:02] because i have no idea what's going on there [01:43:04] >:( [01:43:44] they look different because they don't have those extra mounts for /var/lib/hadoop .. nod [01:44:00] yeah, an26-27 are utility boxes [01:44:26] an01-10 are intended for cpu-intensive non-disk-intensive work [01:45:52] there are another 5G in /home/otto on 1026 ..hrm [01:46:24] i'm monitoring an03 [01:46:28] i think we're okay now [01:46:31] thank you, mutante! [01:46:37] ok, cool:) [01:46:55] tytyty [01:47:50] New review: Faidon; "What Asher said: please aggregate those routes. The C code that gets generated for those ACL matches..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [01:47:54] yurik: ^^^ [01:48:09] dschoon: np, you'll want some logrotate for these. there are some examples in puppet repo ./files/logrotate ttyl then [01:48:10] meh gerrit [01:48:14] no kidding. [01:48:18] paravoid, ? [01:48:33] yurik: "New review"? [01:48:45] mutante: there's surely something that ran away uncontrolled to generate this. we'll dig in tomorrow. [01:48:46] "^^^" being an arrow pointing upwards [01:48:46] ty again [01:49:20] dschoon: yep yep [01:49:55] paravoid, i understood the arrows, :) wasn't sure what exactly your were referring to [01:49:58] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/55837 [01:50:11] yurik: the review, I think you're interested in that one, no? [01:53:54] 83.220.240.0/20 [01:54:06] http://myip.ms/view/ip_owners/2546/Jsc_Vimpelcom.html [01:54:57] 85.115.243.0/24 [01:56:04] New patchset: Tim Starling; "Revert "lucene.jobs.sh to exit whenever an error exit"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55796 [02:00:47] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55796 [02:05:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [02:18:20] New review: Faidon; "I had a 30' chat with Yuri, in which he said there's not enough time until tomorrow to aggregate the..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55302 [02:21:09] !log LocalisationUpdate completed (1.21wmf12) at Tue Mar 26 02:21:08 UTC 2013 [02:21:15] Logged the message, Master [02:22:48] New patchset: Ram; "Bug: 43544 Dump entire global config to a file." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/55841 [02:25:51] PROBLEM - Puppet freshness on virt5 is CRITICAL: Puppet has not run in the last 10 hours [02:38:42] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:38:43] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [02:38:43] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [02:38:43] PROBLEM - Puppet freshness on db1043 is CRITICAL: Puppet has not run in the last 10 hours [02:38:43] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [02:41:50] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:45:52] !log LocalisationUpdate completed (1.21wmf11) at Tue Mar 26 02:45:52 UTC 2013 [02:45:59] Logged the message, Master [02:52:51] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:56:24] PROBLEM - Disk space on mw1010 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:24] PROBLEM - RAID on mw1073 is CRITICAL: NRPE: Command check_raid not defined [02:56:25] PROBLEM - Disk space on mw2 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:25] PROBLEM - Disk space on cp3009 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:25] PROBLEM - Disk space on mw1019 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:25] PROBLEM - DPKG on mw1006 is CRITICAL: NRPE: Command check_dpkg not defined [02:56:25] PROBLEM - Disk space on mw1095 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:25] PROBLEM - RAID on mw1039 is CRITICAL: NRPE: Command check_raid not defined [02:56:26] PROBLEM - Disk space on snapshot4 is CRITICAL: DISK CRITICAL - free space: / 30 MB (0% inode=50%): /var/lib/ureadahead/debugfs 30 MB (0% inode=50%): [02:56:26] PROBLEM - Disk space on cp3010 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:27] PROBLEM - Disk space on virt3 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:27] PROBLEM - Disk space on mw7 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:27] PROBLEM - Lucene disk space on search30 is CRITICAL: NRPE: Command check_disk_6_3 not defined [02:56:28] PROBLEM - Disk space on blondel is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=5%): [02:56:29] PROBLEM - Apache HTTP on mw1209 is CRITICAL: Connection refused [02:56:29] PROBLEM - RAID on mw1010 is CRITICAL: NRPE: Command check_raid not defined [02:56:29] PROBLEM - Disk space on mw1053 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:30] PROBLEM - Disk space on mw1025 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:30] PROBLEM - Disk space on mw14 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:34] PROBLEM - Disk space on mw1063 is CRITICAL: NRPE: Command check_disk_space not defined [02:56:34] PROBLEM - RAID on mw1085 is CRITICAL: NRPE: Command check_raid not defined [02:56:34] PROBLEM - RAID on mw1186 is CRITICAL: NRPE: Command check_raid not defined [02:58:24] PROBLEM - MySQL disk space on db1051 is CRITICAL: NRPE: Command check_disk_6_3 not defined [02:58:34] PROBLEM - mysqld processes on db1051 is CRITICAL: NRPE: Command check_mysqld not defined [03:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [03:05:00] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:06:00] Change abandoned: coren; "Will refactor before resubmitting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [03:31:45] Coren: multiple small commits means submit things as you go [03:31:59] it doesn't mean to put a bunch of small unrelated things together ;) [03:32:18] Ryan_Lane: :-P They're not unrelated; had a good talk with paravoid in re that [03:32:33] also, why not python, rather than a mix of perl and bash [03:32:35] ? [03:32:56] hahahaha [03:32:57] told you! [03:33:01] the vast majority of what we write is python. [03:33:08] Ryan_Lane: Python is the evil suxx0rz for doing command line processing and argument mashing. The right tool for the right task. [03:33:17] it totally isn't [03:33:27] Ryan_Lane: I promise to write at least one Python things for you. :-) [03:33:34] argparse isn't bad at all [03:33:55] and bash sucks more than anything for command line stuff [03:34:07] Ryan_Lane: It does, for all but the simplest tasks. [03:34:18] let me rephrase (am being distracted by someone ;) ) [03:34:19] No argument there. [03:34:19] * paravoid gets some popcorn [03:34:27] bash sucks for parsing command line options [03:34:53] perl isn't a bad language, it's just not what we're using for things [03:35:01] it's nice to have a consistently used language [03:35:35] you don't really believe that, do you? :) [03:35:41] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [03:35:41] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [03:35:41] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:35:41] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [03:35:42] that perl isn't a bad language :) [03:35:51] paravoid: ;) [03:36:04] ok. I also think perl isn't a very good language [03:36:18] but I'm not going to argue based on that [03:36:36] this perl code is fairly straight forward [03:36:46] It is, but at this point I have a choice between (a) mildly proficient python after much effort or (b) solid perl that works now. [03:36:55] give it a year or two and more than one author and it's going to be unreadable [03:37:04] Ryan_Lane: Yeah, but I have a newbie security hole in one of 'em I am downright ashamed of. :-) [03:37:11] heh [03:37:30] also, I'm more comfortable reviewing python [03:37:41] Although it's a zero-impact bug since it allows a user to run an arbitrary command from the shell... they ran an arbitrary command from. :-) [03:37:56] It's still teh evil sux. [03:38:28] http://docs.python.org/dev/library/argparse.html <— <3 [03:38:31] * Coren reads it. [03:38:52] or http://docs.python.org/2/library/optparse.html if you want to use deprecated things. heh [03:39:03] we're using optparse in a lot of code [03:39:19] That's just a fancy getopt. [03:40:27] I need to collect options I don't implement and emit them back in the same order to qsub; it might be possible to weasle around that library to do it, but it's going to be considerably less straightforward. [03:40:56] you should also be able to do that in argparse [03:41:13] And I need to use options in qsub's bizzare format (single dash long options with old-style args) [03:41:41] I don't see why we don't write something sane and just have people switch [03:41:58] we may not even stick with OGE forever [03:42:17] Minimal disruption to existing tools is one of my stated requirements, remember? :-) [03:42:23] if it's a pain in the ass to match what they are doing, write a sane wrapper ;) [03:42:24] :( [03:42:41] a lot of people don't actually use SGE [03:42:48] I'm trying to wrap in newbie-friendly manner. [03:42:52] they've been more increasingly forced to it [03:42:59] and as of late are now required to use it [03:43:10] True, they don't -- hence the happy fun simplified wrapper. [03:43:41] bleh. I just don't want to have to rewrite all of this stuff again later [03:44:09] I guess that'll probably end up happening anyway, though [03:44:16] Look, I don't mind v2 being in Python if you feel the irresistible compulsion; but it's going to be considerably uglier code and will take a while for me to master enough to do so. [03:44:52] My objective now is "make it work with enough abstraction that we can tweak". [03:44:58] * Ryan_Lane nods [03:45:38] I've already got a couple of tools working right, and that critical mass is important. :-) [03:46:12] In fact, I have people lined up for "when you got a stable filesystem" :-) [03:47:54] as long as you're aware that I have a strong dislike of perl and a strong preference to python, and would like this cleaned up at some point [03:48:22] I am aware. [03:48:31] no worries then [03:49:00] RECOVERY - Disk space on analytics1009 is OK: DISK OK [03:49:08] Mind you, there's never going to be more than a dozen of those scripts, and jsub is likely to be the biggest one by far; so it's never going to be a nightmare to clean it up at need. [03:49:15] there's no reason to fret over something when we may decide to change it later anyway [03:49:21] Point. [03:49:42] RECOVERY - Disk space on analytics1026 is OK: DISK OK [03:49:44] Actually, the "really important" scripts are jstart/jstop [03:49:58] And those I want to push people using the most for all bot-like things. [03:50:12] Those will be easy to abstract to some other mechanism if needed. [03:50:26] indeed [03:50:42] Well, jstart is just a disguise for jsub, but the usage is accordingly simplified. [03:54:42] anyway. tomorrow I'm going to install some eqiad labstore systems for testing [03:55:01] I'm also likely to shrink a volume or two in pmtpa for testing [03:55:06] to remove a couple bricks [03:55:17] Yeay! [03:55:24] [20:38:15] Ryan_Lane: FYI for your backlog; the manual puppet run installs the cron entry. You may have killed the agents too fast. :-) [03:55:24] if that goes well I'm likely to shrink all of the volumes [03:55:49] then we can use two of the pmtpa glusters as the replacements [03:55:54] bleh [03:55:58] I did that like 2 hours later [03:56:02] that's totally lame :) [03:56:17] Odd. Probably something might have prevented puppet from running at the wrong time. [03:56:34] I bet puppet was broken during part of that time [03:56:43] oh well [03:56:48] I can force a puppet run via salt [03:57:07] * Ryan_Lane does that and watches the labspocolypse [03:57:14] Just forcing a manual run solved it for all my instances; so just doing a forced run everywhere should fix any lingering instance. [03:58:08] * Coren is inordinately pleased with the motd he did on a lark while watching TV. :-) [03:58:20] motd? [03:58:23] * Ryan_Lane goes to look [03:59:17] I still need to fix the damn read-only issues on a bunch of projects [03:59:27] hahaha [03:59:29] nice [04:00:05] It's recognizable, at least. [04:00:28] I used to be better at this, but it's been AGES since I did AA. :-) [04:00:31] indeed [04:00:33] :D [04:03:56] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:08:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 04:08:39 UTC 2013 [04:09:00] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:10:46] New patchset: Tim Starling; "Disable API action=imagerotate" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55842 [04:11:18] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 04:11:08 UTC 2013 [04:11:56] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:12:47] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 04:12:36 UTC 2013 [04:12:57] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:13:37] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 04:13:27 UTC 2013 [04:13:57] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:14:09] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 04:14:05 UTC 2013 [04:14:57] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:15:32] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55842 [04:15:47] PROBLEM - Puppet freshness on analytics1010 is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on db1052 is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on mw1214 is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on mw1217 is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on zirconium is CRITICAL: Puppet has not run in the last 10 hours [04:16:11] !log tstarling synchronized wmf-config/CommonSettings.php [04:16:17] Logged the message, Master [04:16:23] TimStarling: saw the RT? [04:16:33] ah you did [04:16:37] yes, and commented on it twice [04:16:47] sorry, it's late :) [04:17:49] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [04:17:49] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: Puppet has not run in the last 10 hours [04:18:46] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [04:27:48] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [04:27:48] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [04:27:48] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [04:27:48] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [04:27:48] PROBLEM - Puppet freshness on sq57 is CRITICAL: Puppet has not run in the last 10 hours [04:28:47] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [04:28:48] PROBLEM - Puppet freshness on iron is CRITICAL: Puppet has not run in the last 10 hours [04:28:48] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [04:28:48] PROBLEM - Puppet freshness on sq44 is CRITICAL: Puppet has not run in the last 10 hours [04:30:46] PROBLEM - Puppet freshness on ms1001 is CRITICAL: Puppet has not run in the last 10 hours [04:31:46] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [04:32:46] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [04:32:46] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [04:32:46] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [04:32:46] PROBLEM - Puppet freshness on tridge is CRITICAL: Puppet has not run in the last 10 hours [04:33:46] PROBLEM - Puppet freshness on db1049 is CRITICAL: Puppet has not run in the last 10 hours [04:33:46] PROBLEM - Puppet freshness on pdf1 is CRITICAL: Puppet has not run in the last 10 hours [04:35:02] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [04:35:03] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [04:35:03] PROBLEM - Puppet freshness on sq86 is CRITICAL: Puppet has not run in the last 10 hours [04:36:02] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [04:36:02] PROBLEM - Puppet freshness on lvs5 is CRITICAL: Puppet has not run in the last 10 hours [04:36:02] PROBLEM - Puppet freshness on williams is CRITICAL: Puppet has not run in the last 10 hours [04:37:02] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [04:37:02] PROBLEM - Puppet freshness on sq45 is CRITICAL: Puppet has not run in the last 10 hours [04:39:04] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [04:39:04] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [04:39:04] PROBLEM - Puppet freshness on pdf2 is CRITICAL: Puppet has not run in the last 10 hours [04:39:04] PROBLEM - Puppet freshness on pdf3 is CRITICAL: Puppet has not run in the last 10 hours [04:40:03] PROBLEM - Puppet freshness on sq42 is CRITICAL: Puppet has not run in the last 10 hours [04:42:02] PROBLEM - Puppet freshness on mc1007 is CRITICAL: Puppet has not run in the last 10 hours [04:43:02] PROBLEM - Puppet freshness on sq63 is CRITICAL: Puppet has not run in the last 10 hours [04:44:02] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [04:46:05] PROBLEM - Puppet freshness on lvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:46:05] PROBLEM - Puppet freshness on ocg1 is CRITICAL: Puppet has not run in the last 10 hours [04:47:02] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [04:47:02] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [04:48:02] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [04:48:02] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [04:48:02] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [04:48:02] PROBLEM - Puppet freshness on nickel is CRITICAL: Puppet has not run in the last 10 hours [04:48:02] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [04:49:02] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [04:49:02] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Puppet has not run in the last 10 hours [04:49:02] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Puppet has not run in the last 10 hours [04:49:02] PROBLEM - Puppet freshness on sq50 is CRITICAL: Puppet has not run in the last 10 hours [04:51:02] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [04:51:02] PROBLEM - Puppet freshness on sq55 is CRITICAL: Puppet has not run in the last 10 hours [04:51:02] PROBLEM - Puppet freshness on sq64 is CRITICAL: Puppet has not run in the last 10 hours [04:51:02] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [04:52:02] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [04:52:02] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [04:52:02] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [04:52:02] PROBLEM - Puppet freshness on sq82 is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on chromium is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on sq37 is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on sq78 is CRITICAL: Puppet has not run in the last 10 hours [04:53:02] PROBLEM - Puppet freshness on sq59 is CRITICAL: Puppet has not run in the last 10 hours [04:53:03] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [04:53:03] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [04:54:02] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [04:54:02] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [04:54:02] PROBLEM - Puppet freshness on ocg2 is CRITICAL: Puppet has not run in the last 10 hours [04:54:02] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [04:54:02] PROBLEM - Puppet freshness on sq77 is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on cp3021 is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on nitrogen is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on sq75 is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on sq36 is CRITICAL: Puppet has not run in the last 10 hours [04:55:03] PROBLEM - Puppet freshness on hydrogen is CRITICAL: Puppet has not run in the last 10 hours [04:55:04] PROBLEM - Puppet freshness on sq66 is CRITICAL: Puppet has not run in the last 10 hours [04:56:02] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [04:56:02] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [04:56:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:56:02] PROBLEM - Puppet freshness on sq56 is CRITICAL: Puppet has not run in the last 10 hours [04:56:02] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: Puppet has not run in the last 10 hours [04:57:03] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [04:58:02] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [04:58:02] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [04:58:02] PROBLEM - Puppet freshness on sq61 is CRITICAL: Puppet has not run in the last 10 hours [04:59:02] PROBLEM - Puppet freshness on formey is CRITICAL: Puppet has not run in the last 10 hours [04:59:02] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [05:00:02] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [05:00:02] PROBLEM - Puppet freshness on ssl3 is CRITICAL: Puppet has not run in the last 10 hours [05:00:02] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [05:01:02] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [05:01:02] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [05:02:02] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [05:02:02] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [05:02:02] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [05:02:02] PROBLEM - Puppet freshness on sq60 is CRITICAL: Puppet has not run in the last 10 hours [05:02:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on cp3020 is CRITICAL: Puppet has not run in the last 10 hours [05:03:02] PROBLEM - Puppet freshness on yvon is CRITICAL: Puppet has not run in the last 10 hours [05:04:47] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:05:13] PROBLEM - Puppet freshness on cerium is CRITICAL: Puppet has not run in the last 10 hours [05:05:13] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [05:05:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:05:13] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [05:05:13] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [05:06:09] PROBLEM - Puppet freshness on sq65 is CRITICAL: Puppet has not run in the last 10 hours [05:07:07] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [05:07:08] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [05:07:08] PROBLEM - Puppet freshness on sq72 is CRITICAL: Puppet has not run in the last 10 hours [05:07:08] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [05:07:08] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [05:07:08] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [05:08:09] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [05:08:11] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [05:09:11] PROBLEM - Puppet freshness on sq51 is CRITICAL: Puppet has not run in the last 10 hours [05:10:08] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [05:10:08] PROBLEM - Puppet freshness on sq43 is CRITICAL: Puppet has not run in the last 10 hours [05:10:08] PROBLEM - Puppet freshness on sq74 is CRITICAL: Puppet has not run in the last 10 hours [05:10:08] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on ekrem is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on ms10 is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on praseodymium is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on sq76 is CRITICAL: Puppet has not run in the last 10 hours [05:11:07] PROBLEM - Puppet freshness on titanium is CRITICAL: Puppet has not run in the last 10 hours [05:11:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:12:08] PROBLEM - Puppet freshness on sq54 is CRITICAL: Puppet has not run in the last 10 hours [05:12:08] PROBLEM - Puppet freshness on sq58 is CRITICAL: Puppet has not run in the last 10 hours [05:13:08] PROBLEM - Puppet freshness on calcium is CRITICAL: Puppet has not run in the last 10 hours [05:13:08] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [05:13:09] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [05:14:07] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [05:14:07] PROBLEM - Puppet freshness on sq33 is CRITICAL: Puppet has not run in the last 10 hours [05:15:07] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [05:15:07] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [05:15:07] PROBLEM - Puppet freshness on sq52 is CRITICAL: Puppet has not run in the last 10 hours [05:16:11] PROBLEM - Puppet freshness on sq84 is CRITICAL: Puppet has not run in the last 10 hours [05:16:11] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: Puppet has not run in the last 10 hours [05:17:08] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [05:17:09] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:17:10] PROBLEM - Puppet freshness on sq62 is CRITICAL: Puppet has not run in the last 10 hours [05:18:07] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [05:18:09] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [05:18:09] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [05:19:08] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [05:20:08] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [05:21:08] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [05:21:08] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:21:08] PROBLEM - Puppet freshness on manutius is CRITICAL: Puppet has not run in the last 10 hours [05:43:35] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [05:59:49] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:00:39] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:03:36] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:03:48] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [06:04:37] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3847 bytes in 0.006 second response time [06:31:50] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 06:31:37 UTC 2013 [06:32:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:33:09] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 06:33:04 UTC 2013 [06:33:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:34:07] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 06:33:57 UTC 2013 [06:34:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:34:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 06:34:40 UTC 2013 [06:50:30] New review: Nikerabbit; "(1 comment)" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/55837 [07:00:08] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:57] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:06:24] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [07:12:14] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [07:12:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [07:12:24] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [07:12:35] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:12:44] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:13:05] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:13:05] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [07:13:05] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [07:13:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [07:13:34] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:15:24] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.067 second response time [07:18:44] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:20:45] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:21:44] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.179 second response time [07:24:45] PROBLEM - RAID on ms-fe3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:45] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK: HTTP/1.1 200 OK - 2503 bytes in 0.132 second response time [07:25:24] PROBLEM - RAID on ms-fe2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:34] PROBLEM - SSH on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:14] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.966 second response time [07:26:14] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.392 second response time [07:26:14] RECOVERY - RAID on ms-fe2 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:26:24] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.927 second response time [07:26:24] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [07:26:24] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.787 second response time [07:26:24] RECOVERY - SSH on ms-fe2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:26:34] RECOVERY - RAID on ms-fe3 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:26:44] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [07:26:45] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [07:26:45] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62103 bytes in 0.226 second response time [07:27:30] !log Restarted swift-proxy on ms-fe* [07:27:37] Logged the message, Master [07:27:55] * apergos peeks in [07:29:33] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [07:49:49] PROBLEM - Puppet freshness on barium is CRITICAL: Puppet has not run in the last 10 hours [07:54:49] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [07:57:49] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [07:58:50] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:04:52] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:08:52] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 08:08:46 UTC 2013 [08:08:52] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:09:42] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 08:09:35 UTC 2013 [08:09:52] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:10:36] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 08:10:31 UTC 2013 [08:10:56] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:11:25] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 08:11:17 UTC 2013 [08:11:53] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:20:42] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [08:24:13] PROBLEM - RAID on db45 is CRITICAL: CRITICAL: Degraded [08:45:56] !log nikerabbit synchronized php-1.21wmf12/extensions/UniversalLanguageSelector [08:46:02] Logged the message, Master [08:47:12] !log nikerabbit synchronized php-1.21wmf12/extensions/Translate [08:47:18] Logged the message, Master [08:55:09] hello [10:01:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:04:39] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [10:05:22] New patchset: Hashar; "futureproof for ruby1.9.x (not in use yet)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [10:05:51] New review: Hashar; "I have updated commit summary to list out the ruby version tested :-) Still need an explanation abo..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54692 [10:07:32] New review: Silke Meyer; "Yay! Works!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/51797 [10:16:19] PROBLEM - Puppet freshness on mw1137 is CRITICAL: Puppet has not run in the last 10 hours [10:17:56] New patchset: Hashar; "wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47564 [10:18:19] PROBLEM - Puppet freshness on mw1031 is CRITICAL: Puppet has not run in the last 10 hours [10:18:19] PROBLEM - Puppet freshness on mw1098 is CRITICAL: Puppet has not run in the last 10 hours [10:19:10] New review: Hashar; "If you guys could please review this, that would be nice :-]" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47564 [10:19:22] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Puppet has not run in the last 10 hours [10:21:19] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [10:21:19] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [10:21:19] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [10:21:34] New patchset: Hashar; "WIP monitoring lucene search boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [10:23:21] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [10:23:22] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [10:23:22] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [10:23:29] !log nikerabbit synchronized php-1.21wmf12/extensions/UniversalLanguageSelector/UniversalLanguageSelector.php 'Testing fix for wikidata' [10:23:36] Logged the message, Master [10:25:55] New review: Hashar; "rebased, the old nagios confs have been phased out." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [10:28:56] New patchset: Silke Meyer; "Updated documentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55872 [10:33:26] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [10:38:36] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:40:26] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [10:51:46] New patchset: Hashar; "monitoring lucene search boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51174 [11:03:39] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:10:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:25:48] New patchset: Matthias Mullie; "Enable AFTv5 on frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [11:44:30] PROBLEM - Puppet freshness on mw1143 is CRITICAL: Puppet has not run in the last 10 hours [11:47:05] New patchset: Matthias Mullie; "Enable AFTv5 on frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [12:04:30] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:08:00] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:07:58 UTC 2013 [12:08:31] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:11:01] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:10:50 UTC 2013 [12:11:30] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:13:22] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:13:17 UTC 2013 [12:13:32] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:14:43] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:14:31 UTC 2013 [12:15:32] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:15:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:15:26 UTC 2013 [12:16:31] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:16:31] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:16:27 UTC 2013 [12:17:30] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:18:00] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 12:17:54 UTC 2013 [12:18:30] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:26:00] PROBLEM - Puppet freshness on virt5 is CRITICAL: Puppet has not run in the last 10 hours [12:31:16] New patchset: Hashar; "sql script no more need /etc/cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55877 [12:38:48] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [12:38:48] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [12:38:48] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [12:38:48] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:38:48] PROBLEM - Puppet freshness on db1043 is CRITICAL: Puppet has not run in the last 10 hours [13:04:07] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [13:36:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [13:36:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [13:36:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:36:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [14:04:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [14:09:46] !log running sync-common on mw1209 [14:09:52] Logged the message, Master [14:11:10] New patchset: Hashar; "(bug 46104) reduce number of wikis on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55889 [14:14:40] PROBLEM - Puppet freshness on srv294 is CRITICAL: Puppet has not run in the last 10 hours [14:14:40] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [14:14:48] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47564 [14:16:00] New patchset: Ottomata; "Moving university filters to gadolinium. Moving 5xx filter to gadolinium, attempting to use udp-filter instead of 5xx filter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55890 [14:16:19] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55890 [14:16:42] PROBLEM - Puppet freshness on analytics1010 is CRITICAL: Puppet has not run in the last 10 hours [14:16:43] PROBLEM - Puppet freshness on db1052 is CRITICAL: Puppet has not run in the last 10 hours [14:16:43] PROBLEM - Puppet freshness on mw1214 is CRITICAL: Puppet has not run in the last 10 hours [14:16:43] PROBLEM - Puppet freshness on mw1217 is CRITICAL: Puppet has not run in the last 10 hours [14:16:43] PROBLEM - Puppet freshness on zirconium is CRITICAL: Puppet has not run in the last 10 hours [14:16:43] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [14:17:55] New review: Anomie; "We can always add wikis back if someone complains." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/55889 [14:18:12] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55889 [14:18:40] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [14:18:40] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: Puppet has not run in the last 10 hours [14:19:40] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [14:28:40] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [14:28:41] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [14:28:41] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [14:28:41] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [14:28:41] PROBLEM - Puppet freshness on sq57 is CRITICAL: Puppet has not run in the last 10 hours [14:28:59] New patchset: Ottomata; "Fixing comment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55891 [14:29:24] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55891 [14:29:40] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [14:29:40] PROBLEM - Puppet freshness on iron is CRITICAL: Puppet has not run in the last 10 hours [14:29:40] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [14:29:40] PROBLEM - Puppet freshness on sq44 is CRITICAL: Puppet has not run in the last 10 hours [14:31:41] PROBLEM - Puppet freshness on ms1001 is CRITICAL: Puppet has not run in the last 10 hours [14:32:40] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [14:34:10] !log hashar synchronized multiversion '{{gerrit|47564}} wikiversions.cdb now vary by realm' [14:34:17] Logged the message, Master [14:34:31] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [14:34:32] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [14:34:32] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [14:34:32] PROBLEM - Puppet freshness on pdf1 is CRITICAL: Puppet has not run in the last 10 hours [14:34:32] PROBLEM - Puppet freshness on db1049 is CRITICAL: Puppet has not run in the last 10 hours [14:34:32] PROBLEM - Puppet freshness on tridge is CRITICAL: Puppet has not run in the last 10 hours [14:35:31] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [14:35:31] PROBLEM - Puppet freshness on sq71 is CRITICAL: Puppet has not run in the last 10 hours [14:35:31] PROBLEM - Puppet freshness on sq86 is CRITICAL: Puppet has not run in the last 10 hours [14:36:21] PROBLEM - Disk space on locke is CRITICAL: NRPE: Command check_disk_space not defined [14:36:31] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [14:36:31] PROBLEM - Puppet freshness on lvs5 is CRITICAL: Puppet has not run in the last 10 hours [14:36:31] PROBLEM - Puppet freshness on williams is CRITICAL: Puppet has not run in the last 10 hours [14:37:31] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [14:37:33] PROBLEM - Puppet freshness on sq45 is CRITICAL: Puppet has not run in the last 10 hours [14:39:31] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [14:39:31] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [14:39:31] PROBLEM - Puppet freshness on pdf2 is CRITICAL: Puppet has not run in the last 10 hours [14:39:31] PROBLEM - Puppet freshness on pdf3 is CRITICAL: Puppet has not run in the last 10 hours [14:40:31] PROBLEM - Puppet freshness on sq42 is CRITICAL: Puppet has not run in the last 10 hours [14:42:31] PROBLEM - Puppet freshness on mc1007 is CRITICAL: Puppet has not run in the last 10 hours [14:43:31] PROBLEM - Puppet freshness on sq63 is CRITICAL: Puppet has not run in the last 10 hours [14:44:31] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [14:45:10] New patchset: Faidon; "Make the integration/zuul git latest -> present" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55892 [14:45:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55892 [14:46:31] PROBLEM - Puppet freshness on lvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:46:32] PROBLEM - Puppet freshness on ocg1 is CRITICAL: Puppet has not run in the last 10 hours [14:47:33] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [14:47:33] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [14:48:31] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [14:48:31] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [14:48:31] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [14:48:31] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [14:48:31] PROBLEM - Puppet freshness on nickel is CRITICAL: Puppet has not run in the last 10 hours [14:49:31] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [14:49:31] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Puppet has not run in the last 10 hours [14:49:31] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Puppet has not run in the last 10 hours [14:49:31] PROBLEM - Puppet freshness on sq50 is CRITICAL: Puppet has not run in the last 10 hours [14:51:31] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [14:51:31] PROBLEM - Puppet freshness on sq64 is CRITICAL: Puppet has not run in the last 10 hours [14:51:31] PROBLEM - Puppet freshness on sq55 is CRITICAL: Puppet has not run in the last 10 hours [14:51:31] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [14:52:02] hi guys!, tons of 500s right now and for the last couple of days [14:52:04] to things like this [14:52:06] http://commons.wikimedia.org/w/index.php?title=MediaWiki:Filepage.css&action=raw&maxage=2678400&usemsgcache=yes&ctype=text%2Fcss&smaxage=2678400 [14:52:14] who should I notify? [14:52:31] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [14:52:31] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [14:52:31] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [14:52:31] PROBLEM - Puppet freshness on sq82 is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on chromium is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on sq37 is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on sq59 is CRITICAL: Puppet has not run in the last 10 hours [14:53:31] PROBLEM - Puppet freshness on sq78 is CRITICAL: Puppet has not run in the last 10 hours [14:53:32] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: Puppet has not run in the last 10 hours [14:53:32] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [14:54:31] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [14:54:31] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [14:54:31] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [14:54:31] PROBLEM - Puppet freshness on sq77 is CRITICAL: Puppet has not run in the last 10 hours [14:54:31] PROBLEM - Puppet freshness on ocg2 is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on cp3021 is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on hydrogen is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on nitrogen is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on sq36 is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on sq66 is CRITICAL: Puppet has not run in the last 10 hours [14:55:31] PROBLEM - Puppet freshness on sq75 is CRITICAL: Puppet has not run in the last 10 hours [14:56:31] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [14:56:31] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [14:56:31] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [14:56:31] PROBLEM - Puppet freshness on sq56 is CRITICAL: Puppet has not run in the last 10 hours [14:56:31] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:31] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [14:57:49] hi Jeff_Green! [14:57:53] you back and working this week? [14:58:31] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [14:58:31] PROBLEM - Puppet freshness on sq61 is CRITICAL: Puppet has not run in the last 10 hours [14:58:31] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [14:59:31] PROBLEM - Puppet freshness on formey is CRITICAL: Puppet has not run in the last 10 hours [14:59:31] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [14:59:50] New review: Hashar; "+1 :-] (just publicly acknowledging)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55892 [15:00:30] ottomata: ya, got back late last night. I'm trying to work starting today [15:00:31] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [15:00:31] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [15:00:31] PROBLEM - Puppet freshness on ssl3 is CRITICAL: Puppet has not run in the last 10 hours [15:00:40] welcome back Jeff_Green ! [15:00:45] thanks! [15:01:33] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [15:01:33] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [15:02:13] LeslieCarr: I have added an Icinga check for lucene :-] [15:02:31] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [15:02:31] PROBLEM - Puppet freshness on kaulen is CRITICAL: Puppet has not run in the last 10 hours [15:02:31] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [15:02:31] PROBLEM - Puppet freshness on sq60 is CRITICAL: Puppet has not run in the last 10 hours [15:02:31] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on cp3020 is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on lvs1 is CRITICAL: Puppet has not run in the last 10 hours [15:03:44] PROBLEM - Puppet freshness on yvon is CRITICAL: Puppet has not run in the last 10 hours [15:04:44] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [15:05:46] PROBLEM - Puppet freshness on cerium is CRITICAL: Puppet has not run in the last 10 hours [15:05:46] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [15:05:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:05:46] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [15:05:46] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [15:06:44] PROBLEM - Puppet freshness on sq65 is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on gurvin is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on sq72 is CRITICAL: Puppet has not run in the last 10 hours [15:07:44] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [15:08:45] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [15:08:45] PROBLEM - Puppet freshness on sq79 is CRITICAL: Puppet has not run in the last 10 hours [15:09:44] PROBLEM - Puppet freshness on sq51 is CRITICAL: Puppet has not run in the last 10 hours [15:09:56] welcome back! (sorrry, missed your messages) [15:10:15] welp, i'm trying to finalize the locke -> gadolinium move, if you have time today let's talk about what needs to happen for your stuff [15:10:24] Jeff_Green ^ [15:10:48] PROBLEM - Puppet freshness on sq43 is CRITICAL: Puppet has not run in the last 10 hours [15:10:48] PROBLEM - Puppet freshness on sq85 is CRITICAL: Puppet has not run in the last 10 hours [15:10:48] PROBLEM - Puppet freshness on sq74 is CRITICAL: Puppet has not run in the last 10 hours [15:10:48] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on ekrem is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on ms10 is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on praseodymium is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on sq76 is CRITICAL: Puppet has not run in the last 10 hours [15:11:44] PROBLEM - Puppet freshness on titanium is CRITICAL: Puppet has not run in the last 10 hours [15:11:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:12:45] PROBLEM - Puppet freshness on sq54 is CRITICAL: Puppet has not run in the last 10 hours [15:12:46] PROBLEM - Puppet freshness on sq58 is CRITICAL: Puppet has not run in the last 10 hours [15:13:44] PROBLEM - Puppet freshness on calcium is CRITICAL: Puppet has not run in the last 10 hours [15:13:44] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [15:13:44] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [15:14:44] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [15:14:44] PROBLEM - Puppet freshness on sq33 is CRITICAL: Puppet has not run in the last 10 hours [15:15:44] PROBLEM - Puppet freshness on sq41 is CRITICAL: Puppet has not run in the last 10 hours [15:15:44] PROBLEM - Puppet freshness on sq52 is CRITICAL: Puppet has not run in the last 10 hours [15:15:44] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Puppet has not run in the last 10 hours [15:16:45] PROBLEM - Puppet freshness on sq84 is CRITICAL: Puppet has not run in the last 10 hours [15:16:45] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: Puppet has not run in the last 10 hours [15:17:44] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [15:17:45] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [15:17:45] PROBLEM - Puppet freshness on sq62 is CRITICAL: Puppet has not run in the last 10 hours [15:18:44] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [15:18:44] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [15:18:44] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [15:19:44] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [15:20:44] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [15:21:49] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [15:21:49] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:21:49] PROBLEM - Puppet freshness on manutius is CRITICAL: Puppet has not run in the last 10 hours [15:24:00] !log swapping disk 6 ms-be1012 [15:24:07] Logged the message, Master [15:25:56] ottomata: ok, want to talk now? [15:26:09] sure [15:26:26] * Jeff_Green checking RT [15:26:50] not sure how up to date the RT is [15:26:50] so [15:27:01] almost all of the udp2log stuff is moved to gadolinium [15:27:06] i'm working on webstatscollector now [15:27:06] but [15:27:07] there's one dependency I put in, ma rk may have dealt with it [15:27:18] i haven't touched any custom FR stuff [15:27:23] your filters are on gadolinium [15:27:26] logging into a separate dir [15:27:34] /a/log/fundraising [15:27:45] there's really not much--the filters, an nfs mount, and a cron-script that sweeps things across nfs [15:27:49] oh i see it [15:27:53] the dep [15:28:02] i haven't managed to find either yet :-( [15:28:42] mark around? [15:29:54] https://rt.wikimedia.org/Ticket/Display.html?id=4710 [15:30:03] https://rt.wikimedia.org/Ticket/Display.html?id=4720 [15:30:28] yep [15:30:52] i need to get the nfs mount done for the rest to work [15:32:13] ah right [15:32:15] you need that now? [15:32:24] yeah--just the acl adjusted [15:37:00] PROBLEM - DPKG on mw1209 is CRITICAL: NRPE: Command check_dpkg not defined [15:37:10] PROBLEM - Disk space on mw1209 is CRITICAL: NRPE: Command check_disk_space not defined [15:37:20] PROBLEM - RAID on mw1209 is CRITICAL: NRPE: Command check_raid not defined [15:38:44] paravoid: around? [15:38:48] yes [15:40:05] i can't get megacli to work for me so i can add the new disk back...unless someone else is a megacli genius I may have to reboot and cfg via raid bios [15:40:13] need your okay to do that [15:41:22] I don't mind, but it kinda sucks to have to reboot machines to swap disks [15:41:48] so let me give it a go [15:42:05] that's ms-be1012, right? [15:42:09] correct [15:43:40] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [15:46:00] btw, ms-be1012 has a bunch of broken disks [15:46:52] hah, I did a PDList and it's now completely stuck [15:48:10] PROBLEM - RAID on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:15] mark thanks re. nfs [15:48:54] paravoid: give it a minute ...did that to me [15:49:08] i noticed disk 0/slot 0 is bad now [15:51:00] RECOVERY - RAID on ms-be1012 is OK: OK: State is Optimal, checked 1 logical device(s) [15:53:18] ottomata: I'll pick up where I left off re. gadolinium, however I think we still have banners up until the end of the month--and I don't want to do any cut over until fundraising is quiet [15:53:26] that's fine [15:53:38] we can leave the old stuff on locke running til then [15:53:47] just fyi the filters are up and running on gadolinium too [15:53:50] k [15:53:58] so you should be able to set everything up there and verify before we actually turn locke off [15:53:59] the fr logs are just growing? [15:54:24] (fine if so, just curious) [15:56:16] cmjohnson1: there's something weird with Ceph that I'm troubleshooting now [15:56:37] ok..i noticed that all but 2 osd's are down on ms-be1012 [15:56:51] it's not just that [15:56:59] this escalated to a full ceph outage [15:57:03] I'm debugging this with ceph people [15:57:09] ok...let me know [15:57:14] I will [15:57:16] question though [15:57:22] did you also ran megacli on saturday? [15:57:46] i don't think so [15:57:51] okay [16:02:43] hm, erg [16:02:47] how do I fix this? [16:02:47] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find resource 'Class[Nrpe::Packages]' for relationship on 'Nrpe::Check[check_dpkg]' on node stat1.wikimedia.org [16:02:54] started happening on a few nodes yesterday [16:03:22] LeslieCarr or notpeter? (did I ask about this yesterday? did you tell me to go do something on stafford?) [16:04:02] ah yep [16:04:17] the puppet clean configs - lemme look up the command again [16:04:19] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:04:38] often (but not always) it's because puppetmaster was overloaded and made some stupid compiling the config error [16:04:54] whenever you get an error 400 [16:05:35] puppetstoredconfigclean.rb $fqdn [16:08:19] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:08:14 UTC 2013 [16:09:19] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:10:39] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:10:36 UTC 2013 [16:10:56] New patchset: Demon; "Configure PoolCounter for search requests" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55777 [16:11:27] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:13:09] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:13:01 UTC 2013 [16:13:20] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:15:40] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:15:31 UTC 2013 [16:16:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:16:39] RECOVERY - Puppet freshness on locke is OK: puppet ran at Tue Mar 26 16:16:33 UTC 2013 [16:18:19] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:18:17 UTC 2013 [16:19:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:20:20] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:20:14 UTC 2013 [16:20:20] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:20:32] RECOVERY - Puppet freshness on mw1209 is OK: puppet ran at Tue Mar 26 16:20:21 UTC 2013 [16:21:41] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:21:38 UTC 2013 [16:22:22] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:22:40] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Tue Mar 26 16:22:34 UTC 2013 [16:23:22] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:28:56] heya [16:29:09] woosters: did you follow up with binasher on https://bugzilla.wikimedia.org/46378 ? [16:29:42] oh, i did ask Peter to review it [16:29:50] i will followup with him [16:29:57] and get back to u later today [16:30:24] ^demon: i have another request, while we are waiting for that: the log of the dispatchCHanges cron job. it's not picking up as we hoped after yesterday's deployment, and I'd like to see what's up with that. [16:31:03] i may need some debug logs in addition to that later, but the log of the cron job would already help. [16:31:39] PROBLEM - Puppet freshness on mw1044 is CRITICAL: Puppet has not run in the last 10 hours [16:33:09] PROBLEM - RAID on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:33:29] PROBLEM - SSH on ms-be1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:39] PROBLEM - Puppet freshness on mw1126 is CRITICAL: Puppet has not run in the last 10 hours [16:33:40] sigh [16:33:46] fcking hardware [16:33:49] PROBLEM - DPKG on ms-fe1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:33:54] <^demon> DanielK_WMDE: /var/log/wikidata/dispatcher(2).log? [16:33:59] RECOVERY - RAID on ms-be1012 is OK: OK: State is Optimal, checked 1 logical device(s) [16:34:06] ^demon: i guess? [16:34:17] i don't know where it writes to... [16:34:21] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:34:29] <^demon> Hmm, I don't see said logs on hume. [16:34:36] o_O [16:34:39] <^demon> Probably because there's no ./wikidata/ in /var/log/ [16:34:40] huh? [16:34:46] <^demon> And I doubt that user can create it. [16:34:55] ugh [16:34:56] hm [16:35:03] paravoid: anything i can do to help? [16:35:11] i seem to recall reedy saying something about logrotate for this [16:35:12] not really [16:35:21] cmjohnson1: do you have spare disks to swap slot 0? [16:35:32] cmjohnson1: let's not yet, but I may ask you in 10' if you do have [16:35:38] ^demon: who shall we poke about this? shall we go via bugzilla? [16:35:46] PROBLEM - Puppet freshness on mw1073 is CRITICAL: Puppet has not run in the last 10 hours [16:35:46] PROBLEM - Puppet freshness on mw1087 is CRITICAL: Puppet has not run in the last 10 hours [16:35:57] i suppose while we are at it, we could request --verbose to be added to the cron job [16:35:58] i do have disk destined for other places but can use [16:36:07] LeslieCarr, I ran puppetstoredconfigclean.rb stat1.wikimedia.org on stafford [16:36:09] <^demon> BAH [16:36:09] still same error [16:36:11] <^demon> Wrong server. [16:36:16] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find resource 'Class[Nrpe::Packages]' for relationship on 'Nrpe::Check[check_dpkg]' on node stat1.wikimedia.org [16:37:22] aww [16:37:56] RECOVERY - DPKG on ms-fe1002 is OK: All packages OK [16:38:07] hrm, perhaps this is a "must clear out relevant bits on the db" -- which i don't actually know how to do, but i believe notpeter does --- and if we're really nice maybe we can get him to add the instructions onto a wikitech page :) [16:38:19] PROBLEM - RAID on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:38:41] <^demon> DanielK_WMDE: Here's the tail of dispatcher.log: http://p.defau.lt/?qerxO_6TwO_k4ixnb4duFQ, dispatcher2.log: http://p.defau.lt/?_ESXfbN4lY1O_Qqdi_rp9w [16:39:02] interesting [16:42:06] RECOVERY - RAID on ms-be1012 is OK: OK: State is Optimal, checked 1 logical device(s) [16:42:13] ^demon: thanks. we'll need more info, but it's a good start [16:42:30] New patchset: Matthias Mullie; "Enable AFTv5 on frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [16:43:00] ^demon: can you grep to find some 'changes to enwiki' in there? [16:43:17] most likely to be connected to wikidata [16:44:05] <^demon> aude: 16:39:36 Posted 1000 changes to enwiki, up to ID 13925047, timestamp 20130325172254. Lag is 83802 seconds. Next ID is 13925047. [16:44:09] thanks [16:44:16] <^demon> yw [16:49:27] ^demon: can you get us profiling info for the dispatcher script too? [16:53:13] robla, woosters, ^demon: as a number of interest: we have 2 dispatchers, each seems to be pushing about 4000 changes per hour to all client wikis. [16:54:32] so, we should be able to handle up to 8000 changes per hour right now. we should (!) be able to scale out by adding more dispatcher processes. [16:54:38] <^demon> The profiling info is available on noc: https://noc.wikimedia.org/cgi-bin/report.py [16:54:43] <^demon> (Although it seems to be lagging?) [16:54:43] we are currently seeing rouchly 17000 changes per hour though [16:54:53] so we need at least two more dispatcher processes... [16:54:57] fun. [16:56:52] ^demon: what is lagging? the profiler? [16:57:04] oh yea, slow... [16:57:05] <^demon> The script, I guess. [16:57:37] DanielK_WMDE: dispatchChanges.php --wiki wikidatawiki --max-time 300 (every 5 min) [16:57:40] New patchset: Yurik; "Beeline IPs, unified default lang redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [16:57:49] the other every 7 min, max-time 420 [17:00:09] aude: i'd say make them both run every 5 minutes with: dispatchChanges.php --wiki wikidatawiki --max-time 900 --verbose [17:00:18] ok [17:00:58] so each cron job is triggering 3 "overlapping" processes, giving us 6 total, 3 times what we have now. [17:01:11] that should cover us for 24k changes/hour (assuming linear scaling) [17:01:23] i'd prefer the second cron job to be on a different box... [17:01:30] +1 :) [17:01:51] ^demon: is there another box we could run a dispatcher on? [17:02:08] having two cron jobs on the same box is a bit silly, it was intended as a proof of concept. [17:02:13] <^demon> Not yet. We're working on firing up another maintenance box in eqiad. [17:02:23] hmmmmm [17:04:23] !log added (crappy, but better than nothing) .deb for webstatscollector to apt.wikimedia.org [17:04:29] Logged the message, Master [17:04:32] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [17:04:55] i can submit a patch for the 2 cronjobs on hume [17:05:20] aude: please do; so we at least have something concrete to discuss. [17:05:22] ottomata: as I couldn't keep an eye on this channel for the last ~2h: Did you get some feedback on http://commons.wikimedia.org/w/index.php?title=MediaWiki:Filepage.css&action=raw&maxage=2678400&usemsgcache=yes&ctype=text%2Fcss&smaxage=2678400 creating an HTTP 5** error, but that it works when dropping &usemsgcache=yes or &action=raw ? [17:05:23] * aude can't put them elsewhere [17:05:38] I didn't ask since LeslieCarr was up (since she's on RT duty) [17:05:45] LeslieCarr, there are TONS more 500 errors in the last couple of days [17:05:54] i'm just looking at filesizes of the 5xx logs [17:06:07] gah [17:06:10] why'd you do that ottomata ? [17:06:13] ah, I see [17:06:15] i checked what was coming in now, and saw a lot of 500s on requests like or identical to the ones andre__ just posted [17:06:54] New patchset: Aude; "Update cronjobs for wikidata" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55904 [17:07:03] DanielK_WMDE: 17:06 <+gerrit-wm> New patchset: Aude; "Update cronjobs for wikidata" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55904 [17:07:11] LeslieCarr, eehhh, do what? [17:07:12] look at things? [17:07:29] make the 500's happen [17:07:37] ah yeah [17:07:46] sorry, i just hit refresh 50 billion times [17:07:50] it was fun! [17:07:54] I had a big ol' refresh party [17:07:57] I invited mah friends [17:08:14] we put on techno music and just mashed refresh over and over again [17:08:36] But! LeslieCarr, I ask you because your name is in the topic :), who do you think would care or want to check up on that? [17:11:50] New review: Daniel Kinzler; "This will cause 3 "overlapping" processes to run from each of the two cron entries, effectively givi..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55904 [17:13:40] paravoid, i shortened the list using python lib (hope it worked ok) [17:14:10] !log Change I34a488247 is now merged in wmf/1.21wmf12 and pulled on fenari for test.wikipedia.org, do not sync. [17:14:16] Logged the message, Master [17:14:24] :) [17:14:27] hello [17:14:35] anybody responsible for the fundraising banners here? [17:14:39] K4-713: [17:14:47] What's up? [17:14:49] https://pl.wikipedia.org/wiki/Loren_Acton is displaying three banners, all the same, one under another (when logged out) - and if this isn't fixed prompty, i'm going to hide the banners in site CSS. (i'm a sysop at pl.wp) [17:15:18] (this seems to affect all articles) [17:15:21] Argh! We have had very sporadic reports of that happening, and can't reproduce the problem. [17:15:24] What are you running? [17:15:29] and this has been happening for at least a few hours now, consistently [17:15:33] on multiple computers and browsers [17:15:37] in the morning, there were even four [17:15:45] opera right now, but seen it on chrome as well [17:15:57] Okay, we'll take down the campaigns. [17:16:22] and it's only shown for logged out users, so i only realized this was happening when a non-wipedia friend pointed it out :/ [17:17:07] it doesn't seem to appen in debug mode [17:17:13] (there's just one banner) [17:17:34] Huh. So, the act of trying to reproduce the problem, might prevent it from happening? [17:17:46] That's incredibly helpful, actually. [17:18:13] yeah, i was going to get dirty with debugging, and noticed :P [17:20:19] funnily, it seems to have fixed itself now - did you do anything? [17:20:24] RECOVERY - RAID on mw1054 is OK: OK: no RAID installed [17:20:24] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [17:20:24] RECOVERY - Disk space on mw1016 is OK: DISK OK [17:20:24] RECOVERY - DPKG on mw1010 is OK: All packages OK [17:20:24] RECOVERY - Disk space on mw1195 is OK: DISK OK [17:20:24] RECOVERY - Disk space on mw1063 is OK: DISK OK [17:20:24] RECOVERY - Disk space on mw1079 is OK: DISK OK [17:20:25] RECOVERY - Disk space on mw1076 is OK: DISK OK [17:20:32] RECOVERY - Disk space on mw1183 is OK: DISK OK [17:20:33] damned bots. [17:20:33] RECOVERY - DPKG on mw1073 is OK: All packages OK [17:20:33] RECOVERY - Disk space on mw115 is OK: DISK OK [17:20:33] RECOVERY - Disk space on mw48 is OK: DISK OK [17:20:33] RECOVERY - Disk space on mw2 is OK: DISK OK [17:20:33] RECOVERY - Disk space on mw96 is OK: DISK OK [17:20:33] RECOVERY - Disk space on mw11 is OK: DISK OK [17:20:34] funnily, it seems to have fixed itself now - did you do anything? [17:20:42] RECOVERY - Disk space on mw1116 is OK: DISK OK [17:20:46] K4-713: funnily, it seems to have fixed itself now - did you do anything? [17:20:52] RECOVERY - Disk space on mw1005 is OK: DISK OK [17:20:52] RECOVERY - Disk space on mw1109 is OK: DISK OK [17:20:52] RECOVERY - DPKG on mw1046 is OK: All packages OK [17:20:52] RECOVERY - Disk space on mw1073 is OK: DISK OK [17:20:52] RECOVERY - Disk space on mw1158 is OK: DISK OK [17:21:00] Argh. No, not yet. [17:21:08] :| [17:21:32] * MatmaRex tries another browser [17:21:54] But the campaigns are down now. [17:22:02] MatmaRex -- so you just loaded the page and got multiple banners? [17:22:10] mwalker: yes [17:22:55] mwalker: and i saw that on at least two browsers, on two computers in different towns today [17:23:12] cool [17:23:12] it magically fixed itself when i started debugging it [17:23:18] of course :) [17:23:30] mark, or if you have a moment, could you review - i'm trying to launch a very large new partner tonight - https://gerrit.wikimedia.org/r/#/c/55302 [17:23:36] *crossing fingers* I can replicate it locally [17:23:47] (i mean - we are trying to launch, i'm trying to get the patch in :)) [17:27:51] mwalker: hey, it's back on [17:28:06] mwalker: if you want a quick fix - why don't you ensure window.insertBanner() is called just once [17:28:25] i just confirmed that it's called repeatedly, somehow [17:28:44] ya -- that's one solution -- I'd rather figure out why it's doing that though [17:28:55] and... you just got banners? [17:29:31] mwalker: yup. three of them [17:29:32] yurik: mark's in europe and is gone for the night [17:29:39] i'm trying to debug what exactly is happening [17:29:41] yurik: binasher may be able to properly review [17:31:02] MatmaRex: did you by any chance happen to have a network logger running when you got these? ie: dragonfly or firebug [17:31:02] PROBLEM - DPKG on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:31:07] * Aaron|home still sees the same fucking PopulateFundraisingStatistics::updateDays errors [17:31:22] PROBLEM - RAID on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:31:41] mwalker: i did, and i still do [17:31:47] okay, i think i see what's happening [17:31:50] Aaron|home: I dimly recall something about this -- can you refresh my context though? [17:31:53] Special:BannerRandom is loaded twice [17:31:56] with different URLs [17:32:00] i mean, at least twice [17:32:07] ok -- cool [17:32:08] yikes [17:32:10] i've got a breakpoint in that function :) and i'm stopped on it [17:32:13] https://meta.wikimedia.org/wiki/Special:BannerRandom?userlang=pl&sitename=Wikipedia&project=wikipedia&anonymous=true&bucket=1&country=PL&device=desktop&slot=2 [17:32:16] https://meta.wikimedia.org/wiki/Special:BannerRandom?userlang=pl&sitename=Wikipedia&project=wikipedia&anonymous=true&bucket=1&country=PL&device=desktop&slot=15 [17:32:21] they differ in the slot field [17:32:29] New patchset: Kaldari; "Turning Thanks extension on on MediaWiki wiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55908 [17:32:32] PROBLEM - SSH on ms-be1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:47] hurm... so the controller initializer is getting called twice [17:32:48] fuuuun [17:33:58] mwalker: can you apply some sort of a quick fix now? i don't know, maybe just do window.insertBanner = $.noop at the end of insertBanner() or something [17:34:10] because it's pretty ugly and disruptive :/ [17:35:00] MatmaRex: we should be coming out of the 15 minutes of cache soon -- so even if it makes multiple calls it won't have content to server [17:35:04] *serve [17:35:16] RECOVERY - RAID on ms-be1012 is OK: OK: State is Optimal, checked 1 logical device(s) [17:35:17] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:35:23] hm [17:35:39] mwalker: look in dberrors.log [17:35:42] i have no idea how centralnotice works internally, so i'm going to trust you on this ;) [17:35:44] it's been spamming for ages [17:35:46] RECOVERY - DPKG on ms-be1012 is OK: All packages OK [17:35:50] * Aaron|home is tempted to disable the shit [17:36:14] mwalker: also, funnily [17:36:18] Aaron|home: it's not like it works anyways... [17:36:24] mwalker: the second call to insertBanner() inserts two banners at once [17:36:29] don't ask me how [17:36:53] MatmaRex: there's two centralnotice divs [17:36:55] so who is responsible for maintaining it and thus turning it off properly? [17:36:56] i'm now debugging inside loadRandomBanner() [17:37:17] Aaron|home: uh; technically fundraising; but we don't know anything about it either -- it's all legacy code [17:37:42] Jeff_Green: do you know anything about FundraiserStatistics? [17:38:10] MatmaRex: want to move over to #tech? [17:38:26] mwalker: hmm. it's not ringing a bell but my head is still on vacation. tell me more? [17:38:36] mwalker: sure [17:39:11] Jeff_Green: we're having database issues with it -- and it's not properly working for 2013 anyways; so Aaron proposes to just turn it off and I agree [17:39:25] but there's a question of how it actually works in order for us to do that [17:39:27] is this the public-facing report? [17:40:19] yep [17:41:46] mwalker: I don't know much, but there are a couple mentions in server admin log [17:41:56] how is this possible? [17:42:14] how is what possible? [17:42:35] that this thing is running and everyone knows fuck all about it? [17:43:00] oh; because up until the 1st of January of this year; it was just working [17:43:03] I don't think it's old as, say, the search system is it? [17:43:19] afaik there's not a lot to it [17:44:03] should read off of a fundraising slave db, db1025 [17:45:06] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:13] mwalker: where are you seeing evidence of the db issue? [17:45:46] RECOVERY - DPKG on mw1085 is OK: All packages OK [17:45:46] RECOVERY - RAID on mw1085 is OK: OK: no RAID installed [17:45:56] RECOVERY - Disk space on mw1085 is OK: DISK OK [17:45:56] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:46:20] aaron says dberrors.log; I guess on flourine? [17:46:37] i can't even find the damned tool :-P [17:46:47] New review: Faidon; "Can you please split the unrelated to Beeline changes into a separate commit? Since this affects a l..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55302 [17:46:51] yurik: ^ [17:47:08] oh there it is... [17:49:32] ugh duplicate key errors [17:50:07] PROBLEM - Puppet freshness on barium is CRITICAL: Puppet has not run in the last 10 hours [17:50:28] New patchset: Pyoungmeister; "only include nrpe checks on internal hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55910 [17:51:05] !log removing disk from slot 0 ms-be1012 [17:51:12] Logged the message, Master [17:52:13] New review: Faidon; "Drop the lsbdistid if, I think it's a remnant from when our puppet tried to support Solaris. No need..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55910 [17:52:44] New patchset: Pyoungmeister; "only include nrpe checks on internal hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55910 [17:53:17] notpeter: see my review btw [17:53:43] mwalker: on hume "mwscript extensions/ContributionReporting/PopulateFundraisingStatistics.php foundationwiki --op populatefundraisers" runs every 5 minutes [17:53:48] paravoid: ok [17:54:03] paravoid: yeah, I was thinking the same thing :) [17:54:06] you uploaded a PS right after I did, so I thought it might get lost :) [17:54:06] mwalker: it does an insert which is failing on a primary key error [17:54:12] heh [17:55:07] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [17:55:27] New patchset: Pyoungmeister; "only include nrpe checks on internal hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55910 [17:56:13] Jeff_Green: I'm guessing this is also why it's broken for 2013 -- I can take a poke at it after I finish putting out this CentralNotice problem [17:56:34] mwalker: k. [17:56:42] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55910 [17:57:19] Jeff_Green: and... if you find yourself having copious free time today -- if you can get the proxy/analytics box set up we can start retiring this legacy PoS [17:58:19] mwalker: that's my main priority for this week. today is screwy though, I have an appt and will be afk for a while this afternoon [17:58:31] so is this getting fixed? [18:01:13] Aaron|home: yes [18:01:28] hopefully you'll stop seeing it by the end of the day [18:02:39] did the dup key errors start today or have they been ongoing? [18:04:24] ongoing I'm guessing [18:05:06] we're consistently seeing varnish 500 errors on mobile watchlist views - anybody know what might be going on? [18:05:27] to replicate: log in at en.m.wikipedia.org, tap 'watchlist' from the nav, then tap 'modified' and experience the glory of varnish 500 [18:05:36] mwalker: appears to have started on 1/29/2013 [18:05:45] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55908 [18:06:23] this apparently has been going on since at least last night (pacific) [18:08:32] New patchset: Ottomata; "Puppetizing webstatscollector on gadolinium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55917 [18:09:29] New patchset: Matthias Mullie; "Enable AFTv5 on frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [18:10:13] New patchset: Dzahn; "move account awight from admins::restricted to admins::mortals (RT-4819)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55918 [18:10:24] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55917 [18:12:20] New patchset: Pyoungmeister; "check lucene disk space is now redundant with general space mon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55919 [18:12:32] !log Change I34a488247 (mw.loader debug) has been reverted in wmf/1.21wmf12 and pulled on fenari [18:12:32] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [18:12:33] Logged the message, Master [18:13:36] !log krinkle synchronized php-1.21wmf12/resources/mediawiki/mediawiki.js 'I80af730daa815 fixing bug 46575' [18:13:42] Logged the message, Master [18:13:55] anybody available to look into the 500 errors we're seeing on mobile watchlist? ^^ [18:14:26] New patchset: Dzahn; "move account mwalker from admins::restricted to admins::mortals (RT-4820)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55920 [18:14:41] Reedy: I don't know what's going on but logmsgbot is in -operations and in -tech, causing duplicate entries [18:14:51] I know [18:14:54] Why tell me? [18:14:55] New patchset: Pyoungmeister; "unlcear why this is still be used, but oh god stop the spam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55921 [18:14:55] Reedy: Puppet says it is in -operations. the other one must be a manual one? [18:14:56] I'm not ops [18:14:58] I can't do anything about it [18:15:06] Reedy: I figured you could fix it, but I guess not. [18:15:11] sorry :) [18:15:13] New patchset: Ottomata; "Fixing path to awk filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55922 [18:15:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55921 [18:17:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55919 [18:21:40] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [18:21:54] awjr: all I've got is that varnish will 500 if it's out of resources -- but I don't have access nor time to help debug :( [18:22:30] apparently it's due to a php fatal coming from wikibase [18:24:52] !log mlitn synchronized wmf-config/InitialiseSettings.php 'Turning Thanks extension on on MediaWiki wiki' [18:24:58] Logged the message, Master [18:26:52] hi - a headsup: I'll be pushing some AFTv5 updates & enabling AFT5 on frwiki soon [18:31:20] New patchset: Mattflaschen; "Add labs redis subclassing the main one and setting directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [18:31:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55922 [18:33:29] PROBLEM - DPKG on terbium is CRITICAL: NRPE: Command check_dpkg not defined [18:33:40] PROBLEM - Disk space on terbium is CRITICAL: NRPE: Command check_disk_space not defined [18:33:45] Reedy: Weird, they're not two instances. It's one instance of ircecho [18:33:46] nobody 17204 0.0 0.1 198176 4304 ? Sl Mar20 2:01 python /usr/ircecho/bin/ircecho --infile=/var/log/logmsg:#wikimedia-operations,#wikimedia-tech #wikimedia-operations,#wikimedia-tech logmsgbot chat.freenode.net [18:34:01] PROBLEM - RAID on terbium is CRITICAL: NRPE: Command check_raid not defined [18:38:49] !log restarting snmtrapd on neon [18:38:55] Logged the message, Master [18:40:21] Supposedly fixed with https://github.com/wikimedia/operations-puppet/commit/db4beb13a5e82a4c1e856b13d70bcdb48e2b7228 [18:41:03] notpeter: load average: 241.69, 145.81, 124.77 [18:41:04] :) [18:41:09] LeslieCarr: Do you have root on fenari? Assuming so, can you please kill ircecho for logmsgbot and make sure the latest puppet is deployed there? [18:41:32] !log kaldari synchronized php-1.21wmf11/extensions/Echo/modules/icons 'syncing Echo icons dir' [18:41:38] Logged the message, Master [18:41:48] https://gerrit.wikimedia.org/r/#/c/55031/ fixed logmsgbot to not be in two channels (causing duplicate entries in https://wikitech.wikimedia.org/wiki/Server_admin_log ) [18:41:54] and has been merged, but not deployed yet [18:43:10] !log reedy synchronized php-1.21wmf12/extensions/MobileFrontend/includes/specials/SpecialMobileWatchlist.php [18:43:15] Logged the message, Master [18:43:34] New review: Mattflaschen; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [18:43:34] New review: Krinkle; "logmsgbot should not be in two channels, that causes it two log to Server admin log[1] twice." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54982 [18:43:52] New review: Krinkle; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8344 [18:44:28] !log killing all snmp related processes on neon and restarting [18:44:35] Logged the message, Master [18:44:37] New review: Mattflaschen; "No, this gives:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54970 [18:46:03] New patchset: RobH; "added terbium into NFS allowed mounts, like hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55929 [18:47:47] New patchset: RobH; "added terbium into NFS allowed mounts, like hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55929 [18:48:02] RobH: I miss bastions named after people [18:48:10] and maintenance servers [18:48:17] !log stopping gmetad on neon [18:48:23] Logged the message, Master [18:48:26] tampa servers are named after encyclopedians [18:48:26] should anybody need me, i'll be in some other channel. i'm in way too many [18:48:29] eqiad are elements [18:48:33] ;] [18:48:33] boo [18:48:42] elements are way cooler [18:48:44] also, i've been having this recent conversation simultaneously here, on #-tech and on #-dev [18:48:48] and this is clearly bad [18:48:50] but oh well. [18:48:51] not for bastions [18:48:58] bastions are snowflakes [18:48:58] * MatmaRex parts [18:49:29] RECOVERY - Puppet freshness on sq65 is OK: puppet ran at Tue Mar 26 18:49:26 UTC 2013 [18:49:29] RECOVERY - Puppet freshness on sq74 is OK: puppet ran at Tue Mar 26 18:49:26 UTC 2013 [18:49:29] RECOVERY - Puppet freshness on sq76 is OK: puppet ran at Tue Mar 26 18:49:26 UTC 2013 [18:49:29] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Tue Mar 26 18:49:26 UTC 2013 [18:49:29] RECOVERY - Puppet freshness on sq53 is OK: puppet ran at Tue Mar 26 18:49:26 UTC 2013 [18:49:34] yay [18:53:04] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55929 [18:53:23] Krinkle: isn't that latest change adding it back to 2 channels? [18:53:32] ircecho_logs = { "/var/log/logmsg" => ["#wikimedia-tech","#wikimedia-operations"] } [18:53:48] Krinkle: still need killing? [18:53:50] mutante: No, that was an odd change made by LeslieCarr made 6 days ago [18:53:54] mutante: Yes [18:53:59] New review: Jeremyb; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8344 [18:54:04] I dont know why lesie added it two both [18:54:25] But https://gerrit.wikimedia.org/r/#/c/55031/ fixed it and needs to be deployed [18:54:48] $ircecho_logs = { "/var/log/logmsg" => "#wikimedia-operations" } [18:54:56] exactly [18:54:57] <-- that is what is on sockpuppet [18:55:05] mutante: OK, then it just needs to be restarted [18:55:27] because, current process [18:55:28] nobody 17204 0.0 0.1 198176 4304 ? Sl Mar20 2:01 python /usr/ircecho/bin/ircecho --infile=/var/log/logmsg:#wikimedia-operations,#wikimedia-tech #wikimedia-operations,#wikimedia-tech logmsgbot chat.freenode.net [18:55:49] python /usr/ircecho/bin/ircecho --infile=/var/log/logmsg:#wikimedia-operations #wikimedia-operations .. [18:55:59] !log restarting ircecho on fenari [18:56:05] Logged the message, Master [18:56:42] New patchset: Pyoungmeister; "moving some riad check specific things to be with raid check def" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55933 [18:57:12] mutante: Thx :) [18:57:33] np [18:58:16] !log starting gmetad on neon - we failed to resolve data source name ms-fe1001.eqiad.wmnet [18:58:23] Logged the message, Master [18:59:49] i added it to both because people asked me to log it into both [18:59:51] New patchset: Pyoungmeister; "removing search::monitoring include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55935 [19:00:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55933 [19:00:05] someone was like "oh no it's not in one of them" [19:00:56] LeslieCarr: It's been in operations only for a while, it only went back to wikimedia-tech because Jeremyb (accidentally?) moved it while refactoring in https://gerrit.wikimedia.org/r/#/c/8344/6 [19:01:08] LeslieCarr: It can run in both, but only if morebots is not in both as well. [19:01:13] anyhow, fixed now :) [19:01:17] i really don't care where it is :) [19:01:24] that makes two of us [19:01:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55935 [19:01:47] okay was there anything else people were pinging me about ? [19:02:08] !log kaldari synchronized php-1.21wmf12/extensions/Echo/modules/icons 'syncing Echo icons dir' [19:02:15] Logged the message, Master [19:03:52] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [19:03:58] Logged the message, Master [19:04:07] notpeter: did you fix up the latest pw thing ? [19:04:17] !log mlitn synchronized php-1.21wmf12/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [19:04:23] Logged the message, Master [19:04:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:07] !log reedy synchronized php-1.21wmf12/extensions/Collection/ [19:05:11] LeslieCarr: yeah [19:05:14] Logged the message, Master [19:05:17] cool [19:06:11] RECOVERY - Puppet freshness on db1045 is OK: puppet ran at Tue Mar 26 19:06:06 UTC 2013 [19:06:11] i love how you have to kill puppet with -9 sometimes ... [19:06:20] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Tue Mar 26 19:06:09 UTC 2013 [19:06:35] yeah, it's awesome.... [19:06:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [19:07:29] RECOVERY - Puppet freshness on db1049 is OK: puppet ran at Tue Mar 26 19:07:24 UTC 2013 [19:08:17] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [19:09:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [19:11:28] PROBLEM - DPKG on mw1212 is CRITICAL: NRPE: Command check_dpkg not defined [19:11:38] PROBLEM - Disk space on mw1212 is CRITICAL: NRPE: Command check_disk_space not defined [19:11:48] PROBLEM - RAID on mw1212 is CRITICAL: NRPE: Command check_raid not defined [19:12:31] !log mlitn synchronized wmf-config 'Enable AFTv5 on frwiki' [19:12:38] Logged the message, Master [19:13:09] PROBLEM - DPKG on virt2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:14:08] PROBLEM - RAID on ms1 is CRITICAL: Connection refused by host [19:14:18] PROBLEM - SSH on ms1 is CRITICAL: Connection refused [19:14:29] PROBLEM - DPKG on ms1 is CRITICAL: Connection refused by host [19:14:40] PROBLEM - Disk space on ms1 is CRITICAL: Connection refused by host [19:15:26] just did an extension deployment 10 minutes ago that included some css updates. The new css has been picked up by palladium, but not any of the other bits servers. Is there any way to flush it on the others? [19:17:04] New patchset: Matthias Mullie; "Include AFTv5 on frwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55938 [19:18:15] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55938 [19:19:28] !log mlitn synchronized wmf-config/InitialiseSettings.php 'Include AFTv5 on frwiki' [19:19:35] Logged the message, Master [19:20:33] !log kaldari synchronized php-1.21wmf12/extensions/Echo/modules/icons 're-syncing Echo icons dir' [19:20:40] Logged the message, Master [19:21:47] New patchset: Krinkle; "Rename legacy 'live-1.5/' to 'w/'." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [19:22:16] Reedy: see kaldari's question above [19:22:35] tried touching and re-sycning, but that didn't fix it [19:22:41] New review: Krinkle; "Rebased (resolved conflicts). @Tim: Is this good to go? Or are there other components in the infrast..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/53125 [19:22:51] would a scap flush them? [19:25:17] PROBLEM - DPKG on db1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:25:26] PROBLEM - Disk space on db1045 is CRITICAL: NRPE: Command check_disk_space not defined [19:25:36] PROBLEM - Disk space on db1043 is CRITICAL: NRPE: Command check_disk_space not defined [19:25:36] PROBLEM - Disk space on db1049 is CRITICAL: NRPE: Command check_disk_space not defined [19:26:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:36] PROBLEM - Host mw1209 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:03] kaldari: What did you try touching? [19:27:16] !log mw1209 re-imaging [19:27:16] PROBLEM - NTP on ms1 is CRITICAL: NTP CRITICAL: No response from NTP server [19:27:23] Logged the message, Master [19:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.862 second response time [19:31:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:46] RECOVERY - Host mw1209 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:32:44] !log reedy synchronized php-1.21wmf12/extensions/Wikibase/client [19:32:51] Logged the message, Master [19:34:13] PROBLEM - Apache HTTP on mw1209 is CRITICAL: Connection refused [19:34:21] PROBLEM - SSH on mw1209 is CRITICAL: Connection refused [19:37:38] Reedy: https://bits.wikimedia.org/static-1.21wmf12/extensions/Echo/modules/icons/icons.css [19:37:54] kaldari: Try resources/startup.js [19:37:59] I think that still makes a difference.. [19:38:07] ok... [19:38:11] btw Reedy / kaldari : K4-713 is going to deploy some fundraising stuff right now [19:40:03] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [19:40:34] !log kaldari synchronized php-1.21wmf12/resources/startup.js 'syncing php-1.21wmf12/resources/startup.js' [19:40:40] Logged the message, Master [19:40:44] New patchset: Yurik; "Unified default lang redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [19:42:08] Reedy: no luck, palladium is still the only server serving the new css [19:42:21] RECOVERY - SSH on mw1209 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:43:04] New review: Ottomata; "I got rid of any references to origin/production, and replaced it with the SHA1 of FETCH_HEAD at the..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [19:43:41] New patchset: Yurik; "Added all Beeline IPs to ACL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55944 [19:45:54] !log khorn synchronized php-1.21wmf11/extensions/CentralNotice 'Fixing CentralNotice bug in which it was possible to initialize multiple times' [19:45:58] Logged the message, Master [19:46:31] PROBLEM - NTP on mw1209 is CRITICAL: NTP CRITICAL: No response from NTP server [19:46:42] New patchset: Asher; "adding new s1 dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55945 [19:47:21] paravoid, I splited up the patch [19:47:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55945 [19:49:02] New review: Krinkle; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [19:49:29] !log khorn synchronized php-1.21wmf12/extensions/CentralNotice 'Fixing CentralNotice bug in which it was possible to initialize multiple times' [19:49:35] Logged the message, Master [19:49:50] hashar: Looks like gerrit-wm (once again) is broken, it doesn't report the comment but the header (does exclude Patch Set, but doesn't exclude "(1 comment)") [19:50:02] hashar: Shouldn't it report the first line of the comment? [19:52:02] Krinkle: if I remember correctly, it should not report any comment. [19:52:14] Krinkle: I think we disabled that at one point to reduce the spam there [19:52:24] That's new to me. [19:52:24] most probably someone tweaked the python hooks [19:55:58] notpeter: Inc.Updater needs restarting [19:59:05] New patchset: Matthias Mullie; "Oversight request email address was invalid" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55946 [20:01:01] New review: Matthias Mullie; "Krinkle: nice catch (and quite a stupid mistake)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [20:04:03] greg-g, Reedy: no luck getting the new CSS to take, so I'm just going to do a whole new deployment from scratch [20:04:57] oooo k [20:05:02] kaldari: do you get it served if you test it manually and append &whateveritdoesntreallymatter to the end? [20:05:51] Reedy: yeah, it works fine if I invalidate the cache with a query string param [20:06:29] touch the file on fenari and sync-file ? [20:06:36] yep, tried that [20:06:59] Reedy: Is there a way to flush a specific URL from bits? [20:07:12] yeah, you can ask ops to "ban" it [20:07:12] !log asher synchronized wmf-config/db-eqiad.php 'raising db1051 to full weight' [20:07:20] Logged the message, Master [20:07:51] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Tue Mar 26 20:07:41 UTC 2013 [20:08:06] RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Tue Mar 26 20:08:01 UTC 2013 [20:08:15] kaldari: Have you asked Krinkle|detached? [20:08:18] Pfft, he's not here [20:08:36] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Tue Mar 26 20:08:26 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Tue Mar 26 20:08:36 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on amslvs4 is OK: puppet ran at Tue Mar 26 20:08:38 UTC 2013 [20:11:03] Ryan_Lane: I have a specific URL in bits that won't let go of the old cached version (except on palladium, which successfully picked up the changes): https://bits.wikimedia.org/static-1.21wmf12/extensions/Echo/modules/icons/icons.css . Reedy mentioned that you could 'ban' this URL to flush it. Is that an easy thing to do? If not, I'll just try doing the deployment over again. [20:13:49] kaldari: why is css being served from bits w/o using resourceloader? [20:14:25] it's served from resourceloader normally [20:15:26] what's the abnormal case? [20:15:32] debug=true [20:15:59] kaldari: the patch specifically is https://gerrit.wikimedia.org/r/#/c/46887/ ; but you should be able to just push master [20:16:11] mwalker: NP [20:17:16] PROBLEM - Puppet freshness on mw1137 is CRITICAL: Puppet has not run in the last 10 hours [20:19:18] PROBLEM - Puppet freshness on mw1031 is CRITICAL: Puppet has not run in the last 10 hours [20:19:18] PROBLEM - Puppet freshness on mw1098 is CRITICAL: Puppet has not run in the last 10 hours [20:20:16] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Puppet has not run in the last 10 hours [20:20:23] mwalker: is it cool to just deploy CR to wmf12 or do you need wmf11 as well? [20:22:16] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [20:22:16] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [20:23:02] kaldari: Still need me, or is it resolved now? [20:23:11] xyzram: ok [20:23:22] kaldari: purged it [20:23:43] Krinkle: Fixed now [20:23:49] binasher: Thanks! [20:23:57] lol. [20:24:18] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [20:24:18] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [20:24:18] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [20:24:58] kaldari: just 12 [20:25:04] ok [20:26:55] xyzram: done [20:27:09] thanks. [20:28:13] notpeter: db11 -> RT-4828 [20:29:26] pgehres: i found "pgehres special project" in site.pp .. the db server it uses appears to have a broken disk. creating ticket to replace it [20:29:33] that is db29 [20:29:48] mutante: :-( thanks [20:29:56] RECOVERY - Puppet freshness on analytics1010 is OK: puppet ran at Tue Mar 26 20:29:46 UTC 2013 [20:31:30] ppph [20:31:32] *oooh [20:31:45] paravoid: Huge spam of swift related warnings [20:32:26] RECOVERY - Puppet freshness on barium is OK: puppet ran at Tue Mar 26 20:32:19 UTC 2013 [20:33:55] !log kaldari synchronized php-1.21wmf12/extensions/ContributionReporting 'syncing ContributionReporting extension' [20:34:01] Logged the message, Master [20:34:16] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [20:34:44] mwalker: CR is deployed and synched to wmf12 [20:34:54] kaldari: cool; thanks :) [20:34:55] lemme know if it needs a scap [20:35:03] I sincerely hope it doesn't [20:35:20] !log broken disks/degraded RAID on db servers: db11->RT-4828, db29->RT-4829, db45->RT-4831 [20:35:26] Logged the message, Master [20:35:54] Aaron|home: you should no longer be seeing cronspam/dberror spam from contribution reporting [20:37:18] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.002 second response time [20:40:57] !log more degraded RAIDs: db1001->RT-4832, db1028->4834 [20:41:04] Logged the message, Master [20:45:38] RECOVERY - NTP on mw1209 is OK: NTP OK: Offset -0.09676086903 secs [20:47:18] PROBLEM - Apache HTTP on mw1209 is CRITICAL: Connection refused [20:51:45] !log kaldari synchronized wmf-config/CommonSettings.php 'syncing CommonSettings.php for Echo event tracking' [20:51:48] Logged the message, Master [20:55:23] yurik: this is the one we need to go live? https://gerrit.wikimedia.org/r/#/c/55944/ [20:55:41] brion, correctc [20:55:44] <^demon> xyzram: solr + zookeeper = fun times :) [20:55:56] <^demon> Doing some benchmarking now. [20:56:09] RobH or someone: want to help us with that? :) varnish update for Zero, needs to go live today [20:56:22] binasher: https://gdash.wikimedia.org/dashboards/reqerror/ Is that really 10k 500s/min and not so many 5xxs/min [20:56:34] New review: Brion VIBBER; "Updated IPs needed for launch tonight." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55944 [20:56:34] on the first graph [20:56:53] brion: i think both paravoid and asher shot this down [20:57:00] maybe not [20:57:03] RobH: this is a shorter one [20:57:15] just adds IPs to an ACL [20:57:15] ok, cool [20:57:19] chatting with opsen, this is much better [20:57:31] <^demon> I thought my gerrit dashboard was way shorter. [20:57:34] <^demon> Must've been a dream. [20:57:39] :) [20:57:47] so [20:57:50] looking at that ACL [20:58:03] why don't they take a few much bigger prefixes? [20:58:06] kaldari, are you done with your deployment? [20:58:11] why all the weird exceptions and /31s? [20:58:22] mark: we took the ip list they gave us... [20:58:24] mark, it's Beeline, the crap of craps [20:58:33] can we ask them to reduce it? [20:58:33] possibly we could get away with a shorter list with bigger ranges, dunno [20:58:35] heh [20:58:46] this translates to not very efficient code [20:58:53] just so they can leave out a few IPs [20:58:56] for whatever weird reason [20:58:59] paravoid, still around? [20:59:12] trying to figure out some git buildpackage best practices with this python jsonschema stuff [20:59:26] Reedy: yes, uploads.wikimedia.org throws mad 500's [20:59:31] it looks like it's mostly 3 big ranges [20:59:32] mark: partner turnaround time is probably not great; can we deploy this as-is now and see about tightening the ranges later? [20:59:57] Reedy: urls such as http://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Limina-Stemma.gif/180px-Limina-Stemma.gif [21:00:02] brion: if we can have assurance that someone's actually going to chase them down for it :) [21:00:24] :) [21:00:32] Ahhh [21:02:15] these are supposedly their ranges: http://myip.ms/view/ip_owners/2546/Jsc_Vimpelcom.html [21:02:25] another common 500: http://commons.wikimedia.org/w/index.php?title=MediaWiki:Filepage.css&action=raw&maxage=2678400&usemsgcache=yes&ctype=text%2Fcss&smaxage=2678400 [21:03:03] example referrer for the above = http://en.wikipedia.org/wiki/File:Soba_at_Mitsuwa.jpg [21:03:31] Exception from line 637 of /usr/local/apache/common-local/php-1.21wmf12/includes/cache/MessageCache.php: Message key 'Filepage.css' does not appear to be a full key. [21:03:56] ooomg, someone broke mediawiki:P [21:06:43] PROBLEM - Disk space on fluorine is CRITICAL: NRPE: Command check_disk_space not defined [21:06:43] PROBLEM - DPKG on virt1 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:06:54] PROBLEM - Disk space on labstore1 is CRITICAL: NRPE: Command check_disk_space not defined [21:06:54] PROBLEM - Disk space on virt1 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:03] PROBLEM - Disk space on kuo is CRITICAL: NRPE: Command check_disk_space not defined [21:07:04] PROBLEM - Disk space on virt1007 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:04] PROBLEM - Disk space on virt5 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:14] PROBLEM - Disk space on strontium is CRITICAL: NRPE: Command check_disk_space not defined [21:07:15] PROBLEM - Disk space on stafford is CRITICAL: NRPE: Command check_disk_space not defined [21:07:23] PROBLEM - Disk space on analytics1010 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:33] PROBLEM - Disk space on constable is CRITICAL: NRPE: Command check_disk_space not defined [21:07:33] PROBLEM - Disk space on nfs1 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:33] PROBLEM - Disk space on virt8 is CRITICAL: NRPE: Command check_disk_space not defined [21:07:41] hrmmmmmrmrm [21:07:46] that real? [21:08:13] those alerts are shit [21:08:19] they're not set up correctly yet [21:08:24] k [21:08:29] an10 is the namenode :) [21:08:50] (SPoF for hadoop) [21:09:00] ah, gotcha [21:09:15] those notices are actually saying that I just fixed puppet on all of those hosts [21:09:37] kaldari, not heard back from you - I'm deploying my stuff now, our window has begun [21:09:55] https://ccp.cloudera.com/display/FREE4DOC/Configuring+HDFS+High+Availability [21:10:00] yeah [21:10:08] not considered a priority atm [21:10:28] neither is responding to disk issues on the analytics cluster :) [21:10:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:10:38] there's also facebook's AvatarNode [21:11:00] aren't you running cloudera's cdh4 distro? [21:11:11] we are. [21:11:50] New patchset: Dzahn; "decom db11 - RT-4828" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55997 [21:12:21] we haven't tried out qjm yet, binasher [21:12:44] and imo using nfs for the journal seems a bit sketchy [21:13:03] but bother are worth exploring if you're interested [21:13:10] http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html [21:13:30] New review: Dzahn; "On Tue Mar 26 20:54:36 2013, cmjohnson wrote:" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55997 [21:13:43] New review: Dzahn; "On Tue Mar 26 20:54:36 2013, cmjohnson wrote:" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/55997 [21:13:43] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55997 [21:14:58] binasher: merge coredb changes on sockpuppet.. k? and also just fyi: decom'ing db11, it did not appear anywhere in puppet and is old Sun [21:15:09] (and had broken disk) [21:15:14] mutante: yep, thanks [21:15:29] kk, done [21:15:49] mutante: did you point analytics towards logrotate for yesterdays issue? [21:15:57] i did [21:16:05] awesome [21:16:24] like that there are some examples in ./files/logrotate in puppet repo [21:16:59] haha [21:17:01] New patchset: JGonera; "Add configuration for Special:LoginHandshake workaround" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56019 [21:17:33] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [21:17:59] ori-l, you round? [21:18:34] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:18:34] ottomata: binasher: to be fair, it was probably already full a bit earlier but then the reports came in when NRPE was fixed [21:18:56] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56019 [21:19:24] !g 53714 [21:19:25] https://gerrit.wikimedia.org/r/#q,53714,n,z [21:19:40] yeah, i'm not sure what you guys worked on last night with those issues, but just so you know, an09 and an26 were not a worry, they aren't running anything, and it looks like udp2log had spawned itself back up. i had previously been using it to find packet loss issues, and was writing spurts of unsampled logs to files [21:19:56] an03 thought, i'm not sure what was up, mutante, did you zip up a big log file or something? [21:20:37] <^demon> MaxSem: Soo, am I correct in guessing your solr setup is a traditional master+slave(s) setup? [21:20:47] yes [21:21:03] <^demon> Thought so :) [21:21:51] hehe, hashar is that for me :) [21:21:51] ottomata: fair, but i'd have to say that if they are not running anything then we could remove or deactivate monitoring [21:21:52] New patchset: JGonera; "Add configuration for Special:LoginHandshake workaround" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56019 [21:22:00] ottomata: hey; here now [21:22:15] what's up? [21:22:16] mutante, sure, that'd be nice [21:22:19] not really sure how to do that [21:22:22] so, ori-l [21:22:27] i'm trying python-jsonschema again [21:22:49] New review: Hashar; "looks like you are applying both role::db::redis::labs and role::db::redis . Since each of them cal..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [21:22:50] i can't build it with the deb packaging from debian because whatever they have fails tests [21:22:56] same test fails with 0.8.0 as with 1.1.0 [21:23:04] i think it might have something to do with my version of nose or mock [21:23:06] but i have the latest [21:23:08] so i'm not sure [21:23:08] can you pastebin the output somewhere? [21:23:11] ja... [21:23:32] New review: Hashar; "I can give it a look on an instance whenever wmflabs is back up :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [21:23:40] mutante: can you do whatever magic to make neon actually eat the passive check packets again? [21:23:40] https://gist.github.com/ottomata/5249394 [21:23:48] ottomata: Dschoon alerted us that this was important. what i did can be seen in SAL [21:24:10] notpeter: ok :p.. but that would probably include stopping gmetad again ... [21:24:12] ty again, mutante [21:24:18] mutante: sgtm ;) [21:24:21] shouldn't normally be a thing -- seder and all [21:24:41] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56019 [21:25:20] notpeter: odd, it doesn't look as bad as it did earlier .. [21:25:37] dschoon: np [21:27:17] ottomata: what's the output of python -c 'import mock;print(mock.__version__)' ? [21:27:50] mutante: huh. well, lots of boxes are still listed as puppet fails that I know just ran puppet [21:28:08] !log stopping gmetad on neon [21:28:14] Logged the message, Master [21:28:25] 0.7.2 [21:28:35] ottomata: not the latest by a long stretch [21:28:47] well, latest according to what apt tells me :) [21:28:50] notpeter: it looks like those for solr2 and sq86 are stuck somehow ..hrmm [21:29:13] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Tue Mar 26 21:29:12 UTC 2013 [21:29:13] RECOVERY - Puppet freshness on cp1036 is OK: puppet ran at Tue Mar 26 21:29:12 UTC 2013 [21:29:23] notpeter: ^ :) [21:29:23] RECOVERY - Puppet freshness on grosley is OK: puppet ran at Tue Mar 26 21:29:13 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on ssl3003 is OK: puppet ran at Tue Mar 26 21:29:13 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Tue Mar 26 21:29:13 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on fluorine is OK: puppet ran at Tue Mar 26 21:29:13 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Tue Mar 26 21:29:15 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on colby is OK: puppet ran at Tue Mar 26 21:29:15 UTC 2013 [21:29:23] RECOVERY - Puppet freshness on constable is OK: puppet ran at Tue Mar 26 21:29:16 UTC 2013 [21:29:24] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Tue Mar 26 21:29:16 UTC 2013 [21:29:24] RECOVERY - Puppet freshness on lvs1003 is OK: puppet ran at Tue Mar 26 21:29:17 UTC 2013 [21:29:25] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Tue Mar 26 21:29:17 UTC 2013 [21:29:25] RECOVERY - Puppet freshness on kuo is OK: puppet ran at Tue Mar 26 21:29:21 UTC 2013 [21:29:26] RECOVERY - Puppet freshness on lvs6 is OK: puppet ran at Tue Mar 26 21:29:21 UTC 2013 [21:29:26] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Tue Mar 26 21:29:21 UTC 2013 [21:29:27] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Tue Mar 26 21:29:22 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Tue Mar 26 21:29:25 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on silver is OK: puppet ran at Tue Mar 26 21:29:27 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Tue Mar 26 21:29:27 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on zirconium is OK: puppet ran at Tue Mar 26 21:29:29 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on virt8 is OK: puppet ran at Tue Mar 26 21:29:29 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on virt1 is OK: puppet ran at Tue Mar 26 21:29:30 UTC 2013 [21:29:33] RECOVERY - Puppet freshness on virt5 is OK: puppet ran at Tue Mar 26 21:29:30 UTC 2013 [21:29:34] RECOVERY - Puppet freshness on virt1007 is OK: puppet ran at Tue Mar 26 21:29:32 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on stafford is OK: puppet ran at Tue Mar 26 21:29:33 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on strontium is OK: puppet ran at Tue Mar 26 21:29:33 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on ssl4 is OK: puppet ran at Tue Mar 26 21:29:34 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on ssl1 is OK: puppet ran at Tue Mar 26 21:29:36 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Tue Mar 26 21:29:37 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on srv294 is OK: puppet ran at Tue Mar 26 21:29:38 UTC 2013 [21:29:43] RECOVERY - Puppet freshness on solr1 is OK: puppet ran at Tue Mar 26 21:29:39 UTC 2013 [21:29:44] RECOVERY - Puppet freshness on rdb1001 is OK: puppet ran at Tue Mar 26 21:29:41 UTC 2013 [21:29:44] RECOVERY - Puppet freshness on rdb1002 is OK: puppet ran at Tue Mar 26 21:29:41 UTC 2013 [21:29:47] !log killed stuck snmptt procs on neon to fix the others.. apparently :p [21:29:53] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Tue Mar 26 21:29:43 UTC 2013 [21:29:53] RECOVERY - Puppet freshness on pc1001 is OK: puppet ran at Tue Mar 26 21:29:43 UTC 2013 [21:29:53] RECOVERY - Puppet freshness on ms-fe3001 is OK: puppet ran at Tue Mar 26 21:29:44 UTC 2013 [21:29:53] RECOVERY - Puppet freshness on mc1007 is OK: puppet ran at Tue Mar 26 21:29:45 UTC 2013 [21:29:53] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Tue Mar 26 21:29:46 UTC 2013 [21:29:54] Logged the message, Master [21:30:03] RECOVERY - Puppet freshness on mw1031 is OK: puppet ran at Tue Mar 26 21:30:00 UTC 2013 [21:30:22] and it doesn't even get kicked.. hooray [21:30:24] booya! [21:30:24] ottomata: where's the debian/ stuff for the package you're working with? [21:30:28] woot, tahnks ori-l, I upgraded with pip, that built better [21:30:38] http://anonscm.debian.org/gitweb/?p=openstack/python-jsonschema.git [21:31:26] ok cool, so that works! now if only I knew the proper git-buildpackage procedure here [21:31:28] ... [21:31:32] !log restarted memcached on virt0 in a futile attempt to address https://bugzilla.wikimedia.org/show_bug.cgi?id=46583 [21:31:38] Logged the message, Master [21:31:51] ottomata: thanks very much for doing this, again [21:32:48] yup! i'm learning stuff, and its fun, just wish I had paravoid in a magic 8 ball to consult or something [21:33:08] oh magic 8 ball paravoid, should I recreate the debian/experimental branch, or should I try to merge? [21:33:43] ottomata: reply hazy, try again [21:33:45] New review: Brion VIBBER; "Mark wants a more consolidated list than this; don't know if we can produce one on a reasonable turn..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55944 [21:34:39] heya ^demon, how goes those repository delete powers? :) [21:34:53] <^demon> Oh, that's totally up and running fine. [21:34:56] OH! [21:34:57] I can do that?! [21:34:57] <^demon> I should've sent an e-mail :) [21:35:09] <^demon> Lemme double check you've got the permission, but yeah [21:35:26] <^demon> You're in ldap/ops, right? [21:35:55] think so? [21:36:09] yes! [21:36:21] <^demon> Indeed, you are. [21:36:25] <^demon> So, `ssh -p 29418 gerrit.wikimedia.org delete-project delete --help` for all the options. [21:36:30] PROBLEM - Disk space on pc1001 is CRITICAL: NRPE: Command check_disk_space not defined [21:36:30] PROBLEM - RAID on rdb1001 is CRITICAL: NRPE: Command check_raid not defined [21:36:40] PROBLEM - Disk space on mexia is CRITICAL: NRPE: Command check_disk_space not defined [21:36:40] PROBLEM - DPKG on rdb1002 is CRITICAL: NRPE: Command check_dpkg not defined [21:36:45] <^demon> eg: `ssh -p 29418 gerrit.wikimedia.org delete-project delete --yes-really-delete -- foo/bar` [21:36:50] PROBLEM - Disk space on rdb1002 is CRITICAL: NRPE: Command check_disk_space not defined [21:37:00] PROBLEM - RAID on rdb1002 is CRITICAL: NRPE: Command check_raid not defined [21:37:00] PROBLEM - DPKG on rdb1001 is CRITICAL: NRPE: Command check_dpkg not defined [21:37:00] PROBLEM - Disk space on solr1 is CRITICAL: NRPE: Command check_disk_space not defined [21:37:01] COOL, and then I can recreate with the same name? [21:37:06] --yes-really-delete hah [21:37:07] <^demon> Yep. [21:37:10] PROBLEM - Disk space on mc1007 is CRITICAL: NRPE: Command check_disk_space not defined [21:37:10] PROBLEM - Disk space on rdb1001 is CRITICAL: NRPE: Command check_disk_space not defined [21:37:11] anyone have a good example page of a bad certificate ? jeremyb_ ? [21:37:11] so awesome thank you [21:37:12] New patchset: MF-Warburg; "(bug 44285) config changes for eswikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56055 [21:37:22] LeslieCarr: pa.us.wikimedia.org [21:37:26] thanks :) [21:37:32] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.083 second response time [21:37:47] LeslieCarr: jeremyb_ pa.us is gone already [21:37:58] mutante: but it still works for our purposes :) [21:38:09] mutante: the old name is still in DNS... [21:38:13] :) [21:38:17] it's still there and showing up [21:38:26] and still broken! [21:38:33] <^demon> jeremyb_: It's completely irrecoverable, so yeah, you should be really sure :) [21:39:11] and let me figure out the steps to show certificates in chrome [21:39:29] MaxSem: Sorry I missed your message, I'm all done with deployments [21:39:29] ungh, yeah, ori-l, there are so many ways to do this I don't know what the right one is! [21:39:31] LeslieCarr: on windows too! windows is harder than chrome [21:39:40] windows is crazy [21:39:45] and usually when I guess i'm wrong! [21:39:48] unrelated to the issue reported in RT-4827 [21:39:48] old versions of xp that have never been updated == eep [21:40:05] test.m.wikipedia.org is not working correctly - it is serving up the desktop version of the site rather than the mobile version like it used to [21:40:07] ^demon: i'm trying and failing to think of other commands i've seen with counterpart really i mean it options. but i'm sure i've seen at least one [21:40:11] LeslieCarr: http://i.imgur.com/ptsf6QX.png [21:40:27] im wondering if it might be related to https://gerrit.wikimedia.org/r/#/c/55555/ [21:40:37] which appears to cause device detection to not run for testwiki [21:40:41] $ host test.m.wikipedia.org [21:40:41] test.m.wikipedia.org is an alias for mobile-lb.eqiad.wikimedia.org. [21:40:44] <^demon> jeremyb_: Well, we briefly had a --i-know-what-im-doing on update.php ;-) [21:40:54] mark ^^ [21:41:11] ottomata: let me see if i can find faidon's recommendation from a while back [21:42:21] welllllll, i think if I remember what that was, it won't be relevant here [21:42:40] so there is this debianization that someone else is doing a good job of maintaining [21:42:48] we want to take it and build a package [21:42:50] RECOVERY - Disk space on mc1010 is OK: DISK OK [21:43:10] PROBLEM - RAID on solr1002 is CRITICAL: Connection refused by host [21:43:10] PROBLEM - RAID on snapshot3 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on solr2 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on solr1 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on wtp1003 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on analytics1022 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on analytics1014 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on ms-be6 is CRITICAL: Connection refused by host [21:43:12] PROBLEM - RAID on snapshot1003 is CRITICAL: Connection refused by host [21:43:13] PROBLEM - RAID on snapshot2 is CRITICAL: Connection refused by host [21:43:13] PROBLEM - RAID on wtp1002 is CRITICAL: Connection refused by host [21:43:20] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [21:43:25] 1. do I need to check this into gerrit? [21:43:25] 2. if so, how do I follow their branching structure? [21:43:25] 3. Do I need to fork all of their branches and tags? [21:43:28] etc. etc. [21:43:31] RECOVERY - Disk space on searchidx1001 is OK: DISK OK [21:43:32] RECOVERY - Disk space on virt3 is OK: DISK OK [21:43:32] RECOVERY - Disk space on db1049 is OK: DISK OK [21:43:32] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [21:43:32] RECOVERY - Disk space on ms-fe1 is OK: DISK OK [21:43:32] RECOVERY - Disk space on labstore1 is OK: DISK OK [21:43:32] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [21:43:32] RECOVERY - Disk space on pc1001 is OK: DISK OK [21:43:33] RECOVERY - Disk space on mc1005 is OK: DISK OK [21:43:33] RECOVERY - Disk space on virt5 is OK: DISK OK [21:43:34] RECOVERY - Disk space on constable is OK: DISK OK [21:43:34] RECOVERY - Disk space on ms-be1003 is OK: DISK OK [21:43:35] RECOVERY - Disk space on ms-be1004 is OK: DISK OK [21:43:35] RECOVERY - Disk space on nfs1 is OK: DISK OK [21:43:36] RECOVERY - Disk space on mc1008 is OK: DISK OK [21:43:36] RECOVERY - Disk space on virt8 is OK: DISK OK [21:43:40] RECOVERY - Disk space on mc9 is OK: DISK OK [21:43:40] RECOVERY - Disk space on mexia is OK: DISK OK [21:43:40] RECOVERY - Disk space on ms-be7 is OK: DISK OK [21:43:40] RECOVERY - Disk space on ms-fe4 is OK: DISK OK [21:43:40] RECOVERY - Disk space on ms-be6 is OK: DISK OK [21:43:40] RECOVERY - Disk space on fluorine is OK: DISK OK [21:43:40] RECOVERY - Disk space on mc1001 is OK: DISK OK [21:43:41] RECOVERY - Disk space on mc5 is OK: DISK OK [21:43:41] RECOVERY - Disk space on virt2 is OK: DISK OK [21:43:44] shhhh [21:43:50] RECOVERY - Disk space on ms-be1007 is OK: DISK OK [21:43:50] RECOVERY - Disk space on tola is OK: DISK OK [21:43:50] RECOVERY - Disk space on virt1 is OK: DISK OK [21:44:00] RECOVERY - Disk space on ms-be3 is OK: DISK OK [21:44:00] RECOVERY - MySQL disk space on pc1 is OK: DISK OK [21:44:00] RECOVERY - Disk space on solr1 is OK: DISK OK [21:44:00] RECOVERY - Disk space on virt1007 is OK: DISK OK [21:44:10] RECOVERY - Disk space on mc1006 is OK: DISK OK [21:44:10] RECOVERY - Disk space on mc1007 is OK: DISK OK [21:44:10] RECOVERY - Disk space on mc13 is OK: DISK OK [21:44:10] RECOVERY - Disk space on kuo is OK: DISK OK [21:44:10] RECOVERY - Disk space on virt4 is OK: DISK OK [21:44:10] RECOVERY - Disk space on strontium is OK: DISK OK [21:44:10] RECOVERY - Disk space on searchidx2 is OK: DISK OK [21:44:11] RECOVERY - Disk space on stafford is OK: DISK OK [21:44:14] New patchset: Reedy; "(bug 45968) Set $wgCategoryCollation to 'uca-pl' on Polish Wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54352 [21:44:20] greg-g: don't shhh. we're getting to a point where we have some basic monitoring again :) [21:44:24] New patchset: awjrichards; "Ensures device detection happens for testwiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56056 [21:44:27] just type /ignore icinga-wm [21:44:39] awjr: ah right - that device detection call should move up [21:44:44] mark: https://gerrit.wikimedia.org/r/#/c/56056/ [21:44:50] ottomata: dch -v 1.1.0, then debcommit? [21:45:10] PROBLEM - Puppet freshness on mw1143 is CRITICAL: Puppet has not run in the last 10 hours [21:45:26] does that work with git-buildpackage setup? [21:45:31] notpeter: well ok then [21:45:38] ottomata: yes; you need to git-import-dsc first [21:45:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56056 [21:46:11] thanks mark - will puppet push that out automagically for testwiki? we're trying to do some testing there now [21:46:29] i'll trigger a puppet run on the 4 mobile boxes now [21:46:39] thanks mark [21:47:03] yeah, ori-l [21:47:08] i have the whole structure [21:47:10] i can build a deb [21:47:15] i'm just not sure what to push to our repo [21:48:46] i think both upstream and debian branches into equivalent branch names in gerrit [21:49:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54352 [21:49:41] but this is the blind leading the blind -- maybe wait for faidon? [21:50:02] # git branch -a [21:50:02] debian/1.1.0-1 [21:50:02] * debian/experimental [21:50:02] master [21:50:02] remotes/github/master [21:50:03] remotes/origin/HEAD -> origin/debian/experimental [21:50:03] remotes/origin/debian/experimental [21:50:07] I created 1.1.0-1, not sure if I should ahve [21:50:19] i cloned from debian, [21:50:31] added a github remote and pulled 1.1.0 master (and tag) from there [21:50:43] yes, that's correct [21:50:56] created debian/1.1.0-1 from master, (copied the debian directory manually from debian/experimental) [21:51:20] then I deleted the previous debian/experimental and branched it again from debian/1.1.0-1 (they are the same now) [21:51:28] (oh and edited changelog) [21:51:41] debian/experimental is the —git-debian-branch [21:51:44] so, that all works [21:52:01] you could have done that with git-import-dsc --debian-branch=debian/experimental, I think, but what you did is OK too. [21:52:10] hmmmmmmmmmmmmmmMMMMMm, from what, wh [21:52:13] from master? [21:52:13] hmm [21:52:23] i thought import-dsc was for when you were starting a new repo [21:52:43] mark, ping [21:53:06] ok so what should I push to gerrit? [21:53:07] :p [21:53:10] for review? [21:53:15] anything? [21:53:19] pip! [21:53:34] mark, dfoy & I are here, just spoke with brion, we can't really shrink down the ip list beyond what it is already. it has already shrunk down 3+ fold. [21:53:58] we have a 56million users release tonight [21:54:03] that's why so many ranges [21:54:33] yurik: maybe you can't, but that carrier can [21:54:45] mark, we could try to convince the carrier to do it later, [21:55:04] plus we could try to optimize varnish code to introduce a "switch (ip) { case ACL:... + [21:55:07] constructs [21:55:14] that would be a binary search based [21:55:33] yes [21:55:36] but that's conditional on us actually seeing a performance problem [21:55:46] we will add many many many more ip ranges soon [21:55:59] when we sign up more partners [21:56:14] ottomata: i think you should commit both branches as two separate commits, but be careful with git-review, so that it doesn't treat your branch names as topics and merges them onto master. i think you can do git push with refs/for with the right branch names [21:56:15] you know, when we started the initial discussions about zero, we were talking about a BGP based system [21:56:25] more efficient and manageable than this [21:56:36] may be time to start exploring that [21:56:39] yeah, but shoudl I commit them for review? what does faidon want to review, the changelog? [21:56:42] for some reason I never heard anything afterwards [21:57:04] anyway, we can merge this now, but I'd like someone in contact with that carrier to followup with them [21:57:36] mark, brion, i agree that we need to look into it. dfoy will follow up with it [21:57:47] \o/ [21:57:48] mark, could you merge both patches? [21:57:53] :) [21:58:02] this way we will solve the default language issue as well [21:58:04] ottomata: have faidon review both? debian branch for proper packaging, upstream so he can sanity-check the quality of the code being packaged [21:58:22] and it will make brion even more happy [21:58:39] :D [21:58:58] hmm, ok, hmmm [21:59:05] i'm willing to merge the ACL one now, not the other one - i don't have time to review and babysit it now [21:59:12] ori-l, i'm going to delete (WEE!) gerrit's python-jsonschema and re-push [21:59:14] and see how it goes [21:59:24] ottomata: if your endorsement and mine is sufficient for upstream being good enough for inclusion in apt, then he can just merge it without further thought [21:59:33] cool, thanks [21:59:42] mark, that's acceptable for the tonight launch, but would be good to review it soon - it should make it slightly more efficient too [22:00:03] New review: Mark Bergsma; "22:57:36 mark, brion, i agree that we need to look into it. dfoy will follow up with it" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55944 [22:00:03] mark, are you the person who posted about faking ip based on the header param in varnish? [22:00:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55944 [22:00:12] no [22:00:48] COOL ^demon that worked! [22:01:04] mark, thank you !!! when do you think it will be live? [22:01:11] in 30 mins [22:01:12] <^demon> sweet [22:01:28] sweet!!! [22:01:36] sweet!!!!!!! [22:01:48] sweeeeeeeeet!!!!!!!!!!!!!!!!!!!!!!!!! [22:01:54] <^demon> !!! [22:01:56] * brion wonders who can top THAT [22:02:23] <^demon> I can. Dinner and Adventure Time. [22:02:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:08] New review: Mattflaschen; "Thanks, hashar. I've been testing on https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000005e8 ...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:03:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [22:03:36] The authenticity of host '[gerrit.wikimedia.org]:29418 ([208.80.154.152]:29418)' can't be established. [22:03:36] RSA key fingerprint is dc:e9:68:7b:99:1b:27:d0:f9:fd:ce:6a:2e:bf:92:e1. [22:03:36] Are you sure you want to continue connecting (yes/no)? [22:03:36] while(true){sweet!} [22:03:44] (on fenari) [22:04:07] <^demon|away> Nothing new on gerrit. [22:04:18] uh [22:04:20] <^demon|away> Definitely not a key change. [22:04:40] expired cert> [22:04:42] ? [22:05:18] no [22:05:37] debug1: Server host key: RSA dc:e9:68:7b:99:1b:27:d0:f9:fd:ce:6a:2e:bf:92:e1 [22:05:41] same key i see at home MaxSem [22:06:03] is this just first time connecting to gerrit on fenari? [22:06:07] New review: Mattflaschen; "Correct URL: http://goo.gl/Jf3V5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:06:14] so not a DNS spoofing... probably:P [22:06:30] brion, it worked for me today [22:06:37] weird [22:06:52] New patchset: Pyoungmeister; "fixing scope of icinga_config_dir in mysql.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56061 [22:07:38] uhhh, and now it pulled quietly [22:07:58] !log restarted nova-api on virt0 because it appeared to have crashed. [22:08:00] New review: Lcarr; ":)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/56061 [22:08:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56061 [22:08:04] Logged the message, Master [22:08:06] puppet run was updating some keys? [22:08:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:09:44] New patchset: Dzahn; "always redirect wikimediafoundation.org to https (RT-4830)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/56062 [22:10:11] !log icinga mostly trustworthy again :) [22:10:17] Logged the message, notpeter [22:10:27] mostly... [22:10:28] ;) [22:10:36] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [22:10:37] New patchset: Ottomata; "New 1.1.0 release deb packaging" [operations/debs/python-jsonschema] (debian/experimental) - https://gerrit.wikimedia.org/r/56064 [22:10:48] PROBLEM - Puppet freshness on mw1095 is CRITICAL: Puppet has not run in the last 10 hours [22:12:20] Change abandoned: Dzahn; "don't do more than one thing at a time, care about the quoted booleans in a future change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21759 [22:13:08] hokey dokey ori-l [22:13:11] i feel good about this: [22:13:11] https://gerrit.wikimedia.org/r/#/c/56064/1 [22:15:17] RECOVERY - Disk space on db1043 is OK: DISK OK [22:15:26] ottomata: but debian/changelog doesn't contain any of the 0.8 - 1.1 commits [22:15:27] RECOVERY - Disk space on analytics1010 is OK: DISK OK [22:15:36] RECOVERY - Disk space on db1045 is OK: DISK OK [22:15:56] PROBLEM - Apache HTTP on mw109 is CRITICAL: Connection refused [22:15:56] PROBLEM - DPKG on mw1093 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:17] PROBLEM - DPKG on mw1092 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:17] PROBLEM - DPKG on mw1098 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:17] PROBLEM - DPKG on mw1096 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:18] PROBLEM - RAID on db59 is CRITICAL: CRITICAL: Degraded [22:16:27] PROBLEM - DPKG on mw1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:27] PROBLEM - DPKG on mw1090 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:28] RECOVERY - Disk space on erzurumi is OK: DISK OK [22:16:28] PROBLEM - DPKG on mw1097 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:46] PROBLEM - DPKG on mw1095 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:46] RECOVERY - Disk space on db59 is OK: DISK OK [22:16:46] PROBLEM - DPKG on mw1091 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:16:57] PROBLEM - DPKG on mw1094 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:17:27] those DPKG issues are real [22:17:28] ottomata: i think it can probably be removed -- it's a full log of upstream commits. debian/CHANGELOG is the right file. [22:17:49] mutante: someone should fix that ;) [22:18:06] RECOVERY - Disk space on mw1212 is OK: DISK OK [22:18:06] RECOVERY - DPKG on mw1212 is OK: All packages OK [22:18:16] RECOVERY - DPKG on mw1209 is OK: All packages OK [22:18:16] RECOVERY - Disk space on mw1209 is OK: DISK OK [22:18:16] RECOVERY - RAID on mw1212 is OK: OK: no RAID installed [22:18:36] RECOVERY - RAID on mw1209 is OK: OK: no RAID installed [22:18:46] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [22:18:46] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [22:18:46] RECOVERY - DPKG on mw1095 is OK: All packages OK [22:18:46] RECOVERY - DPKG on mw1091 is OK: All packages OK [22:18:46] RECOVERY - Disk space on rdb1002 is OK: DISK OK [22:18:46] PROBLEM - Apache HTTP on mw1093 is CRITICAL: Connection refused [22:18:56] RECOVERY - DPKG on mw1094 is OK: All packages OK [22:18:56] RECOVERY - DPKG on rdb1001 is OK: All packages OK [22:18:56] RECOVERY - DPKG on mw1093 is OK: All packages OK [22:19:06] PROBLEM - Apache HTTP on mw1096 is CRITICAL: Connection refused [22:19:06] PROBLEM - Apache HTTP on mw1098 is CRITICAL: Connection refused [22:19:06] RECOVERY - Disk space on rdb1001 is OK: DISK OK [22:19:21] PROBLEM - Apache HTTP on mw1099 is CRITICAL: Connection refused [22:19:21] RECOVERY - DPKG on mw1092 is OK: All packages OK [22:19:21] RECOVERY - DPKG on mw1098 is OK: All packages OK [22:19:21] https://gerrit.wikimedia.org/r/#/c/56037 [22:19:26] PROBLEM - Apache HTTP on mw1091 is CRITICAL: Connection refused [22:19:26] PROBLEM - Apache HTTP on mw1095 is CRITICAL: Connection refused [22:19:26] RECOVERY - DPKG on mw1096 is OK: All packages OK [22:19:26] RECOVERY - Disk space on terbium is OK: DISK OK [22:19:26] RECOVERY - DPKG on rdb1002 is OK: All packages OK [22:19:26] PROBLEM - Apache HTTP on mw1092 is CRITICAL: Connection refused [22:19:26] RECOVERY - DPKG on mw1099 is OK: All packages OK [22:19:27] RECOVERY - DPKG on mw1090 is OK: All packages OK [22:19:27] RECOVERY - DPKG on mw1097 is OK: All packages OK [22:19:36] PROBLEM - Apache HTTP on mw1097 is CRITICAL: Connection refused [22:19:40] I'd like that to get pushed out as soon as possible. [22:19:46] PROBLEM - Apache HTTP on mw1094 is CRITICAL: Connection refused [22:19:46] RECOVERY - DPKG on terbium is OK: All packages OK [22:19:56] PROBLEM - Apache HTTP on mw1090 is CRITICAL: Connection refused [22:20:44] Susan: why on this channel? [22:21:03] Where else? [22:22:23] Susan: wikimedia-tech [22:22:37] since ops doesn't deal with the core stuff (i mean technically i could, but not really supposed to [22:22:49] All right. [22:23:11] Yet if someone wanted to push that file out, they'd notify here. [22:23:24] notpeter: joy http://paste.debian.net/244965/ let's see on mw1039 :) [22:23:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Tue Mar 26 22:23:29 UTC 2013 [22:23:41] make that 1093 [22:23:48] RECOVERY - Puppet freshness on mw1044 is OK: puppet ran at Tue Mar 26 22:23:37 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1087 is OK: puppet ran at Tue Mar 26 22:23:38 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1073 is OK: puppet ran at Tue Mar 26 22:23:38 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1103 is OK: puppet ran at Tue Mar 26 22:23:41 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1126 is OK: puppet ran at Tue Mar 26 22:23:41 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1098 is OK: puppet ran at Tue Mar 26 22:23:42 UTC 2013 [22:23:48] RECOVERY - Puppet freshness on mw1137 is OK: puppet ran at Tue Mar 26 22:23:43 UTC 2013 [22:23:49] RECOVERY - Puppet freshness on mw1143 is OK: puppet ran at Tue Mar 26 22:23:44 UTC 2013 [22:23:55] !log reedy synchronized wmf-config/InitialiseSettings.php [22:23:57] RECOVERY - Puppet freshness on mw1210 is OK: puppet ran at Tue Mar 26 22:23:48 UTC 2013 [22:23:57] RECOVERY - Puppet freshness on mw1211 is OK: puppet ran at Tue Mar 26 22:23:49 UTC 2013 [22:23:57] RECOVERY - Puppet freshness on mw1213 is OK: puppet ran at Tue Mar 26 22:23:49 UTC 2013 [22:23:57] RECOVERY - Puppet freshness on mw1214 is OK: puppet ran at Tue Mar 26 22:23:54 UTC 2013 [22:24:01] Logged the message, Master [22:24:08] RECOVERY - Puppet freshness on mw1215 is OK: puppet ran at Tue Mar 26 22:23:58 UTC 2013 [22:24:08] RECOVERY - Puppet freshness on mw1216 is OK: puppet ran at Tue Mar 26 22:24:02 UTC 2013 [22:24:08] RECOVERY - Puppet freshness on mw1217 is OK: puppet ran at Tue Mar 26 22:24:04 UTC 2013 [22:24:16] RECOVERY - Puppet freshness on mw1218 is OK: puppet ran at Tue Mar 26 22:24:09 UTC 2013 [22:24:16] RECOVERY - Puppet freshness on mw1219 is OK: puppet ran at Tue Mar 26 22:24:14 UTC 2013 [22:24:16] RECOVERY - Puppet freshness on mw1220 is OK: puppet ran at Tue Mar 26 22:24:15 UTC 2013 [22:24:22] Susan: yeah because deploys often cause major issues (this one wouldn't) so we want to know why the site just died ;) [22:24:26] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:24:40] icinga-wm: db11 is decom, forget it:) [22:25:17] I'd like to kill the site one day. [22:25:42] don't say that! it's fragile enough with all the unintentional killing! [22:25:56] Uptime is overrated. [22:25:58] (o; [22:26:47] New patchset: Reedy; "(bug 45596) Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54353 [22:27:23] as long as it's over 50% it's ok ? [22:27:32] notpeter: seems they were in the middle of doing an upgrade but are done now .... [22:27:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54353 [22:28:17] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Tue Mar 26 22:28:06 UTC 2013 [22:28:37] mutante: would you be willing to make tickets for all of the boxes that are showing broken disks? [22:28:38] !log reedy synchronized wmf-config/InitialiseSettings.php [22:28:40] looks like you did some [22:28:45] Logged the message, Master [22:28:46] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.089 second response time [22:28:56] RECOVERY - Apache HTTP on mw109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.138 second response time [22:29:26] notpeter: already made 4 [22:29:38] Susan: but.. it's a contest http://www.uptimeprj.com/ [22:30:20] i'd like to have a 4 9's uptime [22:30:37] notpeter: some "command not defined" left that report as RAID though [22:30:57] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [22:31:46] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.088 second response time [22:32:06] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.076 second response time [22:32:06] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [22:32:16] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [22:32:16] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.046 second response time [22:32:23] Uptime is an odd metric. [22:32:28] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [22:32:28] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [22:32:36] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.079 second response time [22:32:39] I mean, server uptime is one thing. Reachability is another. And then there's "it loads but it's missing all the CSS." [22:32:43] And variants. [22:33:12] that's becaues there weren't any releases for those commits? [22:33:14] ori-l? [22:33:16] PROBLEM - DPKG on mw1088 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:33:25] notpeter: i think we have all that are RAID criticals but not unknown command , one ticket each [22:33:32] let me ACK them: [22:33:48] PROBLEM - DPKG on mw1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:33:56] PROBLEM - DPKG on mw1087 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:07] PROBLEM - Apache HTTP on mw108 is CRITICAL: Connection refused [22:34:07] PROBLEM - DPKG on mw1083 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:07] PROBLEM - DPKG on mw1081 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:28] PROBLEM - DPKG on mw1089 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:28] PROBLEM - DPKG on mw1084 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:29] PROBLEM - DPKG on mw1086 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:34:29] PROBLEM - DPKG on mw1080 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:35:04] ottomata: i dunno; i'm confused by there being two debian/changelog files (changelog and CHANGELOG), one of which seems to track non-version-bumping regular commits against the upstream repository (changelog), the other looking like a normal debian changelog (CHANGELOG) [22:35:49] RECOVERY - Disk space on db1051 is OK: DISK OK [22:35:49] ACKNOWLEDGEMENT - RAID on db1001 is CRITICAL: CRITICAL: Degraded daniel_zahn RT #4832 [22:35:49] RECOVERY - RAID on db1051 is OK: OK: State is Optimal, checked 2 logical device(s) [22:35:49] RECOVERY - MySQL Recent Restart on db1051 is OK: OK seconds since restart [22:35:49] RECOVERY - Full LVS Snapshot on db1051 is OK: OK no full LVM snapshot volumes [22:36:03] RECOVERY - MySQL Idle Transactions on db1051 is OK: OK longest blocking idle transaction sleeps for seconds [22:36:03] RECOVERY - mysqld processes on db1051 is OK: PROCS OK: 1 process with command name mysqld [22:36:03] RECOVERY - Apache HTTP on mw108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.123 second response time [22:36:13] RECOVERY - mysqld processes on db1052 is OK: PROCS OK: 1 process with command name mysqld [22:36:13] PROBLEM - Apache HTTP on mw1088 is CRITICAL: Connection refused [22:36:13] RECOVERY - DPKG on mw1081 is OK: All packages OK [22:36:13] RECOVERY - DPKG on mw1083 is OK: All packages OK [22:36:13] RECOVERY - MySQL Replication Heartbeat on db1051 is OK: OK replication delay seconds [22:36:13] RECOVERY - MySQL Idle Transactions on db1052 is OK: OK longest blocking idle transaction sleeps for seconds [22:36:13] RECOVERY - MySQL Replication Heartbeat on db1056 is OK: OK replication delay seconds [22:36:14] RECOVERY - MySQL Idle Transactions on db1056 is OK: OK longest blocking idle transaction sleeps for seconds [22:36:14] RECOVERY - MySQL Slave Delay on db1051 is OK: OK replication delay seconds [22:36:15] RECOVERY - Full LVS Snapshot on db1052 is OK: OK no full LVM snapshot volumes [22:36:21] ACKNOWLEDGEMENT - RAID on db1028 is CRITICAL: CRITICAL: Degraded daniel_zahn RT #4834 [22:36:23] RECOVERY - MySQL Slave Running on db1051 is OK: OK replication [22:36:23] RECOVERY - DPKG on mw1088 is OK: All packages OK [22:36:23] RECOVERY - MySQL disk space on db1051 is OK: DISK OK [22:36:23] RECOVERY - MySQL Slave Delay on db1056 is OK: OK replication delay seconds [22:36:23] RECOVERY - MySQL Recent Restart on db1056 is OK: OK seconds since restart [22:36:23] RECOVERY - DPKG on db1051 is OK: All packages OK [22:36:23] RECOVERY - MySQL disk space on db1052 is OK: DISK OK [22:36:24] RECOVERY - MySQL Recent Restart on db1052 is OK: OK seconds since restart [22:36:24] PROBLEM - Apache HTTP on mw1083 is CRITICAL: Connection refused [22:36:25] RECOVERY - DPKG on mw1089 is OK: All packages OK [22:36:33] RECOVERY - DPKG on mw1084 is OK: All packages OK [22:36:33] RECOVERY - DPKG on mw1086 is OK: All packages OK [22:36:33] RECOVERY - MySQL Replication Heartbeat on db1052 is OK: OK replication delay seconds [22:36:33] RECOVERY - DPKG on mw1080 is OK: All packages OK [22:36:33] RECOVERY - MySQL Slave Running on db1056 is OK: OK replication [22:36:45] RECOVERY - MySQL Slave Delay on db1052 is OK: OK replication delay seconds [22:36:45] RECOVERY - MySQL Slave Running on db1052 is OK: OK replication [22:36:45] RECOVERY - Full LVS Snapshot on db1056 is OK: OK no full LVM snapshot volumes [22:36:56] PROBLEM - Apache HTTP on mw1084 is CRITICAL: Connection refused [22:36:56] PROBLEM - Apache HTTP on mw1089 is CRITICAL: Connection refused [22:36:56] RECOVERY - RAID on db1056 is OK: OK: State is Optimal, checked 2 logical device(s) [22:36:56] RECOVERY - MySQL disk space on db1056 is OK: DISK OK [22:36:56] RECOVERY - DPKG on mw1082 is OK: All packages OK [22:36:56] PROBLEM - Apache HTTP on mw1081 is CRITICAL: Connection refused [22:36:56] RECOVERY - DPKG on db1056 is OK: All packages OK [22:36:57] RECOVERY - DPKG on mw1087 is OK: All packages OK [22:37:03] RECOVERY - Disk space on db1056 is OK: DISK OK [22:37:13] PROBLEM - Apache HTTP on mw1082 is CRITICAL: Connection refused [22:37:13] PROBLEM - Apache HTTP on mw1087 is CRITICAL: Connection refused [22:37:23] PROBLEM - Apache HTTP on mw1086 is CRITICAL: Connection refused [22:37:33] PROBLEM - Apache HTTP on mw1080 is CRITICAL: Connection refused [22:37:53] PROBLEM - Disk space on mw1211 is CRITICAL: NRPE: Command check_disk_space not defined [22:37:53] PROBLEM - DPKG on mw1214 is CRITICAL: NRPE: Command check_dpkg not defined [22:37:53] PROBLEM - DPKG on mw1220 is CRITICAL: NRPE: Command check_dpkg not defined [22:37:53] PROBLEM - DPKG on mw1218 is CRITICAL: NRPE: Command check_dpkg not defined [22:37:53] PROBLEM - DPKG on mw1216 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:03] PROBLEM - Disk space on mw1214 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:03] PROBLEM - Disk space on mw1216 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:03] PROBLEM - Disk space on mw1218 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:03] PROBLEM - Disk space on mw1220 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:03] PROBLEM - RAID on mw1211 is CRITICAL: NRPE: Command check_raid not defined [22:38:13] mark, ori-l found this code, we thought that mark was you :) http://codebaboon.com/varnish-casting-string-ip-address [22:38:14] PROBLEM - DPKG on mw1210 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:14] PROBLEM - RAID on mw1220 is CRITICAL: NRPE: Command check_raid not defined [22:38:14] PROBLEM - RAID on mw1216 is CRITICAL: NRPE: Command check_raid not defined [22:38:14] PROBLEM - DPKG on mw1213 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:14] PROBLEM - RAID on mw1218 is CRITICAL: NRPE: Command check_raid not defined [22:38:14] PROBLEM - RAID on mw1214 is CRITICAL: NRPE: Command check_raid not defined [22:38:23] PROBLEM - Disk space on mw1213 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:23] PROBLEM - Disk space on mw1210 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:23] PROBLEM - DPKG on mw1215 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:23] PROBLEM - DPKG on mw1219 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:23] PROBLEM - DPKG on mw1217 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:32] MAY BE you [22:38:33] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.995 second response time [22:38:33] PROBLEM - Disk space on mw1215 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:33] PROBLEM - RAID on mw1210 is CRITICAL: NRPE: Command check_raid not defined [22:38:33] PROBLEM - Disk space on mw1217 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:33] PROBLEM - Disk space on mw1219 is CRITICAL: NRPE: Command check_disk_space not defined [22:38:33] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [22:38:43] PROBLEM - DPKG on mw1211 is CRITICAL: NRPE: Command check_dpkg not defined [22:38:43] PROBLEM - RAID on mw1217 is CRITICAL: NRPE: Command check_raid not defined [22:38:43] PROBLEM - RAID on mw1219 is CRITICAL: NRPE: Command check_raid not defined [22:38:43] PROBLEM - RAID on mw1213 is CRITICAL: NRPE: Command check_raid not defined [22:38:43] PROBLEM - RAID on mw1215 is CRITICAL: NRPE: Command check_raid not defined [22:38:44] i don't need no trouble, mister [22:38:53] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.085 second response time [22:38:53] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [22:38:53] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [22:39:14] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [22:39:14] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [22:39:14] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.076 second response time [22:39:23] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.080 second response time [22:39:34] !log did a service apache2 start on mw108* [22:39:38] :) [22:39:40] Logged the message, Mistress of the network gear. [22:39:54] RECOVERY - DPKG on mw1214 is OK: All packages OK [22:40:03] RECOVERY - Disk space on mw1214 is OK: DISK OK [22:40:15] RECOVERY - RAID on mw1216 is OK: OK: no RAID installed [22:40:15] RECOVERY - RAID on mw1214 is OK: OK: no RAID installed [22:40:23] RECOVERY - DPKG on mw1215 is OK: All packages OK [22:40:23] RECOVERY - DPKG on mw1217 is OK: All packages OK [22:40:33] RECOVERY - Disk space on mw1219 is OK: DISK OK [22:40:35] RECOVERY - Disk space on mw1217 is OK: DISK OK [22:40:35] RECOVERY - Disk space on mw1215 is OK: DISK OK [22:40:44] RECOVERY - RAID on mw1217 is OK: OK: no RAID installed [22:40:44] RECOVERY - RAID on mw1215 is OK: OK: no RAID installed [22:40:44] RECOVERY - RAID on mw1219 is OK: OK: no RAID installed [22:40:53] RECOVERY - DPKG on mw1216 is OK: All packages OK [22:40:53] RECOVERY - DPKG on mw1218 is OK: All packages OK [22:40:53] RECOVERY - DPKG on mw1220 is OK: All packages OK [22:41:03] RECOVERY - Disk space on mw1216 is OK: DISK OK [22:41:03] RECOVERY - Disk space on mw1220 is OK: DISK OK [22:41:03] RECOVERY - Disk space on mw1218 is OK: DISK OK [22:41:18] RECOVERY - RAID on mw1218 is OK: OK: no RAID installed [22:41:18] RECOVERY - RAID on mw1220 is OK: OK: no RAID installed [22:41:23] RECOVERY - DPKG on mw1219 is OK: All packages OK [22:41:45] making everything better is hard [22:43:06] New patchset: Pyoungmeister; "stupid nrpe can't restart properly, adding restart attr to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56071 [22:44:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56071 [22:45:08] !log mw2 - broken DPKG: pi librsvg2-2 , pi librsvg2-bin [22:45:15] Logged the message, Master [22:47:13] RECOVERY - DPKG on mw1210 is OK: All packages OK [22:47:23] RECOVERY - Disk space on mw1210 is OK: DISK OK [22:47:26] !log search21 - broken DPKG: iF sun-j2sdk1.6 [22:47:33] RECOVERY - RAID on mw1210 is OK: OK: no RAID installed [22:47:33] Logged the message, Master [22:47:43] RECOVERY - RAID on mw1213 is OK: OK: no RAID installed [22:47:43] RECOVERY - DPKG on mw1211 is OK: All packages OK [22:47:53] RECOVERY - Disk space on mw1211 is OK: DISK OK [22:48:04] RECOVERY - RAID on mw1211 is OK: OK: no RAID installed [22:48:17] RECOVERY - DPKG on mw1213 is OK: All packages OK [22:48:23] RECOVERY - Disk space on mw1213 is OK: DISK OK [22:49:04] !log search21 - remove sun-j2sdk , replaced by oracle-jdsdk [22:49:10] Logged the message, Master [22:49:23] RECOVERY - DPKG on search21 is OK: All packages OK [22:49:25] New patchset: Reedy; "Remove checkers.php and various obsolete spam regexes" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55525 [22:49:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55525 [22:50:42] !log reedy synchronized wmf-config/ [22:50:51] Logged the message, Master [22:51:42] !log starting udpprofile (carbon-cache.py) on professor [22:51:48] Logged the message, Master [22:52:43] RECOVERY - Disk space on db10 is OK: DISK OK [22:52:56] RECOVERY - Disk space on db9 is OK: DISK OK [22:53:01] New patchset: Reedy; "Removed favicon.ico files obsoleted by I35d3af43" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55014 [22:53:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55014 [22:53:27] New review: Krinkle; "Which openstack change?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [22:53:54] !log reedy synchronized docroot [22:54:00] Logged the message, Master [22:54:54] RECOVERY - DPKG on mw1135 is OK: All packages OK [22:55:07] !log re-installing ganglia-monitor on mw1135, fix dpkg [22:55:14] Logged the message, Master [23:00:06] !log removing mysql packages from db1036 (broken dpkg, status "pi") [23:00:12] Logged the message, Master [23:00:43] RECOVERY - DPKG on db1036 is OK: All packages OK [23:04:03] New patchset: Reedy; "(bug 46005) Set $wgCategoryCollation to 'uca-be-tarask' on be-x-old.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54364 [23:04:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54364 [23:04:56] !log reedy synchronized wmf-config/InitialiseSettings.php [23:05:02] Logged the message, Master [23:05:30] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:07:00] PROBLEM - Disk space on db1052 is CRITICAL: NRPE: Command check_disk_space not defined [23:07:28] !log mw13 - killing puppet and re-running [23:07:35] Logged the message, Master [23:07:40] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [23:08:00] RECOVERY - Disk space on db1052 is OK: DISK OK [23:08:13] ACKNOWLEDGEMENT - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 daniel_zahn decom - RT #4828 [23:10:17] !log db11 - disabling all notifications [23:10:24] Logged the message, Master [23:11:02] hey [23:11:09] paravoid: heya [23:11:19] ori-l, ottomata: saw you mentioned my name a bunch of times [23:11:21] what do you need :) [23:11:50] oh! I actually have a question for you: can you explain how the -k option in timeout is used and how it's different than just the regular duration [23:11:55] the man page is pretty unclear [23:12:14] paravoiiiiid! hiii [23:12:15] i'm heading out [23:12:16] but! [23:12:22] and we have a couple of these and similar: "DISK CRITICAL - /var/lib/ceph/osd/ceph-64 is not accessible: Input/output error " on ms-be boxes [23:12:31] notpeter: the regular duration sends a sigterm, -k sends a sigkill [23:12:43] cool! makes sense [23:12:43] was trying to figure out what to push from the packaging of python-jsonschema [23:12:44] https://gerrit.wikimedia.org/r/#/c/56064/ [23:12:49] i pushed that for review :) [23:13:00] is the -k value that much longer after the regular duration? [23:13:08] in other news, limn puppet module needs review too [23:13:09] mutante: holy crap, we actually have disk checks now?! [23:13:20] paravoid: we have lots of checks now :) [23:13:25] https://gerrit.wikimedia.org/r/#/c/49710/ [23:13:28] buuuuuut, i am out [23:13:28] paravoid: we do, and since then we detected like 5 broken arrays and 3 full disks or something :p[ [23:13:37] thank youuuuuu, laters all [23:13:43] and loads of broken packages :) [23:13:46] I know about the ceph disks, but it's *very* helpful to have them in nagios [23:13:52] that's just awesome [23:14:02] kudos to whoever added those checks [23:14:15] notpeter I guess :) [23:14:15] paravoid: we used to have them! I just dusted them off :) [23:15:56] PROCS CRITICAL: 0 processes with regex args '^/usr/bin/python /usr/bin/swift-account-reaper' .. shrug, the account-reaper doesn't even sound like we use it, but i can also start it and be told it would already be running , but it's not [23:16:20] New patchset: Pyoungmeister; "add more aggressive killing to puppet for precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56073 [23:16:32] paravoid: would you be willing to take a look at ^ [23:16:43] sec [23:16:49] ok [23:16:51] no worries [23:18:02] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:18:20] RECOVERY - swift-account-reaper on ms-be12 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [23:18:48] hey, but: [23:18:50] start swift-account-reaper [23:18:50] start: Job is already running: swift-account-reaper [23:19:01] (it told me to not use the init script) [23:19:15] anwyays, yay:) [23:19:32] New patchset: Faidon; "Install python-swiftclient on all Swift boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56074 [23:19:43] i think Icinga CRITs are down to a record low for now [23:19:47] no they're not :) [23:19:53] they were a month ago [23:20:05] but now we have more checks, so your comment probably stands percentage-wise :) [23:20:23] last month I did a sprint and we were down to 10 alerts or something [23:20:29] crit+warning [23:20:38] I was very noisy about it, I'm sure RobH will remember :) [23:20:44] TimStarling: When you have a minute, could you give your view on https://gerrit.wikimedia.org/r/#/c/53125/ ? [23:20:45] :) and all the broken RAID tickets are already picked up by dc people:) [23:20:49] the check-raid.py script also needs to be updated [23:21:03] there are lots of parse errors with megacli64, is that what you're referring to? [23:21:15] paravoid: yeah [23:21:15] TimStarling: btw, nice work on favicon cleanup. It's been a long time coming, good to see it finally go main stream. [23:21:22] it's just what I pulled off of db9 [23:21:32] it doesn't know about our newer raid card versions at all [23:21:36] mark when you've got a chance can you take a look at https://gerrit.wikimedia.org/r/#/c/52606/ ? [23:21:37] like, the ciscos, for example [23:22:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56074 [23:23:23] mutante: the account reaper job was started, it just crashed because python-swiftclient wasn't installed and it imports that [23:23:31] New patchset: Krinkle; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [23:23:37] that's a package bug, as it should clearly Depend on python-swiftclient [23:23:57] paravoid: thanks for info, cool [23:26:32] New patchset: Dzahn; "fix monitoring of carbon-cache.py on professor (check args instead of cmdline)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56076 [23:27:15] New review: Dzahn; "root@professor:~# /usr/lib/nagios/plugins/check_procs -c 1:1 -C carbon-cache.py" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/56076 [23:27:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56076 [23:27:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:46] !log olivneh synchronized php-1.21wmf12/extensions/NavigationTiming/modules/ext.navigationTiming.js 'Workaround for buggy NavTiming implementation in IE9 (bug 46474)' [23:28:52] Logged the message, Master [23:29:26] notpeter: I don't understand what you're trying to do :) [23:29:36] I think you misunderstood how timeout works, or I did :) [23:29:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.167 second response time [23:30:06] timeout -k M N will send a term after N seconds and if it doesn't die after M seconds it will send a -9 [23:30:29] timeout -k M -s 9 N doesn't make any sense [23:30:55] !log restarting nrpe on professor [23:31:01] Logged the message, Master [23:31:47] gmetad keeps messing up neon :p [23:32:05] it is like spence now [23:32:14] Krinkle: since you're basically doing what I told you to do, I think you can assume that I am fine with it ;) [23:32:49] paravoid: you are correct. I thought that -k M -s 9 N would send a 9 after M+N seconds [23:32:56] if it was still running [23:33:01] but [23:33:06] that's the default behavior [23:33:22] ah, ok! [23:33:25] then ijust want the -k [23:33:27] yes :) [23:33:30] cool! [23:33:32] thanks :) [23:33:33] on a semi-related note [23:33:34] https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Miscellaneous+pmtpa&h=stafford.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [23:33:44] TimStarling: I'm asking since you +1, you didn't merge it. Though I think now that that is common in operations/* (peer approve is +1, merge always done by author) [23:33:45] New patchset: Tim Starling; "Rename legacy 'live-1.5/' to 'w/'." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [23:34:05] i'm suspecting he wasn't going to deploy it [23:34:08] so hence not merging it ;) [23:34:13] I can merge it [23:34:21] OK [23:34:27] paravoid: weird.... [23:34:27] it's just that merging it is slightly more work than approving it [23:34:35] a lot of the procs are getting hung over time [23:34:43] which is my impetus for doing this now [23:34:50] since it implies that you will take some measures to avoid instantly breaking the cluster [23:35:12] Yeah, mediawiki-config merging without deploying is frowned upon, and rightfully so. [23:35:14] New patchset: Pyoungmeister; "add more aggressive killing to puppet for precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56073 [23:35:16] like, say, deploying it in two stages instead of just replacing all the live-1.5 directories with symlinks to nowhere and then populating the new directory [23:35:18] I understand, I was just making sure. [23:35:30] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/56019/' [23:35:37] Logged the message, Master [23:35:39] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [23:35:50] RECOVERY - mysqld processes on db1056 is OK: PROCS OK: 1 process with command name mysqld [23:36:11] TimStarling: Right, because our deployment system is rather fragile. Would this kind of changes will likely hit more visibly [23:36:40] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [23:36:40] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [23:36:40] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [23:36:40] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [23:36:50] RobH: around? [23:37:10] error: Updating the following directories would lose untracked files in it: [23:37:10] live-1.5 [23:37:13] makes things interesting [23:37:15] I mean the fact that our sync system keeps the machine pooled while it is syncing the machine, thus exposing it to http request while it is in a compelely unpredicable state, not unlikely with references to inexistent files or classes. [23:38:42] TimStarling: git status --ignored [23:39:01] .svn, live-1.5/load.php~ and our fatal/phpinfo files. [23:39:31] I wonder how those are ignored? [23:39:49] New patchset: Asher; "pulling db1001, adding db1052" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56081 [23:39:59] easy enough to deal with [23:40:22] Ah, .git/info/exclude , of course [23:40:24] wth, cpu_nice.rrd: illegal attempt to update using time 1364341199 when last update time is 1364341199 (minimum one second step) [23:41:55] hey notpeter, it looks like db29 has another "lock wait timeout exceeded" and replication is stalled. [23:41:59] I wonder if there's any way to convince people to stop doing syncs for a while [23:42:03] any idea what causes this? [23:42:26] gmetad can't get answers from data sources, reports those "illegal attempts" trying to update more than once a second and generally uses lots of CPU , sigh [23:43:28] TimStarling: I've been thinking about implementing a basic sync blocker. A file somewhere with a message in it. (e.g. "Don't sync --tstarling"), if the file is non-empty our sync- bins will abort and display the message, allowing override with Y/N on "Are you sure". Or something like that. [23:44:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56073 [23:44:36] !log tstarling synchronized w [23:44:42] Logged the message, Master [23:46:15] scap probably won't replace a directory with a symlink anyway [23:46:46] notpeter: I notice a bunch of XML files under /a/search/dumps, some with today's timestamp; do we create/update indexes from dump files as part of the normal workflow or are they there for emergencies ? [23:46:51] TimStarling: I see that the docroot/*/w still point to live-1.5. I missed those. Perhaps (after syncing /w) do those first? [23:46:59] So that everything is pointing to the new place. [23:47:07] pgehres: you are running queries that lock centralauth.globaluser [23:47:21] binasher: its possible [23:47:33] its fact [23:47:34] xyzram: that is part of the workflow for private wikis [23:47:39] it's not a problem if the symlink exists, is it? [23:47:51] (and also for some reason dewikisource...) [23:47:52] binasher: ha, i meant that i could be, but unintentionally so [23:48:05] When do they run ? [23:48:08] is this something I can fix, or does it require you [23:48:26] TimStarling: True, but if some script is in the middle of recursively deleting /live-1.5 in preperation for creating the symlink. [23:48:42] I'm not going to do it that way [23:49:04] xyzram: 0200 utc [23:49:08] pgehres: you should avoid doing "UPDATE mytable …. ( SELECT table i don't want to block writes to )… " [23:49:26] defined in manifests/search.pp search::indexer in a bit block of crons [23:49:38] You're going to update all references and when done simply delete live-1.5? (not even a symlink) That would would too. [23:49:40] notpeter: thanks. [23:49:40] pgehres: updating the audit tables via a join against a prod table can also result in a gaplock that will block inserts [23:49:42] would work* [23:49:45] binasher: yeah, i was just looking for those. i guess I mean, when this timeout occurs, can I fix the underlying replications [23:49:46] import-private and also import-broken [23:49:52] pgehres: just stop slave ; start slave ; [23:49:57] xyzram: yep! [23:50:15] dsh -o-lroot -g mediawiki-installation -cM ' [23:50:17] binasher: awesome! thanks. it worked [23:50:21] mv /usr/local/apache/common-local/live-1.5 /usr/local/apache/common-local/live-1.5.old [23:50:30] ln -s w /usr/local/apache/common-local/live-1.5 [23:50:34] ' [23:50:57] should only leave a few ms when the dir is absent [23:51:12] pgehres: i just ran this on db29 for you: "set global slave_transaction_retries=100000;" [23:51:17] I'll test it on a single server first [23:51:30] Aha, manually moving them. Yeah, that's best. [23:52:57] !log on all apaches: replacing /usr/local/apache/common-local/live-1.5 with a symlink to w [23:53:03] Logged the message, Master [23:54:26] it's done [23:54:53] and the site still seems to be up [23:55:34] I'll remove the old directory now [23:57:09] finished [23:57:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:59:54] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1