[00:00:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.824 second response time [00:06:06] ^d: Gerrit is 503ing :( [00:06:25] <^d> Yeah sorry about that. puppet's fixing it. [00:06:28] <^d> But it's taking it's time. [00:06:41] OK [00:07:10] $ ssh -p 29418 gerrit.wikimedia.org [00:07:12] ssh: connect to host gerrit.wikimedia.org port 29418: Connection refused [00:07:18] Right OK yeah I'll let that sort itself out then ^_^ [00:08:48] <^d> Ok all back. [00:08:51] <^d> I promise that's it. [00:09:15] <^d> "notice: Finished catalog run in 169.60 seconds" [00:09:17] <^d> Damn puppet. [00:09:28] thanks :) [00:10:09] The page you requested was not found, or you do not have permission to view this page. [00:10:09] on https://gerrit.wikimedia.org/r/#/c/82769/ (which is my patchset) [00:10:27] WFM [00:10:36] <^d> Me too. [00:10:46] legoktm: Refresh all the things [00:11:26] hm [00:11:27] works now [00:11:42] <^d> RoanKattouw: gerrit2 30318 22.6 2.6 29104352 882080 ? Sl 00:07 0:45 GerritCodeReview -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --run-id=1378339676.30295 [00:11:51] <^d> That -Xmx20g is the main reason we upgraded ;-) [00:11:57] What's that do [00:12:05] <^d> jvm heap = 20g. [00:12:06] 20GB memories [00:12:53] Nice [00:13:48] <^d> Also more cores, so I'll probably raise the number of threads for various tasks as well. [00:22:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.112 second response time [00:30:09] I'm sure commons already has some sort of strange template to handle that >.> *runs away from Reedy* [00:30:42] Wrong channel? [00:31:00] Wouldn't suprise me if there was and it was a big hardcoded switch [00:32:01] apparently yes >.> [00:32:15] * p858snake|l gets the MIB memory eraser out [00:35:48] (03PS1) 10Ori.livneh: Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 [00:36:19] ^ paravoid [00:36:41] i'm going to scp the change into place meanwhile [00:37:19] (03CR) 10Faidon Liambotis: [C: 032] Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 (owner: 10Ori.livneh) [00:37:43] (03CR) 10Faidon Liambotis: [V: 032] Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 (owner: 10Ori.livneh) [00:49:06] ^d: gerrit does not seem to merge currently [00:49:19] <^d> gerrit isn't or zuul isn't? [00:49:25] <^d> James_F just pinged me about the latter. [00:49:27] zuul probably [00:49:42] not sure which part of the puzzle is doing what these days [00:51:32] <^d> We're working on making a diagram ;-) [00:52:00] <^d> tl;dr version: zuul is the bridge & pipeline thing that runs between jenkins and gerrit [00:54:43] Yeah Zuul is totally broken [00:54:48] It claims everything is unmergeable [00:55:02] See latest comment on https://gerrit.wikimedia.org/r/#/c/82777/ which is based directly on master [00:57:22] ( ^d ---^^ ) [00:57:39] I suppose the server move might be throwing off the create-merge-commit step? [00:57:56] <^d> Example job? [01:01:59] I don't know [01:02:06] I linked an example of a failing commit above [01:02:34] And I know that the first step in gate-and-submit is to create the hypothetical merge commit that would result from merging the change into master [01:02:43] <^d> Yeah I'm trying to find where that is in jenkins [01:02:48] If that fails, the change is -1ed with a message saying "this isn't mergeable" [01:02:56] So I suspect that's failing for different reasons [01:02:57] Hmm [01:07:26] <^d> zuul's logs aren't being useful enough. [01:07:32] <^d> I can see they're failing, but no idea why [01:08:00] I'm digging around the Jenkins web UI trying to find my jobs [01:08:05] I just uploaded a whole bunch of VE changes [01:08:16] ^d: restart it? Doesn't it listen to stream events, which would've been interrupted? [01:08:26] <^d> I just restarted a bit ago [01:08:30] (grrrit-wm's source automatically restarted, I think) [01:08:35] ah, hmm [01:10:37] PROBLEM - Disk space on analytics1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 11255 MB (3% inode=99%): [01:12:34] <^d> RoanKattouw: I would look for jobs, but jenkins ui is not...fast [01:12:56] You might say that [01:12:59] 502 Proxy Error [01:13:10] <^d> Hmm, worked for two pywikibot commits. [01:13:12] <^d> https://gerrit.wikimedia.org/r/#/q/status:open,n,z [01:15:30] <^d> Krinkle: Yo, you around? [01:16:44] * Elsie bugs ^d in ops. [01:22:30] <^d> A-ha!!!!! [01:22:52] <^d> demon@gallium:/srv/ssd/zuul/git/mediawiki/core$ git remote -v [01:22:52] <^d> origin ssh://jenkins-bot@manganese.wikimedia.org:29418/mediawiki/core (fetch) [01:22:52] <^d> origin ssh://jenkins-bot@manganese.wikimedia.org:29418/mediawiki/core (push) [01:22:55] <^d> RoanKattouw: ^ [01:23:20] <^d> Willing to bet that explains the other ones too. [01:23:45] Well yeah .... [01:23:52] Why would we ever hardcode a server name like that [01:26:03] <^d> So, how on earth to fix this. [01:27:15] <^d> Hmm, I guess some bash magic over all .git/config's. [01:27:19] Well, the quick fix is to do git remote rm origin && git remote add ssh://.......gerrit.wikimedia.org....... [01:27:21] Or that [01:27:27] and bash magic it, yeah [01:28:53] Like, for f in `find | grep '.git/config$'`; do sed -i -e 's/manganese/gerrit/g' $f; done ? [01:29:43] Tested the components individually and that seems to work [01:29:50] Might want to back up the .git/config files too though [01:30:04] ^d: Are you gonna be OK handling this? I'm being dragged away for dinner [01:30:43] <^d> I should be fine now that I know what's up. [01:30:47] OK aweseome [01:30:51] Then I'm gonna go [01:31:20] Message me on gchat/hangout if you need me, I have that on my phone for both my personal and my work accounts [01:34:48] ^d: hey [01:34:55] I'm guessing the new gerrit IP is a service ip? [01:35:18] (03PS3) 10Mattflaschen: Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 [01:35:18] while at it, it'd be neat if we also did ipv6 :) [01:35:30] (03CR) 10jenkins-bot: [V: 04-1] Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [01:35:42] <^d> paravoid: That it is :) [01:41:41] (03PS1) 10Faidon Liambotis: Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 [01:42:20] (03CR) 10Faidon Liambotis: [C: 032] Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 (owner: 10Faidon Liambotis) [01:42:27] (03CR) 10Faidon Liambotis: [V: 032] Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 (owner: 10Faidon Liambotis) [01:42:33] <^d> !log gallium: updated git remotes in /srv/ssd/zuul/git/* to not point to old manganese box and point to gerrit.wm.o instead [01:42:35] drdee: ^ [01:42:38] Logged the message, Master [01:44:15] ^d: does gerrit do anything with IPs? [01:44:18] authentication or whatever? [01:44:40] <^d> Nope, doesn't care. [01:44:48] so adding ipv6 might not be so hard [01:44:57] <^d> Should be easy. It sits behind apache and barely knows the outside world exists. [01:45:12] <^d> Well, for web. SSH might be a little more fun. [01:45:40] for web I'm sure ma rk would want to put it behind varnish :) [01:46:06] <^d> That's the goal, we talked about it but haven't finished. [01:46:15] <^d> It'll probably break things. [01:46:17] <^d> :) [01:46:20] yeah [01:46:23] also probably means lvs [01:46:36] well, certainly [01:46:44] and we'll need SSH LVS too, heh [01:46:48] fun [01:46:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:02] <^d> Well gerrit can only live on one box. [01:47:21] <^d> Unless you have super secret google code that lets it run in the cloud ;-) [01:47:27] but you do need to point :80 to varnish, :443 to ssl terminators and :22 to gerrit :) [01:47:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.023 second response time [01:48:15] (03PS2) 10Springle: db maintenance role for terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81633 [01:48:21] anyway [01:48:24] let's do ipv6 :) [01:48:38] not now, I'm dead [01:48:44] <^d> Yeah me too. [01:48:46] it's 5am and I woke up at 9 :) [01:48:46] (03CR) 10Springle: [C: 032 V: 032] db maintenance role for terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81633 (owner: 10Springle) [01:49:45] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [01:49:47] <^d> paravoid: You can also do authenticated things over https, which will make things double fun if we put it behind an extra layer. [01:50:05] technically, two extra layers :) [01:50:07] but yeah [01:50:31] I hope at least it doesn't do any SSL client auth [01:50:59] <^d> It can do some funky things with auth. [01:51:15] <^d> And Ryan was wanting to use something like mod_openid to make login seamless with wikitech. [01:51:15] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:55] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:59] !log LocalisationUpdate completed (1.22wmf15) at Thu Sep 5 02:15:59 UTC 2013 [02:16:03] Logged the message, Master [02:23:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.613 second response time [02:30:04] !log LocalisationUpdate completed (1.22wmf14) at Thu Sep 5 02:30:03 UTC 2013 [02:30:07] Logged the message, Master [02:47:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 5 02:47:15 UTC 2013 [02:47:19] Logged the message, Master [02:56:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.545 second response time [03:14:31] RECOVERY - Disk space on analytics1003 is OK: DISK OK [03:59:41] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [04:02:51] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:13:16] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [05:22:19] (03CR) 10Asher: [V: 031] add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 (owner: 10Springle) [05:28:13] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [06:43:13] PROBLEM - Puppet freshness on search1003 is CRITICAL: No successful Puppet run in the last 10 hours [06:43:13] PROBLEM - Puppet freshness on search1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:13] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:13] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [07:14:34] (03PS2) 10Springle: add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 [07:25:42] (03CR) 10Springle: [C: 032 V: 032] add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 (owner: 10Springle) [07:34:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:36] RECOVERY - Disk space on analytics1004 is OK: DISK OK [07:47:39] re [07:53:47] (03PS1) 10Ori.livneh: Replace Gmetric implementation [operations/puppet] - 10https://gerrit.wikimedia.org/r/82805 [07:57:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [08:32:48] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 35841 seconds [08:33:48] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [08:38:48] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 35556 seconds [09:05:53] !log jenkins: killed two requesthandler threads that were eating CPU power. [09:05:56] Logged the message, Master [09:07:18] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 142 seconds [09:07:38] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [09:07:43] pow pow [09:09:55] mark: morning are you around ? :-] lanthanum.eqiad.wmnet is showing high system proc again. Might be kernel issue [09:11:51] apergos: hi :) are you aware of any bug that would make a server use lot of system CPU time for now reason ? [09:12:03] lanthanum.eqiad.wmnet is hit by it : https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Miscellaneous+eqiad&h=lanthanum.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [09:13:08] the watchdogs [09:13:10] that's interesting [09:13:21] I don't know, let's see what google says [09:13:28] on installation that machine apparently had the same high system cpu [09:13:36] last week I got its kernel upgraded and rebooted it [09:13:40] mark had to console reboot it [09:13:49] after 6days + it shows up the same issue again [09:14:05] so I was wondering if that might be a know issue :-] [09:14:28] lots of these: Sep 5 06:35:28 lanthanum kernel: [593173.826155] CPU2: Package power limit notification (total events = 559284) [09:14:46] the first being at [488647.093218] [09:14:51] no idea what time it is :-D [09:14:56] I guess seconds uptime [09:15:45] ahh dmesg -T|head [09:15:49] [Wed Sep 4 01:15:51 2013] CPU0: Package power limit notification (total events = 347340) [09:15:55] I see em from aug 25 [09:16:32] but your issue starts on the 1st or 2nd [09:18:49] and I have no idea how to find out what is causing the issue [09:18:56] I guess system CPU time is related to the kernel [09:19:37] ahh [09:19:51] power_saving command takes lot of %CPU [09:22:20] Aug 29 09:31:53 lanthanum kernel: [ 0.117213] CPU0: Thermal monitoring enabled (TM1) [09:29:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:20] filling a RT :-) [09:30:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.414 second response time [09:31:36] we could see if cpu stepping is enabled in the bios on this box and if so disable it, see if that helps [09:31:49] this would obviously require downtime [09:33:11] https://rt.wikimedia.org/Ticket/Display.html?id=5722 :-D [09:33:24] apergos: that machine is not yet used so it can go down whenever needed. [09:33:34] https://bugzilla.kernel.org/show_bug.cgi?id=36182 [09:34:27] and the patch just remove the notifications apparently [09:36:13] not thinking about the patch but about changing the bios setting [09:51:14] springle: Thanks for the help on the RT ticket, Sean. I've made a request for shell access. [09:51:38] springle: btw, where are you based? [09:52:05] siebrand: np. east-coast australia [09:52:19] springle: oh! Tim's not alone anymore :) [09:52:23] :) [09:56:53] all right lemme try rebooting and resetting this thing in the bios and we'll keep an eye on it, hashar... ok by you? [09:56:59] !log leaving sync_binlog=0 on db36 and db38 so replication keeps up. to be reviewed [09:57:01] apergos: yup [09:57:02] Logged the message, Master [09:57:16] apergos: would be lovely to one day use power saving though :-] [09:58:10] if the box stays busy it might not make that much difference [09:59:06] I got to get a bunch of patch deployed to be able to switch jobs to it [09:59:21] note that once the server has been rebooted, it will be idling again [09:59:35] maybe we can add an icinga check to monitor system cpu [10:01:52] doo dee doo dee doo [10:01:55] slow reboot is slow [10:02:18] last time it got stuck :/ [10:02:31] great [10:02:33] PROBLEM - SSH on lanthanum is CRITICAL: Connection refused [10:02:43] PROBLEM - Disk space on lanthanum is CRITICAL: Connection refused by host [10:02:46] if I gotta, I'll powercycle. but let's give it a few minutes [10:03:33] PROBLEM - RAID on lanthanum is CRITICAL: Connection refused by host [10:03:33] PROBLEM - DPKG on lanthanum is CRITICAL: Connection refused by host [10:07:03] PROBLEM - Host lanthanum is DOWN: PING CRITICAL - Packet loss = 100% [10:07:43] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [10:08:33] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay -0 seconds [10:12:05] (03PS4) 10Hashar: contint: generate .gitconfig files for all jenkins users [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 [10:12:20] (03CR) 10Hashar: "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [10:12:33] apergos: did you check that power profile bios setting for lanthanum? [10:12:46] I'm waiting for the bios to come up now [10:13:18] there's two settings I want to look at, one is 'dell custom' or 'os' which is the power setting one, I forget the name, and the other is the cpu stepping one [10:14:08] performance er watt = os so that's all right [10:14:45] ok [10:15:46] actually I'm going to set it to custom so I can turn off the c states [10:19:12] (03PS4) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [10:19:13] (03PS2) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [10:20:46] (03PS3) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [10:21:28] RECOVERY - DPKG on lanthanum is OK: All packages OK [10:21:29] RECOVERY - RAID on lanthanum is OK: OK: State is Optimal, checked 1 logical device(s) [10:21:29] RECOVERY - SSH on lanthanum is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:21:36] (03CR) 10Hashar: "patchset2: move the git daemon firewall rule to https://gerrit.wikimedia.org/r/82625 which define git-daemon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 (owner: 10Hashar) [10:21:38] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:21:39] RECOVERY - Disk space on lanthanum is OK: DISK OK [10:21:53] !log lanthanum bios processor settings set to 'custom' and c states disabled (attempt to fix power limit notification issue) [10:21:56] Logged the message, Master [10:21:59] (03CR) 10Hashar: "Now comes with a firewall rule I extracted from https://gerrit.wikimedia.org/r/82614" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [10:22:03] guess we'll see in a few days [10:23:40] (03PS5) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [10:23:56] (03CR) 10Hashar: "and rebased :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [10:24:38] apergos: thanks :-) can you update https://rt.wikimedia.org/Ticket/Display.html?id=5722 for later reference ? [10:24:52] anyone know what db1045 is for? some sort of test box? [10:25:20] done [10:26:13] springle: the only note I have found in server admin logs is Asher upgrading it to precise in July 2012 :( [10:26:13] no idea [10:27:45] I am off to lunch [10:27:53] enjoy [10:28:38] he's not gone yet, you can ask him I guess, it was 'upgrading to precise for testing' [10:29:09] * springle does [10:31:48] RECOVERY - DPKG on db1045 is OK: All packages OK [10:54:22] (03CR) 10Faidon Liambotis: [C: 032] "Assuming you're cleaning up gmetric.js manually." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82805 (owner: 10Ori.livneh) [11:27:27] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 322 pgs stuck unclean: recovery 6206025/923605575 degraded (0.672%) [11:27:27] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 324 pgs stuck unclean: recovery 6206025/923605578 degraded (0.672%) [11:27:35] uh oh [11:27:47] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 325 pgs stuck unclean: recovery 6206033/923606007 degraded (0.672%) [11:28:12] [1728184.013786] sd 0:2:1:0: [sdb] Unhandled error code [11:28:12] [1728184.013790] sd 0:2:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [11:28:15] [1728184.013793] sd 0:2:1:0: [sdb] CDB: Write(10): 2a 00 74 7f 77 a6 00 00 02 00 [11:28:17] PROBLEM - Disk space on ms-be1011 is CRITICAL: DISK CRITICAL - /var/lib/ceph/osd/ceph-120 is not accessible: Input/output error [11:28:18] cool [11:28:47] PROBLEM - MySQL Processlist on db1009 is CRITICAL: CRIT 1 unauthenticated, 0 locked, 0 copy to table, 44 statistics [11:29:47] paravoid: hey :) [11:29:47] RECOVERY - MySQL Processlist on db1009 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [11:30:09] paravoid: if you have any knowledge about our swift install, I filled a RT about some unmounted device on ms-be5.pmtpa.wmnet ( https://rt.wikimedia.org/Ticket/Display.html?id=5719 ) [11:30:32] container-replicator spam syslog messages about it :D [11:31:07] apergos: know anything about that or should I start investigating? [11:31:45] hm I don't know [11:34:38] 720xd replacement gone wrong? [11:35:01] er [11:35:04] h710 that is [11:35:14] this seems to have / on sda/sdb [11:35:21] but is configured by puppet for sdm/sdn [11:41:53] (03PS1) 10Faidon Liambotis: swift: ms-be5 is H710 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82817 [11:41:58] apergos: ^ [11:42:07] Product Name: PERC H710 Mini [11:42:17] ok great [11:42:19] root@ms-be5:~# uptime 11:42:13 up 55 days, 14:42, 1 user, load average: 18.92, 15.31, 14.67 [11:42:22] nice [11:42:27] (03CR) 10Faidon Liambotis: [C: 032] swift: ms-be5 is H710 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82817 (owner: 10Faidon Liambotis) [11:43:14] ah regex and nodes [11:44:20] vhtcpd[1958]: TCP conn to 127.0.0.1:80: response too large, dropping request poor purges :D [11:50:48] not updated in rings either [11:50:49] sigh [11:52:06] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [11:54:35] apergos: what was the last box you did a controller swap? [12:06:19] 236 12 10.0.6.208 6000 sde1 0.00 2 999.99 [12:06:52] what's the deal with that? [12:06:55] wth? [12:06:56] apergos: ? [12:07:50] resolved 3 months ago [12:07:55] seriously [12:09:49] and un-rebalanced too [12:10:33] https://rt.wikimedia.org/Ticket/Display.html?id=5432 I see this but cannot remember anything about it [12:13:02] !log pushing new swift object/account/container rings: setting ms-be9's sde1 to 33, swapping ms-be5's sda/b <-> sdm/n, also 33 [12:13:05] Logged the message, Master [12:18:41] hashar: thanks a lot of spotting this [12:20:54] apergos: so when was the last controller swap? [12:21:09] I guess it was ms-be5 [12:21:16] 2 months ago? [12:21:20] yes [12:21:28] I thought you were still working on that [12:21:33] yes [12:24:52] paravoid: and I am happy to see you jumped in :) [12:24:53] it's been 6 weeks. the last two of those weeks I have mostly been gone (and there was a drive being replaced). it takes a couple weeks plus for the box with the controller to get back to speed. the two weeks in between there were swift spikes or other back end oddness every time I thought about touching a box [12:29:52] I am not sure how we could get that monitored [12:30:05] maybe the syslog conf can be made to send any line containing ERROR to a swift-error.log file [12:30:09] and we could monitor it [12:35:23] j^: could you put 6FA6DCF5 to some keyserver? [12:35:35] keyservers even [12:35:36] or even just one [12:36:16] paravoid: if you are in the mood, I could use a couple merges for contint. I have moved some iptables rules to the contint module :-] [12:36:37] and I will be veery close from phasing out manifests/misc/contint.pp \O/ [12:36:45] iptables? ew [12:37:10] yup gallium has a public address and some of its services have to be fire walled :D [12:37:18] that existed before our time [12:37:58] ferm? [12:40:28] yeah ryan raised that argument to me :-D [12:41:05] but apparently nobody got ferm yet and I am too lazy to get a look at it right now. I merely need one port firewalled and took the occasion to move the rules to contint module. [12:42:26] PROBLEM - DPKG on tmh1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:42:46] PROBLEM - DPKG on tmh1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:15] (that would be me) [12:43:23] PROBLEM - DPKG on mw1156 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:33] PROBLEM - DPKG on mw1154 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - DPKG on mw1157 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - DPKG on mw1160 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - MySQL Processlist on db1024 is CRITICAL: CRIT 40 unauthenticated, 0 locked, 0 copy to table, 1 statistics [12:43:59] oh shit [12:44:10] I did a dist-upgrade on all imagescalers but there's apache in there [12:44:19] it's going to stop all of them momentarily [12:44:21] pages coming [12:44:23] PROBLEM - DPKG on mw1158 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:23] PROBLEM - DPKG on mw1155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:23] PROBLEM - DPKG on mw1159 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:35] :( [12:45:23] PROBLEM - DPKG on mw77 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:23] PROBLEM - DPKG on mw79 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:24] PROBLEM - DPKG on mw78 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:33] PROBLEM - DPKG on tmh2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:34] PROBLEM - DPKG on tmh1 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:43] RECOVERY - MySQL Processlist on db1024 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [12:45:53] PROBLEM - DPKG on mw80 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:46:13] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [12:46:25] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection refused [12:46:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [12:46:27] RECOVERY - DPKG on mw1159 is OK: All packages OK [12:46:34] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection refused [12:46:34] RECOVERY - DPKG on mw1154 is OK: All packages OK [12:46:34] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Thu Sep 5 12:46:32 UTC 2013 [12:46:34] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [12:46:34] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [12:46:43] damn [12:46:43] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection refused [12:46:44] RECOVERY - DPKG on mw1160 is OK: All packages OK [12:46:44] RECOVERY - DPKG on mw1157 is OK: All packages OK [12:46:44] RECOVERY - DPKG on tmh1002 is OK: All packages OK [12:47:23] RECOVERY - DPKG on mw1156 is OK: All packages OK [12:47:23] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60835 bytes in 0.482 second response time [12:47:26] RECOVERY - DPKG on mw1158 is OK: All packages OK [12:47:26] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [12:47:26] RECOVERY - DPKG on mw1155 is OK: All packages OK [12:47:26] RECOVERY - DPKG on tmh1001 is OK: All packages OK [12:47:34] PROBLEM - Apache HTTP on mw77 is CRITICAL: Connection refused [12:47:41] sorry about that [12:47:44] PROBLEM - Apache HTTP on mw78 is CRITICAL: Connection refused [12:47:44] PROBLEM - Apache HTTP on mw79 is CRITICAL: Connection refused [12:47:53] PROBLEM - Apache HTTP on mw80 is CRITICAL: Connection refused [12:48:34] RECOVERY - DPKG on tmh2 is OK: All packages OK [12:48:45] RECOVERY - DPKG on mw80 is OK: All packages OK [12:48:53] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:53] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:53] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:34] RECOVERY - DPKG on tmh1 is OK: All packages OK [12:49:43] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:44] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.285 second response time [12:49:53] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:53] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.623 second response time [12:50:23] RECOVERY - DPKG on mw77 is OK: All packages OK [12:50:23] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection refused [12:50:26] RECOVERY - DPKG on mw78 is OK: All packages OK [12:50:26] RECOVERY - DPKG on mw79 is OK: All packages OK [12:50:43] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.246 second response time [12:50:43] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.270 second response time [12:50:43] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Thu Sep 5 12:50:36 UTC 2013 [12:50:43] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:44] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:53] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.501 second response time [12:51:33] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:33] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [12:52:19] 2013-09-05 12:42:50.221438 [apaches] Could not depool server mw1093.eqiad.wmnet because of too many down! [12:52:22] 2013-09-05 12:42:50.267998 [apaches] Monitoring instance ProxyFetch reports server mw1074.eqiad.wmnet (enabled/up/pooled) down: Getting http://en.wikipedia.org/wiki/Main_Page took longer than 5 seconds. [12:52:24] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60835 bytes in 0.278 second response time [12:52:26] wait, I didn't touch these [12:52:27] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [12:52:29] I only touched imagescalers [12:52:29] wth [12:52:33] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.298 second response time [12:53:00] (03PS1) 10Akosiaris: Adding scala as a build dependency [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82822 [12:53:23] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:23] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:24] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:13] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [12:54:13] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [12:54:23] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.961 second response time [12:54:53] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:03] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:33] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.104 second response time [12:55:43] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:44] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [12:56:02] 7103950677 wikiadmin 10.64.16.142:44668 viwiki Query 635 Sending data SELECT /*!40001 SQL_NO_CACHE */ * FROM `pagelinks` [12:56:08] oh snapshot1004... [12:56:13] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.096 second response time [12:56:29] apergos: did you restart snapshot1004 jobs? [12:56:33] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.091 second response time [12:56:37] there are always dump jobs running there [12:56:43] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:56:50] they run n a rolling basis [12:56:53] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.180 second response time [12:56:58] I killed a bunch of workers the other day [12:57:01] as it was killing the site [12:57:07] i.e. one completes, the next starts; I restarted those when I got back on Tuesday [12:57:22] did you fix the bugs? [12:57:23] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.041 second response time [12:57:44] RECOVERY - Apache HTTP on mw78 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.279 second response time [12:57:53] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:54] can you please kill them? [12:57:54]