[00:00:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.824 second response time [00:06:06] ^d: Gerrit is 503ing :( [00:06:25] <^d> Yeah sorry about that. puppet's fixing it. [00:06:28] <^d> But it's taking it's time. [00:06:41] OK [00:07:10] $ ssh -p 29418 gerrit.wikimedia.org [00:07:12] ssh: connect to host gerrit.wikimedia.org port 29418: Connection refused [00:07:18] Right OK yeah I'll let that sort itself out then ^_^ [00:08:48] <^d> Ok all back. [00:08:51] <^d> I promise that's it. [00:09:15] <^d> "notice: Finished catalog run in 169.60 seconds" [00:09:17] <^d> Damn puppet. [00:09:28] thanks :) [00:10:09] The page you requested was not found, or you do not have permission to view this page. [00:10:09] on https://gerrit.wikimedia.org/r/#/c/82769/ (which is my patchset) [00:10:27] WFM [00:10:36] <^d> Me too. [00:10:46] legoktm: Refresh all the things [00:11:26] hm [00:11:27] works now [00:11:42] <^d> RoanKattouw: gerrit2 30318 22.6 2.6 29104352 882080 ? Sl 00:07 0:45 GerritCodeReview -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --run-id=1378339676.30295 [00:11:51] <^d> That -Xmx20g is the main reason we upgraded ;-) [00:11:57] What's that do [00:12:05] <^d> jvm heap = 20g. [00:12:06] 20GB memories [00:12:53] Nice [00:13:48] <^d> Also more cores, so I'll probably raise the number of threads for various tasks as well. [00:22:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.112 second response time [00:30:09] I'm sure commons already has some sort of strange template to handle that >.> *runs away from Reedy* [00:30:42] Wrong channel? [00:31:00] Wouldn't suprise me if there was and it was a big hardcoded switch [00:32:01] apparently yes >.> [00:32:15] * p858snake|l gets the MIB memory eraser out [00:35:48] (03PS1) 10Ori.livneh: Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 [00:36:19] ^ paravoid [00:36:41] i'm going to scp the change into place meanwhile [00:37:19] (03CR) 10Faidon Liambotis: [C: 032] Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 (owner: 10Ori.livneh) [00:37:43] (03CR) 10Faidon Liambotis: [V: 032] Fix spoofed hostname format for StatsD's Ganglia emitter [operations/puppet] - 10https://gerrit.wikimedia.org/r/82778 (owner: 10Ori.livneh) [00:49:06] ^d: gerrit does not seem to merge currently [00:49:19] <^d> gerrit isn't or zuul isn't? [00:49:25] <^d> James_F just pinged me about the latter. [00:49:27] zuul probably [00:49:42] not sure which part of the puzzle is doing what these days [00:51:32] <^d> We're working on making a diagram ;-) [00:52:00] <^d> tl;dr version: zuul is the bridge & pipeline thing that runs between jenkins and gerrit [00:54:43] Yeah Zuul is totally broken [00:54:48] It claims everything is unmergeable [00:55:02] See latest comment on https://gerrit.wikimedia.org/r/#/c/82777/ which is based directly on master [00:57:22] ( ^d ---^^ ) [00:57:39] I suppose the server move might be throwing off the create-merge-commit step? [00:57:56] <^d> Example job? [01:01:59] I don't know [01:02:06] I linked an example of a failing commit above [01:02:34] And I know that the first step in gate-and-submit is to create the hypothetical merge commit that would result from merging the change into master [01:02:43] <^d> Yeah I'm trying to find where that is in jenkins [01:02:48] If that fails, the change is -1ed with a message saying "this isn't mergeable" [01:02:56] So I suspect that's failing for different reasons [01:02:57] Hmm [01:07:26] <^d> zuul's logs aren't being useful enough. [01:07:32] <^d> I can see they're failing, but no idea why [01:08:00] I'm digging around the Jenkins web UI trying to find my jobs [01:08:05] I just uploaded a whole bunch of VE changes [01:08:16] ^d: restart it? Doesn't it listen to stream events, which would've been interrupted? [01:08:26] <^d> I just restarted a bit ago [01:08:30] (grrrit-wm's source automatically restarted, I think) [01:08:35] ah, hmm [01:10:37] PROBLEM - Disk space on analytics1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 11255 MB (3% inode=99%): [01:12:34] <^d> RoanKattouw: I would look for jobs, but jenkins ui is not...fast [01:12:56] You might say that [01:12:59] 502 Proxy Error [01:13:10] <^d> Hmm, worked for two pywikibot commits. [01:13:12] <^d> https://gerrit.wikimedia.org/r/#/q/status:open,n,z [01:15:30] <^d> Krinkle: Yo, you around? [01:16:44] * Elsie bugs ^d in ops. [01:22:30] <^d> A-ha!!!!! [01:22:52] <^d> demon@gallium:/srv/ssd/zuul/git/mediawiki/core$ git remote -v [01:22:52] <^d> origin ssh://jenkins-bot@manganese.wikimedia.org:29418/mediawiki/core (fetch) [01:22:52] <^d> origin ssh://jenkins-bot@manganese.wikimedia.org:29418/mediawiki/core (push) [01:22:55] <^d> RoanKattouw: ^ [01:23:20] <^d> Willing to bet that explains the other ones too. [01:23:45] Well yeah .... [01:23:52] Why would we ever hardcode a server name like that [01:26:03] <^d> So, how on earth to fix this. [01:27:15] <^d> Hmm, I guess some bash magic over all .git/config's. [01:27:19] Well, the quick fix is to do git remote rm origin && git remote add ssh://.......gerrit.wikimedia.org....... [01:27:21] Or that [01:27:27] and bash magic it, yeah [01:28:53] Like, for f in `find | grep '.git/config$'`; do sed -i -e 's/manganese/gerrit/g' $f; done ? [01:29:43] Tested the components individually and that seems to work [01:29:50] Might want to back up the .git/config files too though [01:30:04] ^d: Are you gonna be OK handling this? I'm being dragged away for dinner [01:30:43] <^d> I should be fine now that I know what's up. [01:30:47] OK aweseome [01:30:51] Then I'm gonna go [01:31:20] Message me on gchat/hangout if you need me, I have that on my phone for both my personal and my work accounts [01:34:48] ^d: hey [01:34:55] I'm guessing the new gerrit IP is a service ip? [01:35:18] (03PS3) 10Mattflaschen: Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 [01:35:18] while at it, it'd be neat if we also did ipv6 :) [01:35:30] (03CR) 10jenkins-bot: [V: 04-1] Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [01:35:42] <^d> paravoid: That it is :) [01:41:41] (03PS1) 10Faidon Liambotis: Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 [01:42:20] (03CR) 10Faidon Liambotis: [C: 032] Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 (owner: 10Faidon Liambotis) [01:42:27] (03CR) 10Faidon Liambotis: [V: 032] Update GeoIP.conf comment with the new account [operations/puppet] - 10https://gerrit.wikimedia.org/r/82793 (owner: 10Faidon Liambotis) [01:42:33] <^d> !log gallium: updated git remotes in /srv/ssd/zuul/git/* to not point to old manganese box and point to gerrit.wm.o instead [01:42:35] drdee: ^ [01:42:38] Logged the message, Master [01:44:15] ^d: does gerrit do anything with IPs? [01:44:18] authentication or whatever? [01:44:40] <^d> Nope, doesn't care. [01:44:48] so adding ipv6 might not be so hard [01:44:57] <^d> Should be easy. It sits behind apache and barely knows the outside world exists. [01:45:12] <^d> Well, for web. SSH might be a little more fun. [01:45:40] for web I'm sure ma rk would want to put it behind varnish :) [01:46:06] <^d> That's the goal, we talked about it but haven't finished. [01:46:15] <^d> It'll probably break things. [01:46:17] <^d> :) [01:46:20] yeah [01:46:23] also probably means lvs [01:46:36] well, certainly [01:46:44] and we'll need SSH LVS too, heh [01:46:48] fun [01:46:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:02] <^d> Well gerrit can only live on one box. [01:47:21] <^d> Unless you have super secret google code that lets it run in the cloud ;-) [01:47:27] but you do need to point :80 to varnish, :443 to ssl terminators and :22 to gerrit :) [01:47:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.023 second response time [01:48:15] (03PS2) 10Springle: db maintenance role for terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81633 [01:48:21] anyway [01:48:24] let's do ipv6 :) [01:48:38] not now, I'm dead [01:48:44] <^d> Yeah me too. [01:48:46] it's 5am and I woke up at 9 :) [01:48:46] (03CR) 10Springle: [C: 032 V: 032] db maintenance role for terbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81633 (owner: 10Springle) [01:49:45] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [01:49:47] <^d> paravoid: You can also do authenticated things over https, which will make things double fun if we put it behind an extra layer. [01:50:05] technically, two extra layers :) [01:50:07] but yeah [01:50:31] I hope at least it doesn't do any SSL client auth [01:50:59] <^d> It can do some funky things with auth. [01:51:15] <^d> And Ryan was wanting to use something like mod_openid to make login seamless with wikitech. [01:51:15] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [01:52:55] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:15:59] !log LocalisationUpdate completed (1.22wmf15) at Thu Sep 5 02:15:59 UTC 2013 [02:16:03] Logged the message, Master [02:23:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.613 second response time [02:30:04] !log LocalisationUpdate completed (1.22wmf14) at Thu Sep 5 02:30:03 UTC 2013 [02:30:07] Logged the message, Master [02:47:15] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 5 02:47:15 UTC 2013 [02:47:19] Logged the message, Master [02:56:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.545 second response time [03:14:31] RECOVERY - Disk space on analytics1003 is OK: DISK OK [03:59:41] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [04:02:51] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:13:16] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [05:22:19] (03CR) 10Asher: [V: 031] add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 (owner: 10Springle) [05:28:13] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [06:43:13] PROBLEM - Puppet freshness on search1003 is CRITICAL: No successful Puppet run in the last 10 hours [06:43:13] PROBLEM - Puppet freshness on search1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:13] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:13] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [07:14:34] (03PS2) 10Springle: add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 [07:25:42] (03CR) 10Springle: [C: 032 V: 032] add percona processlist monitoring, and start migrating to PMP v1.0.4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82221 (owner: 10Springle) [07:34:06] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [07:39:36] RECOVERY - Disk space on analytics1004 is OK: DISK OK [07:47:39] re [07:53:47] (03PS1) 10Ori.livneh: Replace Gmetric implementation [operations/puppet] - 10https://gerrit.wikimedia.org/r/82805 [07:57:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [08:32:48] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 35841 seconds [08:33:48] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [08:38:48] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 35556 seconds [09:05:53] !log jenkins: killed two requesthandler threads that were eating CPU power. [09:05:56] Logged the message, Master [09:07:18] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 142 seconds [09:07:38] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [09:07:43] pow pow [09:09:55] mark: morning are you around ? :-] lanthanum.eqiad.wmnet is showing high system proc again. Might be kernel issue [09:11:51] apergos: hi :) are you aware of any bug that would make a server use lot of system CPU time for now reason ? [09:12:03] lanthanum.eqiad.wmnet is hit by it : https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&c=Miscellaneous+eqiad&h=lanthanum.eqiad.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [09:13:08] the watchdogs [09:13:10] that's interesting [09:13:21] I don't know, let's see what google says [09:13:28] on installation that machine apparently had the same high system cpu [09:13:36] last week I got its kernel upgraded and rebooted it [09:13:40] mark had to console reboot it [09:13:49] after 6days + it shows up the same issue again [09:14:05] so I was wondering if that might be a know issue :-] [09:14:28] lots of these: Sep 5 06:35:28 lanthanum kernel: [593173.826155] CPU2: Package power limit notification (total events = 559284) [09:14:46] the first being at [488647.093218] [09:14:51] no idea what time it is :-D [09:14:56] I guess seconds uptime [09:15:45] ahh dmesg -T|head [09:15:49] [Wed Sep 4 01:15:51 2013] CPU0: Package power limit notification (total events = 347340) [09:15:55] I see em from aug 25 [09:16:32] but your issue starts on the 1st or 2nd [09:18:49] and I have no idea how to find out what is causing the issue [09:18:56] I guess system CPU time is related to the kernel [09:19:37] ahh [09:19:51] power_saving command takes lot of %CPU [09:22:20] Aug 29 09:31:53 lanthanum kernel: [ 0.117213] CPU0: Thermal monitoring enabled (TM1) [09:29:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:20] filling a RT :-) [09:30:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.414 second response time [09:31:36] we could see if cpu stepping is enabled in the bios on this box and if so disable it, see if that helps [09:31:49] this would obviously require downtime [09:33:11] https://rt.wikimedia.org/Ticket/Display.html?id=5722 :-D [09:33:24] apergos: that machine is not yet used so it can go down whenever needed. [09:33:34] https://bugzilla.kernel.org/show_bug.cgi?id=36182 [09:34:27] and the patch just remove the notifications apparently [09:36:13] not thinking about the patch but about changing the bios setting [09:51:14] springle: Thanks for the help on the RT ticket, Sean. I've made a request for shell access. [09:51:38] springle: btw, where are you based? [09:52:05] siebrand: np. east-coast australia [09:52:19] springle: oh! Tim's not alone anymore :) [09:52:23] :) [09:56:53] all right lemme try rebooting and resetting this thing in the bios and we'll keep an eye on it, hashar... ok by you? [09:56:59] !log leaving sync_binlog=0 on db36 and db38 so replication keeps up. to be reviewed [09:57:01] apergos: yup [09:57:02] Logged the message, Master [09:57:16] apergos: would be lovely to one day use power saving though :-] [09:58:10] if the box stays busy it might not make that much difference [09:59:06] I got to get a bunch of patch deployed to be able to switch jobs to it [09:59:21] note that once the server has been rebooted, it will be idling again [09:59:35] maybe we can add an icinga check to monitor system cpu [10:01:52] doo dee doo dee doo [10:01:55] slow reboot is slow [10:02:18] last time it got stuck :/ [10:02:31] great [10:02:33] PROBLEM - SSH on lanthanum is CRITICAL: Connection refused [10:02:43] PROBLEM - Disk space on lanthanum is CRITICAL: Connection refused by host [10:02:46] if I gotta, I'll powercycle. but let's give it a few minutes [10:03:33] PROBLEM - RAID on lanthanum is CRITICAL: Connection refused by host [10:03:33] PROBLEM - DPKG on lanthanum is CRITICAL: Connection refused by host [10:07:03] PROBLEM - Host lanthanum is DOWN: PING CRITICAL - Packet loss = 100% [10:07:43] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [10:08:33] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay -0 seconds [10:12:05] (03PS4) 10Hashar: contint: generate .gitconfig files for all jenkins users [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 [10:12:20] (03CR) 10Hashar: "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75856 (owner: 10Hashar) [10:12:33] apergos: did you check that power profile bios setting for lanthanum? [10:12:46] I'm waiting for the bios to come up now [10:13:18] there's two settings I want to look at, one is 'dell custom' or 'os' which is the power setting one, I forget the name, and the other is the cpu stepping one [10:14:08] performance er watt = os so that's all right [10:14:45] ok [10:15:46] actually I'm going to set it to custom so I can turn off the c states [10:19:12] (03PS4) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [10:19:13] (03PS2) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [10:20:46] (03PS3) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [10:21:28] RECOVERY - DPKG on lanthanum is OK: All packages OK [10:21:29] RECOVERY - RAID on lanthanum is OK: OK: State is Optimal, checked 1 logical device(s) [10:21:29] RECOVERY - SSH on lanthanum is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [10:21:36] (03CR) 10Hashar: "patchset2: move the git daemon firewall rule to https://gerrit.wikimedia.org/r/82625 which define git-daemon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 (owner: 10Hashar) [10:21:38] RECOVERY - Host lanthanum is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:21:39] RECOVERY - Disk space on lanthanum is OK: DISK OK [10:21:53] !log lanthanum bios processor settings set to 'custom' and c states disabled (attempt to fix power limit notification issue) [10:21:56] Logged the message, Master [10:21:59] (03CR) 10Hashar: "Now comes with a firewall rule I extracted from https://gerrit.wikimedia.org/r/82614" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [10:22:03] guess we'll see in a few days [10:23:40] (03PS5) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [10:23:56] (03CR) 10Hashar: "and rebased :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [10:24:38] apergos: thanks :-) can you update https://rt.wikimedia.org/Ticket/Display.html?id=5722 for later reference ? [10:24:52] anyone know what db1045 is for? some sort of test box? [10:25:20] done [10:26:13] springle: the only note I have found in server admin logs is Asher upgrading it to precise in July 2012 :( [10:26:13] no idea [10:27:45] I am off to lunch [10:27:53] enjoy [10:28:38] he's not gone yet, you can ask him I guess, it was 'upgrading to precise for testing' [10:29:09] * springle does [10:31:48] RECOVERY - DPKG on db1045 is OK: All packages OK [10:54:22] (03CR) 10Faidon Liambotis: [C: 032] "Assuming you're cleaning up gmetric.js manually." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82805 (owner: 10Ori.livneh) [11:27:27] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 322 pgs stuck unclean: recovery 6206025/923605575 degraded (0.672%) [11:27:27] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 324 pgs stuck unclean: recovery 6206025/923605578 degraded (0.672%) [11:27:35] uh oh [11:27:47] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 340 pgs degraded: 325 pgs stuck unclean: recovery 6206033/923606007 degraded (0.672%) [11:28:12] [1728184.013786] sd 0:2:1:0: [sdb] Unhandled error code [11:28:12] [1728184.013790] sd 0:2:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [11:28:15] [1728184.013793] sd 0:2:1:0: [sdb] CDB: Write(10): 2a 00 74 7f 77 a6 00 00 02 00 [11:28:17] PROBLEM - Disk space on ms-be1011 is CRITICAL: DISK CRITICAL - /var/lib/ceph/osd/ceph-120 is not accessible: Input/output error [11:28:18] cool [11:28:47] PROBLEM - MySQL Processlist on db1009 is CRITICAL: CRIT 1 unauthenticated, 0 locked, 0 copy to table, 44 statistics [11:29:47] paravoid: hey :) [11:29:47] RECOVERY - MySQL Processlist on db1009 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [11:30:09] paravoid: if you have any knowledge about our swift install, I filled a RT about some unmounted device on ms-be5.pmtpa.wmnet ( https://rt.wikimedia.org/Ticket/Display.html?id=5719 ) [11:30:32] container-replicator spam syslog messages about it :D [11:31:07] apergos: know anything about that or should I start investigating? [11:31:45] hm I don't know [11:34:38] 720xd replacement gone wrong? [11:35:01] er [11:35:04] h710 that is [11:35:14] this seems to have / on sda/sdb [11:35:21] but is configured by puppet for sdm/sdn [11:41:53] (03PS1) 10Faidon Liambotis: swift: ms-be5 is H710 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82817 [11:41:58] apergos: ^ [11:42:07] Product Name: PERC H710 Mini [11:42:17] ok great [11:42:19] root@ms-be5:~# uptime 11:42:13 up 55 days, 14:42, 1 user, load average: 18.92, 15.31, 14.67 [11:42:22] nice [11:42:27] (03CR) 10Faidon Liambotis: [C: 032] swift: ms-be5 is H710 [operations/puppet] - 10https://gerrit.wikimedia.org/r/82817 (owner: 10Faidon Liambotis) [11:43:14] ah regex and nodes [11:44:20] vhtcpd[1958]: TCP conn to 127.0.0.1:80: response too large, dropping request poor purges :D [11:50:48] not updated in rings either [11:50:49] sigh [11:52:06] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [11:54:35] apergos: what was the last box you did a controller swap? [12:06:19] 236 12 10.0.6.208 6000 sde1 0.00 2 999.99 [12:06:52] what's the deal with that? [12:06:55] wth? [12:06:56] apergos: ? [12:07:50] resolved 3 months ago [12:07:55] seriously [12:09:49] and un-rebalanced too [12:10:33] https://rt.wikimedia.org/Ticket/Display.html?id=5432 I see this but cannot remember anything about it [12:13:02] !log pushing new swift object/account/container rings: setting ms-be9's sde1 to 33, swapping ms-be5's sda/b <-> sdm/n, also 33 [12:13:05] Logged the message, Master [12:18:41] hashar: thanks a lot of spotting this [12:20:54] apergos: so when was the last controller swap? [12:21:09] I guess it was ms-be5 [12:21:16] 2 months ago? [12:21:20] yes [12:21:28] I thought you were still working on that [12:21:33] yes [12:24:52] paravoid: and I am happy to see you jumped in :) [12:24:53] it's been 6 weeks. the last two of those weeks I have mostly been gone (and there was a drive being replaced). it takes a couple weeks plus for the box with the controller to get back to speed. the two weeks in between there were swift spikes or other back end oddness every time I thought about touching a box [12:29:52] I am not sure how we could get that monitored [12:30:05] maybe the syslog conf can be made to send any line containing ERROR to a swift-error.log file [12:30:09] and we could monitor it [12:35:23] j^: could you put 6FA6DCF5 to some keyserver? [12:35:35] keyservers even [12:35:36] or even just one [12:36:16] paravoid: if you are in the mood, I could use a couple merges for contint. I have moved some iptables rules to the contint module :-] [12:36:37] and I will be veery close from phasing out manifests/misc/contint.pp \O/ [12:36:45] iptables? ew [12:37:10] yup gallium has a public address and some of its services have to be fire walled :D [12:37:18] that existed before our time [12:37:58] ferm? [12:40:28] yeah ryan raised that argument to me :-D [12:41:05] but apparently nobody got ferm yet and I am too lazy to get a look at it right now. I merely need one port firewalled and took the occasion to move the rules to contint module. [12:42:26] PROBLEM - DPKG on tmh1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:42:46] PROBLEM - DPKG on tmh1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:15] (that would be me) [12:43:23] PROBLEM - DPKG on mw1156 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:33] PROBLEM - DPKG on mw1154 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - DPKG on mw1157 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - DPKG on mw1160 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:43:43] PROBLEM - MySQL Processlist on db1024 is CRITICAL: CRIT 40 unauthenticated, 0 locked, 0 copy to table, 1 statistics [12:43:59] oh shit [12:44:10] I did a dist-upgrade on all imagescalers but there's apache in there [12:44:19] it's going to stop all of them momentarily [12:44:21] pages coming [12:44:23] PROBLEM - DPKG on mw1158 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:23] PROBLEM - DPKG on mw1155 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:23] PROBLEM - DPKG on mw1159 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:44:35] :( [12:45:23] PROBLEM - DPKG on mw77 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:23] PROBLEM - DPKG on mw79 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:24] PROBLEM - DPKG on mw78 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:33] PROBLEM - DPKG on tmh2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:34] PROBLEM - DPKG on tmh1 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:45:43] RECOVERY - MySQL Processlist on db1024 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [12:45:53] PROBLEM - DPKG on mw80 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:46:13] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [12:46:25] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection refused [12:46:27] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection refused [12:46:27] RECOVERY - DPKG on mw1159 is OK: All packages OK [12:46:34] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection refused [12:46:34] RECOVERY - DPKG on mw1154 is OK: All packages OK [12:46:34] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Thu Sep 5 12:46:32 UTC 2013 [12:46:34] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [12:46:34] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection refused [12:46:43] damn [12:46:43] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection refused [12:46:44] RECOVERY - DPKG on mw1160 is OK: All packages OK [12:46:44] RECOVERY - DPKG on mw1157 is OK: All packages OK [12:46:44] RECOVERY - DPKG on tmh1002 is OK: All packages OK [12:47:23] RECOVERY - DPKG on mw1156 is OK: All packages OK [12:47:23] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60835 bytes in 0.482 second response time [12:47:26] RECOVERY - DPKG on mw1158 is OK: All packages OK [12:47:26] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [12:47:26] RECOVERY - DPKG on mw1155 is OK: All packages OK [12:47:26] RECOVERY - DPKG on tmh1001 is OK: All packages OK [12:47:34] PROBLEM - Apache HTTP on mw77 is CRITICAL: Connection refused [12:47:41] sorry about that [12:47:44] PROBLEM - Apache HTTP on mw78 is CRITICAL: Connection refused [12:47:44] PROBLEM - Apache HTTP on mw79 is CRITICAL: Connection refused [12:47:53] PROBLEM - Apache HTTP on mw80 is CRITICAL: Connection refused [12:48:34] RECOVERY - DPKG on tmh2 is OK: All packages OK [12:48:45] RECOVERY - DPKG on mw80 is OK: All packages OK [12:48:53] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:53] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:48:53] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:34] RECOVERY - DPKG on tmh1 is OK: All packages OK [12:49:43] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:44] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.285 second response time [12:49:53] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:53] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.623 second response time [12:50:23] RECOVERY - DPKG on mw77 is OK: All packages OK [12:50:23] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection refused [12:50:26] RECOVERY - DPKG on mw78 is OK: All packages OK [12:50:26] RECOVERY - DPKG on mw79 is OK: All packages OK [12:50:43] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.246 second response time [12:50:43] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.270 second response time [12:50:43] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Thu Sep 5 12:50:36 UTC 2013 [12:50:43] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:44] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:53] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.501 second response time [12:51:33] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:33] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [12:52:19] 2013-09-05 12:42:50.221438 [apaches] Could not depool server mw1093.eqiad.wmnet because of too many down! [12:52:22] 2013-09-05 12:42:50.267998 [apaches] Monitoring instance ProxyFetch reports server mw1074.eqiad.wmnet (enabled/up/pooled) down: Getting http://en.wikipedia.org/wiki/Main_Page took longer than 5 seconds. [12:52:24] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 60835 bytes in 0.278 second response time [12:52:26] wait, I didn't touch these [12:52:27] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [12:52:29] I only touched imagescalers [12:52:29] wth [12:52:33] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.298 second response time [12:53:00] (03PS1) 10Akosiaris: Adding scala as a build dependency [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82822 [12:53:23] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:23] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:24] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:13] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [12:54:13] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [12:54:23] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.961 second response time [12:54:53] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:03] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:33] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.104 second response time [12:55:43] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:44] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [12:56:02] 7103950677 wikiadmin 10.64.16.142:44668 viwiki Query 635 Sending data SELECT /*!40001 SQL_NO_CACHE */ * FROM `pagelinks` [12:56:08] oh snapshot1004... [12:56:13] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.096 second response time [12:56:29] apergos: did you restart snapshot1004 jobs? [12:56:33] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.091 second response time [12:56:37] there are always dump jobs running there [12:56:43] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:56:50] they run n a rolling basis [12:56:53] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.180 second response time [12:56:58] I killed a bunch of workers the other day [12:57:01] as it was killing the site [12:57:07] i.e. one completes, the next starts; I restarted those when I got back on Tuesday [12:57:22] did you fix the bugs? [12:57:23] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.041 second response time [12:57:44] RECOVERY - Apache HTTP on mw78 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.279 second response time [12:57:53] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:54] can you please kill them? [12:57:54] the wikiexporter bug? that's been outstanding for ... well, ever. [12:58:02] for now? [12:58:23] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.508 second response time [12:58:33] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:33] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.246 second response time [12:58:33] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:53] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:53] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:00] there were two that were requesting tables, I have ^z them [12:59:23] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [12:59:24] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.686 second response time [12:59:44] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.045 second response time [12:59:53] RECOVERY - Apache HTTP on mw80 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.346 second response time [12:59:53] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:43] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [13:00:44] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.090 second response time [13:00:44] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [13:01:43] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.101 second response time [13:03:33] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:03:34] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:33] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:33] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:33] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:33] RECOVERY - Apache HTTP on mw77 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.298 second response time [13:05:33] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.038 second response time [13:05:33] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [13:06:23] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [13:06:23] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.079 second response time [13:06:23] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [13:06:33] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.556 second response time [13:06:40] what the hell [13:06:53] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:43] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:43] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [13:07:44] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:53] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:23] paravoid: need help ? [13:08:33] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.074 second response time [13:08:43] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [13:08:43] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.914 second response time [13:10:05] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:35] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.106 second response time [13:11:45] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:55] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [13:13:04] paravoid: dafuq did you trigger ? [13:13:35] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [13:13:35] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:45] RECOVERY - Apache HTTP on mw79 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.213 second response time [13:14:25] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.138 second response time [13:14:51] !log adjusting depool-threshold, restarting pybal on lvs1003 [13:14:54] Logged the message, Master [13:14:55] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:55] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:55] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:05] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:35] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:35] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:35] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:35] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:46] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:46] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:15] nice [13:16:25] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.354 second response time [13:16:35] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.604 second response time [13:16:35] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:35] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:35] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.758 second response time [13:16:35] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:36] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.619 second response time [13:16:36] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:37] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:40] and yet [13:16:40] [13:16:45] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:45] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:46] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.948 second response time [13:16:55] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [13:16:55] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.796 second response time [13:16:55] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:55] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:15] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:15] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:25] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:25] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:26] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [13:17:26] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:26] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.312 second response time [13:17:26] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.133 second response time [13:17:26] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.287 second response time [13:17:35] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:35] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:35] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.254 second response time [13:17:35] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.777 second response time [13:17:36] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.690 second response time [13:17:45] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.357 second response time [13:17:45] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.089 second response time [13:17:45] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [13:17:55] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.418 second response time [13:18:05] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.193 second response time [13:18:05] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.782 second response time [13:18:15] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.914 second response time [13:18:17] hey [13:18:25] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.080 second response time [13:18:26] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.192 second response time [13:18:26] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.566 second response time [13:18:26] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.824 second response time [13:18:26] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.747 second response time [13:18:26] hey [13:18:32] whats up? [13:18:37] i was about to hit the road ;) [13:18:41] dpkg broke soemething [13:18:44] nope [13:18:47] so I upgraded imagescalers [13:19:01] and I messed up and did it at the same time, so we had a very brief rendering outage [13:19:15] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.006 second response time [13:19:16] immediately after that, apaches started all to flap [13:19:22] 2013-09-05 13:10:12.288818 [apaches_80 ProxyFetch] mw1078.eqiad.wmnet (enabled/up/pooled): Fetch failed, 5.002 s [13:19:25] 2013-09-05 13:10:12.830266 [apaches_80 ProxyFetch] mw1042.eqiad.wmnet (enabled/up/pooled): Fetch failed, 5.003 s [13:19:26] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.768 second response time [13:19:31] could be a coincidence but no idea why it would do that [13:20:06] (03Abandoned) 10Hashar: contint: install python-virtualenv [operations/puppet] - 10https://gerrit.wikimedia.org/r/81649 (owner: 10Hashar) [13:20:38] the weirdest thing is [13:20:41] manually requesting [13:20:52] says [13:20:56] for /wiki/Main_Page [13:21:03] I tried a bunch of them multiple times [13:21:08] it's all < 0.200 [13:21:19] but both pybal and nagios disagree, so maybe I haven't hit it yet [13:21:38] strange [13:22:35] I had a theory it might have been pybal depooling a bunch of them then the rest getting overloaded [13:22:47] so I altered the threshold to 0.8 and restarted [13:23:09] it's better now but it wasn't for 3' after I did that [13:24:13] there are still occasional fetch failed [13:26:37] so weird [13:27:53] yeah hard to say [13:28:02] should add icmp ping checks to pybal [13:28:35] (03PS4) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [13:29:06] (03CR) 10Hashar: "added in iptables_purge_service and iptables_add_exec entries." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 (owner: 10Hashar) [13:29:08] I'd like to unsuspend my two dump jobs now, since they seem to be unrelated to anything [13:29:18] (03PS6) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [13:29:52] or not [13:30:15] I did find a long-running query from snapshot1004, I didn't blame them just by guessing [13:30:53] (03PS7) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [13:31:20] the table dumps are via mysqldump without locking [13:32:04] (03CR) 10Hashar: "added in iptables_purge_service and iptables_add_exec entries." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 (owner: 10Hashar) [13:32:34] would someone be brave enough to merge in some iptables rules on gallium for me please ? Might need a console access in case augeas screw them :/ [13:39:01] ok, since you have some reservations I would like a judgment call then [13:39:21] since things appear to have settled down, on whether it would be fine to resume the jobs [13:42:07] https://commons.wikimedia.org/wiki/Special:NewFiles reports of breakage still, see wikimedia-tech [13:42:42] thumbs o new images not rendering [13:45:26] the thumbnail.log on fluoride shows a spam of thumbnail failing for /usr/bin/convert [13:45:47] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [13:46:26] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [13:46:26] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [13:49:52] hashar@mw1154:/tmp$ convert -quality 80 -background white foobar.jpg -thumbnail '16x16!' -depth 8 -sharpen '0x0.8' -rotate -0 /tmp/FOOTHUMB.jpg [13:49:53] convert: Premature end of JPEG file `foobar.jpg' @ warning/jpeg.c/EmitMessage/231. [13:49:54] convert: Corrupt JPEG data: premature end of data segment `foobar.jpg' @ warning/jpeg.c/EmitMessage/231. [13:49:56] although the file is a jpg [13:49:57] $ file foobar.jpg [13:49:57] foobar.jpg: JPEG image data, EXIF standard 2.21 [13:50:22] yeah I jsut tried by hand on a box, same stuff [13:50:37] premature end of jpeg file [13:50:59] for which file? [13:51:16] any jpg apparently [13:51:35] on mw1154 I have copied a random file being processed [13:51:41] using cp $(ls -1 localcopy*.jpg|head -n1) foobar.jpg [13:51:51] and then tried the same command mediawiki is using to render the thumb [13:51:57] that has yield the premature end of jpeg [13:52:24] well boxes are failing png filesl as well : ( [13:52:52] maybe not a fully uploaded file ? [13:53:03] I tried the same with no luck [13:53:36] I could try a thumb of an older file and see if that works [13:54:33] I'm trying [13:54:33] http://commons.wikimedia.org/w/thumb_handler.php/2/2f/Burntwood_Primitive_Methodists_Chapel_02.jpg/800px-Burntwood_Primitive_Methodists_Chapel_02.jpg [13:54:40] ah [13:54:49] it fails [13:54:53] but the orig jpg does not [13:55:23] success with old thumb [13:56:21] but not with the one you chose [13:56:28] so [13:56:39] I caught http://commons.wikimedia.org/wiki/File:2011_Tohoku_earthquake_aftershocks_1year_by_JMA.png not rendering on mw1156 [13:56:49] hashar@mw1156:/tmp$ /usr/bin/convert -quality 95 -background white File\:2011_Tohoku_earthquake_aftershocks_1year_by_JMA.png -thumbnail '50x50!' -depth 8 -rotate -0 /tmp/Toho-thumb.png [13:56:50] convert: improper image header `File:2011_Tohoku_earthquake_aftershocks_1year_by_JMA.png' @ error/png.c/ReadPNGImage/3246. [13:56:51] convert: missing an image filename `/tmp/Toho-thumb.png' @ error/convert.c/ConvertImageCommand/3011. [13:56:58] so somethings really messing up imagemagick apparently [13:57:54] well the one truncated file I pulled off a host [13:58:09] was actually truncated (the original), ff didn't want to display it either [13:58:26] hashar: you're missing -comment [13:58:28] that command-line is invalid [13:58:39] damn sorry [13:58:45] I am trying to convert a HTML file :/ [13:59:06] nor gimp. and it's definitely jpeg data at least the first bit of it [13:59:20] can you find me the file where mediawiki spawns convert? [13:59:25] I'm grepping /usr/local/apache [14:00:09] something under includes/media I guess [14:00:31] media/Bitmap.php I guess [14:00:33] look for [14:00:39] wgImageMagickConvertCommand [14:00:40] yup and function transformImageMagick [14:01:19] the full command line you can get out of the logs on fluorine [14:05:51] what happened at 12:43 UTC? [14:06:28] dist-upgrade of image scalers [14:07:08] that's precisely when thumbs stopped appearing;) [14:07:24] bah [14:07:39] running the command manually has apache does not yield error :( [14:07:40] hashar@mw1158:/tmp$ sudo -u apache '/usr/bin/convert' -background white 'Light_dispersion_conceptual_waves.gif' -coalesce -thumbnail '144x108!' -set comment 'File source: http://commons.wikimedia.org/wiki/File:Light_dispersion_conceptual_waves.gif' -depth 8 -rotate -0 -fuzz 5% -layers optimizeTransparency '/tmp/thumb.gif' [14:07:43] echo $? [14:07:44] 0 [14:08:29] well ther eis the ulimit too [14:08:43] but so unlikely for that to have any part [14:09:53] nothing in dmesg on that box :/ [14:11:44] fixed [14:12:12] !log recreating mediawiki cgroup lost after cgroup-bin upgrade; thumbnail issues fixed [14:12:14] Logged the message, Master [14:12:23] oh [14:12:43] it was broken on mw1153 all along... [14:12:48] confirmed [14:13:49] now we get some spam like "convert: no decode delegate for this image format `/a/magick-tmp/magick-L4Lgur5f' @ error/constitute.c/ReadImage/532." [14:14:08] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [14:14:09] apergos, i'm about to head to a cafe for a bit, but when I get there would you help me figure out this silly puppet ganglia aggregator problem? [14:14:43] ottomata: I will in a little bit, we have a thumbnail issue we're dealing with [14:14:51] and grr I have to make it to the bank before my landlord shows up [14:14:56] in 45 min [14:16:11] it's fixed, so I'm not sure what you're investigating exactly :) [14:16:41] there are some other errors showing in thumbnail.log [14:16:49] root@mw1154:~# dpkg-reconfigure cgroup-bin [14:16:49] cgconfig stop/pre-start, process 1422 [14:16:49] cgred start/running, process 1457 [14:16:49] root@mw1154:~# ls /sys/fs/cgroup/memory/mediawiki [14:16:49] ls: cannot access /sys/fs/cgroup/memory/mediawiki: No such file or directory [14:16:51] sometime convert is not even being passed an input filename [14:16:52] root@mw1154:~# mkdir -p /sys/fs/cgroup/memory/mediawiki [14:16:55] mkdir: cannot create directory `/sys/fs/cgroup/memory': No such file or directory [14:16:58] amazing [14:17:33] another apparently truncated file [14:17:45] it says it isn't but that error usually means something else broke instead, hashar [14:20:24] that sure looks better [14:23:31] paravoid: I am pretty sure I already have the cgroup issue on beta but I can't find the bug report :( [14:24:39] ah July 2nd 13:38 hashar: restarted mw-cgroup upstart service on apaches box. That recreated the wgCgroup directory /sys/fs/cgroup/memory/mediawiki [14:25:07] #53800 [14:25:11] just filed this [14:25:18] !b 53800 [14:25:18] https://bugzilla.wikimedia.org/53800 [14:27:30] apparently it tries to create them [14:29:33] paravoid: any chance you look at my contint iptables rules ? or should I postpone that after all staff? :D [14:29:42] not today... [14:29:48] apergos maybe? [14:29:59] I'lll look at them in a lttle while, sure [14:30:09] :) [14:30:23] my queue is now: bank, landlord/6 pm meeting, ottomata, hashar [14:30:42] paravoid: so do you think I can (once back from the bank) resume those two dumps and keep an eye on them? [14:30:43] so that would be post all staff :D [14:30:52] naw, it's later today is all [14:31:08] CommanderData: what kind of questions can I ask you? [14:31:08] yes but please fix the wikiexporter bug and other db bugs that sean points out [14:31:12] oops [14:31:14] wrong chat :p [14:31:37] it's very confusing to have an outage and finding snapshot1004 long-running queries at the same time [14:31:45] also, I pointed out the wikiadmin auth failure, did you see this? [14:31:46] apergos: I will not be there tonight, so don't bother :-] Will ping again after all staff [14:32:22] the wikiexporter bug isn't fixed because a) it's core, b) batching is not a good fix, as folks have pointed out, and c) sean thinks the proper solution is to have a dedicated slave [14:32:38] the long running queries are not wikiexporter [14:32:41] good [14:32:41] do that [14:32:42] those are mysqldump [14:33:15] they can't take less time, they take as long as the table takes to dump, but they are non-locking [14:38:53] be back later [14:48:08] PROBLEM - MySQL Processlist on db1024 is CRITICAL: CRIT 38 unauthenticated, 0 locked, 0 copy to table, 1 statistics [14:49:28] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:08] RECOVERY - MySQL Processlist on db1024 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [14:50:18] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.779 second response time [14:51:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [14:59:08] PROBLEM - MySQL Processlist on db1024 is CRITICAL: CRIT 36 unauthenticated, 0 locked, 0 copy to table, 0 statistics [15:00:08] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82832 [15:00:09] (03PS1) 10Reedy: Phase 1 wikis to 1.22wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82833 [15:00:10] (03PS1) 10Reedy: All wikipedia wikis to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82834 [15:00:11] (03PS1) 10Reedy: Move php symlink to point at 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82835 [15:02:08] RECOVERY - MySQL Processlist on db1024 is OK: OK 4 unauthenticated, 0 locked, 0 copy to table, 0 statistics [15:02:36] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82832 (owner: 10Reedy) [15:02:46] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82832 (owner: 10Reedy) [15:03:46] heya paravoid, [15:03:52] you wrote the varnishkafka init script, right? [15:11:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.754 second response time [15:17:45] !log reedy synchronized php-1.22wmf16 'Initial file sync' [15:17:48] Logged the message, Master [15:18:22] !log reedy synchronized docroot and w [15:18:25] Logged the message, Master [15:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:28:26] !log replacing disk slot 0 ms-be1011 [15:28:29] Logged the message, Master [15:29:08] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [15:32:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:58] That's intriguing. scap died due to a missing ExtensionMessages file [15:33:17] But it's not done that in the last few runs because it created them.. [15:33:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [15:36:45] ottomata: you're up [15:36:58] yoyo [15:37:12] ok so, let's look at analytics1003 [15:37:19] all right, lemme hop on [15:37:38] k, you can see in site.pp [15:37:39] node /analytics100[2-8]\.eqiad\.wmnet/ { [15:37:43] # ganglia aggregator for the Analytics cluster. [15:37:43] if ($hostname == 'analytics1003') { [15:37:43] $ganglia_aggregator = true [15:37:43] } [15:37:47] (and this has worked before [15:37:57] and when you try to apply, what happens? [15:38:10] - deaf = no [15:38:10] + deaf = yes [15:38:18] huh [15:38:28] guess I'd better read the code [15:38:52] (03PS1) 10Reedy: Kill DataValues/DataTypes/DataTypes.i18n.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82841 [15:39:04] yeah so [15:39:07] i can guide you through it [15:39:09] its kinda weird [15:39:15] by setting $ganglia_aggregator [15:39:52] (03CR) 10Reedy: [C: 032] Kill DataValues/DataTypes/DataTypes.i18n.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82841 (owner: 10Reedy) [15:40:04] ganglia.pp [15:40:08] line 42 [15:40:13] shoudl set $deaf properly [15:40:24] (03Merged) 10jenkins-bot: Kill DataValues/DataTypes/DataTypes.i18n.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82841 (owner: 10Reedy) [15:40:48] which will then get rendered in gmond.conf [15:41:03] [@analytics1003:~]↥ 130 $ grep deaf /etc/ganglia/gmond.conf [15:41:03] deaf = no [15:41:40] apergos: im' going to run puppet with puppetd --test --trace --debug [15:41:40] and see if I can see anytihng [15:41:50] ok [15:42:09] pastebin if you do [15:42:27] k [15:42:50] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Thu Sep 5 15:42:45 UTC 2013 [15:42:50] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:43:47] apergos: nothing useful [15:43:47] https://gist.github.com/ottomata/6451953 [15:45:48] (03PS1) 10Reedy: Remove more DataValues crap [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82843 [15:46:03] (03CR) 10Reedy: [C: 032] Remove more DataValues crap [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82843 (owner: 10Reedy) [15:46:12] (03Merged) 10jenkins-bot: Remove more DataValues crap [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82843 (owner: 10Reedy) [15:46:40] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 330 pgs degraded: 301 pgs stuck unclean: recovery 6091521/923801682 degraded (0.659%) [15:47:31] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 330 pgs degraded: 309 pgs stuck unclean: recovery 6091531/923802870 degraded (0.659%) [15:47:31] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 330 pgs degraded: 309 pgs stuck unclean: recovery 6091531/923802870 degraded (0.659%) [15:47:38] again? [15:47:42] I didn't expect anything tbh, in my experience debug/trace never give me useful detail [15:47:46] yeah [15:48:11] i'm not so sure how to check this either, aside from editing puppet configs and adding notice() on stafford [15:48:44] but editing manually (even for troubleshooting) is bad, ja? [15:48:54] i could commit some notice()es [15:49:30] you could puppet apply on the local box I guess [15:49:34] editing in /etc/puppet [15:50:38] uh, meeting in 10 mins? [15:50:40] good thing I noticed it in time [15:51:11] yup [15:51:14] am I in this meeting? [15:51:18] no [15:51:22] good! [15:51:31] !log reedy Started syncing Wikimedia installation... : testwiki to 1.22wmf16 and build l10n cache [15:51:34] Logged the message, Master [15:51:37] apergos: hm, i'd have to rsync all of the puppet manifests over to do that, no? [15:52:16] yeah, probably not a great idea either [15:52:37] and i can solve it for you [15:52:46] you're setting $ganglia_aggregator after the class include [15:52:52] always set variables first [15:52:53] (03PS1) 10Andrew Bogott: Move run_directory into the actual base class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82845 [15:52:54] bah! [15:52:56] that matters? crazy ok [15:52:58] crazy puppet [15:53:02] thanks mark! [15:53:29] given that it's a declarative language, yes, I agree [15:54:41] use $::hostname while you're at it [15:56:06] k [15:57:03] (03PS2) 10Andrew Bogott: Move run_directory into mysql.pp, the only place it is used. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82845 [15:57:30] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:20] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [15:58:27] (03CR) 10coren: [C: 032] "Yep. Globals are Evil." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82845 (owner: 10Andrew Bogott) [15:59:34] (03PS1) 10Ottomata: Moving $ganglia_aggregator declaration above analytics class includes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82846 [15:59:49] (03PS2) 10Ottomata: Moving $ganglia_aggregator declaration above analytics class includes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82846 [16:00:02] (03CR) 10Ottomata: [C: 032 V: 032] Moving $ganglia_aggregator declaration above analytics class includes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82846 (owner: 10Ottomata) [16:02:51] yes, that works way better, thanks apergos+mark [16:02:57] puppet reenabled on an11 and an03 [16:03:05] didn't do anything but glad it's working now [16:03:08] :) [16:04:30] RECOVERY - Puppet freshness on analytics1011 is OK: puppet ran at Thu Sep 5 16:04:20 UTC 2013 [16:04:59] and two more hosts with fresh puppet runs [16:05:31] RECOVERY - DPKG on analytics1011 is OK: All packages OK [16:05:31] (03PS2) 10Akosiaris: Adding scala as a build dependency.Bumping version [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82822 [16:06:06] (03CR) 10Akosiaris: [C: 032] Adding scala as a build dependency.Bumping version [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82822 (owner: 10Akosiaris) [16:06:15] (03CR) 10Akosiaris: [V: 032] Adding scala as a build dependency.Bumping version [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82822 (owner: 10Akosiaris) [16:08:24] (03PS1) 10Cmjohnson: reclaiming manganese -remove site.pp/add decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/82849 [16:10:30] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.22wmf16 and build l10n cache [16:10:34] Logged the message, Master [16:15:23] sbernardin: let me remove the failed disk on sq36 first [16:16:05] cmjohnson1: OK [16:23:35] Ryan_Lane: I'm trying to move a mysql data dir to /mnt on a labs vm, but am getting apparmor errors [16:23:40] apparmor="DENIED" operation="open" parent=1 profile="/usr/sbin/mysqld" name="/mnt/mysql/ibdata1" pid=18376 comm="mysqld" requested_mask="rw" denied_mask="rw" fsuid=113 ouid=113 [16:23:42] !log reedy synchronized php-1.22wmf16/extensions/ 'Minor wikibase updates' [16:23:45] Logged the message, Master [16:24:05] /var/lib/mysql is symlinked to /mnt/mysql [16:26:42] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.22wmf15 till deployment window [16:26:45] Logged the message, Master [16:27:58] Ryan_Lane: nm, adding '/mnt/mysql/* rw,' in /etc/apparmor.d/usr.sbin.mysqld seems to have helped [16:28:56] (03PS2) 10Hashar: contint: move iptables under module [operations/puppet] - 10https://gerrit.wikimedia.org/r/82613 [16:29:48] awesoome, thanks Snaps_ [16:29:51] if you push the snappy stuff [16:29:57] i'll build new librdkafka and varnishkafka debs [16:30:03] and try those out where mark is producing right now [16:30:10] and chekcup on the parition balanace stuff [16:30:15] balance* [16:30:24] (03CR) 10ArielGlenn: [C: 032] contint: move iptables under module [operations/puppet] - 10https://gerrit.wikimedia.org/r/82613 (owner: 10Hashar) [16:30:40] ottomata: Thanks. I will, can you wait 3 more hours? [16:31:07] yeah no probs [16:31:45] splendid, I'll sort it out tonight [16:32:29] danke [16:35:28] cmjohnson1: let me know when you're ready for me to swap the disks on sq36 [16:37:18] (03PS5) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [16:41:34] !log powering down sq36 to swap disk [16:41:37] Logged the message, Master [16:44:16] sbernardin: once the servers powers off...replace the 2nd disk [16:44:27] Ok [16:44:49] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [16:45:16] shaat uuuup! :) [16:45:59] PROBLEM - Host sq36 is DOWN: PING CRITICAL - Packet loss = 100% [16:49:08] (03PS6) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [16:49:40] RECOVERY - Puppet freshness on analytics1026 is OK: puppet ran at Thu Sep 5 16:49:29 UTC 2013 [16:51:48] (03PS7) 10Hashar: contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 [16:53:29] RECOVERY - RAID on analytics1026 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:53:39] RECOVERY - DPKG on analytics1026 is OK: All packages OK [16:53:41] (03CR) 10ArielGlenn: [C: 032] contint: prevents access to Zuul daemon [operations/puppet] - 10https://gerrit.wikimedia.org/r/82614 (owner: 10Hashar) [16:54:19] RECOVERY - Disk space on analytics1026 is OK: DISK OK [16:57:29] (03PS1) 10Hashar: contint: dupe define Iptables_add_exec[gallium] [operations/puppet] - 10https://gerrit.wikimedia.org/r/82857 [16:59:53] (03PS2) 10Hashar: contint: dupe define Iptables_add_exec[gallium] [operations/puppet] - 10https://gerrit.wikimedia.org/r/82857 [16:59:58] (03PS1) 10Akosiaris: Fix debian/changelog [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82859 [17:00:17] (03CR) 10Akosiaris: [C: 032 V: 032] Fix debian/changelog [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82859 (owner: 10Akosiaris) [17:00:21] (03CR) 10jenkins-bot: [V: 04-1] contint: dupe define Iptables_add_exec[gallium] [operations/puppet] - 10https://gerrit.wikimedia.org/r/82857 (owner: 10Hashar) [17:02:51] (03PS3) 10Hashar: contint: dupe define Iptables_add_exec[gallium] [operations/puppet] - 10https://gerrit.wikimedia.org/r/82857 [17:03:42] sbernardin: are you getting any errors on reboot? [17:04:56] Stuck at remote access controller detected for 5 min....no errors on screen [17:05:56] cmjohnson1: servers seems to be hung [17:06:21] reboot it [17:07:33] (03CR) 10ArielGlenn: [C: 032] contint: dupe define Iptables_add_exec[gallium] [operations/puppet] - 10https://gerrit.wikimedia.org/r/82857 (owner: 10Hashar) [17:08:50] (03CR) 10Kaldari: [C: 031] Enable XFO: SAMEORIGIN for enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82751 (owner: 10CSteipp) [17:09:14] cmjohnson1: just rebooted it for the 3rd time....same thing [17:10:09] RECOVERY - NTP on analytics1026 is OK: NTP OK: Offset -0.01382350922 secs [17:12:08] heya Snaps_ [17:14:36] (03PS2) 10Cmjohnson: reclaiming manganese -remove site.pp/add decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/82849 [17:16:57] (03CR) 10Cmjohnson: [C: 032 V: 032] reclaiming manganese -remove site.pp/add decom.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/82849 (owner: 10Cmjohnson) [17:18:10] i'm having trouble with paravoid's init.d script he wrote for varnishkafka [17:18:32] it sets DAEMON_OPTS to [17:18:32] DAEMON_OPTS="-a -w ${LOGFILE} -D -P ${PIDFILE}" [17:18:38] and passes those args to varnishkafka [17:18:39] but [17:18:47] cmjohnson1: put original disk back [17:19:01] those seem like ' standard Varnish VSL arguments', (at least they are to varnishlog?) [17:19:08] but varnishkafka doesn't like them [17:19:32] but, if I remove them, start-stop-daemon doesn't seem to create the pidfile properly…unless there is somethign else going on [17:19:34] i'm looking into it [17:19:35] cmjohnson1: getting pci express error interrupt at 1D76:002E [17:20:28] sbernardin: yep...that;s what i figured once you mentioned remote controller issue [17:21:13] that is the kiss of death for the 1950s...many others know this already but there is a capacitor on the HD controller card that goes bad [17:21:29] apergos: ^ [17:21:52] cmjohnson1: do I continue with boot or shut it down? [17:22:06] it's not going to boot [17:22:32] !log upload kafka 0.8-20130903-1 at apt.wikimedia.org, component universe [17:22:35] Logged the message, Master [17:22:37] it will need to be decom'd [17:22:40] (03PS8) 10Hashar: contint: publish Zuul git over git protocol [operations/puppet] - 10https://gerrit.wikimedia.org/r/82625 [17:22:47] OK...got it [17:32:52] (03CR) 10Cmjohnson: [C: 032 V: 032] "Robla approved in RT5691 and it has been more than 3 days." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81953 (owner: 10Dzahn) [17:34:52] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [17:35:14] ah it was the controller indeed [17:35:27] bye bye sq36 [17:46:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:56] paravoid: ok pushed 6FA6DCF5 to pgp.mit.edu and keyserver.ubuntu.com [17:48:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.683 second response time [17:54:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.940 second response time [17:56:59] sbernardin: https://rt.wikimedia.org/Ticket/Display.html?id=5647 [18:03:32] !log adding virt12 to the virt node pool [18:03:35] Logged the message, Master [18:06:24] Ryan_Lane: I am trying to dig through the email backlog and found the 'one api squid in tampa' email [18:06:30] sq36 bit the dust today [18:06:37] where does that leave us? [18:06:46] we have sq39 and 52 or so [18:07:00] or whichever ones I added [18:07:04] it's in a different email [18:07:07] ok whew [18:07:15] (03PS8) 10Ori.livneh: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [18:07:35] in squid.php it only shows three of which presumably two are gone, comments there mustbe outa sync [18:07:35] thanks [18:16:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.731 second response time [18:20:36] (03PS1) 10Ori.livneh: Increase the rate of NavigationTiming events [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82875 [18:22:15] (03CR) 10Ori.livneh: [C: 032] Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [18:22:26] (03Merged) 10jenkins-bot: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [18:22:32] (03CR) 10Ori.livneh: [C: 032] Increase the rate of NavigationTiming events [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82875 (owner: 10Ori.livneh) [18:23:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:19] (03Merged) 10jenkins-bot: Increase the rate of NavigationTiming events [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82875 (owner: 10Ori.livneh) [18:24:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.803 second response time [18:24:43] (03PS1) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82877 [18:25:02] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [18:25:09] (03PS1) 10coren: Tool Labs: Make tools-proxy infrastructure [operations/puppet] - 10https://gerrit.wikimedia.org/r/82878 [18:25:15] * YuviPanda clicks [18:25:22] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [18:25:31] (03PS2) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82877 [18:25:32] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [18:25:36] (03CR) 10jenkins-bot: [V: 04-1] RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82877 (owner: 10RobH) [18:26:10] !log olivneh synchronized wmf-config/CommonSettings.php 'Increase rate of NavigationTiming events' [18:26:13] Logged the message, Master [18:26:58] (03Abandoned) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82877 (owner: 10RobH) [18:27:12] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Change I34d1f29d2: clean-up of InitialiseSettings.php' [18:27:15] Logged the message, Master [18:33:41] (03PS1) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 [18:35:56] (03CR) 10RobH: [C: 031] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [18:36:17] mutante: I added you as a reviewer to the above, its moving bugzilla to its own cert [18:36:29] im 99.99% sure its ok, but seems best for another set of eyes [18:37:28] (03PS1) 10Lcarr: adding new virt subnet [operations/dns] - 10https://gerrit.wikimedia.org/r/82880 [18:37:52] Ryan_Lane: ^^ [18:38:27] (03CR) 10Lcarr: [C: 032] adding new virt subnet [operations/dns] - 10https://gerrit.wikimedia.org/r/82880 (owner: 10Lcarr) [18:41:23] Ryan_Lane: and subnet is now routable [18:42:06] !log reedy synchronized php-1.22wmf16/extensions/UploadWizard/ [18:42:09] Logged the message, Master [18:43:55] heya ^d [18:44:00] gerrit moved servers, right? [18:44:04] <^d> Yuppp [18:44:07] should scp gerrit.wikimedia.org:hooks/commit-msg .git/hooks/commit-msg still work? [18:44:16] <^d> No reason it shouldn't. [18:44:25] i'm getting public key denied, [18:44:33] where is :hooks/commit-msg relative to? [18:44:49] (looking as root too :) ) [18:45:02] <^d> I don't have a good idea how that thing works. [18:45:15] <^d> I just tried and wfm'd. [18:45:49] oh, hm, ha, who knows? [18:47:08] <^d> !log gerrit coming down for a short bit. moving data around. [18:47:12] Logged the message, Master [18:48:26] ah, was about to try git review :p [18:48:36] <^d> Yeah sorry about that. [18:49:17] <^d> ytterbium wasn't partitioned right :) [18:49:22] ^d: you're moving the data, right? not me? [18:49:26] heh [18:49:33] git review isnt working noooooooo [18:49:46] i just tried to disable someone's access ;] [18:49:56] https://gerrit.wikimedia.org/r/ => Service Temporarily Unavailable :( [18:49:57] <^d> Ryan_Lane: Move it all into /mnt, then you'll change it back to /var/lib/gerrit2? [18:50:01] RobH: we just disabled your access to disable that access [18:50:06] ^d: yeah [18:50:09] anarchy! wooooo [18:50:19] rmoen: we're moving data around. the system was improperly partitioned [18:50:30] 503! [18:50:44] Ryane_Lane: Sounds justified to me :) [18:50:56] "I'm not slacking, gerrit is down" [18:50:57] Ryan_Lane ^_^ [18:51:25] sorry man, i botched your name [18:51:29] :D [18:51:35] you don't use tab completion? :) [18:51:58] <^d> Tab-completion is for slackers! [18:52:15] Indeed, I try to use it as much as possible ( aka slack ) [18:52:46] <^d> Ryan_Lane: It's all in /mnt/gerrit2 [18:52:58] you're done? [18:53:11] <^d> Yeah that's it. [18:53:45] you did a copy rather than a move? [18:53:48] tsk tsk [18:54:00] <^d> Oh, I thought you said cp. [18:54:02] <^d> Sorry :) [18:54:03] heh [18:54:27] please get out of mnt :) [18:54:39] <^d> k [18:54:39] ty [18:55:18] done [18:55:44] I moved the old data to gerrit2-bak [18:58:20] <^d> !log gerrit back up [18:58:24] Logged the message, Master [18:59:12] am I good to delete the old data? [18:59:33] Thanks guys :) [18:59:37] yw [18:59:52] I'm Receiving objects yay [18:59:53] (03PS2) 10Reedy: Phase 1 wikis to 1.22wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82833 [19:00:01] ^d: am I good to delete the old data? [19:00:05] (03CR) 10Reedy: [C: 032] Phase 1 wikis to 1.22wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82833 (owner: 10Reedy) [19:00:14] <^d> Yeah, we're good. [19:00:15] (03Merged) 10jenkins-bot: Phase 1 wikis to 1.22wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82833 (owner: 10Reedy) [19:00:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, mediawikiwiki, test2wiki, loginwiki and testwikidatawiki to 1.22wmf16 [19:01:06] (03PS1) 10RobH: disabling py shell access [operations/puppet] - 10https://gerrit.wikimedia.org/r/82882 [19:01:07] Logged the message, Master [19:01:21] ok. deleted [19:01:26] well, deleting [19:01:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.039 second response time [19:02:52] (03PS2) 10RobH: disabling py shell access [operations/puppet] - 10https://gerrit.wikimedia.org/r/82882 [19:05:08] (03CR) 10RobH: [C: 032] disabling py shell access [operations/puppet] - 10https://gerrit.wikimedia.org/r/82882 (owner: 10RobH) [19:06:01] (03PS2) 10Reedy: All wikipedia wikis to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82834 [19:06:08] (03CR) 10Reedy: [C: 032] All wikipedia wikis to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82834 (owner: 10Reedy) [19:06:39] (03CR) 10jenkins-bot: [V: 04-1] All wikipedia wikis to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82834 (owner: 10Reedy) [19:07:53] Ryon_Lane: hey so I have this problem -> fatal: Couldn't find remote ref refs/changes/36/79336/10 Any ideas? [19:08:07] (03CR) 10Reedy: [V: 032] All wikipedia wikis to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82834 (owner: 10Reedy) [19:08:51] Ryan_Lane: ^^ Lol i did it again [19:09:02] Lion_Rane! [19:09:05] LOL! [19:09:55] * Reedy kicks APC [19:10:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: all wikipedias to 1.22wmf15 [19:10:07] Logged the message, Master [19:10:50] rmoen: ^d [19:10:56] I have no idea what the issue is [19:11:28] (03PS1) 10Cmjohnson: adding sq36 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82884 [19:11:39] Ryan_Lane: Ok ty [19:11:46] (03PS2) 10Reedy: Move php symlink to point at 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82835 [19:11:52] (03CR) 10Reedy: [C: 032] Move php symlink to point at 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82835 (owner: 10Reedy) [19:12:17] (03Merged) 10jenkins-bot: Move php symlink to point at 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82835 (owner: 10Reedy) [19:14:14] (03PS1) 10Ottomata: varnishkafka module, first real commit! [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 [19:14:41] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [19:15:45] (03PS2) 10Cmjohnson: adding sq36 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82884 [19:27:52] (03CR) 10coren: [C: 032] Tool Labs: Make tools-proxy infrastructure [operations/puppet] - 10https://gerrit.wikimedia.org/r/82878 (owner: 10coren) [19:28:08] paravoid, can I get a little help with signing a deb repo? I get 'no valid OpenPGP data found' no matter what I try. [19:36:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:43:16] (03PS4) 10Reedy: $wgCategoryCollation to 'uca-ru' on all Russian-language wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [19:48:15] (03CR) 10Edenhill: [C: 031] "(3 comments)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [19:50:07] PROBLEM - SSH on sq42 is CRITICAL: Server answer: [19:52:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:57:36] ^d: so hey i haz this problem => fatal: Couldn't find remote ref refs/changes/36/79336/10 any ideas? [19:57:57] i got this running git review -d and for sure in the correct repo [20:02:38] (03CR) 10Ottomata: "(2 comments)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [20:03:59] (03CR) 10Ottomata: "(1 comment)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [20:07:07] RECOVERY - SSH on sq42 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:08:17] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [20:09:23] (03PS2) 10Ottomata: varnishkafka module, first real commit! [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 [20:09:27] PROBLEM - Backend Squid HTTP on sq42 is CRITICAL: Connection refused [20:10:12] (03CR) 10Edenhill: [C: 031] varnishkafka module, first real commit! [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [20:10:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:27] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:37] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [20:12:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:12:46] (03PS3) 10Ottomata: varnishkafka module. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 [20:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:49] (03CR) 10Edenhill: [C: 031] varnishkafka module. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [20:24:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [20:25:01] (03PS1) 10Ryan Lane: SimpleScheduler no longer exists; use FilterScheduler [operations/puppet] - 10https://gerrit.wikimedia.org/r/82934 [20:26:35] that fixes that... [20:27:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:22] (03PS2) 10Ryan Lane: SimpleScheduler no longer exists; use FilterScheduler [operations/puppet] - 10https://gerrit.wikimedia.org/r/82934 [20:28:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [20:28:29] (03CR) 10Ryan Lane: [C: 032 V: 032] SimpleScheduler no longer exists; use FilterScheduler [operations/puppet] - 10https://gerrit.wikimedia.org/r/82934 (owner: 10Ryan Lane) [20:34:38] !log reedy synchronized php-1.22wmf16/extensions/ArticleFeedbackv5/ [20:34:41] Logged the message, Master [20:37:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:40:17] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Thu Sep 5 20:40:09 UTC 2013 [20:40:37] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [20:43:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [20:46:45] (03PS1) 10RobH: RT#5010 blog.wikimedia.org getting its own cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 [20:46:55] (03PS2) 10RobH: RT#5010 blog.wikimedia.org getting its own cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 [20:50:57] <^d> marktraceur: So, gerrit UI thinks the change exists. Repo on the disk says it's there. But it doesn't seem to want to fetch from the ref. [20:51:25] !log reedy synchronized php-1.22wmf15/extensions/ArticleFeedbackv5/ [20:51:28] Logged the message, Master [20:52:15] (03CR) 10RobH: [C: 031] "Looks right to me, requesting a second set of eyes to look it over." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 (owner: 10RobH) [20:52:35] ^d: So maybe replication or forwarding is broken? /me has no idea how Gerrit works [20:53:03] <^d> Well when we changed from the old box to the new box, I had been replicating all repos over so we wouldn't have to copy them all at once. [20:53:11] <^d> I've still got the old repos backed up. [20:58:32] (03CR) 10Reedy: [C: 032] $wgCategoryCollation to 'uca-ru' on all Russian-language wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [20:58:47] (03Merged) 10jenkins-bot: $wgCategoryCollation to 'uca-ru' on all Russian-language wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [20:59:13] is it possible that gerrit vanished a change? [20:59:27] yeah, i'm having issues with a change as well [20:59:29] i strongly suspect it's simply me being stupid and missing something [21:00:07] see https://gerrit.wikimedia.org/r/#/c/82868/ [21:00:20] rmoen: what are you seeing? [21:00:23] (03PS2) 10Dzahn: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [21:00:41] i cannot checkout https://gerrit.wikimedia.org/r/#/c/79336/ A flow change [21:00:47] (03CR) 10Dzahn: [C: 032] "lgtm, checking on kaulen" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [21:00:48] ori-l ^ [21:01:30] yep, confirmed [21:01:33] !log reedy synchronized wmf-config/InitialiseSettings.php [21:01:36] Logged the message, Master [21:01:51] ^d: are you aware? [21:02:02] oh, sorry, I see that you are. I missed the backlog. [21:02:36] <^d> ori-l: I'm kinda baffled :\ [21:02:55] rmoen: you should be able to download it if you copy one of the commands from the "Download" section [21:03:07] (at least it just worked for me on a different changeset) [21:03:13] MatmaRex: nope [21:05:32] It seems the commit has even beet replicated to gitblit: https://git.wikimedia.org/commit/mediawiki%2Fextensions%2FFlow.git/4edd7bf7643c7e267da6a767df736d54fcaf3c3f [21:05:59] And gitblit has even ref associated to it [21:06:22] but git ls-remote doesn't show it [21:06:25] it only shows patches 1-9 [21:06:42] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [21:07:27] <^d> qchris: And we've got the ref in the old repos on manganese. [21:07:38] <^d> Some refs (not all) seem to have not been replicated :\ [21:08:32] But the commit is in the new repo, only the ref is missing new repo? Or is the commit also missing in the new repo? [21:08:47] <^d> The commits are all there as far as I can tell. [21:09:30] is there a record of commits across all repositories somewhere, so we can precisely delimit the range of commits that were affected? [21:09:42] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Thu Sep 5 21:09:34 UTC 2013 [21:09:54] <^d> ori-l: No. And it's not time-dependent as far as I can tell. [21:10:09] hrm. [21:10:10] <^d> They were all replicated prior to migration. [21:10:42] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [21:12:09] <^d> I mean we could manually fix them as we see them, but that doesn't scale. [21:12:14] <^d> And doesn't ensure we fix them all. [21:12:45] <^d> I tried pushing refs/changes/*:refs/changes/* from the old to new repo but git decided everything's up to date. [21:12:58] We cold iterate over the repos, list the refs and compare the output to see which we are missing. (Brute force approach) [21:13:04] right, that's what i was thinking [21:13:10] (03CR) 10Dzahn: [C: 04-1] "wait, need to check about bug-attachment.wikimedia.org cert errors when this is not covered by star anymore" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [21:13:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [21:16:12] (03PS1) 10Ottomata: Updating comments for $zookeeper_chroot. Supporting array for $log_dir [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/83015 [21:16:36] (03CR) 10RobH: "easy solution is two certs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [21:16:47] (03CR) 10Ottomata: [C: 032 V: 032] Updating comments for $zookeeper_chroot. Supporting array for $log_dir [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/83015 (owner: 10Ottomata) [21:17:23] (03CR) 10Dzahn: "we had brief IRC discussion, we're gonna get bug-attachment on a new cert" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [21:17:25] ^d: You said the refs we replicated prior to the migration, but when comparing server logs, it seems patch set 79336/10 was pushed during/after the migration. [21:17:45] heaaaay LeslieCarr, I'm about to repave/install kafka 0.8 brokers via puppet [21:17:49] you want to do with me? [21:18:58] <^d> qchris: 4edd7bf7643c7e267da6a767df736d54fcaf3c3f shows up on ytterbium but not manganese, as expected yeah. [21:19:53] <^d> But ytterbium doesn't know about refs/changes/36/79336/10 [21:20:06] So that commit went to ytterbium, and we do not see the ref? So it's not a replication manganese->ytterbium thing. [21:20:09] Yes. [21:20:11] Mhmmm. [21:20:39] ottomata: LeslieCarr just had to turn in her laptop to OIT for repair [21:20:40] <^d> Yeah that part isn't. But I think there's some examples of stuff from before the migration that's weird. [21:20:44] so she is afk and unable to reply [21:20:45] fti [21:20:47] fyi [21:20:52] Did you copy over the caches as well? [21:21:09] <^d> No, it was a fresh init. [21:21:16] ohh right [21:21:31] thanks RobH [21:21:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:21:33] ^d: Ok. [21:23:11] <^d> I've restarted gerrit a number of times, so caches have been pretty cold. [21:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:24:17] I assume it is as the Web UI is working, but did you check in the database tables that patch set 10 is there and looks valid? [21:24:55] qchris / ^d: python script to do what qchris suggested https://dpaste.de/MhIEw/raw/ [21:25:11] you'll have to change BASE_URL_B to point to the old repos [21:26:09] ori-l: \o/ [21:27:02] ori-l: Oh. I am not sure whether we have https access to the old gerrit :-/ But yes. We could just walk the file system. [21:27:32] <^d> The old box is still online, yes. [21:27:35] it's just shelling out to git, so it should take any remote that git ls-remote does [21:27:45] i.e., file system path or git/ssh/https URI [21:28:22] ori-l: Stupid me. You just fetch the repo list via https. [21:28:43] ah that's what you meant. yeah [21:30:23] bbiab [21:30:52] (03CR) 10Dzahn: "is this gonna be holmium or marmontel, both still have misc::blogs::wikimedia" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 (owner: 10RobH) [21:31:58] (03CR) 10Dzahn: [C: 032] "ok, lgtm, holmium just has this single Apache site" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 (owner: 10RobH) [21:33:25] <^d> Running that script now. [21:40:02] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Thu Sep 5 21:39:59 UTC 2013 [21:40:23] !log replaced SSL cert on blog.wikimedia.org, don't use star cert, have separate new one [21:40:26] Logged the message, Master [21:40:28] (03CR) 10Dzahn: "- SSLCertificateFile /etc/ssl/certs/star.wikimedia.org.pem" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 (owner: 10RobH) [21:40:42] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [21:42:17] (03CR) 10Dzahn: "openssl s_client -showcerts -CApath /etc/ssl/certs/ -connect blog.wikimedia.org:443" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83008 (owner: 10RobH) [21:42:43] csteipp: ^ [21:47:11] <^d> *sigh* [21:47:14] <^d> What to do. [21:47:26] *sigh* need to roll that back because of techblog.wikimedia.org [21:47:35] otherwise SSL cert errors there [21:47:42] ^d is the script still running? [21:48:18] What does the output say so far? Like tons of missing refs, or just a few? [21:48:56] (03PS1) 10Dzahn: Revert "RT#5010 blog.wikimedia.org getting its own cert" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83017 [21:49:24] <^d> qchris: Most seem to be there, some missing. No real pattern I see. [21:49:36] starting scap of E3's GuidedTour to mater [21:49:49] *update to master [21:50:41] (03CR) 10Dzahn: [C: 032] "reverting because the cert also needs techblog.wikimedia.org or that needs to go and otherwise we cause cert errors there (we still want t" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83017 (owner: 10Dzahn) [21:52:43] There were reports of people not seeing some refs as they ran gerrit as user root for some short amount of time (to test things) and afterwards tried to run it as user gerrit afterwards. Since we're on puppet, this should not affect us. Right? [21:54:01] The init script would be there to start/stop gerrit through. [21:54:06] <^d> Right. [21:56:03] (03CR) 10Dzahn: "so, techblog is just a redirect to https://blog.wikimedia.org/c/technology/ though, can we just get rid of that or not care about cert err" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83017 (owner: 10Dzahn) [21:56:25] (03PS1) 10Ottomata: Including role::analytics on an21 and an22. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83019 [21:57:19] !log rolled back SSL cert on blog for now to discuss techblog.wm issue which would not be covered by cert but is just a redirect [21:57:22] Logged the message, Master [21:59:18] !log spage Started syncing Wikimedia installation... : Deploying latest GuidedTour to wmf15 and wmf16 [21:59:20] Logged the message, Master [22:00:20] (03CR) 10Ottomata: [C: 032 V: 032] Including role::analytics on an21 and an22. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83019 (owner: 10Ottomata) [22:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [22:04:08] <^d> qchris, ori-l: Output so far http://p.defau.lt/?awScnEjBSyTbAlUpwJ6x8A [22:04:47] Does not look too bad :-) [22:05:26] (03PS3) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 [22:06:22] (03PS4) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 [22:06:57] (03PS4) 10Spage: Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [22:07:30] (03CR) 10Spage: [C: 032] "The right 7 languages." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [22:07:40] (03Merged) 10jenkins-bot: Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [22:09:19] error in scap: snapshot1: sudo: no tty present and no askpass program specified [22:10:03] the usual glitch in scap: mw1089: Copying to mw1089 from mw1070.eqiad.wmnet...cannot delete non-empty directory: php-1.22wmf2/.git/modules/extensions/WikiLove [22:10:24] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [22:11:00] !log spage Finished syncing Wikimedia installation... : Deploying latest GuidedTour to wmf15 and wmf16 [22:11:03] Logged the message, Master [22:11:22] (03PS1) 10Andrew Bogott: Lots of ruckus to get our apt repo 'signed'. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83022 [22:13:21] (03CR) 10Andrew Bogott: [C: 032] Lots of ruckus to get our apt repo 'signed'. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83022 (owner: 10Andrew Bogott) [22:13:59] (03PS1) 10Dzahn: remove old node marmontel from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83023 [22:15:28] (03PS12) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [22:16:56] !log spage synchronized wmf-config/InitialiseSettings.php 'E3 deploy GuidedTour to 7 more wikis' [22:16:59] Logged the message, Master [22:17:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:49] (03CR) 10jenkins-bot: [V: 04-1] Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 (owner: 10Ottomata) [22:17:58] (03CR) 10RobH: [C: 04-1] "add mormontel to decomissioning.pp with a note that its a reclaim, not a decom" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83023 (owner: 10Dzahn) [22:18:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:19:33] (03CR) 10RobH: "I want to handle merge, just +1 or -1 please" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [22:20:43] (03PS13) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [22:21:03] (03CR) 10Dzahn: [C: 031] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [22:23:01] (03CR) 10Ottomata: [C: 032 V: 032] Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 (owner: 10Ottomata) [22:23:09] (03PS5) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 [22:23:51] greg-g, bsitu : E3 deploy finished [22:24:03] thanks spagewmf [22:24:08] (03PS1) 10Dzahn: retab bugzilla.pp from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/83025 [22:24:14] spagewmf: thx [22:24:25] greg-g: I guess I will wait till 4? [22:25:28] (03PS1) 10Ottomata: Updating kafka to latest commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/83026 [22:25:41] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka to latest commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/83026 (owner: 10Ottomata) [22:25:59] bsitu: yeah, if you could [22:26:11] greg-g: sure [22:28:16] (03PS6) 10RobH: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 [22:29:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.429 second response time [22:35:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.725 second response time [22:38:12] ^d: Of those refs you gave us, it seems refs/changes/92/80192/1 from mediawiki/extensions/WikibaseQuery is the only change ref that's not on ytterbium. [22:38:22] Is the script still running? [22:39:08] <^d> It halted because a repo was fubar'd. Re-running. [22:39:14] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [22:39:23] Ok :-) [22:40:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Thu Sep 5 22:39:56 UTC 2013 [22:40:24] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [22:41:22] (03PS1) 10Andrew Bogott: Use %transient-key when generating our apt-signing key [operations/puppet] - 10https://gerrit.wikimedia.org/r/83031 [22:41:50] (03CR) 10Andrew Bogott: [C: 032] Use %transient-key when generating our apt-signing key [operations/puppet] - 10https://gerrit.wikimedia.org/r/83031 (owner: 10Andrew Bogott) [22:42:24] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:43:34] (03Abandoned) 10RobH: retab bugzilla.pp from tabs to 4 spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/83025 (owner: 10Dzahn) [22:43:52] !log csteipp synchronized php-1.22wmf16/includes/actions [22:43:55] Logged the message, Master [22:44:31] !log csteipp synchronized php-1.22wmf16/includes/specials [22:44:34] Logged the message, Master [22:45:27] !log csteipp synchronized php-1.22wmf16/extensions/CentralNotice 'bug53032' [22:45:30] Logged the message, Master [22:47:19] anything has been changed with gerrit? i can't log in to the web interface... [22:48:14] Danny_B: gerrit has moved to a different machine. Logging in works for me. [22:48:33] (03PS2) 10RobH: remove old node marmontel from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83023 (owner: 10Dzahn) [22:49:53] hmm, i tried every single passwd i can remember and none works [22:50:13] gerrit is a bit picky about user names as well. [22:50:13] can't find any "send recovery passwd" feature :-/ [22:50:24] Danny_B: reset it on wikitech? [22:50:28] Especially when it comes to spaces and underscores [22:50:44] login goes with email address, doesn't it? [22:50:50] Nope. [22:50:59] username? [22:51:00] Your ldap cn. [22:51:00] <^d> Login uses your normal wiki username. [22:51:06] <^d> aka ldap cn. [22:51:18] that may be the issue then ;-) [22:51:33] (03PS1) 10Dzahn: retab and quoting in blogs.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83034 [22:51:47] ^d: bleh. sorry to keep bugging you, but is there an eta on when my repo is going to get fixed? :/ [22:51:58] bingo! [22:52:02] <^d> No. It's not just your repo. [22:52:07] thx for kick [22:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:46] well, i'll just keep waiting then [22:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:55:19] (03CR) 10Dzahn: [C: 032] remove old node marmontel from site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83023 (owner: 10Dzahn) [22:55:35] <^d> qchris: Finished http://p.defau.lt/?KSuOwL5LC_mNksA_MfavWg [22:56:03] (03CR) 10Dzahn: "it is powered on but seperated VLAN and (only) accesible via mgmt" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83023 (owner: 10Dzahn) [22:56:51] ^d: Great. Script to check change refs is running ... [22:57:49] <^d> 151 affected repos. [22:58:15] ^d: what happened? [22:58:31] But for most of them, just HEAD changed, a new patch set got uploaded etc [22:58:38] <^d> Right. [22:58:42] <^d> Which is mostly ok. [22:59:05] <^d> Ryan_Lane: Some refs didn't get copied to ytterbium, but their commits are there. [22:59:11] ah [22:59:31] <^d> No data loss, just annoying as heck :) [23:03:40] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: No successful Puppet run in the last 10 hours [23:05:18] ^d The script does not seem to do the trick. 79336/10 should be affected, but is not listed in the output you pasted. Just as 82783/1 (although 82783/2 is there). [23:05:30] <^d> Grr :\ [23:05:50] greg-g: ^^ [23:05:53] (03PS7) 10Dzahn: RT#5011 bugzilla to use own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [23:06:25] (03CR) 10Dzahn: [C: 031] "does this contain the attachments domain now?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82879 (owner: 10RobH) [23:08:15] ^d On second thought. Thouse were the patch sets that have been uploaded to ytterbium and did not get a ref there. It's ok if they never reached manganese. [23:09:26] Of the output you pasted, I can see all change refs on ytterbium except refs/changes/92/80192/1 [23:10:40] <^d> I think the script doesn't work. [23:11:20] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [23:11:24] It does not? [23:11:44] <^d> I think the script is over-reporting at least. [23:11:49] <^d> Telling us a ton of stuff we don't care. [23:11:55] Yes. It's overreporting. [23:12:00] But that's ok. [23:12:05] I checked all change refs. [23:12:16] They exist (except for one) [23:15:43] !log bsitu synchronized php-1.22wmf16/extensions/PageTriage 'Update PageTriage to master' [23:15:46] Logged the message, Master [23:15:55] (03PS1) 10Dzahn: tabs and lint stuff in outreach.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83036 [23:16:11] !log bsitu synchronized php-1.22wmf15/extensions/PageTriage 'Update PageTriage to master' [23:16:14] Logged the message, Master [23:17:13] (03CR) 10Dzahn: [C: 032] tabs and lint stuff in outreach.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83036 (owner: 10Dzahn) [23:17:20] <^d> qchris: I'm going to go away for awhile and think on this. [23:17:39] Ok. [23:20:09] (03PS1) 10Ottomata: Using JBOD Kafka mounts in production [operations/puppet] - 10https://gerrit.wikimedia.org/r/83038 [23:20:39] (03PS2) 10Ottomata: Using JBOD Kafka mounts in production [operations/puppet] - 10https://gerrit.wikimedia.org/r/83038 [23:20:45] (03CR) 10Ottomata: [C: 032 V: 032] Using JBOD Kafka mounts in production [operations/puppet] - 10https://gerrit.wikimedia.org/r/83038 (owner: 10Ottomata) [23:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.264 second response time [23:24:19] did plugin sync or something just get enabled? :p [23:24:48] (03CR) 10Dzahn: "i don't know any of the background story or this specific blog, so i truly abstain from an opinion, but to merge it you'd have to show som" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80760 (owner: 10Dereckson) [23:24:52] Anything going on with irc.wikimedia.org? [23:26:18] Elsie: are you reporting it broken? [23:26:41] Not sure yet. [23:26:45] My bot has stopped working. [23:26:56] But I haven't connected to the network myself yet. [23:27:02] not aware of anyone touching that server [23:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:36] Elsie: 16:28 -!- I have 223 clients and 0 servers [23:27:38] wfm [23:28:02] well, the server connect [23:28:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.343 second response time [23:29:47] mutante: Do you see activity in #en.wikipedia? [23:30:15] Elsie: yes, i do, works [23:30:28] there's a bot called rc-pmtpa [23:30:28] Okay, thanks. [23:30:33] Something else must be broken. [23:30:39] ok. np [23:31:05] Huh... [23:31:21] The format changed. [23:31:24] heh, while on it: [23:31:26] Who deployed code today? [23:31:29] Or just recently? [23:31:30] !log installing package upgrades on ekrem [23:31:33] Logged the message, Master [23:32:28] Elsie: https://wikitech.wikimedia.org/wiki/SAL ? [23:33:24] 18:19 UTC is the last message I have coming in... [23:33:57] Reedy or ori-l: Did you do anything related to the IRC feed? [23:35:20] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [23:36:01] csteipp: Same question. [23:36:37] There's nothing that matches the 1819 timestamp [23:37:07] It was a change that came after that, I think. [23:37:14] So something in InitialiseSettings.php, maybe. [23:37:27] Or whatever Chris was doing in includes/specials, maybe. [23:37:47] 2013-09-05 19:37:38-0400 [Snatch,client] ' [[Q14534990]] B http://www.wikidata.org/w/index.php?diff=68617208&oldid=65632585&rcid=68732900 * EmausBot * (+71) /* wbeditentity-update:0| */Adding labels' was not matched [23:37:56] I'm getting errors like that. [23:38:00] I think it may be the preceding space. [23:38:21] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:40:07] lol [23:40:18] What? [23:40:35] regex fail due to a space [23:40:41] It's a fucking API. [23:40:47] If someone changed the format, that would break scdripts. [23:40:48] Barely [23:40:49] scripts [23:40:54] Only in every sense of the term. [23:41:17] if it involves regexes, it's not an API. it's a fucking disaster;) [23:41:41] It's little more than screen scraping [23:41:49] Okay, why did the format change? [23:41:50] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Thu Sep 5 23:41:45 UTC 2013 [23:41:54] And what needs to be done to un-change it. [23:42:20] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [23:42:22] Dunno [23:42:25] Go look it up? [23:43:03] I'm trying. [23:43:12] I know you are. [23:43:27] The server admin log is useless. [23:43:38] git log? [23:43:59] I thought I recalled ori making a fix related to the irc rc in production [23:44:03] either earlier this week or last [23:44:15] It is that space, BTW. [23:44:30] ? [23:44:31] problem fix [23:44:33] ed [23:44:39] No, someone has changed the format. [23:44:42] It needs to be un-changed. [23:44:50] The new format now has a preceding space. [23:44:53] You can't do that. [23:44:57] You can [23:44:59] It's been done [23:45:21] Is the format set in InitialiseSettings.php? [23:45:37] I don't think so [23:45:42] http://git.wikimedia.org/commit/mediawiki%2Fcore.git/dc6f5c314de445ec170d706930ebec6a77113df4 is what I was thinking of [23:46:42] $this->mExtras becomes $this->mExtra [23:46:46] http://git.wikimedia.org/commit/mediawiki%2Fcore.git/2961884b43b505b7ddc30212f14eed56250e8e18 [23:46:59] ^ that seems a likely candidate [23:47:12] Right. [23:47:17] Did that get deployed this afternoon? [23:47:24] Sometime around 19:00 UTC? [23:47:25] 81 [23:47:25] + $fullString = "$titleString\0034 $flag\00310 " . [23:47:25] 82 [23:47:25] + "\00302$url\003 \0035*\003 \00303$user\003 \0035*\003 $szdiff \00310$comment\003\n"; [23:47:45] alright, I'm here now [23:47:47] what's up? [23:47:47] it would've been in 15 [23:47:57] Which went out to enwiki today (well, yesterday for me) [23:48:11] Did it also go out to Wikidata today? [23:48:17] No.. [23:48:24] That should've been monday [23:48:27] You'd think someone would've complained already. [23:48:28] or tuesday even [23:48:29] holiday [23:48:31] ori-l: Someone broke Elsies screen scraping [23:48:38] well, let's fix it [23:48:48] ori-l: There's a preceding space in IRC messages now. [23:48:49] ori-l: additional whitespace in feed format :p [23:48:57] as soon as Elsie is done with "you don't even understand just how serious business IRC is" [23:49:01] It's going to be in includes/rcfeed/IRCColourfulRCFeedFormatter.php [23:49:13] * Reedy waits for mutante to accidentally the server [23:49:14] * Elsie smiles at ori-l. [23:49:14] k, let me look [23:49:26] Even though IRC IS SERIOUS AND MOST ANTI-VANDALISM TOOLS RELY ON IT. [23:49:38] Reedy: hah, i just dared to install package upgrades :p [23:49:57] 33 [23:49:57] + // HACK: We need this hook for WMF's secure server setup [23:49:57] 34 [23:49:57] + wfRunHooks( 'IRCLineURL', array( &$url, &$query ) ); [23:50:04] my scrollback buffer doesn't go back far enough; can you paste an example indicating where the extra space is located? [23:50:27] [00:37:46] 2013-09-05 19:37:38-0400 [Snatch,client] ' [[Q14534990]] B http://www.wikidata.org/w/index.php?diff=68617208&oldid=65632585&rcid=68732900 * EmausBot * (+71) /* wbeditentity-update:0| */Adding labels' was not matched [23:50:34] Space before [[ [23:50:40] Oh, beaten. [23:50:46] Right, space at the very beginning. [23:50:55] what kind of log record is that? [23:50:56] Which is probably visible on the network itself. [23:51:04] I'm not sure what you're asking. [23:51:20] $fullString = "$titleString\0034 $flag\00310 " . "\00302$url\003 \0035*\003 \00303$user\003 \0035*\003 $szdiff \00310$comment\003\n"; [23:51:21] It affects all messages. [23:51:31] And title string is \00314[[ [23:51:46] Was it always? [23:52:27] Yup, looks to be a copy paste from the other file [23:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:50] I see it, I think [23:54:00] I don't. [23:54:06] https://gerrit.wikimedia.org/r/#/c/52922/25/includes/RecentChange.php [23:54:16] $line = $prefix . $line; [23:54:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [23:54:23] that's now using wfErrorLog, which.... (/me looks) [23:55:06] $text = preg_replace( '/^/m', $prefix . ' ', $text ); [23:55:12] there you have it [23:55:22] oh, this is going to be a pain in the ass to fix. [23:55:34] Is a space really needed? >_> [23:55:50] ori-l: I can't tell if you're being sarcastic. [23:55:55] Elsie: I'm not [23:56:03] wfErrorLog is used by a bunch of other things [23:56:09] namely udp2log on fluorine which is expecting the space [23:56:10] Can't we just trim() before it gets sent to IRC? [23:56:28] yeah, that's a good idea [23:56:53] where does that happen, though? [23:56:58] sentToUDP? [23:56:59] send [23:57:06] no, that just sends it to a host on the cluster [23:57:10] whose job it is to write it to IRC [23:57:48] there's some python script that does this, I believe [23:57:53] and it's probably not puppetized [23:59:12] 'wgRC2UDPAddress' => array( [23:59:12] 'default' => '208.80.152.178', // pmtpa: ekrem [23:59:23] What about cleanupForIRC? You can't trim $text there? [23:59:23] yeah, ekrem is the IRC server [23:59:40] (Not having any idea what $text is at that point....)