[00:07:41] !log DNS update - point old bugzilla3 entry over to actual bugzilla server [00:07:48] Logged the message, Master [00:20:40] New patchset: Dzahn; "remove from singer: account awjrichards, group wikidev, svn client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58844 [00:22:49] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [00:32:52] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:36:03] New patchset: Ori.livneh; "Drop scap-1, scap-2, & sync-common scripts; up version to 2.8-1" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/58671 [00:37:19] New patchset: Ori.livneh; "Drop scap-1, scap-2, & sync-common scripts; up version to 2.8-1" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/58671 [00:38:36] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:40:32] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:42:45] New review: coren; "There is a problem when applying this to extant puppetmaster::self instances; as previously configur..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/58540 [00:44:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:01:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:36:04] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:49:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [02:02:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:08:37] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:15:30] !log LocalisationUpdate completed (1.22wmf1) at Fri Apr 12 02:15:30 UTC 2013 [02:15:38] Logged the message, Master [02:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:32:37] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:36:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:03:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:16:47] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:19:12] New patchset: Asher; "schema" [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58848 [03:19:38] Change merged: Asher; [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58848 [03:57:03] !log csteipp synchronized php-1.22wmf1/extensions/RSS/ [03:57:06] What was that about? [04:09:50] spagewmf, are the whitespace changes in https://gerrit.wikimedia.org/r/#/c/57823/5/languages/messages/MessagesEn.php deliberate? [04:31:36] New patchset: Asher; "review page expected a previously saved review" [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58849 [04:44:25] Change merged: Asher; [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58849 [05:47:54] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:44] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1368 bytes in 2.127 second response time [05:49:44] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:51:54] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:34:06] New patchset: Tim Starling; "Update documentation for /root/.ssh/authorized_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58658 [06:34:12] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58658 [06:37:59] New patchset: Hashar; "zuul: in labs use the `labs` branch to install Zuul" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [06:38:00] New patchset: Hashar; "zuul: no fetch from pypi and drop statsd dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58827 [06:38:00] New patchset: Hashar; "zuul: support cloning from a different branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [06:39:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:48:03] New patchset: Hashar; "beta: fix sudo rights for mwdeploy user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [06:52:14] analinterns, heh. [07:02:24] New review: Nemo bis; "It *will* be a problem for humans if you don't apply the restriction to bot group only... It's easy ..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/58709 [07:07:39] New review: Legoktm; "All flagged bots have the "noratelimit" right, so won't that exempt them from this anyways?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [07:08:54] New review: MZMcBride; "What would you restrict non-bots to? I think restricting bots but not restricting everyone else woul..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [07:10:49] New patchset: MZMcBride; "Set rate limit for editing on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:12:19] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [08:21:43] New review: Daniel Kinzler; "@Nemo said:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:25:05] New review: Peachey88; "Yes, It's very easy, I used to rack changes up in different tabs then save them one after the other." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:26:39] New review: Nemo bis; "Yes, really, even without scripts. And Wikidata has all those JS features that make editing very fast." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:26:41] New review: Daniel Kinzler; "I just checked and found that rate limits are not enforced by the Wikibase API at all. Filed as bug ..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/58709 [08:27:22] New review: Legoktm; "According to https://www.wikidata.org/w/index.php?title=User_talk:Docu&oldid=23761249#Edits its easi..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [09:13:11] New review: Daniel Kinzler; "@Legoktm allowing that would defeat the purpose of having the limit in the first place: avoiding ver..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [09:39:05] !log jenkins: updated mediawiki-core-whitespaces job to use ZUUL_COMMIT as a refspec specifier (for {{bug| 46723}} ) [09:39:12] Logged the message, Master [09:44:48] Change abandoned: Aude; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [10:13:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:23:45] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [10:26:52] New patchset: Hashar; "beta: fix sudo rights for mwdeploy user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [10:27:09] New review: Hashar; "PS2 fix whitespaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [10:31:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:43:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [10:56:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:36:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:43:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [11:56:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [12:05:28] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:08:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [12:31:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [12:33:28] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [12:37:25] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:55:17] cmjohnson1: good morning :-] [12:55:31] Good Afternoon! [12:56:02] should we wait for RobH ? [12:56:14] I can't help for hardware I am a noob on that area [12:56:20] robh: will not be online this early [12:56:46] well I guess you don't need any remote support anyway? if you get access to gallium console I guess we are fine [12:56:49] we are fine without him...the h/w aspect is simple and should take less than 5 mins [12:56:55] \O/ [12:57:29] at 1300 bring down gallium and I will add the disk [12:59:01] !log gracefully shut-downing Jenkins and Zuul for scheduled maintenance. Will shutdown server gallium just after. [12:59:08] Logged the message, Master [13:01:03] bah it doesn't want to go down :D [13:01:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:02:35] PROBLEM - zuul_service_running on gallium is CRITICAL: Connection refused by host [13:02:52] cmjohnson1: machine is going down. I lost access to it [13:02:53] :-D [13:02:57] hashar: looks down [13:03:25] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:04:12] PROBLEM - Host gallium is DOWN: PING CRITICAL - Packet loss = 100% [13:07:22] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [13:10:29] seem we want to use: XFS, set noatime,nobarrier,logbufs=8 [13:14:23] nobarrier iff there the backstore is battery backed up. [13:15:15] What do our RT priorities mean? [13:16:26] I mean, 50 is clearly the default; is like "0: Meh, if you got nothing to do" -> "100: OMG the machine room is on fire!!1!" ? [13:17:02] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:17:33] \O/ [13:17:39] Coren: I have no idea :( [13:18:01] hashar: new disk is in...it had to go in the last drive slot...the sata connector for slot 3 wouldn't reach the disk [13:18:07] powered on [13:18:11] pinging it [13:18:31] I'll ask Steve, since the ticket is for him. [13:19:17] cmjohnson1: will you handle the disk formatting ? [13:20:19] Coren: https://wikitech.wikimedia.org/wiki/RT#Priorities :-] [13:20:25] Coren: 51-60 => disaster [13:21:06] Ah. Didn't think of searching on wikitech. Silly me. Thanks, hashar. [13:21:12] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:22:18] Although, admitedly, that makes 50 a dubious default. [13:23:22] PROBLEM - HTTP on gallium is CRITICAL: Connection refused [13:23:22] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [13:23:22] PROBLEM - SSH on gallium is CRITICAL: Connection refused [13:23:29] poor gallium :D [13:23:39] cmjohnson1: doesn't seem to want to come up or is that doing some fsck on the disks? [13:23:39] hashar: can you format? [13:23:48] let me connect to it [13:23:50] can't ssh [13:25:27] rebooting [13:25:47] sorry should have told you earlier [13:25:52] I have simply have no idea how fast our box comes up usually [13:26:14] depends on the server [13:27:22] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:30:40] cmjohnson1: I got some syslog entries [13:30:51] it's up but I don't see the disk [13:31:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:32:23] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:33:19] hashar: please take a look if you don't see it either...than i may have to get the disk to work in slot 3 [13:33:27] yeah I am looking at dmesg [13:33:29] and syslog [13:33:39] any idea what would be the name ? [13:34:13] no but dmesg only shows sda and sdb...fdisk -l looks exactly the same as it did before adding the disk [13:34:26] i would think it would be sdc [13:34:49] yup [13:34:54] try another slot ? [13:35:00] okay [13:35:21] shutdowning [13:35:28] okay..cool [13:35:34] !log gallium can't find the SSD, shut downing, will attempt another slot [13:35:41] Logged the message, Master [13:37:03] PROBLEM - Host gallium is DOWN: PING CRITICAL - Packet loss = 100% [13:40:34] cmjohnson1: I can't find anything relevant in the logs :( [13:41:42] 2.566288] ata_piix 0000:00:1f.5: SCR access via SIDPR is available but doesn't work [13:41:42] oh [13:42:32] paste is http://dpaste.com/1055975/ [13:43:28] hashar: Those are just the sata controllers; try to grep sd[a-z] [13:43:47] can't it mean the sata controller has an issue? [13:44:18] That's just info; it's using another method to access de SCR. Not all controllers work with all methods. [13:44:29] s/de/the/ [13:44:42] * hashar doesn't even know what SCR stands for :D [13:44:56] S{CISI,ATA} command register [13:44:58] cmjohnson1: still can't ping it [13:45:42] Drive might be disabled in the BIOS (or the other controller). [13:46:57] booting...i think coren is right about disabled drive [13:47:09] ata1.00 and ata2.00 were up with some samsung disks [13:47:23] ata.2.01 and ata1.01 got: SATA link down (SStatus 4 SControl 0) [13:47:25] yeah..i just enabled port c on sata settings [13:48:27] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:48:48] I think that will do what we want :] [13:48:58] need to check whether the drives ends up being known as /dev/sdc [13:49:03] and it probably need to be formatted first [13:50:36] it pings [13:50:49] can't ssh still [13:52:53] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:53:09] cmjohnson1: services don't seem to go :/ [13:53:50] ah I am on it [13:53:54] hashar: it was still booting up [13:54:26] so the 500GB disk sdb is now known has sdc [13:55:08] i see that and sdb is unk [13:56:14] hashar:have you tried to restart any services again? [13:56:25] nop [13:56:35] though they start on boot [13:57:32] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:57:46] New review: Hashar; "SSD is /dev/sdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:57:59] cmjohnson1: anything to check ? [13:58:37] i don't think so [13:59:23] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [13:59:29] so should I just create a primary partition on /dev/sdb using fdisk then format it with mkfs.xfs ? [14:00:42] correct [14:01:16] do you know where you are mounting? [14:02:03] /srv/ssd if that works for you https://gerrit.wikimedia.org/r/58887 [14:02:08] mark told me to not use /a [14:02:23] that works [14:02:26] proposed somewhere under /var/lib/jenkins but that disk is going to be used by other services beside jenkins [14:02:40] Hey, the dude in Tampa is Steve, right? Does he hang around IRC? [14:02:50] coren: he does but not till later [14:02:59] usually after lunch [14:03:11] something broke? [14:03:17] !log gallium: created primary partition on /dev/sdb . formatting using: mkfs.xfs /dev/sdb1 [14:03:24] Logged the message, Master [14:04:04] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:04:16] New review: Hashar; "/dev/sdb1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:04:24] Coren: cmjohnson1: could get https://gerrit.wikimedia.org/r/#/c/58887/ merged [14:04:25] cmjohnson1: I need him to play with shelf cabling in Tampa like you did in eqiad; it's a blocker for me, so I want to pounce on him the minute he gers here. :-) [14:04:31] that would let puppet mount the SSD disk on gallium :-] [14:06:55] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:06:56] hashar: "nobarrier"? You sure? If you don't have battery backup that can lead to data loss. [14:07:01] oh [14:07:13] hashar..sorry meant to plus 1 [14:07:17] not 2 [14:07:22] too late I guess hehe [14:07:37] Coren: how can we check whether there is battery backup? [14:07:59] since that is a working space, I think we can afford to loose data on that disk [14:08:06] hashar: Unless it's on a raid controller that has battery backed up cache, it doesn't have any. :-) [14:08:28] hashar: I'd turn off nobarrier anyways; there is not much to gain from it on a medium where writes are slow. [14:08:28] no battery backed controller on that server or any of the r410's for that matter [14:08:52] Coren: that is a SSD so supposed to be fast [14:09:16] For various values of "fast". If it doesn't have writeback ram, it's slow. :-) [14:09:35] Not as slow as a physical drive, but slow nonetheless. :-) [14:09:49] Well, caveat configurator. :-) [14:10:54] Just be aware that nobarrier looses some data coherency on XFS. [14:12:03] well we have to talk about it with mark because I barely knows how disk works :/ [14:14:39] In /my/ days, we had to wire wrap our own ram from scrap ferrite cores and hand-carved transistors. :-) [14:15:11] We didn't have fancy shmancy SATA drives, we just had a box with a scribe and a long scroll. [14:16:59] Poor scribes then got allmost all got replaced with monkeys. They had a higher error rate but cheaper (only needed bananas); and we only kelp a few scribes for ECC instead. [14:18:36] Legacy raid array: http://farm4.staticflickr.com/3019/2943407232_26c9510c20.jpg [14:19:31] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:20:18] Coren: only had to worry about dinosaurs destroying the raids back then? [14:20:19] i wonder if you could hot swap with that array? [14:20:36] !log restarting jenkins [14:20:43] Logged the message, Master [14:22:11] cmjohnson1: That depends. Ostensibly yes, but the controllers got all pissy when drives were mounted. :-) [14:22:28] hahaha [14:23:09] You can see from that picture that this was a high-end array: there are two parallel controllers. :-) [14:24:33] New patchset: Demon; "Updating for 2.6-rc0-322-geeed497" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/58891 [14:29:15] New patchset: Hashar; "contint: jenkins master role + ssd directories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [14:29:48] pooor jenkins is restarting :D [14:30:53] cmjohnson1: Coren: I got to document the new Jenkins directories that lands on the SSD. https://gerrit.wikimedia.org/r/58892 that puppet validate :-] [14:30:56] Susan: never heard of analinterns before? [14:32:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [14:34:14] New review: coren; "lgm" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58892 [14:35:06] Hm. Circular. Jenkins won't Verify the patch since it's down; the patch is needed to bring Jenkins back. :-) [14:35:33] New review: coren; "Jenkins can't verify the patch meant to bring it back up, can it? :-)" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/58892 [14:35:33] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [14:39:02] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:49] Coren: that is true, sorry :( [14:40:54] it is still starting up [14:44:08] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [14:44:08] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [14:44:11] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:44:16] !log analytics1007 has never worked..but working on it [14:44:23] Logged the message, Master [14:44:24] I think I will avoid migrating zuul today :-D [14:44:42] it is friday! ..not a good idea ;-) [14:45:23] cmjohnson1: I think you can close the RT ticket I opened to add the SSD https://rt.wikimedia.org/Ticket/Display.html?id=4916 [14:45:31] the ops part is finished ;-] [14:45:38] Is the output of wfDebug() from our production systems viewable somewhere ? [14:45:39] okay..cool [14:45:45] thank you for the new disk \O/ [14:45:52] xyzram: it is not [14:45:57] xyzram: we use wfDebugLogGroup() [14:46:35] <^demon> I thought wfDebugLog() was sufficient. [14:46:36] I have a problem that is only showing up on production. [14:47:07] ah wfDebugLog might end up in the catchall log [14:47:24] So I'd like to see what MW is sending as a search request to lsearchd. [14:47:34] hm [14:47:41] can't remember the function name [14:48:13] so yeah wfDebugLog( 'somegroup', message ); the 'somegroup' should be defined in $wgDebugLogGroups or that will fallback to wfDebug() [14:48:21] That info is logged like this: wfDebug( "Fetching search data from $searchUrl\n" ); [14:48:58] hashar: So the only way is to add custom logging and deploy to production ? [14:49:13] Coren: can you merge on sockpuppet a0757126366 / https://gerrit.wikimedia.org/r/#/c/58892/ ? [14:49:19] Coren: does not seem to have landed [14:49:39] <^demon> xyzram: So, swap wfDebug( $foobar ) for wfDebugLog( 'groupname', $foobar ); [14:49:50] <^demon> Then just have to make sure 'groupname' is defined in $wgDebugLogGroups [14:50:09] where does the info end up ? [14:50:14] <^demon> fluorine [14:50:21] <^demon> In /a/mw-log/* [14:50:37] hashar: Ahcrap; forgot to forward my key [14:51:12] hashar: is why it didn't work. I never found out how to fix this without another merge. [14:51:20] ^demon: the might jenkins loves the build histories :( http://dpaste.com/1056027/ [14:51:34] ^demon: it passes most of the startup time reading a lll the old build history files [14:51:56] <^demon> Yeah. [14:52:12] that should be async [14:52:21] <^demon> You'd think. [14:52:39] ^demon: Ok, thanks. So I would still need to change code and have ops redeploy, right ? This might generate a lot of output, so it may have to be removed once I have the data I'm looking for. [14:52:39] hashar: Do you know how to fix that? The merge failed to rsync because of no ssh key. [14:52:48] <^demon> hashar: Why on earth it needs to load build $n of $randomProject at startup is beyond me. [14:52:53] <^demon> Should load them on demand.' [14:53:20] <^demon> xyzram: No, you or I could do it. $wgDebugLogGroups is defined in wmf-config. [14:53:29] <^demon> So just need the change to MWSearch and that. [14:53:32] ^demon: will fill a bug I guess [14:53:47] !log jenkins is back up. Starting Zuul [14:53:53] Logged the message, Master [14:53:56] <^demon> Or load the latest $n builds. [14:53:58] <^demon> Not all. [14:54:01] <^demon> Anyway [14:54:49] Coren: I have no idea :/ can't you rsync manually ? [14:55:01] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [14:57:08] hashar: It's not clear what is rsync'ed where; it's a hook done automagically during merge. [14:57:23] hashar: Not a big issue, though, since the next merge will push it. [14:57:37] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [14:58:12] I hate jenkins [14:58:50] ^demon: How do you deploy those changes to production ? I thought ops folks do that ... [14:59:04] <^demon> ops can, so can mortal shell users. [14:59:07] <^demon> I can do it [14:59:50] ^demon: Ok, I'll make the change and push to gerrit. [15:00:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:00:56] !log Changing Jenkins workspace directories from ${ITEM_ROOTDIR}/workspace to /srv/ssd/jenkins/workspace/${ITEM_FULLNAME} . My change has not been taken in account early on :( [15:01:02] Logged the message, Master [15:01:49] New patchset: Demon; "Adding mwsearch debug log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58901 [15:02:11] ^demon: I changed the Jenkins workspaces to point to the new SSD . [15:02:22] !log demon synchronized wmf-config/InitialiseSettings.php 'Debug group for mwsearch' [15:02:27] <^demon> xyzram: Just needs the change to MWSearch [15:02:28] Logged the message, Master [15:02:30] <^demon> I took care of wmf-config. [15:03:18] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58901 [15:03:32] ^demon: Quick question: When I run a search query directly on search1015 using: wget -O ukraine "http://search1015:8123/search/uawikimedia/%D0%A3%D0%BA%D1%80%D0%B0%D1%97%D0%BD%D0%B0?namespaces=0&offset=0&limit=20&version=2.1" [15:03:45] I get 8 results. [15:04:26] But from the web page, none. I thought wgDBname might be wrong but that is correct; any thoughts on what might be the issue ? [15:04:37] !log demon synchronized wmf-config/InitialiseSettings.php 'Debug group for mwsearch' [15:04:44] Logged the message, Master [15:04:47] New patchset: Demon; "s/upd/udp/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58902 [15:05:02] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58902 [15:05:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:05:47] sbernardin: are you in the data center? [15:06:21] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:12] ^demon: The change you just committed has "upd:" rather than "udp:" [15:07:17] <^demon> I fixed it. [15:07:21] <^demon> In the followup. [15:07:30] Oh, ok. [15:09:19] <^demon> I'm going to step out to grab lunch really quick. I'll deploy the mwsearch change as soon as I'm back. [15:09:26] <^demon> Should be back in ~15. [15:09:59] ok, I'm taking a break too, back in about an hour. [15:13:57] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:14:05] cmjohnson1: not yet [15:15:07] ping me when you get there....I believe there is a packet of information related to the cisco servers there. May be on the clipboard on the shelves or in the cabinet. I need you to look for it [15:15:07] please [15:15:21] and coren has something that is pretty important that needs to be done. thx [15:15:56] sbernardin = Steve@pmtpa? [15:16:38] Coren: yes [15:17:03] sbernardin: Well met. Please do ping me when your location is coincident with the DC's. :-) [15:19:45] Coren: will do [15:27:41] New patchset: Diederik; "Disable Fabian Kaelin's account, he no longer works with us." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58904 [15:28:08] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:28:24] cmjohnson1: thanks for the sfp's :) [15:28:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58904 [15:28:39] yw lesliecarr..did it work? [15:29:00] no, though i expected it to not work :) [15:29:14] i have a feeling that switch 1 in the stack has some sort of issue [15:29:29] are the fibers long enough that you could easily move the uplink module to switch 2 ? [15:29:35] in row c ? [15:29:52] lemme see [15:31:00] there is only 1 switch in c1 [15:31:22] oh i meant ot c2 [15:31:33] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:32:58] lesliecarr: i think there is enough slack to make that work [15:33:09] but I don't have anymore uplink modules atm [15:33:24] iirc ..robh ordered some or will be ordering [15:33:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:33:38] ? [15:33:49] okay [15:34:00] cmjohnson1: ordering....? ex4200 sfp modules? [15:34:07] well we can move the one from c1 if possible, because it's not working in c1 [15:34:08] :) [15:34:14] (cuz if thats it, it was ordered already and will arrive today) [15:34:29] arrive today ? damn! [15:34:30] super fast [15:35:07] robh: cool [15:35:32] RobH: Chris has setup the SSD in gallium successfully :-] [15:35:56] yep, was reading backread, faster? =] [15:36:01] lesliecarr: what is in c 1/2/0 [15:36:23] the uplink to cr2 [15:36:37] neither one is working, i wanted to keep the second link unchanged as a control [15:37:01] oh...well that makes it easier than..i can move it to c2 if you want [15:37:56] :) [15:37:59] awesome thank you [15:38:01] want me to drop a ticket ? [15:38:31] no..i will do it now [15:43:31] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:47:10] leslecarr moved to c2-asw [15:49:13] hashar / cmjohnson1 Thanks for working on zuul stuff today =] [15:49:22] the entire engineering department thanks you. [15:49:31] well [15:49:37] it is not going to suddenly become THHHHAT Fast :D [15:49:44] but I guess that will help a bit [15:49:44] thanks cmjohnson1 [15:49:46] woot [15:50:01] hashar: it isnt a magic bullet, but the wait io during lagged times on gallium was insane. [15:50:09] yup [15:50:17] I noticed that after we upgraded to Precise [15:50:31] some job suddenly doubled it is time [15:50:35] 30 sec -> 1 min [15:51:48] New review: Hashar; "Beware, this need a manual operation on the gallium server." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58899 [15:52:18] robh: I got a bunch of small Zuul changes to review / merge if you are up for some duty :-] [15:54:09] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:55:01] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:25] oh RobH i think we do need a spare 4200 now … though i may pack up and send the one from ulsfo out ? [15:58:58] robh: if you didn't see update on ticket...camera mounts go in on Monday [15:59:34] New patchset: Hashar; "icinga: fix jenkins monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58906 [15:59:39] !log adding another link to asw-c-eqiad's uplink [15:59:46] Logged the message, Mistress of the network gear. [15:59:52] New review: Hashar; "I have broken icinga check_jenkins test :-D Follow up with https://gerrit.wikimedia.org/r/58906" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58489 [16:00:06] LeslieCarr: check_jenkins is broken on icinga, https://gerrit.wikimedia.org/r/58906 should fix it :-] [16:00:40] hehe [16:00:43] good catch [16:01:14] do we have a disk checking plugin ? [16:01:26] gallium does not have any https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gallium :-D [16:01:31] <^demon> LeslieCarr: Can you look at https://gerrit.wikimedia.org/r/#/c/58692/? hashar will love you. [16:01:47] <^demon> (I will too, fwiw) [16:02:02] * hashar order beers at the tap [16:03:12] haha [16:03:19] ok i am technically on vaation so that is the last work i do [16:03:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:03:45] New patchset: Lcarr; "Allow jenkins admins to manage replicated git repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58692 [16:03:57] why did we switch to having to rebase every change ? [16:04:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58692 [16:04:27] rebasing is evil [16:04:31] ;D [16:04:42] hashar...i don't think it matters because there is no raid installed on gallium [16:04:42] poor ops :( [16:04:51] <^demon> greg-g: aw, why? [16:04:53] cmjohnson1: there is a software raid though [16:05:02] okay, done, merged [16:05:05] and now i go away! [16:05:08] <^demon> LeslieCarr: Thank you! [16:05:09] :( [16:05:14] bye [16:05:25] ^demon: running puppet on gallium [16:05:34] ^demon: history man, history. merging is the way. :) [16:06:12] <^demon> Well, that's why for super active repos it's best to have merge-if-necessary, not cherry-pick or ff-only. [16:06:42] <^demon> If you're working on a repo with few authors or changes infrequently, a linear history is sometimes nicer. [16:07:21] why not cherry-pick ? [16:07:32] ^demon: didn't you just do ff-only? (or planning to?) [16:07:36] change do not need to be rebased and you maintain the linear history [16:07:44] (drawback: commit sha1 changes) [16:08:07] <^demon> sha1 changes makes it a dealbreaker imho. [16:08:24] <^demon> greg-g: It was enabled for operations/puppet, but was changed back because everyone's tired of clicking rebase. [16:09:26] heh [16:12:56] <^demon> The newest jgit now supports recursive merges. [16:13:09] <^demon> That might be nice to enable on something like core where we get lots of conflicts. [16:14:01] cmjohnson1: good news on cameras, thanks for workign on it [16:14:22] i had to go through our rep to get security to approve [16:14:23] ^demon: would that help reducing conflicts on release notes? [16:14:33] <^demon> potentially. [16:14:37] aww [16:14:56] <^demon> It still can't do an octopus merge, but it should at least solve some cases that currently conflict. [16:15:17] like adding a new bullet to a section? [16:15:42] (in the same place where everyone adds it i.e. after the last bullet in the section) [16:16:22] <^demon> Like adding to a different section in the same file. [16:16:29] <^demon> That currently conflicts when it never would locally. [16:17:45] <^demon> But yeah, "This only affects content merges, and is off by default as the [16:17:45] <^demon> upstream implementation is experimental and may still be buggy." [16:17:55] <^demon> So, prolly not best to enable just yet ;-) [16:22:14] New review: Ottomata; "Ok good catch! I fixed a couple of things:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [16:26:51] off, might be there tonight *wave* [16:27:49] bye hashar [16:28:01] cmjohnson1: and thx again for the SSD :] [16:39:52] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [16:57:32] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:05:15] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:35] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [17:09:25] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [17:10:32] New patchset: Anomie; "Preserve timestamps when copying l10n cdb files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58910 [17:11:07] New patchset: Anomie; "l10nupdate: Use refreshMessageBlobs.php instead of clearMessageBlobs.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58911 [17:23:11] New review: coren; "Duplicate parameter 'require' for on File[/etc/puppet/puppet.conf.d/10-self.conf] at /etc/puppet/man..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:25:07] New review: coren; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:29:58] ottomata: Small error in your .pp; I'll test your next patchset if you fix it. [17:30:21] oop [17:30:49] ah! missed that danke [17:32:28] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [17:32:32] Coren ^ there we go. [17:42:15] New review: coren; "New class looks good (untested)." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58540 [17:43:04] ottomata: LGM. Merge? [17:43:52] cool! danke [17:43:57] i can do. [17:44:07] Have fun. [17:44:09] oh, i wanted to ask, should this be a main puppet group option? [17:44:18] right now you have to add it to your project's puppet group's manually [17:44:23] maybe after we test it out for a while? [17:44:27] Probably, once you played with it in its "proper" usage for a while. [17:44:30] yeah [17:44:30] k [17:44:32] cool [17:44:37] danke [17:46:57] ^demon: https://gerrit.wikimedia.org/r/58915 has the change to MWSearch [17:48:31] so, how brave am I feeling today? [17:48:39] thinking about upgrading the primary LDAP server :D [17:49:55] * Coren trembles in fear. [17:50:14] the secondary has been running the update for a week ;) [17:50:21] I'd be hard to break more things without being Leslie. :-) [17:51:46] <^demon> xyzram: 58921 ^ 58921 for updating deployment. [17:52:06] gonna do it :) [17:52:18] let's see what breaks when I stop the ldap server [17:52:35] it should all failover [17:52:50] New patchset: Andrew Bogott; "Marked a couple of bot manifests as deprecated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58922 [17:54:05] looks like everything fails over to me [17:54:30] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [17:54:40] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [17:55:22] I say that [17:55:27] php doesn't seem to want to failover [17:56:02] ah ha [17:56:04] it's keystone [17:57:30] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:59:06] I need to add a secondary to keystone's config [17:59:29] anyway, time to upgrade :) [17:59:35] !log upgrade opendj on virt0 [17:59:44] Logged the message, Master [18:01:00] <^demon> xyzram: If those look fine to you, I'll go ahead and merge and get those live. [18:02:30] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [18:06:16] ^demon: what does "58921 ^ 58921" mean ? That change shows no changes at all ... [18:06:45] <^demon> Whoops, means & [18:07:00] <^demon> I meant 58920 and 58921 [18:07:15] <^demon> https://gerrit.wikimedia.org/r/#/c/58920/ and https://gerrit.wikimedia.org/r/#/c/58921/ [18:07:25] opsen: do you know if the people on ops@ is a subset of the people on engineering@ ? [18:07:31] 58921 shows +0 -0 [18:08:31] greg-g: I'm not sure it's a strict subset. [18:09:03] Coren: worth me cc'ing both engineering and ops with the WMF Deploy highlights email, then, you think? [18:09:24] ^demon: I'm seeing no changes in both. [18:09:58] greg-g: I think that the presumption that people who are on engineering that would care about the deploys are also on ops@ is reasonable; but I'm a relative newbie here so don't take my word for it. [18:10:30] <^demon> xyzram: -Subproject commit d714af877b16ee31306eaf7c711db333b002ff83 [18:10:30] <^demon> +Subproject commit 430196b92ed82d6533ea86cdf568a23eda38eec0 [18:10:35] I'll just do both and get yelled at if needed ;) [18:11:47] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [18:11:54] -_- [18:11:57] wtf [18:12:17] now *that* is a problem [18:12:37] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [18:12:47] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.001 second response time on port 389 [18:13:39] ^demon: Not sure how to view those subproject commits on that page ... [18:14:51] <^demon> Don't worry about it, I went ahead and merged. [18:14:54] <^demon> Deploying in a second. [18:15:23] New patchset: Aaron Schulz; "Set pmtpa queue config to match eqiad." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58953 [18:16:03] !log demon synchronized php-1.21wmf12/extensions/MWSearch/MWSearch_body.php 'Moar logging' [18:16:10] Logged the message, Master [18:16:31] !log demon synchronized php-1.22wmf1/extensions/MWSearch/MWSearch_body.php 'Moar logging' [18:16:38] <^demon> xyzram: Logging patches all live. [18:16:39] Logged the message, Master [18:16:40] New patchset: Aaron Schulz; "Set pmtpa queue config to match eqiad." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58953 [18:18:17] ^demon: thanks. [18:18:34] <^demon> yw [18:24:09] ^demon: mysteriously, search suddenly seems to work on ua.wikemedia.org ! [18:24:28] <^demon> Magic ;-) [18:24:45] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58953 [18:25:03] I just entered "Україна" and got back a bunch of results. [18:25:25] !log aaron synchronized wmf-config/jobqueue-pmtpa.php 'Set pmtpa queue config to match eqiad' [18:25:32] Logged the message, Master [18:25:38] ^demon: Not sure if I should be elated or depressed :-) [18:25:40] xyzram: nice to hear search works.. cool [18:26:20] Can anyone else confirm that ? Just go to ua.wikimedia.org and enter "Україна" in the search box ... [18:27:12] <^demon> Yupp, I got 6 results. [18:27:27] Результати 1 — 8 з 8 для Україна [18:27:36] 8 for me it seems [18:28:18] eh, well 6 in the list but says 1 -8 on top [18:28:32] <^demon> Ah, I had some namespace filtering on. [18:28:33] This was failing all week ... mighty strange. [18:28:36] <^demon> 8 if I enable all ns's. [18:29:01] <^demon> Heh, 8, but says out of 12. [18:29:06] <^demon> Maybe we have an off by 4 bug. [18:30:18] That log file is growing really quickly -- 198785054 bytes so far ... [18:30:44] xyzram: [OK] uawikimedia 2013-04-12 07:39:10 [18:30:48] Do we have limits on the size of those files ? I'm worried it might fill up disk. [18:31:06] mutante: where is that from ? [18:31:09] xyzram: that is from getting the lucene status page from a random search node. (search1015) [18:31:17] wget http://search1015.eqiad.wmnet:8123/statu [18:31:21] status [18:31:31] Ok [18:31:56] it used to say failed there a couple days ago [18:32:13] Oh, that's interesting. [18:32:25] what i did is the same the search monitoring check does [18:32:39] uses check_http to get that status page and check for string FAILED [18:33:21] we still have these failing now: bat_smgwiki, fiu_vrowiki, and 3 zh variants [18:33:30] but that is less than it was [18:33:42] mutante: sorry, what's your real name ? [18:33:49] <^demon> xyzram: We can cut the log off quickly if we don't need it anymore. [18:34:05] xyzram: Daniel Zahn [18:34:49] ^demon: Let me try a couple more tests on the Javanese wiki which was reporting a similar problem; then we can cut it off. [18:35:05] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search1015&service=search+indices+-+check+lucene+status+page [18:35:10] <^demon> okie dokie [18:35:14] that check isnt broken on all nodes though, just a few [18:35:28] mutante: Are you in SF ? [18:35:29] e.g. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search1017&service=search+indices+-+check+lucene+status+page [18:35:33] xyzram: yes [18:37:05] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [18:41:28] mutante: search 15 through 22 are in pool 4 and have been the source of many issues. [18:41:54] Surprising that 15 shows a problem but 16 does not -- they should be identical. [18:43:10] hmm, yeah, pool 4 is the "everything else" pool, afaik, so there are all the small wikis in it , but a lot of them [18:47:12] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:57:10] New patchset: Odder; "(bug 47166) Enable Extension:Collection on sh.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58986 [19:03:12] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [19:19:07] ^demon: can you turn off the mwsearch log ? [19:20:04] <^demon> Yeah, on it [19:22:18] New patchset: Demon; "Disable mwsearch log" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58989 [19:22:45] !log demon synchronized wmf-config/InitialiseSettings.php 'Disable mwsearch log' [19:23:00] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58989 [19:28:15] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:32:29] New patchset: Ori.livneh; "Add 'Contact Wikipedia' footer link on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [19:33:15] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [19:37:37] New review: Ori.livneh; "Agree with Kaldari re: scalability, but also think a config variable is just nicer, especially if th..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [19:48:09] https://bugzilla.wikimedia.org/quips.cgi?action=show [19:48:35] quips: one more reason to use bugzilla 8-] [19:48:44] :) [19:48:57] add some new ones:p [19:50:05] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [19:50:15] PROBLEM - SSH on caesium is CRITICAL: Server answer: [19:51:15] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:52:05] PROBLEM - SSH on labstore3 is CRITICAL: Connection refused [19:52:35] PROBLEM - DPKG on labstore3 is CRITICAL: Connection refused by host [19:52:55] PROBLEM - Disk space on labstore3 is CRITICAL: Connection refused by host [19:52:55] PROBLEM - RAID on labstore3 is CRITICAL: Connection refused by host [19:55:05] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.027 second response time on port 389 [19:55:35] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.027 second response time on port 636 [19:57:21] !log finished opendj upgrade on virt0 [19:57:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [20:01:55] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [20:03:05] New patchset: Ori.livneh; "Update comment re: usage of UserDailyContribs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58999 [20:03:16] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [20:04:05] PROBLEM - NTP on labstore3 is CRITICAL: NTP CRITICAL: No response from NTP server [20:04:06] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58999 [20:06:55] hashar, have you every used the dh override stuff? [20:06:59] ever* [20:07:10] yup once I think [20:07:26] like dh_clean is called before building the package (I think) [20:07:31] it comes with some conventions [20:07:42] if you want to add some more cleaning task you can create a target dh_clean_override [20:07:46] put your specific stuff in it [20:07:50] override_dh_clean [20:07:50] then call again dh_clean [20:07:51] ? [20:07:54] i think so [20:08:05] yeah, makes sense, i think i got clean to work [20:08:12] but what if I need custom build/install steps? [20:08:15] that's supposed to work too, right? [20:08:41] our varnish package has a lot of dh_override :D [20:08:46] a simple one is in python-statsd [20:08:52] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [20:08:59] hm. not sure I like that opendj on virt1000 is crashing [20:09:12] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.001 second response time on port 389 [20:10:14] hmmk will look, tanks [20:12:14] ah [20:12:17] too many open files error [20:14:39] New patchset: Dzahn; "add iptables to deny access to NRPE on hosts with public IP except for our own networks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59001 [20:17:24] New review: Dzahn; "apply just on a single host (gallium) before requiring this in nrpe itself (on all hosts using it). " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59001 [20:19:02] RECOVERY - SSH on labstore3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:20:12] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:22:05] New review: Demon; "War available at https://integration.wikimedia.org/nightly/gerrit/wmf/gerrit-2.6-rc0-322-geeed497.war" [operations/debs/gerrit] (master) C: 2; - https://gerrit.wikimedia.org/r/58891 [20:24:12] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [20:24:28] New review: Dzahn; "don't merge, needs to allow more stuff from public (80/443), and check netstat -tulpen on gallium" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/59001 [20:24:46] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/58891 [20:27:04] mark, are you about? [20:30:02] RECOVERY - Puppet freshness on labstore3 is OK: puppet ran at Fri Apr 12 20:29:52 UTC 2013 [20:30:19] AH, hashar! if figured out my problem [20:30:24] i thought i was overriding all wrong [20:30:24] but [20:30:27] i have local changes, so I was doing [20:30:32] git-buildpackage --git-ignore-new [20:30:47] but since i an also using git-export-dir=../build-area (in gbp.conf) [20:30:51] it was exporting HEAD and building from there [20:30:55] NOT my local working directory [20:31:00] so my changes to rules were ignored [20:32:12] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [20:32:21] hmm, is https://lists.wikimedia.org// supposed to give me a 'default' http server page? [20:32:23] notice the double // [20:32:29] no bigie, just wondering [20:34:44] Fun. :-) [20:34:44] /say /dev/sda /dev/sdab /dev/sdae /dev/sdah /dev/sdb /dev/sdd /dev/sdg /dev/sdj /dev/sdm /dev/sdp /dev/sds /dev/sdv /dev/sdy [20:34:44] /say /dev/sda1 /dev/sdac /dev/sdaf /dev/sdai /dev/sdb1 /dev/sde /dev/sdh /dev/sdk /dev/sdn /dev/sdq /dev/sdt /dev/sdw /dev/sdz [20:34:44] /say /dev/sdaa /dev/sdad /dev/sdag /dev/sdaj /dev/sdc /dev/sdf /dev/sdi /dev/sdl /dev/sdo /dev/sdr /dev/sdu /dev/sdx [20:35:42] RECOVERY - NTP on labstore3 is OK: NTP OK: Offset -0.01021301746 secs [20:39:06] New patchset: J; "dont duplicate wikimedia-task-appserver dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59004 [20:39:22] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:04] Change abandoned: J; "package installed via wikimedia-task-appserver, removing lilypond in https://gerrit.wikimedia.org/r/..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58504 [20:40:12] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [20:40:32] RECOVERY - Disk space on labstore3 is OK: DISK OK [20:40:32] RECOVERY - DPKG on labstore3 is OK: All packages OK [20:40:52] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [20:44:02] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:52] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [20:45:59] ottomata: ahhh glad you figured it out :-] [20:46:13] ottomata: I don't think you want to use --git-ignore-new [20:46:19] just for now [20:46:21] as i'm testing it [20:46:28] certainly not for the final version [20:46:34] you can add some files in debian/options/clean or something like that [20:46:36] i just don't want to have to commit every time I make a change [20:47:09] git commit -a --amend -m 'foo' && git buildpackage ? [20:47:26] meh, —ignore-new is good for now [20:47:28] that's what it is for [20:47:48] i think the docs even recommend using that [20:52:33] !log install package upgrades on sanger [20:57:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:03:02] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [21:04:17] New patchset: Ottomata; "Initial debian packaging using git-buildpackage" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/59008 [21:05:18] New patchset: Ottomata; "Initial debian packaging using git-buildpackage" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/59008 [21:06:07] New patchset: Andrew Bogott; "Mark ldapsupportlib.py 0555." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59009 [21:14:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:17:37] Change abandoned: Andrew Bogott; "This does not fix what I thought it fixed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59009 [21:33:02] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [21:37:33] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [21:41:36] New patchset: Aklapper; "Comment out list of urgent issues; not working as expected" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59014 [21:43:03] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:49:03] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:53] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [21:56:00] Ryan_Lane: The icinga raid check is playing *havock* with IO when you are changing the configuration from underneath it. :-) [21:57:11] :D [21:57:23] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [21:58:03] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.93 ms [22:02:03] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [22:04:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/59014 [22:07:03] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:11:53] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [22:20:03] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [22:32:03] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [22:32:51] New patchset: Krinkle; "noc: Bring back langlist and wikiversions.dat as non-txt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59017 [22:33:03] mutante: ^ [22:34:53] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59017 [22:38:34] mutante: What is the difference between "Reply" and "Comment" on RT? [22:38:50] I'd like to say that I fixed it with link to the change in gerrit [22:39:38] I clicked both, it is explained in the "To" header then ("Reply to requestor", "Comment (not sent to requestor)" [22:40:32] !log temp depooling search1015 [22:40:35] Krenair: comment won't email [22:40:37] err [22:40:39] damn it [22:40:41] Krinkle: ^^ [22:40:43] reply will [22:41:08] Yeah, figured. [22:41:22] Ryan_Lane: Hmm... Not sure about Purple IRC but Xchat has a feature to change autocompletion order to the last matching username which spoke [22:41:33] I'm using adium [22:41:37] which is weird, yes [22:41:54] Or type 3 (!) characters before pressing tab [22:42:09] I know, right. [22:42:14] :) [22:42:20] too much effort ;) [22:43:33] If it's any consolation, Krenair|Krinkle happens a lot in pretty much any channel we're both in. [22:48:19] We live in an environment where names ought to be unique based on the first two case-insensitive characters [22:48:29] Lets trim our _hash tables ;) [22:48:56] It's not to bad actually. [22:49:36] Execpt for the Ryans, James, Marks and Robs we have. [22:49:58] And more I'm sure. [22:51:06] He, my dad met someone online last week who literally had the exact same name (first, middle and last). Unrelated for at least 2 generations. [22:51:27] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:17] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [22:53:55] Ryan_Lane: Can you do another jenkins-bot V only access for me? [22:53:55] https://gerrit.wikimedia.org/r/#/admin/projects/integration/jenkins-job-builder-config,access [22:54:19] jenkins-bot is stable for the jjb-config repo as of this week [22:54:29] I don't have a right edit interface there [22:54:42] Interesting [22:54:55] this one (last the last one) is under integration/* [22:55:13] yep. no clue on that [22:55:22] Ryan_Lane, aren't you a gerrit admin? [22:55:27] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:55:27] yes [22:55:33] this looks like a bug to me [22:55:35] can't admins edit everything? [22:55:35] * Ryan_Lane shrugs [22:55:40] talk to ^demon about it [22:56:41] Ryan_Lane: Since it is just about the autocomplete not listing Jenkins-Bot (and the UI forcing me to pick from auto complete) [22:56:50] ... it should work if I edit the repo manually, right? [22:57:05] It is a repo afaik, not sure if it takes pushes from cli though [22:57:36] I can edit the rights, it just won't let me add certain user groups [22:58:17] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [22:59:28] error: pathspec 'remotes/gerrit/refs/meta/config' did not match any file(s) known to git. [22:59:34] meh [23:00:37] I see ^demon made edits outside the interface (since he provided a commit message) [23:00:40] I'll ask him later [23:01:12] k, cya later [23:02:08] RobH, poke [23:02:17] ? [23:02:29] https://rt.wikimedia.org/Ticket/Display.html?id=4940 - if we wanted to add extra domains there, should Victor send it to the same ticket? [23:02:32] Ryan_Lane: Something f'ed up is going on with check-raid.py. I think it's locking the entire raid controller for several seconds at regular interval. Maybe it gets confused by the JBOD config? [23:03:09] maybe so [23:03:09] Ryan_Lane: Not much point keeping it running since it's a software raid anyways. [23:03:21] Thehelpfulone: what do you mean add extra? as in other wikipediastories.whatevers? [23:03:31] Do you know whereabouts the puppet config I should hunt for this? [23:04:48] I think it also checks software raids? [23:05:27] Ryan_Lane: it D hangs in /usr/bin/MegaCli64 so it's the PERC part that goes crunch [23:05:51] * Ryan_Lane nods [23:07:05] Ah, ew. It's included in base.pp if $::network_zone == "internal" [23:08:26] Lemme break chech-raid.py locally see if that stops the freezes [23:09:05] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:09:46] well, we *should* be checking our raids :) [23:10:27] Yeah, on those boxes it makes no sense to check the PERC though. [23:10:43] indeed [23:10:57] And yeah, I just tried it manually. Definitely the MegaCli64 that goes boom. It locks the entire filesystem for 5-10 secs. [23:13:15] PROBLEM - RAID on labstore3 is CRITICAL: Timeout while attempting connection [23:13:17] * Coren substituted /usr/bin/MegaCLI64 with a shell script that outputs the same status. :-) [23:13:37] !log rebooted labstore3 -> hung MegaCLI processes [23:14:45] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [23:15:25] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [23:16:15] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [23:17:55] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:21:27] New patchset: Ori.livneh; "Add 'Contact Wikipedia' footer link on test/test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [23:22:59] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/57649 [23:23:01] Krinkle|detached: comment should just be a comment on the ticket while Reply is also an email to all people added on the ticket [23:30:39] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Add 'Contact Wikipedia' footer link on test/test2 (Iab1f5f527)' [23:30:52] !log olivneh synchronized wmf-config/CommonSettings.php 'Add 'Contact Wikipedia' footer link on test/test2 (Iab1f5f527)' [23:33:05] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [23:37:26] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:24] LeslieCarr: ping? [23:52:46] https://bugzilla.wikimedia.org/show_bug.cgi?id=46086 Any shell users able to find this in a log? [23:55:16] RECOVERY - search indices - check lucene status page on search1016 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.029 second response time [23:56:24] Krenair: I don't know how I would go about finding that out, sorry. Not familiar with prod at all yet. [23:59:18] Ryan_Lane: Figured it out `git fetch gerrit refs/meta/config:refs/remotes/gerrit/meta/config` http://www.lowlevelmanager.com/2012/09/modify-submit-type-for-gerrit-project.html