[00:07:41] !log DNS update - point old bugzilla3 entry over to actual bugzilla server [00:07:48] Logged the message, Master [00:20:40] New patchset: Dzahn; "remove from singer: account awjrichards, group wikidev, svn client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58844 [00:22:49] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [00:32:52] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:36:03] New patchset: Ori.livneh; "Drop scap-1, scap-2, & sync-common scripts; up version to 2.8-1" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/58671 [00:37:19] New patchset: Ori.livneh; "Drop scap-1, scap-2, & sync-common scripts; up version to 2.8-1" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/58671 [00:38:36] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:40:32] New patchset: Ori.livneh; "Migrate scap-1, scap-2, & sync-common from wikimedia-task-appserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57854 [00:42:45] New review: coren; "There is a problem when applying this to extant puppetmaster::self instances; as previously configur..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/58540 [00:44:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:01:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [01:36:04] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [01:49:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [02:02:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:08:37] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:15:30] !log LocalisationUpdate completed (1.22wmf1) at Fri Apr 12 02:15:30 UTC 2013 [02:15:38] Logged the message, Master [02:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:32:37] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [02:36:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:03:44] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [03:16:47] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:19:12] New patchset: Asher; "schema" [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58848 [03:19:38] Change merged: Asher; [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58848 [03:57:03] !log csteipp synchronized php-1.22wmf1/extensions/RSS/ [03:57:06] What was that about? [04:09:50] spagewmf, are the whitespace changes in https://gerrit.wikimedia.org/r/#/c/57823/5/languages/messages/MessagesEn.php deliberate? [04:31:36] New patchset: Asher; "review page expected a previously saved review" [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58849 [04:44:25] Change merged: Asher; [operations/software/redactatron] (master) - https://gerrit.wikimedia.org/r/58849 [05:47:54] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:44] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1368 bytes in 2.127 second response time [05:49:44] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:51:54] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:54:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:34:06] New patchset: Tim Starling; "Update documentation for /root/.ssh/authorized_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58658 [06:34:12] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58658 [06:37:59] New patchset: Hashar; "zuul: in labs use the `labs` branch to install Zuul" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58738 [06:38:00] New patchset: Hashar; "zuul: no fetch from pypi and drop statsd dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58827 [06:38:00] New patchset: Hashar; "zuul: support cloning from a different branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58737 [06:39:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:48:03] New patchset: Hashar; "beta: fix sudo rights for mwdeploy user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [06:52:14] analinterns, heh. [07:02:24] New review: Nemo bis; "It *will* be a problem for humans if you don't apply the restriction to bot group only... It's easy ..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/58709 [07:07:39] New review: Legoktm; "All flagged bots have the "noratelimit" right, so won't that exempt them from this anyways?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [07:08:54] New review: MZMcBride; "What would you restrict non-bots to? I think restricting bots but not restricting everyone else woul..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [07:10:49] New patchset: MZMcBride; "Set rate limit for editing on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:12:19] PROBLEM - Puppet freshness on db1058 is CRITICAL: No successful Puppet run in the last 10 hours [08:21:43] New review: Daniel Kinzler; "@Nemo said:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:25:05] New review: Peachey88; "Yes, It's very easy, I used to rack changes up in different tabs then save them one after the other." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:26:39] New review: Nemo bis; "Yes, really, even without scripts. And Wikidata has all those JS features that make editing very fast." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [08:26:41] New review: Daniel Kinzler; "I just checked and found that rate limits are not enforced by the Wikibase API at all. Filed as bug ..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/58709 [08:27:22] New review: Legoktm; "According to https://www.wikidata.org/w/index.php?title=User_talk:Docu&oldid=23761249#Edits its easi..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [09:13:11] New review: Daniel Kinzler; "@Legoktm allowing that would defeat the purpose of having the limit in the first place: avoiding ver..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [09:39:05] !log jenkins: updated mediawiki-core-whitespaces job to use ZUUL_COMMIT as a refspec specifier (for {{bug| 46723}} ) [09:39:12] Logged the message, Master [09:44:48] Change abandoned: Aude; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58709 [10:13:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:23:45] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [10:26:52] New patchset: Hashar; "beta: fix sudo rights for mwdeploy user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [10:27:09] New review: Hashar; "PS2 fix whitespaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58859 [10:31:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:43:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [10:56:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:36:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:43:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [11:56:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [12:05:28] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:08:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [12:31:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [12:33:28] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [12:37:25] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [12:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:55:17] cmjohnson1: good morning :-] [12:55:31] Good Afternoon! [12:56:02] should we wait for RobH ? [12:56:14] I can't help for hardware I am a noob on that area [12:56:20] robh: will not be online this early [12:56:46] well I guess you don't need any remote support anyway? if you get access to gallium console I guess we are fine [12:56:49] we are fine without him...the h/w aspect is simple and should take less than 5 mins [12:56:55] \O/ [12:57:29] at 1300 bring down gallium and I will add the disk [12:59:01] !log gracefully shut-downing Jenkins and Zuul for scheduled maintenance. Will shutdown server gallium just after. [12:59:08] Logged the message, Master [13:01:03] bah it doesn't want to go down :D [13:01:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:02:35] PROBLEM - zuul_service_running on gallium is CRITICAL: Connection refused by host [13:02:52] cmjohnson1: machine is going down. I lost access to it [13:02:53] :-D [13:02:57] hashar: looks down [13:03:25] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:04:12] PROBLEM - Host gallium is DOWN: PING CRITICAL - Packet loss = 100% [13:07:22] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [13:10:29] seem we want to use: XFS, set noatime,nobarrier,logbufs=8 [13:14:23] nobarrier iff there the backstore is battery backed up. [13:15:15] What do our RT priorities mean? [13:16:26] I mean, 50 is clearly the default; is like "0: Meh, if you got nothing to do" -> "100: OMG the machine room is on fire!!1!" ? [13:17:02] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:17:33] \O/ [13:17:39] Coren: I have no idea :( [13:18:01] hashar: new disk is in...it had to go in the last drive slot...the sata connector for slot 3 wouldn't reach the disk [13:18:07] powered on [13:18:11] pinging it [13:18:31] I'll ask Steve, since the ticket is for him. [13:19:17] cmjohnson1: will you handle the disk formatting ? [13:20:19] Coren: https://wikitech.wikimedia.org/wiki/RT#Priorities :-] [13:20:25] Coren: 51-60 => disaster [13:21:06] Ah. Didn't think of searching on wikitech. Silly me. Thanks, hashar. [13:21:12] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:22:18] Although, admitedly, that makes 50 a dubious default. [13:23:22] PROBLEM - HTTP on gallium is CRITICAL: Connection refused [13:23:22] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [13:23:22] PROBLEM - SSH on gallium is CRITICAL: Connection refused [13:23:29] poor gallium :D [13:23:39] cmjohnson1: doesn't seem to want to come up or is that doing some fsck on the disks? [13:23:39] hashar: can you format? [13:23:48] let me connect to it [13:23:50] can't ssh [13:25:27] rebooting [13:25:47] sorry should have told you earlier [13:25:52] I have simply have no idea how fast our box comes up usually [13:26:14] depends on the server [13:27:22] RECOVERY - SSH on gallium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:30:40] cmjohnson1: I got some syslog entries [13:30:51] it's up but I don't see the disk [13:31:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:32:23] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [13:33:19] hashar: please take a look if you don't see it either...than i may have to get the disk to work in slot 3 [13:33:27] yeah I am looking at dmesg [13:33:29] and syslog [13:33:39] any idea what would be the name ? [13:34:13] no but dmesg only shows sda and sdb...fdisk -l looks exactly the same as it did before adding the disk [13:34:26] i would think it would be sdc [13:34:49] yup [13:34:54] try another slot ? [13:35:00] okay [13:35:21] shutdowning [13:35:28] okay..cool [13:35:34] !log gallium can't find the SSD, shut downing, will attempt another slot [13:35:41] Logged the message, Master [13:37:03] PROBLEM - Host gallium is DOWN: PING CRITICAL - Packet loss = 100% [13:40:34] cmjohnson1: I can't find anything relevant in the logs :( [13:41:42] 2.566288] ata_piix 0000:00:1f.5: SCR access via SIDPR is available but doesn't work [13:41:42] oh [13:42:32] paste is http://dpaste.com/1055975/ [13:43:28] hashar: Those are just the sata controllers; try to grep sd[a-z] [13:43:47] can't it mean the sata controller has an issue? [13:44:18] That's just info; it's using another method to access de SCR. Not all controllers work with all methods. [13:44:29] s/de/the/ [13:44:42] * hashar doesn't even know what SCR stands for :D [13:44:56] S{CISI,ATA} command register [13:44:58] cmjohnson1: still can't ping it [13:45:42] Drive might be disabled in the BIOS (or the other controller). [13:46:57] booting...i think coren is right about disabled drive [13:47:09] ata1.00 and ata2.00 were up with some samsung disks [13:47:23] ata.2.01 and ata1.01 got: SATA link down (SStatus 4 SControl 0) [13:47:25] yeah..i just enabled port c on sata settings [13:48:27] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:48:48] I think that will do what we want :] [13:48:58] need to check whether the drives ends up being known as /dev/sdc [13:49:03] and it probably need to be formatted first [13:50:36] it pings [13:50:49] can't ssh still [13:52:53] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:53:09] cmjohnson1: services don't seem to go :/ [13:53:50] ah I am on it [13:53:54] hashar: it was still booting up [13:54:26] so the 500GB disk sdb is now known has sdc [13:55:08] i see that and sdb is unk [13:56:14] hashar:have you tried to restart any services again? [13:56:25] nop [13:56:35] though they start on boot [13:57:32] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:57:46] New review: Hashar; "SSD is /dev/sdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [13:57:59] cmjohnson1: anything to check ? [13:58:37] i don't think so [13:59:23] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/bin/zuul-server [13:59:29] so should I just create a primary partition on /dev/sdb using fdisk then format it with mkfs.xfs ? [14:00:42] correct [14:01:16] do you know where you are mounting? [14:02:03] /srv/ssd if that works for you https://gerrit.wikimedia.org/r/58887 [14:02:08] mark told me to not use /a [14:02:23] that works [14:02:26] proposed somewhere under /var/lib/jenkins but that disk is going to be used by other services beside jenkins [14:02:40] Hey, the dude in Tampa is Steve, right? Does he hang around IRC? [14:02:50] coren: he does but not till later [14:02:59] usually after lunch [14:03:11] something broke? [14:03:17] !log gallium: created primary partition on /dev/sdb . formatting using: mkfs.xfs /dev/sdb1 [14:03:24] Logged the message, Master [14:04:04] New patchset: Hashar; "gallium: SSD drive using XFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:04:16] New review: Hashar; "/dev/sdb1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:04:24] Coren: cmjohnson1: could get https://gerrit.wikimedia.org/r/#/c/58887/ merged [14:04:25] cmjohnson1: I need him to play with shelf cabling in Tampa like you did in eqiad; it's a blocker for me, so I want to pounce on him the minute he gers here. :-) [14:04:31] that would let puppet mount the SSD disk on gallium :-] [14:06:55] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58887 [14:06:56] hashar: "nobarrier"? You sure? If you don't have battery backup that can lead to data loss. [14:07:01] oh [14:07:13] hashar..sorry meant to plus 1 [14:07:17] not 2 [14:07:22] too late I guess hehe [14:07:37] Coren: how can we check whether there is battery backup? [14:07:59] since that is a working space, I think we can afford to loose data on that disk [14:08:06] hashar: Unless it's on a raid controller that has battery backed up cache, it doesn't have any. :-) [14:08:28] hashar: I'd turn off nobarrier anyways; there is not much to gain from it on a medium where writes are slow. [14:08:28] no battery backed controller on that server or any of the r410's for that matter [14:08:52] Coren: that is a SSD so supposed to be fast [14:09:16] For various values of "fast". If it doesn't have writeback ram, it's slow. :-) [14:09:35] Not as slow as a physical drive, but slow nonetheless. :-) [14:09:49] Well, caveat configurator. :-) [14:10:54] Just be aware that nobarrier looses some data coherency on XFS. [14:12:03] well we have to talk about it with mark because I barely knows how disk works :/ [14:14:39] In /my/ days, we had to wire wrap our own ram from scrap ferrite cores and hand-carved transistors. :-) [14:15:11] We didn't have fancy shmancy SATA drives, we just had a box with a scribe and a long scroll. [14:16:59] Poor scribes then got allmost all got replaced with monkeys. They had a higher error rate but cheaper (only needed bananas); and we only kelp a few scribes for ECC instead. [14:18:36] Legacy raid array: http://farm4.staticflickr.com/3019/2943407232_26c9510c20.jpg [14:19:31] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:20:18] Coren: only had to worry about dinosaurs destroying the raids back then? [14:20:19] i wonder if you could hot swap with that array? [14:20:36] !log restarting jenkins [14:20:43] Logged the message, Master [14:22:11] cmjohnson1: That depends. Ostensibly yes, but the controllers got all pissy when drives were mounted. :-) [14:22:28] hahaha [14:23:09] You can see from that picture that this was a high-end array: there are two parallel controllers. :-) [14:24:33] New patchset: Demon; "Updating for 2.6-rc0-322-geeed497" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/58891 [14:29:15] New patchset: Hashar; "contint: jenkins master role + ssd directories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [14:29:48] pooor jenkins is restarting :D [14:30:53] cmjohnson1: Coren: I got to document the new Jenkins directories that lands on the SSD. https://gerrit.wikimedia.org/r/58892 that puppet validate :-] [14:30:56] Susan: never heard of analinterns before? [14:32:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [14:34:14] New review: coren; "lgm" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/58892 [14:35:06] Hm. Circular. Jenkins won't Verify the patch since it's down; the patch is needed to bring Jenkins back. :-) [14:35:33] New review: coren; "Jenkins can't verify the patch meant to bring it back up, can it? :-)" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/58892 [14:35:33] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58892 [14:39:02] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:49] Coren: that is true, sorry :( [14:40:54] it is still starting up [14:44:08] New patchset: Hashar; "zuul: support specifying the git directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58898 [14:44:08] New patchset: Hashar; "zuul: migrate git dir in production to the ssd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58899 [14:44:11] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:44:16] !log analytics1007 has never worked..but working on it [14:44:23] Logged the message, Master [14:44:24] I think I will avoid migrating zuul today :-D [14:44:42] it is friday! ..not a good idea ;-) [14:45:23] cmjohnson1: I think you can close the RT ticket I opened to add the SSD https://rt.wikimedia.org/Ticket/Display.html?id=4916 [14:45:31] the ops part is finished ;-] [14:45:38] Is the output of wfDebug() from our production systems viewable somewhere ? [14:45:39] okay..cool [14:45:45] thank you for the new disk \O/ [14:45:52] xyzram: it is not [14:45:57] xyzram: we use wfDebugLogGroup() [14:46:35] <^demon> I thought wfDebugLog() was sufficient. [14:46:36] I have a problem that is only showing up on production. [14:47:07] ah wfDebugLog might end up in the catchall log [14:47:24] So I'd like to see what MW is sending as a search request to lsearchd. [14:47:34] hm [14:47:41] can't remember the function name [14:48:13] so yeah wfDebugLog( 'somegroup', message ); the 'somegroup' should be defined in $wgDebugLogGroups or that will fallback to wfDebug() [14:48:21] That info is logged like this: wfDebug( "Fetching search data from $searchUrl\n" ); [14:48:58] hashar: So the only way is to add custom logging and deploy to production ? [14:49:13] Coren: can you merge on sockpuppet a0757126366 / https://gerrit.wikimedia.org/r/#/c/58892/ ? [14:49:19] Coren: does not seem to have landed [14:49:39] <^demon> xyzram: So, swap wfDebug( $foobar ) for wfDebugLog( 'groupname', $foobar ); [14:49:50] <^demon> Then just have to make sure 'groupname' is defined in $wgDebugLogGroups [14:50:09] where does the info end up ? [14:50:14] <^demon> fluorine [14:50:21] <^demon> In /a/mw-log/* [14:50:37] hashar: Ahcrap; forgot to forward my key [14:51:12] hashar: is why it didn't work. I never found out how to fix this without another merge. [14:51:20] ^demon: the might jenkins loves the build histories :( http://dpaste.com/1056027/ [14:51:34] ^demon: it passes most of the startup time reading a lll the old build history files [14:51:56] <^demon> Yeah. [14:52:12] that should be async [14:52:21] <^demon> You'd think. [14:52:39] ^demon: Ok, thanks. So I would still need to change code and have ops redeploy, right ? This might generate a lot of output, so it may have to be removed once I have the data I'm looking for. [14:52:39] hashar: Do you know how to fix that? The merge failed to rsync because of no ssh key. [14:52:48] <^demon> hashar: Why on earth it needs to load build $n of $randomProject at startup is beyond me. [14:52:53] <^demon> Should load them on demand.' [14:53:20] <^demon> xyzram: No, you or I could do it. $wgDebugLogGroups is defined in wmf-config. [14:53:29] <^demon> So just need the change to MWSearch and that. [14:53:32] ^demon: will fill a bug I guess [14:53:47] !log jenkins is back up. Starting Zuul [14:53:53] Logged the message, Master [14:53:56] <^demon> Or load the latest $n builds. [14:53:58] <^demon> Not all. [14:54:01] <^demon> Anyway [14:54:49] Coren: I have no idea :/ can't you rsync manually ? [14:55:01] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [14:57:08] hashar: It's not clear what is rsync'ed where; it's a hook done automagically during merge. [14:57:23] hashar: Not a big issue, though, since the next merge will push it. [14:57:37] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [14:58:12] I hate jenkins [14:58:50] ^demon: How do you deploy those changes to production ? I thought ops folks do that ... [14:59:04] <^demon> ops can, so can mortal shell users. [14:59:07] <^demon> I can do it [14:59:50] ^demon: Ok, I'll make the change and push to gerrit. [15:00:02] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:00:56] !log Changing Jenkins workspace directories from ${ITEM_ROOTDIR}/workspace to /srv/ssd/jenkins/workspace/${ITEM_FULLNAME} . My change has not been taken in account early on :( [15:01:02] Logged the message, Master [15:01:49] New patchset: Demon; "Adding mwsearch debug log group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58901 [15:02:11] ^demon: I changed the Jenkins workspaces to point to the new SSD . [15:02:22] !log demon synchronized wmf-config/InitialiseSettings.php 'Debug group for mwsearch' [15:02:27] <^demon> xyzram: Just needs the change to MWSearch [15:02:28] Logged the message, Master [15:02:30] <^demon> I took care of wmf-config. [15:03:18] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58901 [15:03:32] ^demon: Quick question: When I run a search query directly on search1015 using: wget -O ukraine "http://search1015:8123/search/uawikimedia/%D0%A3%D0%BA%D1%80%D0%B0%D1%97%D0%BD%D0%B0?namespaces=0&offset=0&limit=20&version=2.1" [15:03:45] I get 8 results. [15:04:26] But from the web page, none. I thought wgDBname might be wrong but that is correct; any thoughts on what might be the issue ? [15:04:37] !log demon synchronized wmf-config/InitialiseSettings.php 'Debug group for mwsearch' [15:04:44] Logged the message, Master [15:04:47] New patchset: Demon; "s/upd/udp/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58902 [15:05:02] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/58902 [15:05:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:05:47] sbernardin: are you in the data center? [15:06:21] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:07:12] ^demon: The change you just committed has "upd:" rather than "udp:" [15:07:17] <^demon> I fixed it. [15:07:21] <^demon> In the followup. [15:07:30] Oh, ok. [15:09:19] <^demon> I'm going to step out to grab lunch really quick. I'll deploy the mwsearch change as soon as I'm back. [15:09:26] <^demon> Should be back in ~15. [15:09:59] ok, I'm taking a break too, back in about an hour. [15:13:57] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:14:05] cmjohnson1: not yet [15:15:07] ping me when you get there....I believe there is a packet of information related to the cisco servers there. May be on the clipboard on the shelves or in the cabinet. I need you to look for it [15:15:07] please [15:15:21] and coren has something that is pretty important that needs to be done. thx [15:15:56] sbernardin = Steve@pmtpa? [15:16:38] Coren: yes [15:17:03] sbernardin: Well met. Please do ping me when your location is coincident with the DC's. :-) [15:19:45] Coren: will do [15:27:41] New patchset: Diederik; "Disable Fabian Kaelin's account, he no longer works with us." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58904 [15:28:08] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:28:24] cmjohnson1: thanks for the sfp's :) [15:28:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58904 [15:28:39] yw lesliecarr..did it work? [15:29:00] no, though i expected it to not work :) [15:29:14] i have a feeling that switch 1 in the stack has some sort of issue [15:29:29] are the fibers long enough that you could easily move the uplink module to switch 2 ? [15:29:35] in row c ? [15:29:52] lemme see [15:31:00] there is only 1 switch in c1 [15:31:22] oh i meant ot c2 [15:31:33] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:32:58] lesliecarr: i think there is enough slack to make that work [15:33:09] but I don't have anymore uplink modules atm [15:33:24] iirc ..robh ordered some or will be ordering [15:33:31] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:33:38] ? [15:33:49] okay [15:34:00] cmjohnson1: ordering....? ex4200 sfp modules? [15:34:07] well we can move the one from c1 if possible, because it's not working in c1 [15:34:08] :) [15:34:14] (cuz if thats it, it was ordered already and will arrive today) [15:34:29] arrive today ? damn! [15:34:30] super fast [15:35:07] robh: cool [15:35:32] RobH: Chris has setup the SSD in gallium successfully :-] [15:35:56] yep, was reading backread, faster? =] [15:36:01] lesliecarr: what is in c 1/2/0 [15:36:23] the uplink to cr2 [15:36:37] neither one is working, i wanted to keep the second link unchanged as a control [15:37:01] oh...well that makes it easier than..i can move it to c2 if you want [15:37:56] :) [15:37:59] awesome thank you [15:38:01] want me to drop a ticket ? [15:38:31] no..i will do it now [15:43:31] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:47:10] leslecarr moved to c2-asw [15:49:13] hashar / cmjohnson1 Thanks for working on zuul stuff today =] [15:49:22] the entire engineering department thanks you. [15:49:31] well [15:49:37] it is not going to suddenly become THHHHAT Fast :D [15:49:44] but I guess that will help a bit [15:49:44] thanks cmjohnson1 [15:49:46] woot [15:50:01] hashar: it isnt a magic bullet, but the wait io during lagged times on gallium was insane. [15:50:09] yup [15:50:17] I noticed that after we upgraded to Precise [15:50:31] some job suddenly doubled it is time [15:50:35] 30 sec -> 1 min [15:51:48] New review: Hashar; "Beware, this need a manual operation on the gallium server." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58899 [15:52:18] robh: I got a bunch of small Zuul changes to review / merge if you are up for some duty :-] [15:54:09] New patchset: Ottomata; "Refactoring puppetmaster::self to allow for puppet clients." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58540 [15:55:01] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:25] oh RobH i think we do need a spare 4200 now … though i may pack up and send the one from ulsfo out ? [15:58:58] robh: if you didn't see update on ticket...camera mounts go in on Monday [15:59:34] New patchset: Hashar; "icinga: fix jenkins monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58906 [15:59:39] !log adding another link to asw-c-eqiad's uplink [15:59:46] Logged the message, Mistress of the network gear. [15:59:52] New review: Hashar; "I have broken icinga check_jenkins test :-D Follow up with https://gerrit.wikimedia.org/r/58906" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58489 [16:00:06] LeslieCarr: check_jenkins is broken on icinga, https://gerrit.wikimedia.org/r/58906 should fix it :-] [16:00:40] hehe [16:00:43] good catch [16:01:14] do we have a disk checking plugin ? [16:01:26] gallium does not have any https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gallium :-D [16:01:31] <^demon> LeslieCarr: Can you look at https://gerrit.wikimedia.org/r/#/c/58692/? hashar will love you. [16:01:47] <^demon> (I will too, fwiw) [16:02:02] * hashar order beers at the tap [16:03:12] haha [16:03:19] ok i am technically on vaation so that is the last work i do [16:03:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [16:03:45] New patchset: Lcarr; "Allow jenkins admins to manage replicated git repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58692 [16:03:57] why did we switch to having to rebase every change ? [16:04:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58692 [16:04:27] rebasing is evil [16:04:31]