[00:22:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.014 seconds [00:28:54] maplebed: thanks [00:29:05] yw. [01:02:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.419 seconds [01:36:18] New patchset: Lcarr; "Trying changing nagios-config dir to prevent the rewriting 50 times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3840 [01:36:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3840 [01:36:53] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3840 [01:36:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3840 [01:41:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.828 seconds [02:05:56] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [02:07:35] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:35] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:35] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:32] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59793 bytes in 0.776 seconds [02:09:32] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52290 bytes in 0.774 seconds [02:09:32] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80915 bytes in 0.875 seconds [02:16:53] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:16:53] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [02:23:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.938 seconds [02:28:53] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [02:28:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [02:50:29] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [04:48:44] RECOVERY - Disk space on search1022 is OK: DISK OK [04:53:53] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3325 MB (2% inode=99%): [04:55:59] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3326 MB (2% inode=99%): [05:13:59] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [06:11:27] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [06:25:51] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3298 MB (2% inode=99%): [06:45:36] PROBLEM - Disk space on search2 is CRITICAL: DISK CRITICAL - free space: /a 3239 MB (2% inode=99%): [07:00:59] PROBLEM - Disk space on search7 is CRITICAL: DISK CRITICAL - free space: /a 787 MB (0% inode=99%): [10:03:29] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:26] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK HTTP/1.0 200 OK - 0.150 second response time [10:30:47] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:34:41] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:23] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:35] PROBLEM - Lucene on search9 is CRITICAL: Connection refused [12:07:41] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [12:18:47] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [12:18:47] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [12:25:49] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:27:37] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [12:30:28] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [12:30:28] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [12:59:30] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3733 [13:14:45] !log restarting puppet/puppetmaster on stafford to experiment with report settings [13:14:47] Logged the message, Master [13:28:57] New patchset: Jgreen; "disabled puppet client logging, so long unused yaml cruft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3858 [13:29:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3858 [13:29:38] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3858 [13:29:40] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3858 [13:34:31] RECOVERY - Lucene on search15 is OK: TCP OK - 8.996 second response time on port 8123 [13:49:08] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [13:54:23] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [13:56:11] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.011 second response time on port 8123 [14:06:59] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [14:10:53] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [14:26:29] RECOVERY - Disk space on search7 is OK: DISK OK [14:26:38] RECOVERY - Disk space on search2 is OK: DISK OK [14:34:44] !log lucene hosed on search9 and search15. restarting, then will look after cause [14:34:45] Logged the message, King of Search [14:35:20] RECOVERY - Lucene on search9 is OK: TCP OK - 0.007 second response time on port 8123 [14:36:25] !log got virt1001 to pxe, but dhcp doesnt know how to handle, need subnet details. [14:36:26] Logged the message, RobH [14:36:35] notpeter: i just have it say my name so i know it worked. [14:37:53] RECOVERY - Lucene on search15 is OK: TCP OK - 0.005 second response time on port 8123 [14:38:09] restartin morebots. [14:38:20] grrr [14:38:58] ok, come on back stupid bot. [14:39:21] !log restarted morebots in screen on wikitech, no longer as catrope, as roan has root on that box [14:39:23] Logged the message, RobH [14:39:30] notpeter: you are now simply notpeter ;] [14:52:22] I run it as root over there [14:52:58] I don't know the catrope convention [14:53:22] hehe [14:53:36] It's just that I was always the one that restarted it [14:53:43] Cause it would tend to die during European hours [14:53:58] oh [14:54:09] I restarted it yesterday, it had been dead for a day and something [14:54:19] but people (incuding me) just kept logging without noticing [14:54:57] I wonder why it died again so soon [15:01:13] i did it [15:01:21] to change peter's reply to logging [15:01:26] (just today) [15:01:46] ah ok then [15:06:31] is there a way that RT emails me when a ticket changes? [15:06:56] yep, if you are a requestor [15:07:06] cc, or admin cc [15:07:13] you can add yourself to a ticket via the 'people' tab [15:07:15] I am on multiple tickets and I never get emails [15:07:19] oh? [15:07:25] I get emails =/ [15:07:33] your account has a valid email address in the about right? [15:07:38] yes it oes [15:07:42] preferences i mean, hrmm [15:07:51] it does, i checked it [15:07:54] and under mail you have it set to individual? [15:07:59] yes [15:08:26] i would suggest dropping a new ticket in the ops-requests describing the issue then, cuz it should be emailing you [15:08:34] for example, our conversation about oxygen does not show up in my email [15:08:37] i get emails just fine, so someone is going to have to trace it out [15:08:47] i just manually check the ticket and see if anything changed [15:08:51] yea that sucks [15:15:39] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [15:17:09] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 90 MB (1% inode=61%): /var/lib/ureadahead/debugfs 90 MB (1% inode=61%): [15:21:03] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 101 MB (1% inode=61%): /var/lib/ureadahead/debugfs 101 MB (1% inode=61%): [15:29:36] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 245 MB (3% inode=62%): /var/lib/ureadahead/debugfs 245 MB (3% inode=62%): [15:33:48] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 206 MB (2% inode=62%): /var/lib/ureadahead/debugfs 206 MB (2% inode=62%): [15:33:57] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [15:35:37] Reedy: looks like we have a lot of full servers [15:35:45] RECOVERY - Disk space on srv221 is OK: DISK OK [15:35:54] RECOVERY - Disk space on srv224 is OK: DISK OK [15:35:56] Yeah, it happens quite regularily [15:36:03] RECOVERY - Disk space on srv223 is OK: DISK OK [15:36:08] There's a cronjob to clean some of it up [15:37:05] would be better if that was just not filling up so easily [15:37:22] and it is filling / …. [15:37:25] bah [15:39:53] the apaches only have / i think. [15:40:11] they are just stupid processing nodes, in the past it was discussed how we could make them diskless ;] [15:40:22] but adding the ram to do that was, at the time, more expensive than the disks. [15:40:45] i mean apaches have no data partition other than / (i said it wrong before) [15:42:08] some, but not all have /a mostly empty [16:08:42] RobH: hey :) we can find the subnet info via the routers :) [16:12:17] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [16:24:10] I am going to be replacing msw-d1-pmtpa. Affecting ssl3/4 , es3/4, ms-fe1, locke, ms-be1, labstore 3/4, db60, db9, cr2. Is anyone working on any of these? [16:26:53] not i [16:34:07] !log replacing msw-d1-pmtpa per rt2639 [16:34:10] Logged the message, Master [16:37:38] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [16:39:35] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, interfaces up: 87, down: 1, dormant: 0, excluded: 0, unused: 0BRfxp0: down - BR [16:41:23] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms [16:41:41] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [16:46:08] hashar: any new patchsets need merging ? [16:46:46] LeslieCarr: https://gerrit.wikimedia.org/r/#change,3733 [16:46:56] LeslieCarr: you might want to have a review at it too [16:47:17] that is kind of a blocker for Jenkins continuous integration project [16:47:30] double checking you're not giving yourself root ;) [16:47:37] !log msw1-d1-pmtpa replacement complete [16:47:37] please note it add a nasty setgid bit on /var/lib/jenkins/jobs (IIRC) [16:47:39] Logged the message, Master [16:47:55] i just want to know who came up with the class name of admins::analinterns [16:48:57] git blame ;) [16:49:07] I love running blame to find out random things [16:49:32] Occasionally it allows you to write impressive commit messages like "fix bug X that had been present since 2005" [16:50:44] aahaha [16:51:10] I find fixing and reviewing old bug a lot more rewarding [16:51:45] LeslieCarr: admins::analinterns should probably be renamed "admins::colon" [16:52:09] RoanKattouw: gerrit 2.3 has the concept of drafts changes :D [16:52:30] RoanKattouw: which nobody has to review. That could probably be used as some kind of staging area for people to prepare their changes [16:53:17] Yeah I saw mention of that in the R-N [16:53:35] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3733 [16:53:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [16:53:59] LeslieCarr: can you apply it on gallium ? [16:54:10] needy ! [16:54:14] LeslieCarr: will only be able to review it later tonight though [16:54:30] cmjohnson1: ping me when you got a sec [16:54:47] it's tie to reseat the usb sms thing on spence again :/ [16:54:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:42] notpeter: give me about 5 [16:55:43] RoanKattouw: I have CR +2 your click tracking hack https://gerrit.wikimedia.org/r/#change,3863 [16:55:52] cmjohnson1: cool. no particular hurry [16:55:57] Thanks [16:55:59] RoanKattouw: you even have one inline comment [16:56:47] and I get commit guidelines at https://www.mediawiki.org/wiki/Git/Commit_message_guidelines ;-D [16:56:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.452 seconds [16:57:29] I am out hacking place is closing [16:57:36] LeslieCarr: will review merge on gallium later tonight. [16:57:39] cya [16:58:09] bye [17:00:58] New patchset: Lcarr; "Fixing jenkins group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3865 [17:01:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3865 [17:02:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3865 [17:02:32] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3865 [17:05:09] notpeter: i reseated the usb....did it work? [17:07:08] New patchset: Lcarr; "trying to fix again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3866 [17:07:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3866 [17:07:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3866 [17:07:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3866 [17:08:05] cmjohnson1: hrm, not so far [17:08:09] I shall fiddle with it [17:08:23] but I think that there will not be need for more physical intervention [17:08:27] thank you! [17:09:09] okay...let me know [17:09:27] yep. thank you [17:10:57] oop, re. fenari load that's me [17:11:21] New patchset: Lcarr; "fixing jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3867 [17:11:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3867 [17:12:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3867 [17:12:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3867 [17:16:51] New patchset: Lcarr; "try 20 to fix jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3868 [17:17:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3868 [17:17:16] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3868 [17:17:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3868 [17:29:11] RobH: how long do you expect to have the swift backend host down for (to play with firmware)? [17:29:33] If you don't intend to leave it down overnight, I won't take it out of the rings - just do a normal shutdown on the host. [17:29:42] if you want it for a few days I'll take it out of the rings. [17:31:37] maplebed: i am not sure if the firmware i have will work, so best to take the one that rebooted most out of rotation [17:31:46] and i cannot do it right this second, so out of ring is indeed best i think [17:31:51] is that ok? [17:32:00] oh, wait [17:32:08] if i take it down, and it works, it will be hours not days [17:32:13] well, if it's down less than a day, then you don't need me. [17:32:16] so just pulling it with normal shutdown is ok i suppose. [17:32:18] you just shut it down whenever you're ready. [17:32:19] awesome! [17:32:33] ok, i didnt want to assume [17:32:38] thx! [17:32:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:42] I appreciate the confirmation. [17:38:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds [17:45:21] maplebed ^ [17:45:44] woosters: the recovery? [17:46:05] tfinc / preilly told me they need some help to push their varnish stuff to production [17:46:20] can u pls take a look? [17:46:36] woosters: yes [17:46:44] indeed we do and next time we'll try and anticipate such low level changes sooner [17:46:45] if there's documentation I can follow, sure. [17:46:46] woosters: but, also need to test it [17:48:20] New patchset: Bhartshorne; "adding dsc to shell access on bastions and analytics hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3870 [17:48:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3870 [17:49:21] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3870 [17:49:24] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3870 [17:59:06] dschoon: would you attempt to log in to bast1001.wikimedia.org for me to verify I added your account right? [17:59:18] will do. [17:59:46] er, crap. [18:00:02] i'm on a different machine atm. i need to copy my wmf keys. [18:00:16] i'll get back to you. :) [18:11:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.030 seconds [18:20:25] notpeter: fenari:~$ tail -f /tmp/fire_in_the_hole-20120328-181334 [18:20:40] that's the search API tester running [18:23:24] Jeff_Green: if people are getting on your case about fenari you could join me on iron (so long as eqiad and no nfs is ok) [18:23:37] ohrly [18:23:53] no nfs is good, because nfs destroys fenari regularly [18:24:05] iron eh? [18:24:15] it does mean, of course that stuff you put there isn't backed up etc. [18:24:27] but as a working host, it's running my swift cleaner nicely. [18:24:30] oh that doesn't matter [18:24:59] i have long considered production hosts to be subject to instantaneous disappearance [18:25:04] New patchset: Bhartshorne; "adding a command to send to an email-to-sms gateway" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3872 [18:25:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3872 [18:25:40] notpeter: ^^^ first step... [18:26:03] the amusing thing is that if I run this script fast enough to be noticeable on fenari, it may destroy lucene [18:26:06] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3872 [18:26:09] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3872 [18:30:00] notpeter: when you've been working on nagios in the past, how do you trigger a page for just you to test the change? [18:33:20] maplebed: I have echoed straight to the sms daemon [18:33:22] such as [18:33:42] ok. [18:33:43] echo "test" | /usr/local/bin/gammu-smsd-inject TEXT +1 [18:33:47] I want nagios to trigger it though... [18:33:51] hrmp.h [18:34:01] I may have to configure a new service to use as my test. [18:34:08] hhhmmm, yeah [18:34:09] so much work! [18:34:20] feel free to play with the eqiad search vips [18:34:21] for this [18:34:31] they will page [18:34:34] and are not in use yet [18:34:51] nah... I think I'll create a service that tests for a specific file on spence or something that I can easily move around to trigger alerts and recoveries. [18:36:11] I think that will be after lunch. [18:36:16] okie dokie [18:43:01] LeslieCarr: so somehow Jenkins broke? :-D [18:44:12] yeah, the group seems to think it's specified multiple times [18:44:16] i tried some fixes but they didn't work [18:44:28] and then i was like "gah, need to get these other 50 things worked on" :( [18:44:28] it probably is, I will try to have a look at it [18:44:46] well at least you tried! =:D [18:45:03] I will investigate the issue a bit [18:46:32] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [18:51:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:36] New patchset: Pyoungmeister; "simplifying some search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3873 [18:52:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3873 [18:57:02] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:57:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.742 seconds [19:08:31] RobH: are you with chad in eqiad ? [19:29:29] hashar: Yep [19:29:35] i put him to work with power stuff in new row [19:29:42] he is mounting all the new power strips and grounding them [19:30:10] if you need him, i can get him setup on wifi, but i am jealously guarding access to him ;] [19:31:21] * schoolcraftT slaps Thehelpfulone with a large smelly trout [19:31:40] hashar: if you just need to chat with him he can use my computer though easy. [19:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:58] this will be a couple extra minutes, this was the too short 7m fiber that i was due to swap [19:36:10] so gotta pull and label a new fiber, so the fiber # will change on both ends too [19:36:23] bah, wrong channel, but i guess we should work in heere [19:36:26] LeslieCarr: ^ [19:36:32] shame on me for working in private ;] [19:36:44] ok [19:36:57] after this fiber, lunch for me [19:37:26] 5/3/3 is okay to swap since nothing is on psw2 [19:38:05] ok, this fiber is now # 2651 if ya wanna update [19:38:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [19:43:00] cool [19:43:06] lemme know when you're done [19:45:54] 5/3/0 done LeslieCarr [19:46:03] new cable # 2651 [19:46:21] dont forget to update label on cr2 ;] [19:46:33] you may already have [19:46:42] cool [19:46:44] yep [19:46:53] ok, i'm gonna find lunch, do 5/3/3 at your leisure [19:47:02] oh cr2 label, yes thanks [19:47:17] thx, you have time for more of this after lunch? [19:47:46] its not exactly breaking anything with the old runs, they just suck ;] [19:48:15] have a nice lunch =] [19:49:00] yeah, i have time for fixing network stuffs [19:49:10] just because it's the network :p [19:49:15] heh [19:49:28] want yer pretty fibers in their own special run dontcha ;] [20:00:56] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [20:12:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:35] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3873 [20:15:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3873 [20:18:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.555 seconds [20:21:53] yep [20:22:01] purple fibers, pink conduit [20:22:29] I'm not a big fan of pink unless it's flaming hot pink [20:22:33] in which case I'm in [20:23:22] oh yeah [20:23:24] hot pink [20:23:30] not cupcake pink [20:23:34] that's only good for cupcakes [20:23:38] good. no pastels [20:26:24] maybe fuschia instead [20:29:13] !log restarted search1020. nothing conspicuous in logs [20:29:15] Logged the message, notpeter [20:30:20] fuschia is nice [20:30:57] the plant and the color [20:31:37] mauve. [20:32:07] New patchset: Hashar; "create jenkins using user() and crafted group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3881 [20:32:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3881 [20:32:51] New review: Hashar; "This caused issue with jenkins group being created twice :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [20:33:21] RobH: fine. Just says Chad hello from me please :-] [20:33:54] mauve is just boring [20:34:18] New review: Hashar; "The issue reported by Leslie was:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [20:34:32] LeslieCarr: but it's interesting because it's boring and annoying *at the same time* [20:36:06] LeslieCarr: Jenkins group on gallium might be fixed with https://gerrit.wikimedia.org/r/3881 ;) [20:37:53] you are not putting pink anything in my datacenter. [20:38:05] we already have a color scheme. [20:38:06] ;p [20:38:07] RobH: you're no fun [20:38:13] agreed with peter [20:38:14] :( [20:38:27] i will allow only one color change, if you wish to have peering fiber have different color. [20:38:42] fuchsia mauve whatever, they all look like pink to me! [20:38:43] black with black stripes [20:38:43] otherwise all normal traffic fibers are already yellow ;p [20:38:43] hashar: lets make that work [20:38:51] will that fit with your scheme? [20:38:54] ooo peering = pink [20:38:54] win ! [20:38:59] yes! [20:39:06] though replacing the fibers for that seems silly. [20:39:06] hashar: let's see if that works [20:39:15] the pink color for fibers ? :D [20:39:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3881 [20:39:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3881 [20:39:27] hehe the patchset, but i approve of pink fiber [20:40:07] RobH: it might not be that silly if it makes them all the same colors and save up lot of time to human beings [20:40:18] RobH: you might have a fast return on investment :-] [20:40:30] albeit in this case it is unlikely, it is not impossible though [20:41:08] hashar: trying the latest puppet run on gallium now.... [20:41:35] so far looking good ... [20:41:37] yeah spying you with watch -n 1 'w | egrep ^root' [20:41:38] created the user ... [20:41:39] hehehe [20:41:53] yay ! [20:41:57] it works [20:42:07] happy dance [20:42:22] good [20:42:34] need you on gallium though [20:42:36] for some quick fixes [20:42:48] chgrp -R jenkins /var/lib/jenkins/.git/ [20:42:57] I though I made that recursive [20:43:11] but that directory just have chmod g+s [20:43:17] LeslieCarr: so i am moving the psw2 to cr2 fiber [20:43:19] and that does not magically fix user rights [20:43:24] but i would suggest we then connect it to psw1 [20:43:30] as psw1 is presently only on cr1. [20:43:39] psw2 wont becalled psw2 when its returned, but ex4500-labs [20:43:43] yeah [20:43:45] so no reason to have it on both routers [20:44:03] so i will attach to psw1s other 10g port? [20:44:09] sure, let's hook up the other psw1 - i allocated 5/2/3 on cr2 [20:44:15] and 0/1/1 on psw1 [20:44:26] is that the old oprt it was on cr2 [20:44:29] or is it moving? [20:44:42] that's the allocated port, i don't believe there's anything in that [20:44:52] just that it's on 5/2/3 on cr1 and i like symmetry [20:44:58] ok, its the same, cool [20:45:14] yea it was psw2 on that port before, so its staying same in terms of physical connection, good times, running now [20:46:28] yay [20:46:58] LeslieCarr: could you run on gallium: chgrp -R jenkins /var/lib/jenkins/.git /var/lib/jenkins/jobs [20:47:30] or maybe I should make that a recursive declaration in puppet [20:47:32] sure, as soon as all my sessions come back from netflap :( [20:47:38]