[00:22:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.014 seconds [00:28:54] maplebed: thanks [00:29:05] yw. [01:02:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.419 seconds [01:36:18] New patchset: Lcarr; "Trying changing nagios-config dir to prevent the rewriting 50 times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3840 [01:36:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3840 [01:36:53] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3840 [01:36:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3840 [01:41:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.828 seconds [02:05:56] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [02:07:35] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:35] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:35] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:32] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59793 bytes in 0.776 seconds [02:09:32] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52290 bytes in 0.774 seconds [02:09:32] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80915 bytes in 0.875 seconds [02:16:53] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:16:53] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [02:23:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.938 seconds [02:28:53] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [02:28:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [02:50:29] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [04:48:44] RECOVERY - Disk space on search1022 is OK: DISK OK [04:53:53] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3325 MB (2% inode=99%): [04:55:59] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3326 MB (2% inode=99%): [05:13:59] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [06:11:27] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [06:25:51] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3298 MB (2% inode=99%): [06:45:36] PROBLEM - Disk space on search2 is CRITICAL: DISK CRITICAL - free space: /a 3239 MB (2% inode=99%): [07:00:59] PROBLEM - Disk space on search7 is CRITICAL: DISK CRITICAL - free space: /a 787 MB (0% inode=99%): [10:03:29] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:26] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK HTTP/1.0 200 OK - 0.150 second response time [10:30:47] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:34:41] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:23] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:35] PROBLEM - Lucene on search9 is CRITICAL: Connection refused [12:07:41] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [12:18:47] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [12:18:47] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [12:25:49] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:27:37] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [12:30:28] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [12:30:28] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [12:59:30] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3733 [13:14:45] !log restarting puppet/puppetmaster on stafford to experiment with report settings [13:14:47] Logged the message, Master [13:28:57] New patchset: Jgreen; "disabled puppet client logging, so long unused yaml cruft" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3858 [13:29:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3858 [13:29:38] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3858 [13:29:40] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3858 [13:34:31] RECOVERY - Lucene on search15 is OK: TCP OK - 8.996 second response time on port 8123 [13:49:08] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [13:54:23] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [13:56:11] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.011 second response time on port 8123 [14:06:59] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [14:10:53] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [14:26:29] RECOVERY - Disk space on search7 is OK: DISK OK [14:26:38] RECOVERY - Disk space on search2 is OK: DISK OK [14:34:44] !log lucene hosed on search9 and search15. restarting, then will look after cause [14:34:45] Logged the message, King of Search [14:35:20] RECOVERY - Lucene on search9 is OK: TCP OK - 0.007 second response time on port 8123 [14:36:25] !log got virt1001 to pxe, but dhcp doesnt know how to handle, need subnet details. [14:36:26] Logged the message, RobH [14:36:35] notpeter: i just have it say my name so i know it worked. [14:37:53] RECOVERY - Lucene on search15 is OK: TCP OK - 0.005 second response time on port 8123 [14:38:09] restartin morebots. [14:38:20] grrr [14:38:58] ok, come on back stupid bot. [14:39:21] !log restarted morebots in screen on wikitech, no longer as catrope, as roan has root on that box [14:39:23] Logged the message, RobH [14:39:30] notpeter: you are now simply notpeter ;] [14:52:22] I run it as root over there [14:52:58] I don't know the catrope convention [14:53:22] hehe [14:53:36] It's just that I was always the one that restarted it [14:53:43] Cause it would tend to die during European hours [14:53:58] oh [14:54:09] I restarted it yesterday, it had been dead for a day and something [14:54:19] but people (incuding me) just kept logging without noticing [14:54:57] I wonder why it died again so soon [15:01:13] i did it [15:01:21] to change peter's reply to logging [15:01:26] (just today) [15:01:46] ah ok then [15:06:31] is there a way that RT emails me when a ticket changes? [15:06:56] yep, if you are a requestor [15:07:06] cc, or admin cc [15:07:13] you can add yourself to a ticket via the 'people' tab [15:07:15] I am on multiple tickets and I never get emails [15:07:19] oh? [15:07:25] I get emails =/ [15:07:33] your account has a valid email address in the about right? [15:07:38] yes it oes [15:07:42] preferences i mean, hrmm [15:07:51] it does, i checked it [15:07:54] and under mail you have it set to individual? [15:07:59] yes [15:08:26] i would suggest dropping a new ticket in the ops-requests describing the issue then, cuz it should be emailing you [15:08:34] for example, our conversation about oxygen does not show up in my email [15:08:37] i get emails just fine, so someone is going to have to trace it out [15:08:47] i just manually check the ticket and see if anything changed [15:08:51] yea that sucks [15:15:39] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [15:17:09] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 90 MB (1% inode=61%): /var/lib/ureadahead/debugfs 90 MB (1% inode=61%): [15:21:03] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 101 MB (1% inode=61%): /var/lib/ureadahead/debugfs 101 MB (1% inode=61%): [15:29:36] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 245 MB (3% inode=62%): /var/lib/ureadahead/debugfs 245 MB (3% inode=62%): [15:33:48] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 206 MB (2% inode=62%): /var/lib/ureadahead/debugfs 206 MB (2% inode=62%): [15:33:57] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [15:35:37] Reedy: looks like we have a lot of full servers [15:35:45] RECOVERY - Disk space on srv221 is OK: DISK OK [15:35:54] RECOVERY - Disk space on srv224 is OK: DISK OK [15:35:56] Yeah, it happens quite regularily [15:36:03] RECOVERY - Disk space on srv223 is OK: DISK OK [15:36:08] There's a cronjob to clean some of it up [15:37:05] would be better if that was just not filling up so easily [15:37:22] and it is filling / …. [15:37:25] bah [15:39:53] the apaches only have / i think. [15:40:11] they are just stupid processing nodes, in the past it was discussed how we could make them diskless ;] [15:40:22] but adding the ram to do that was, at the time, more expensive than the disks. [15:40:45] i mean apaches have no data partition other than / (i said it wrong before) [15:42:08] some, but not all have /a mostly empty [16:08:42] RobH: hey :) we can find the subnet info via the routers :) [16:12:17] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [16:24:10] I am going to be replacing msw-d1-pmtpa. Affecting ssl3/4 , es3/4, ms-fe1, locke, ms-be1, labstore 3/4, db60, db9, cr2. Is anyone working on any of these? [16:26:53] not i [16:34:07] !log replacing msw-d1-pmtpa per rt2639 [16:34:10] Logged the message, Master [16:37:38] PROBLEM - Host ps1-d1-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [16:39:35] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: host 208.80.152.197, interfaces up: 87, down: 1, dormant: 0, excluded: 0, unused: 0BRfxp0: down - BR [16:41:23] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms [16:41:41] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [16:46:08] hashar: any new patchsets need merging ? [16:46:46] LeslieCarr: https://gerrit.wikimedia.org/r/#change,3733 [16:46:56] LeslieCarr: you might want to have a review at it too [16:47:17] that is kind of a blocker for Jenkins continuous integration project [16:47:30] double checking you're not giving yourself root ;) [16:47:37] !log msw1-d1-pmtpa replacement complete [16:47:37] please note it add a nasty setgid bit on /var/lib/jenkins/jobs (IIRC) [16:47:39] Logged the message, Master [16:47:55] i just want to know who came up with the class name of admins::analinterns [16:48:57] git blame ;) [16:49:07] I love running blame to find out random things [16:49:32] Occasionally it allows you to write impressive commit messages like "fix bug X that had been present since 2005" [16:50:44] aahaha [16:51:10] I find fixing and reviewing old bug a lot more rewarding [16:51:45] LeslieCarr: admins::analinterns should probably be renamed "admins::colon" [16:52:09] RoanKattouw: gerrit 2.3 has the concept of drafts changes :D [16:52:30] RoanKattouw: which nobody has to review. That could probably be used as some kind of staging area for people to prepare their changes [16:53:17] Yeah I saw mention of that in the R-N [16:53:35] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3733 [16:53:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [16:53:59] LeslieCarr: can you apply it on gallium ? [16:54:10] needy ! [16:54:14] LeslieCarr: will only be able to review it later tonight though [16:54:30] cmjohnson1: ping me when you got a sec [16:54:47] it's tie to reseat the usb sms thing on spence again :/ [16:54:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:42] notpeter: give me about 5 [16:55:43] RoanKattouw: I have CR +2 your click tracking hack https://gerrit.wikimedia.org/r/#change,3863 [16:55:52] cmjohnson1: cool. no particular hurry [16:55:57] Thanks [16:55:59] RoanKattouw: you even have one inline comment [16:56:47] and I get commit guidelines at https://www.mediawiki.org/wiki/Git/Commit_message_guidelines ;-D [16:56:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.452 seconds [16:57:29] I am out hacking place is closing [16:57:36] LeslieCarr: will review merge on gallium later tonight. [16:57:39] cya [16:58:09] bye [17:00:58] New patchset: Lcarr; "Fixing jenkins group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3865 [17:01:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3865 [17:02:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3865 [17:02:32] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3865 [17:05:09] notpeter: i reseated the usb....did it work? [17:07:08] New patchset: Lcarr; "trying to fix again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3866 [17:07:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3866 [17:07:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3866 [17:07:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3866 [17:08:05] cmjohnson1: hrm, not so far [17:08:09] I shall fiddle with it [17:08:23] but I think that there will not be need for more physical intervention [17:08:27] thank you! [17:09:09] okay...let me know [17:09:27] yep. thank you [17:10:57] oop, re. fenari load that's me [17:11:21] New patchset: Lcarr; "fixing jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3867 [17:11:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3867 [17:12:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3867 [17:12:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3867 [17:16:51] New patchset: Lcarr; "try 20 to fix jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3868 [17:17:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3868 [17:17:16] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3868 [17:17:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3868 [17:29:11] RobH: how long do you expect to have the swift backend host down for (to play with firmware)? [17:29:33] If you don't intend to leave it down overnight, I won't take it out of the rings - just do a normal shutdown on the host. [17:29:42] if you want it for a few days I'll take it out of the rings. [17:31:37] maplebed: i am not sure if the firmware i have will work, so best to take the one that rebooted most out of rotation [17:31:46] and i cannot do it right this second, so out of ring is indeed best i think [17:31:51] is that ok? [17:32:00] oh, wait [17:32:08] if i take it down, and it works, it will be hours not days [17:32:13] well, if it's down less than a day, then you don't need me. [17:32:16] so just pulling it with normal shutdown is ok i suppose. [17:32:18] you just shut it down whenever you're ready. [17:32:19] awesome! [17:32:33] ok, i didnt want to assume [17:32:38] thx! [17:32:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:42] I appreciate the confirmation. [17:38:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds [17:45:21] maplebed ^ [17:45:44] woosters: the recovery? [17:46:05] tfinc / preilly told me they need some help to push their varnish stuff to production [17:46:20] can u pls take a look? [17:46:36] woosters: yes [17:46:44] indeed we do and next time we'll try and anticipate such low level changes sooner [17:46:45] if there's documentation I can follow, sure. [17:46:46] woosters: but, also need to test it [17:48:20] New patchset: Bhartshorne; "adding dsc to shell access on bastions and analytics hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3870 [17:48:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3870 [17:49:21] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3870 [17:49:24] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3870 [17:59:06] dschoon: would you attempt to log in to bast1001.wikimedia.org for me to verify I added your account right? [17:59:18] will do. [17:59:46] er, crap. [18:00:02] i'm on a different machine atm. i need to copy my wmf keys. [18:00:16] i'll get back to you. :) [18:11:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.030 seconds [18:20:25] notpeter: fenari:~$ tail -f /tmp/fire_in_the_hole-20120328-181334 [18:20:40] that's the search API tester running [18:23:24] Jeff_Green: if people are getting on your case about fenari you could join me on iron (so long as eqiad and no nfs is ok) [18:23:37] ohrly [18:23:53] no nfs is good, because nfs destroys fenari regularly [18:24:05] iron eh? [18:24:15] it does mean, of course that stuff you put there isn't backed up etc. [18:24:27] but as a working host, it's running my swift cleaner nicely. [18:24:30] oh that doesn't matter [18:24:59] i have long considered production hosts to be subject to instantaneous disappearance [18:25:04] New patchset: Bhartshorne; "adding a command to send to an email-to-sms gateway" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3872 [18:25:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3872 [18:25:40] notpeter: ^^^ first step... [18:26:03] the amusing thing is that if I run this script fast enough to be noticeable on fenari, it may destroy lucene [18:26:06] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3872 [18:26:09] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3872 [18:30:00] notpeter: when you've been working on nagios in the past, how do you trigger a page for just you to test the change? [18:33:20] maplebed: I have echoed straight to the sms daemon [18:33:22] such as [18:33:42] ok. [18:33:43] echo "test" | /usr/local/bin/gammu-smsd-inject TEXT +1 [18:33:47] I want nagios to trigger it though... [18:33:51] hrmp.h [18:34:01] I may have to configure a new service to use as my test. [18:34:08] hhhmmm, yeah [18:34:09] so much work! [18:34:20] feel free to play with the eqiad search vips [18:34:21] for this [18:34:31] they will page [18:34:34] and are not in use yet [18:34:51] nah... I think I'll create a service that tests for a specific file on spence or something that I can easily move around to trigger alerts and recoveries. [18:36:11] I think that will be after lunch. [18:36:16] okie dokie [18:43:01] LeslieCarr: so somehow Jenkins broke? :-D [18:44:12] yeah, the group seems to think it's specified multiple times [18:44:16] i tried some fixes but they didn't work [18:44:28] and then i was like "gah, need to get these other 50 things worked on" :( [18:44:28] it probably is, I will try to have a look at it [18:44:46] well at least you tried! =:D [18:45:03] I will investigate the issue a bit [18:46:32] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [18:51:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:36] New patchset: Pyoungmeister; "simplifying some search" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3873 [18:52:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3873 [18:57:02] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [18:57:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.742 seconds [19:08:31] RobH: are you with chad in eqiad ? [19:29:29] hashar: Yep [19:29:35] i put him to work with power stuff in new row [19:29:42] he is mounting all the new power strips and grounding them [19:30:10] if you need him, i can get him setup on wifi, but i am jealously guarding access to him ;] [19:31:21] * schoolcraftT slaps Thehelpfulone with a large smelly trout [19:31:40] hashar: if you just need to chat with him he can use my computer though easy. [19:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:58] this will be a couple extra minutes, this was the too short 7m fiber that i was due to swap [19:36:10] so gotta pull and label a new fiber, so the fiber # will change on both ends too [19:36:23] bah, wrong channel, but i guess we should work in heere [19:36:26] LeslieCarr: ^ [19:36:32] shame on me for working in private ;] [19:36:44] ok [19:36:57] after this fiber, lunch for me [19:37:26] 5/3/3 is okay to swap since nothing is on psw2 [19:38:05] ok, this fiber is now # 2651 if ya wanna update [19:38:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [19:43:00] cool [19:43:06] lemme know when you're done [19:45:54] 5/3/0 done LeslieCarr [19:46:03] new cable # 2651 [19:46:21] dont forget to update label on cr2 ;] [19:46:33] you may already have [19:46:42] cool [19:46:44] yep [19:46:53] ok, i'm gonna find lunch, do 5/3/3 at your leisure [19:47:02] oh cr2 label, yes thanks [19:47:17] thx, you have time for more of this after lunch? [19:47:46] its not exactly breaking anything with the old runs, they just suck ;] [19:48:15] have a nice lunch =] [19:49:00] yeah, i have time for fixing network stuffs [19:49:10] just because it's the network :p [19:49:15] heh [19:49:28] want yer pretty fibers in their own special run dontcha ;] [20:00:56] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [20:12:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:35] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3873 [20:15:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3873 [20:18:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.555 seconds [20:21:53] yep [20:22:01] purple fibers, pink conduit [20:22:29] I'm not a big fan of pink unless it's flaming hot pink [20:22:33] in which case I'm in [20:23:22] oh yeah [20:23:24] hot pink [20:23:30] not cupcake pink [20:23:34] that's only good for cupcakes [20:23:38] good. no pastels [20:26:24] maybe fuschia instead [20:29:13] !log restarted search1020. nothing conspicuous in logs [20:29:15] Logged the message, notpeter [20:30:20] fuschia is nice [20:30:57] the plant and the color [20:31:37] mauve. [20:32:07] New patchset: Hashar; "create jenkins using user() and crafted group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3881 [20:32:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3881 [20:32:51] New review: Hashar; "This caused issue with jenkins group being created twice :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [20:33:21] RobH: fine. Just says Chad hello from me please :-] [20:33:54] mauve is just boring [20:34:18] New review: Hashar; "The issue reported by Leslie was:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [20:34:32] LeslieCarr: but it's interesting because it's boring and annoying *at the same time* [20:36:06] LeslieCarr: Jenkins group on gallium might be fixed with https://gerrit.wikimedia.org/r/3881 ;) [20:37:53] you are not putting pink anything in my datacenter. [20:38:05] we already have a color scheme. [20:38:06] ;p [20:38:07] RobH: you're no fun [20:38:13] agreed with peter [20:38:14] :( [20:38:27] i will allow only one color change, if you wish to have peering fiber have different color. [20:38:42] fuchsia mauve whatever, they all look like pink to me! [20:38:43] black with black stripes [20:38:43] otherwise all normal traffic fibers are already yellow ;p [20:38:43] hashar: lets make that work [20:38:51] will that fit with your scheme? [20:38:54] ooo peering = pink [20:38:54] win ! [20:38:59] yes! [20:39:06] though replacing the fibers for that seems silly. [20:39:06] hashar: let's see if that works [20:39:15] the pink color for fibers ? :D [20:39:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3881 [20:39:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3881 [20:39:27] hehe the patchset, but i approve of pink fiber [20:40:07] RobH: it might not be that silly if it makes them all the same colors and save up lot of time to human beings [20:40:18] RobH: you might have a fast return on investment :-] [20:40:30] albeit in this case it is unlikely, it is not impossible though [20:41:08] hashar: trying the latest puppet run on gallium now.... [20:41:35] so far looking good ... [20:41:37] yeah spying you with watch -n 1 'w | egrep ^root' [20:41:38] created the user ... [20:41:39] hehehe [20:41:53] yay ! [20:41:57] it works [20:42:07] happy dance [20:42:22] good [20:42:34] need you on gallium though [20:42:36] for some quick fixes [20:42:48] chgrp -R jenkins /var/lib/jenkins/.git/ [20:42:57] I though I made that recursive [20:43:11] but that directory just have chmod g+s [20:43:17] LeslieCarr: so i am moving the psw2 to cr2 fiber [20:43:19] and that does not magically fix user rights [20:43:24] but i would suggest we then connect it to psw1 [20:43:30] as psw1 is presently only on cr1. [20:43:39] psw2 wont becalled psw2 when its returned, but ex4500-labs [20:43:43] yeah [20:43:45] so no reason to have it on both routers [20:44:03] so i will attach to psw1s other 10g port? [20:44:09] sure, let's hook up the other psw1 - i allocated 5/2/3 on cr2 [20:44:15] and 0/1/1 on psw1 [20:44:26] is that the old oprt it was on cr2 [20:44:29] or is it moving? [20:44:42] that's the allocated port, i don't believe there's anything in that [20:44:52] just that it's on 5/2/3 on cr1 and i like symmetry [20:44:58] ok, its the same, cool [20:45:14] yea it was psw2 on that port before, so its staying same in terms of physical connection, good times, running now [20:46:28] yay [20:46:58] LeslieCarr: could you run on gallium: chgrp -R jenkins /var/lib/jenkins/.git /var/lib/jenkins/jobs [20:47:30] or maybe I should make that a recursive declaration in puppet [20:47:32] sure, as soon as all my sessions come back from netflap :( [20:47:38] ;) [20:50:07] New patchset: Catrope; "Fixes for l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3885 [20:50:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3885 [20:51:47] hashar: damn there are a lot of files there … it's working on it [20:52:37] LeslieCarr: there is few thousands MediaWiki fetches :) [20:52:45] and at least as much test results [20:52:53] LeslieCarr: ok, its plugged in [20:53:43] cr1:5/3/1 should be next, it shows orange/down led [20:53:48] orange error even [20:54:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:34] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 166 MB (2% inode=61%): /var/lib/ureadahead/debugfs 166 MB (2% inode=61%): [20:54:35] RobH: can you grant yuvipanda author status on the blog please [20:54:46] hashar: i told him, he stared at me blankly [20:54:48] ;] [20:55:28] oh my god [20:55:48] tfinc: done, i dont mind doing these but if someone gets mad they are gonna be pointed right at you my friend [20:56:02] thats fine [20:56:05] heh [20:57:44] LeslieCarr: lemme know when ya ready to keep migratin fiber =] [20:58:04] it looks lik eyou took down 5/3/1 already [20:58:08] xe-5/3/1 down down RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.377 seconds [20:58:18] oh 5/3/1 is the xo [20:58:21] it's not up yet [20:58:23] New patchset: Catrope; "Fixes for l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3885 [20:58:28] oh, so i can totally move it now then =] [20:58:30] right? [20:58:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3885 [20:58:41] yeah [20:58:42] moving it now [21:00:20] cr1 is starting to look much more squared away [21:00:41] hey xo just called [21:00:52] RECOVERY - Disk space on srv221 is OK: DISK OK [21:01:09] oh paperwork [21:09:04] RobH: hey do you handle x-connect orders in eqiad ? [21:10:14] hashar: that's all done now [21:11:52] LeslieCarr: so whats the deal with that, we are in negotiation on the xo link? [21:12:05] LeslieCarr: there is no port ont he dmarc for the xo link yet [21:12:06] great [21:12:23] LeslieCarr: I can do them, I think I also gave you permissions to do them as well [21:12:34] ah yeah, stupid portal stuff [21:12:36] let me see ... [21:12:42] LeslieCarr: something I noticed is that I have not been added to the jenkins group despite class admins::jenkins (in manifests/admins.pp ) [21:12:55] well, its all run and ready to plug in once the xconnect order is fufilled [21:12:57] I must have missed something [21:13:40] I also changed my mind on fiber in 5/2/3, i can run it in the duct and just snake it out the cut on the side of it [21:13:55] being in the duct for part of the interrack run is only going to protect it, and make it look nicer. [21:14:05] so when we migrate the HE to the raceway we can do that too. [21:14:32] once those are done, rack a1 fiber is all done with raceway migration [21:14:36] RobH: don't think I have perms - can you check that out ? [21:14:38] \o/ [21:14:43] sweet :) [21:14:46] yep, can you drain HE peer for me? [21:15:08] right after you check my equinix account permissions ;) [21:15:34] it says you have cross connect permissions [21:15:43] the only one you lack is power ordering, which i am giving you now anyhow [21:16:38] grrr, login through the "equinix direct" portal ? [21:16:46] portal.equinix.com [21:16:50] menu item services, cross connect [21:16:57] why the fuck does equinix have 20 different portals [21:16:58] rr [21:17:00] if its still not working, something is messed up, and i am happy to place it [21:17:08] i dont have access to their exchange portal like you [21:17:12] though i would like to ;] [21:17:49] which cabinet RobH ? [21:18:42] New patchset: Bhartshorne; "adding in a check to spence that is useful for checking that paging works." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3887 [21:18:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3887 [21:19:13] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3887 [21:19:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3887 [21:19:22] the xo is running to a1, which is eq 101 [21:19:31] cuz we like letter rows =P [21:19:44] !!! [21:19:45] grrrrr [21:19:57] "select a patch panel" "no patch panel found" "value is required" [21:19:57] grrr [21:20:00] fyi: they all correspond to eq #, a1=101, a8=108, b1-201, b8-208 [21:20:10] wait [21:20:13] they dont run shit to our racks [21:20:20] we just need them to terminate on the dmarc [21:20:36] so, 0000 ? [21:20:42] yep! [21:20:54] sorry, following along on the form now ;] [21:21:03] 0000 would denote our dmarc [21:21:13] circuit ID ? [21:21:17] it is patch panel mounted on cage wal in a lockbox [21:21:18] got a number in mind ? [21:21:41] oh, they have to make a port [21:21:49] thats what you need i assume, lemme see what they have # [21:22:30] oh i mean circuit id like have you already decided a number ? [21:22:33] or else i'll make it up [21:22:33] oh i have no goddamn idea, the port # we hooked up for EQ peer ia 328187 so some such [21:22:36] oh, the cable #? [21:22:42] yeah [21:22:54] 2009 [21:23:26] its run to the panel and just hanging free since equinix has not provided a port(s) yet [21:23:37] the patch panel is blank, the ethernet term is keystone jack [21:23:40] i have no idea the fiber term [21:23:45] same? [21:23:54] so they need to add the coupler [21:23:56] yay order completed [21:24:18] drain HE so a1 will be completed ;] [21:24:45] when you do that i will also be d/c cr1:5/2/3 [21:24:56] :p [21:24:57] and making it look purdy. [21:25:03] ok fine [21:25:08] i did promise [21:25:10] so in the end does it matter whether the interface is disabled or the bgp neighbor is? [21:25:15] you never saw how horrid it was, or you would be very happy. [21:25:17] (and then I am going to bed :-P) [21:25:19] mark_ may do cartwheels. [21:25:37] well on HE i assume we need to drain the interface on psw1 [21:25:38] well first want to disable bgp neighbor [21:25:42] allow existing traffic to still go through [21:25:46] but i dunno [21:25:47] just new stuff won't as he's routes update [21:26:16] i guess no matter what HE will get an alarm eh? ;] [21:26:28] yes, but this way we don't fuck over traffic as much [21:26:51] ok, go for it rob [21:26:56] both are good to go? [21:27:12] 5/2/3 is good to go (the l3 config is on cr1) [21:27:29] ok have fun with the fibers, I'm off [21:27:32] see folks tomorrow [21:27:37] see you :) [21:28:43] damn, i fubared the run on psw1 link to cr2 [21:28:51] went over a nylon tie, not under [21:28:55] i can move that now too right? [21:29:03] as it has no traffic ;] [21:29:24] also good to pull psw1 to HE connection right ? [21:29:24] [21:29:24] [21:29:33] damn colloquy bug =P [21:29:49] yep [21:29:50] LeslieCarr: ^ [21:29:52] good to pull it all [21:29:55] cool, fixing them, thx! [21:30:02] right now it's basically all the same thing [21:34:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:39:10] LeslieCarr: all done, can reenable [21:39:48] reenabled [21:40:05] a1 looks awesome. [21:40:42] yay [21:40:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.039 seconds [21:40:46] pics or it didn't happen [21:41:59] i took em, will email em later =] [21:42:22] LeslieCarr: so i will leave this up to you, i am going to be back down here again tomorrow. i will be here for another 1h15m today [21:42:28] we can start in on cr2 fibers now [21:42:37] or you can reclaim the rest of your day and we do later ;] [21:42:55] since we are at an excellent stopping point if you have other shit to get done today ;] [21:43:19] we either have 4 or 5 left, cannot really tell [21:43:45] nm, 4 left. [21:44:04] let's keep going on most for now, i would like to hold off on the transit/peering until the new one is up [21:44:21] ahh we do indeed have transit on cr2 [21:44:31] yes [21:44:33] ok, so let me know what port to pull next on cr2 [21:44:33] our main one [21:44:50] actually, let me change locations, give me a minute [21:44:56] k, i snag a water [21:47:02] back [21:48:31] Hi Leslie, do you know whether varnish/squid support logging of the x-wap-profile http header? [21:48:55] don't know [21:49:03] :( [22:00:02] RobH: ready ? [22:03:21] LeslieCarr: back, yep! [22:03:50] next port? [22:03:58] let's do it :) lemme turn down cr2 5/0/0 [22:04:18] go on 5/0/0 [22:04:49] cool, migrating now [22:05:43] heh, we did that one already [22:06:05] lemme get the ones we didnt do [22:06:21] oh [22:06:22] hehe [22:07:04] LeslieCarr: would you mind flushing the mobile varnish cache again? you will be handsomely rewarded [22:07:12] 5/0/1, 5/1/1, 5/2/1, 5/3/1 [22:07:37] awjr , i don't see a glass of scotch in front of me .... [22:07:52] RobH: go for 5/0/1 [22:08:04] ok, 5/0/1 migrating now [22:08:27] LeslieCarr: yet [22:08:31] despite that, i flushed cache [22:08:41] i'm sitting on the couch … and laphroaig is on my desk [22:08:43] just a hint ;) [22:09:15] LeslieCarr: do you already have a glass? [22:09:16] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [22:09:34] no glass [22:09:46] rocks? [22:10:46] one [22:10:54] even though Ironholds wants to kill me for that [22:15:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:30] awjr: now gets unlimited mobile cache purges [22:20:22] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [22:20:22] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [22:21:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.874 seconds [22:23:09] \o/ [22:23:32] LeslieCarr: there is no shame in scotch on the singular rock [22:23:34] LeslieCarr: done [22:23:55] well, you can put one ice cube and a splash of water. [22:24:04] the water splash is required even on fine scotches to add depth. [22:24:53] LeslieCarr: we prolly have time for one more [22:25:02] i am out of here at 7pm on the dot, as i have poor chad with me [22:25:07] hehehe [22:25:09] and im buying him some 5guys [22:25:15] that's super nice of you [22:25:22] 5/1/0 ? [22:25:56] 5/1/0 is direct attached copper sfp [22:26:03] so it doesnt need to go in ducting, so its ok. [22:26:26] 5/1/1, 5/2/1, 5/3/1 are all the remaining fiber [22:26:31] i know one of those is transit though. [22:27:59] ok [22:28:05] 5/1/1 [22:28:11] where does that run to? [22:28:16] 1 sec lemme disable [22:28:21] just curious before i go pulling [22:28:22] asw-b-eqiad:xe-8/1/2 [22:28:29] ahh, short run, cool. [22:28:32] lemme know when to pull [22:29:09] pull it [22:29:34] ok, migrating now [22:32:13] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [22:32:13] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [22:39:41] LeslieCarr: done [22:39:49] ok, thats it for fiber tonight, we can move the last two later [22:39:57] i need to clean this place up before i leave =] [22:40:02] LeslieCarr: thanks for helping =] [22:40:14] cool [22:40:18] i guess the last two are the wave and transit right? [22:40:28] yep [22:40:33] the transit i really don't want to move yet [22:40:36] until we get the xo up [22:51:49] NP [22:51:51] np even [22:55:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:46] there's a sync-common process on mw22 that has been running since Feb 14 [23:02:12] and an apt-get that has been running since march 17 [23:02:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.436 seconds [23:03:00] no parent process for either of them [23:10:28] hung apt-get processes: http://paste.tstarling.com/p/zSuJAQ.html [23:20:32] actually almost all of those are zombies of hung puppet parents, only a few are still alive [23:22:38] !log cleaned up stuck apt-get processes on mw22,mw66,srv193,srv250,srv253 [23:22:40] Logged the message, Master [23:25:12] !log cleaned up stuck apt-get process on srv236 [23:25:14] Logged the message, Master [23:27:03] !log running apt-get upgrade on mw22,mw66,srv193,srv250,srv253,srv236 [23:27:05] Logged the message, Master [23:33:57] PROBLEM - Apache HTTP on srv236 is CRITICAL: Connection refused [23:36:03] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [23:36:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:11] root@srv193:~# apache-start [23:42:11] System failed sanity check: VIP not configured on lo [23:42:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.055 seconds