[00:08:40] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:40] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:19] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 279 MB (3% inode=61%): /var/lib/ureadahead/debugfs 279 MB (3% inode=61%): [00:19:37] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:19:37] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:20:49] RECOVERY - Disk space on srv221 is OK: DISK OK [00:29:13] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [00:31:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:57:58] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 760 MB (3% inode=92%): [02:42:49] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 231 seconds [02:43:07] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 239 seconds [02:46:34] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [02:51:13] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [02:51:31] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [03:24:31] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [03:24:49] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 192 MB (2% inode=61%): /var/lib/ureadahead/debugfs 192 MB (2% inode=61%): [03:35:28] RECOVERY - Disk space on srv223 is OK: DISK OK [03:40:45] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2675 [04:03:20] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [04:28:50] RECOVERY - Disk space on stafford is OK: DISK OK [04:53:19] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4135 MB (3% inode=99%): [04:55:25] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3747 MB (3% inode=99%): [05:08:01] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4445 MB (3% inode=99%): [05:16:07] PROBLEM - Disk space on db1004 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=89%): /var/lib/ureadahead/debugfs 284 MB (3% inode=89%): [05:16:43] PROBLEM - MySQL disk space on db1004 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=89%): /var/lib/ureadahead/debugfs 284 MB (3% inode=89%): [06:00:53] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4430 MB (3% inode=99%): [06:00:53] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4430 MB (3% inode=99%): [06:22:02] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3733 MB (3% inode=99%): [07:08:06] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3705 MB (3% inode=99%): [07:08:06] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3609 MB (3% inode=99%): [08:22:03] RECOVERY - Puppet freshness on mw19 is OK: puppet ran at Mon Mar 26 08:21:37 UTC 2012 [08:24:51] !log on several mw* boxes puppet did not run because .yaml files on the puppetmaster became corrupted. need to delete the $hostname files in /var/lib/puppet/yaml/node on stafford and re-run. puppet bug similar to http://projects.puppetlabs.com/issues/7836 [08:24:55] Logged the message, Master [08:26:33] RECOVERY - Puppet freshness on mw1073 is OK: puppet ran at Mon Mar 26 08:26:12 UTC 2012 [08:28:03] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Mon Mar 26 08:27:58 UTC 2012 [08:29:06] RECOVERY - Puppet freshness on mw30 is OK: puppet ran at Mon Mar 26 08:28:50 UTC 2012 [08:31:39] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Mon Mar 26 08:31:23 UTC 2012 [08:32:33] RECOVERY - Puppet freshness on mw45 is OK: puppet ran at Mon Mar 26 08:32:08 UTC 2012 [08:33:09] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Mon Mar 26 08:32:47 UTC 2012 [08:34:24] RECOVERY - Puppet freshness on mw72 is OK: puppet ran at Mon Mar 26 08:34:13 UTC 2012 [09:05:09] !log brewster was out of disk - deleted lighttpd access.log.1, gzipped access.log [09:05:14] Logged the message, Master [09:18:08] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [09:18:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [09:31:37] New patchset: Hashar; "remove nagios bot from #wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [09:31:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2675 [09:32:18] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Mon Mar 26 09:31:54 UTC 2012 [09:32:18] RECOVERY - Squid on brewster is OK: TCP OK - 0.006 second response time on port 8080 [09:33:14] !log brewster - delete puppet lock file, restart lighttpd, puppet ... [09:33:18] Logged the message, Master [09:36:57] PROBLEM - Puppet freshness on ssl2 is CRITICAL: Puppet has not run in the last 10 hours [09:41:04] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [09:41:40] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Mon Mar 26 09:41:16 UTC 2012 [09:43:01] !log another corrupted .yaml file on ssl2 [09:43:04] Logged the message, Master [09:45:16] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [09:58:10] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Mon Mar 26 09:57:41 UTC 2012 [09:58:28] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [09:59:13] RECOVERY - Puppet freshness on db59 is OK: puppet ran at Mon Mar 26 09:59:05 UTC 2012 [09:59:36] !log ..and on ms-be-3. running puppet on db59 [09:59:39] Logged the message, Master [10:10:19] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:10:19] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [10:10:19] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:34] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [10:21:16] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:21:16] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:34:40] mutante: apergos: hi :) Do you have any idea how I can pick a gid number for a new group ? [10:34:50] heh no :-D [10:35:25] I made some educatd guesses based on what was in the puppet files last time I needed a gid [10:35:35] going to do the same [10:35:38] 561 :-] [10:36:08] all righty then :-D [10:40:35] apergos: next question. How can I put a list of people in a group ? :-] [10:41:22] I thought about something similar to the sudo_user declarations in site.pp [10:45:37] I guess you would make a class lik for roots or restricted in admins.pp [10:46:03] or like analinterns [10:46:56] User[ [ "demon" , "hashar", "reed" ] ] { [10:46:56] groups +> 'jenkinsgroup', [10:46:56] require => Group['jenkinsgroup'] } [10:47:05] then you can include your groups::blah in it [10:47:10] that one is almost a guarantee to have mark to face palm :-) [10:47:15] oh. I like that less [10:47:36] make a class for them, maybe that' snot th ebest but at least it looks like what we already have in there [10:47:38] * hashar opens admin.pp [10:47:46] and if he wants it changed then those can get changed too [10:53:42] ::facepalm:: [10:58:51] :))) [11:15:36] New patchset: Hashar; "jenkins group for continuous integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [11:15:47] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3733 [11:17:12] New patchset: Hashar; "jenkins group for continuous integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [11:17:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3733 [11:17:42] apergos: here we have some group madness :-] [11:20:07] good luck with that :-D [11:22:25] New patchset: Hashar; "gerrit.pp warned about invalid escape sequence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3734 [11:22:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3734 [11:58:46] New patchset: Hashar; "strip long paths from puppet linter output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3736 [11:59:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3736 [12:15:49] New patchset: Hashar; "makes puppet file mode always 4 digits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3738 [12:16:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3738 [13:24:05] RECOVERY - MySQL disk space on db42 is OK: DISK OK [13:25:41] !log db42 was out of disk , caused by ~5G citations.csv in /tmp, gzipped the file [13:25:44] RECOVERY - Disk space on db42 is OK: DISK OK [13:25:45] Logged the message, Master [13:37:14] !log while on it, installing a whole bunch of package updates on db42 [13:37:17] Logged the message, Master [13:45:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.740 seconds [13:48:23] RECOVERY - DPKG on db42 is OK: All packages OK [13:54:41] PROBLEM - DPKG on db42 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:56:47] RECOVERY - DPKG on db42 is OK: All packages OK [14:01:39] "Hello, I'm looking to get most trees removed off a 3-acre property I have in Cleveland, TX." sent to noc@ . . . uh whut? [14:03:13] haha , yeah Jeff, i said the same thing "i understand spam if you wnt to sell something, but this is just _Weird_. even with the tree photos attached" [14:04:53] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [14:05:23] maybe it's a secret code [14:05:36] ohh steganography? [14:05:45] do we have a code breaking dept yet? [14:06:29] no, although I think we hav a few people interested in such things [14:06:35] I used to do applied crypto at one job [14:06:46] not peter is into it I think [14:13:54] "Stegdetect is an automated tool for detecting steganographic content in images. It is capable of detecting several different steganographic methods to embed hidden information in JPEG images." [14:14:24] and wikispecies.org should be ours and a redirect, not parked on sedo [14:14:40] was checking on "Cryptocephalus", some leaf eating beetle :p [14:14:46] heh, wikispecies is so dead no one has gone after that domain. [14:14:51] ok, back to grub2 and GPT :o [14:14:59] i guess we can ask our legal folks to go get it [14:15:04] RobH: i have some edits there :) [14:15:21] I have never met a wikispecies editor before! [14:15:28] linking taxonomic names from wiktionary to species and back :p [14:15:29] today is historic. [14:15:33] hehe [14:15:58] :) [14:16:06] ok, onsite at eqiad, gonna install the fiber ducting today, and later we migrate fibers, huzzah [14:17:08] oh [14:17:17] maybe I'll be around for some of the fibers [14:17:23] we have our meeting today though [14:17:29] which means it will likely be late [14:18:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:58] apergos, you want fibers in your home now instead of the powerline stuff:) [14:20:21] you're kidding right [14:20:53] I cna't even get a contract for better connectivity (I have 2 Mbps now I think) until I have paperwork once [14:21:32] apergos: Wait, you can't get fast internet without your immigration paperwork? [14:21:45] That sounds draconian [14:23:47] I can't sign contracts without a tax number [14:23:56] which relaly requires that I have a residence permit [14:23:58] which ... [14:24:02] there ya go [14:24:15] I could dins someone else to sign a contract but that kinda sucks [14:24:18] *find [14:25:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.144 seconds [14:25:09] Hmm [14:25:10] especially since there is still some slim possibility (I think it's unlikely but who knows right?) that they could deny me the permit at any time [14:25:15] That sounds worse that the US [14:25:17] since I've not gotten one yet [14:25:26] I don't have a SSN yet but I should be able to get internet in my apt without that [14:29:03] "grub-setup: warn: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and its use is discouraged.." [14:29:11] _how_ unrealiable ? [14:30:37] 6! [14:30:50] ok:) [14:30:57] More than 3 [14:31:20] yes, and the scale is: 0 to Unreliable [14:38:22] can some take db59 down for me? I need to remove I/O cards [14:39:19] apergos: yea i wanna learn how to do it too so i figured we would do it post ops meetin gwith leslie [14:41:15] the fiber ducting i am installing would make any network admin feel warm and fuzzy [14:41:23] so much nicer than naked fiber draped around the cage =P [14:42:45] aww [14:43:24] heh ^ [14:44:12] robh or apergos can you take db59 down for me please [14:50:08] lemme take a look [14:50:20] I was on it but I don't know i [14:50:23] f it is serving anything [14:50:44] ahh, this was fusion io test [14:50:58] and asher is certainly done as he is no longer in country ;] [14:51:03] how did you find that out? [14:51:10] i looked in rt for db59 [14:51:17] robh: yes...the io test...the cards need to go out tomorrow [14:51:22] and checked the noc.w.o page for db to ensure its not in cluster [14:51:45] cmjohnson1: shutting it down now for you, you can feel free to ship them back, hrmm, i guess i need to get an address for ya ! [14:51:57] i will email our dell reps and CC you on the mail so they can reply back to you directly [14:52:00] would help [14:52:02] cool [14:52:07] thx [14:52:14] oh, you knew what to look for [14:52:17] I would have no idea [14:52:18] apergos: basically rt told me [14:52:30] well, i did rt search on db59 cuz chris doesnt touch anything without rt [14:52:39] and saw it was the test host for the io cards [14:52:39] ok [14:52:46] cmjohnson1: just so you know what else i did [14:52:55] i checked out http://noc.wikimedia.org/dbtree/ [14:53:10] yeah I looked at db.php directly [14:53:10] which shows if its in general DB use, however that does not show if its one of the 'misc db' servers [14:53:18] yea, i am lazy and the website is faster for me [14:53:19] heh [14:53:30] i used to do the db.php [14:53:43] right now though i just happen to know which the misc db servers are [14:53:48] though they need to be better documented [14:54:02] well I couldn't fid my bookmark for the diagram [14:54:04] if im not sure if its misc db, i have root, so i just login to the box and check out what databases it has [14:54:09] probably got lost in the upgrade [14:54:16] its linked off noc.wikimedia.org so i just do that [14:54:26] I would have had to remember it was on noc [14:54:33] !log db59 shutting down for io card removal per rt 2589 [14:54:36] Logged the message, RobH [14:54:37] heh [14:54:58] see, this is a perfect example of why cmjohnson1 needs root. [14:55:18] cmjohnson1: fyi, so on db servers, you cannot simply do shutdown -h now [14:55:25] as mysql takes longer to shutdown cleanly than that [14:55:38] you need to always stop mysql, let that finish, then shutdown the server. [14:55:46] ok [14:55:56] ok, its shutting off, when its powered off its all yours [14:56:39] thx [14:58:51] PROBLEM - Host db59 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.102 seconds [15:11:18] RECOVERY - Host db59 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:15:48] PROBLEM - DPKG on db59 is CRITICAL: Connection refused by host [15:16:15] PROBLEM - MySQL disk space on db59 is CRITICAL: Connection refused by host [15:16:24] PROBLEM - Disk space on db59 is CRITICAL: Connection refused by host [15:17:27] PROBLEM - RAID on db59 is CRITICAL: Connection refused by host [15:17:36] PROBLEM - SSH on db59 is CRITICAL: Connection refused [15:18:26] cmjohnson1: dont worry about it, its not in rotatoin and no doubt asher had done funky stuff to it [15:18:47] ok..sounds good [15:18:49] !log db59 has errors, but as it was a fusion io testbed server, it is more than likely tweaked for such, it is not in any rotation [15:18:53] Logged the message, RobH [15:19:02] if it was in a cluster we would be trying to fix it [15:20:36] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:28:37] ok, fiber raceways make running fiber a million times easier [15:28:41] \o/ [15:35:27] RECOVERY - Disk space on srv220 is OK: DISK OK [15:36:38] heh, i can do in 15m what used to take 20, awesome. [15:36:49] shorter fibers everywhere. [15:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:04] breathe on stafford and it gets angry . . . [15:42:16] load 64 whee [15:44:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.898 seconds [15:46:55] i think stafford is a r310 [15:47:00] a high performance host it aint if so [15:47:14] ok, gonna run and get lunch before the rush, and i have 1pm with leslie to turn up the new fiber [15:47:25] it's not an r310 [15:47:31] it's pretty high performance [15:47:33] ahh [15:47:41] puppet just doesn't like doing its job [15:47:52] indeed its showing 16 cores, so it has ht on [15:47:54] i modified about 15 files that are all installed only on grosley/aluminium [15:47:55] for some reason all puppet runs are queuing again [15:47:56] and dual cpu [15:48:01] ok, afk a bit, back shortly [15:48:03] I -think- since leslie installed nagios in eqiad [15:48:06] and it got very very angry [15:48:18] it gets very angry every 30 mins [15:48:22] hah [15:48:44] let's fork it and name it troglodyte [15:51:24] * Jeff_Green (looks at ganglia stafford page) you aren't kidding [15:52:13] re: stafford, /var/lib/puppet/reports is getting large again, recently deleted some to prevent out of disk.. [15:53:05] the ganglia report is bizarre for this host [15:53:35] and some .yaml files got corrupted, see SAL, either puppet bug or it was because the master got interrupted while writing them or something [15:54:50] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=stafford.pmtpa.wmnet&m=load_one&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+pmtpa [15:56:19] I can't tell from that graph whether CPU utilization became very bursty (i.e. host is blocking on something else) after 12:00 or if it rose dramatically and reporting got spotty [15:59:07] mutante: when approx did you purge the reports? [16:00:58] Jeff_Green: March 23 01:55 deleting puppet report files older than 60hours on stafford to free disk space [16:01:06] k [16:01:49] Jeff_Green: puppet run completes on aluminium though [16:01:57] yeah, I ran them by hand on both [16:02:30] it worked generally as expected, except for the part where stafford freaked out [16:05:00] ok, might have been that you saw these as well ": Error 400 on SERVER: Could not parse YAML data .. syntax error". in that case need to delete the right yaml file on stafford [16:05:36] that happened on several mw boxex, ssl2, ms-be3 ... [16:07:06] mutante: there's a fair amount of chatter about that in stafford:/var/log/daemon [16:07:50] http://projects.puppetlabs.com/issues/1812 [16:08:18] yes, or http://projects.puppetlabs.com/issues/7836 [16:09:02] what happens if we purge *all* of that data? [16:09:31] actually i moved the files to /tmp and now to /root in case we want to report them [16:11:04] mmmm 5 guys. [16:11:18] the best of a list of poor in-n-out substitutes. [16:11:22] ha [16:12:23] mutante: do we ever use the yaml reports? [16:13:24] Jeff_Green: i asked ", do we want to keep those? " myself ;) [16:15:21] yeah, seems like a fair amount of overhead if we don't use it [16:15:54] Jeff_Green: 2 seprate things though: /var/lib/puppet/reports = gets large, and afaik just for human consumption (or the dashboard?) .. and /var/lib/puppet/yaml/node = other .yaml files that , if corrupted, break client runs. these are recreated by the next succesful run [16:16:29] yeah--it's the first one that I'm talking about [16:17:05] from puppet.conf: report = true [16:17:13] we could turn off client-side reporting [16:17:26] yep, i wasnt sure enough, so i asked when we got the first disk space warning, and just deleted the oldest ones when it got closer to running out of space [16:17:33] yeah [16:17:38] i'll post an RT ticket [16:19:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.950 seconds [16:33:56] woosters: anyone in Ops to handle search problems? http://bugzilla.wikimedia.org/35451 - Search engine does not index new pages on pl.wikipedia [16:34:55] mutante: message sent to ops@ [16:36:02] Jeff_Green: alright, cool [16:36:06] hexmode: notpeter [16:36:15] notpeter is the king of search you see. [16:36:25] hexmode ... let me create a rt ticket for it then [16:36:41] woosters: k, just let me know the # :) [16:36:49] hah, different kind of search ;) [16:37:14] oh [16:37:25] hrm, nope. assign to rainman? [16:37:37] or to me and I'll get in contact with him [16:37:46] notpeter: it is, but I'm not sure how much rainmain is available [16:37:47] heh, sorry if i was incorrect ;] [16:38:25] RobH: no, you're right. I thought that it was a robots.txt initially, but this is us [16:47:07] heh, i didnt even look at the bug [16:47:21] just when hexmode asked for search, i know yer the dude. [16:47:37] and if you werent, i assumed you were in better contact with rainman than the rest of us ;] [16:48:34] notpeter: https://rt.wikimedia.org/Ticket/Display.html?id=2700 [16:49:16] hexmode: cool. thank you. working on it presently :) [16:49:30] :) [16:49:34] lack of knowledge of polish slightly hindering... ;) [16:49:49] but, I'm a pattern recognition monkey [16:49:51] so that helps [16:51:08] notpeter: I know some -pl peeps are helpful and on IRC if you want me to ping them [16:52:11] I'll see what I can do with string matching first, but I might hop over to -pl [16:52:33] Beau -- guy who reported -- is there and helpful [16:52:52] er... could be a girl, I guess [16:55:23] ok, migrated leslie's new fiber to the raceway.... [16:55:39] woosters: is leslie in today? we have a conference call in 5 minutes [16:55:58] or any other ops person in the office would know ;] [16:56:29] she is not in office yet [16:56:53] hexmode: yes, the indexes on search7, the box that has plwiki on it, are from 2012-02-23. although the indexer has never ones. hurray.... [16:57:12] let me sms her ...peering port work isn't it? [16:57:50] thats my understanding, turning up the fiber for connection to EQ peering [16:58:04] i mean, its all ready on my end, but i am in call just in case something doesn't work. [17:01:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.401 seconds [17:14:54] !log backingup plwiki.nspart1 index on search7, deleting working copy, and restarting lsearchd. (note: this will probably cause some downtime on some languages while the proc restarts...) [17:14:59] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:19:26] hrmm, i think morebots needs title updates. [17:19:54] hah [17:20:46] notpeter: i felt you needed a better response. [17:22:14] it's true, it's been issuing the same threats for a while now... [17:23:59] RoanKattouw: can you search for something on nl and tell me if it's returning real results? [17:24:08] Sure [17:24:56] https://nl.wikipedia.org/w/index.php?title=Speciaal%3AZoeken&profile=default&search=van+dam&fulltext=Search WFM [17:25:08] I searched for a common surname and it turned up a bunch of semi-well-known people with that surname [17:25:22] cool, thanks! [17:36:51] !log fluorine coming down for new disks [17:41:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds [17:51:39] !log fluorine disk upgrade done, os install pending, details on rt 2350 [17:52:45] LeslieCarr: I see the fiber you are talking about now, well when we migrate it that will certainly see if it fixes it [17:53:34] !log cp1019 coming down for memory replacement per rt 2651 [18:06:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.32450678571 (gt 8.0) [18:13:03] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.14684903509 [18:21:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [18:29:47] New patchset: ArielGlenn; "ms1001 gets tweaks for high-bandwidth rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3749 [18:30:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3749 [18:32:31] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3749 [18:32:34] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3749 [19:00:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:20] lol whoops [19:03:55] I just noticed that the job runners are semi-broken [19:04:11] I mean they're running but if someone tried to restart them, they'd all break [19:07:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [19:08:02] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.36590393162 (gt 8.0) [19:19:35] !log cp1019 memory replaced per rt 2651 [19:20:38] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [19:22:44] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.55783869565 [19:25:53] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [19:26:47] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [19:33:32] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [19:41:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:26] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [19:47:02] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.162 seconds [19:47:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [19:47:47] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.174 seconds [19:55:09] !log stopping puppet on search6 and search15 for 24 hours to test new log rotation script [20:00:32] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [20:00:32] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [20:05:08] any ops around willing to review some of my changes on puppet please ? https://gerrit.wikimedia.org/r/#q,owner:hashar+project:operations/puppet+status:open,n,z [20:07:07] hashar: ok [20:07:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3531 [20:08:03] the topic branch regroup them [20:08:07] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.12571974359 (gt 8.0) [20:11:24] LeslieCarr: looks like I need to rebase some changes :) [20:11:43] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [20:11:43] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [20:12:19] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 1.44555983051 [20:20:43] woosters: shell access to bz -- https://rt.wikimedia.org/Ticket/Display.html?id=2584 -- robla gave his ok, what is left? [20:22:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:49] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:22:49] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:23:18] notpeter: is there anyway we could add monitors for freshness of the search index? [20:23:35] notpeter: so plwiki-like problems would show up sooner [20:23:46] yep [20:23:55] I'm working on getting some better monitoring in place [20:24:15] I shall make sure that something along those lines is part of that [20:24:51] notpeter: would you be offended if I created an RT ticket for this? or is there one already? [20:25:01] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3526 [20:25:04] go for it [20:25:14] hashar: sorry about getting this slowly, i'm doing other stuff at the same time [20:25:37] LeslieCarr: just focus on the other stuff so :-] [20:25:40] hexmode: I'm not doing a great job of creating sub-tickets for the work I'm doing on search stuffs [20:25:40] it is not that urgent! [20:25:53] trying to figure out how to rebase my change meanwhile [20:26:39] notpeter: I'm not faulting you :) I just know if I were to ask about it woosters would ask me where my ticket was ;) [20:27:07] heh, fair enough [20:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.045 seconds [20:51:00] New patchset: Hashar; "reindent / align hookconfig.py $filename hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3526 [20:51:15] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [20:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3526 [20:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [20:52:33] New review: Hashar; "I think I have rebased it correctly :-]" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3526 [20:58:09]