[00:08:40] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:40] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:19] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 279 MB (3% inode=61%): /var/lib/ureadahead/debugfs 279 MB (3% inode=61%): [00:19:37] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:19:37] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:20:49] RECOVERY - Disk space on srv221 is OK: DISK OK [00:29:13] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [00:31:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:57:58] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 760 MB (3% inode=92%): [02:42:49] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 231 seconds [02:43:07] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 239 seconds [02:46:34] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [02:51:13] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [02:51:31] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [03:24:31] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [03:24:49] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 192 MB (2% inode=61%): /var/lib/ureadahead/debugfs 192 MB (2% inode=61%): [03:35:28] RECOVERY - Disk space on srv223 is OK: DISK OK [03:40:45] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2675 [04:03:20] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [04:28:50] RECOVERY - Disk space on stafford is OK: DISK OK [04:53:19] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4135 MB (3% inode=99%): [04:55:25] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3747 MB (3% inode=99%): [05:08:01] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4445 MB (3% inode=99%): [05:16:07] PROBLEM - Disk space on db1004 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=89%): /var/lib/ureadahead/debugfs 284 MB (3% inode=89%): [05:16:43] PROBLEM - MySQL disk space on db1004 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=89%): /var/lib/ureadahead/debugfs 284 MB (3% inode=89%): [06:00:53] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4430 MB (3% inode=99%): [06:00:53] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4430 MB (3% inode=99%): [06:22:02] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3733 MB (3% inode=99%): [07:08:06] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3705 MB (3% inode=99%): [07:08:06] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3609 MB (3% inode=99%): [08:22:03] RECOVERY - Puppet freshness on mw19 is OK: puppet ran at Mon Mar 26 08:21:37 UTC 2012 [08:24:51] !log on several mw* boxes puppet did not run because .yaml files on the puppetmaster became corrupted. need to delete the $hostname files in /var/lib/puppet/yaml/node on stafford and re-run. puppet bug similar to http://projects.puppetlabs.com/issues/7836 [08:24:55] Logged the message, Master [08:26:33] RECOVERY - Puppet freshness on mw1073 is OK: puppet ran at Mon Mar 26 08:26:12 UTC 2012 [08:28:03] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Mon Mar 26 08:27:58 UTC 2012 [08:29:06] RECOVERY - Puppet freshness on mw30 is OK: puppet ran at Mon Mar 26 08:28:50 UTC 2012 [08:31:39] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Mon Mar 26 08:31:23 UTC 2012 [08:32:33] RECOVERY - Puppet freshness on mw45 is OK: puppet ran at Mon Mar 26 08:32:08 UTC 2012 [08:33:09] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Mon Mar 26 08:32:47 UTC 2012 [08:34:24] RECOVERY - Puppet freshness on mw72 is OK: puppet ran at Mon Mar 26 08:34:13 UTC 2012 [09:05:09] !log brewster was out of disk - deleted lighttpd access.log.1, gzipped access.log [09:05:14] Logged the message, Master [09:18:08] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [09:18:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [09:31:37] New patchset: Hashar; "remove nagios bot from #wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [09:31:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2675 [09:32:18] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Mon Mar 26 09:31:54 UTC 2012 [09:32:18] RECOVERY - Squid on brewster is OK: TCP OK - 0.006 second response time on port 8080 [09:33:14] !log brewster - delete puppet lock file, restart lighttpd, puppet ... [09:33:18] Logged the message, Master [09:36:57] PROBLEM - Puppet freshness on ssl2 is CRITICAL: Puppet has not run in the last 10 hours [09:41:04] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [09:41:40] RECOVERY - Puppet freshness on ssl2 is OK: puppet ran at Mon Mar 26 09:41:16 UTC 2012 [09:43:01] !log another corrupted .yaml file on ssl2 [09:43:04] Logged the message, Master [09:45:16] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [09:58:10] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Mon Mar 26 09:57:41 UTC 2012 [09:58:28] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [09:59:13] RECOVERY - Puppet freshness on db59 is OK: puppet ran at Mon Mar 26 09:59:05 UTC 2012 [09:59:36] !log ..and on ms-be-3. running puppet on db59 [09:59:39] Logged the message, Master [10:10:19] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:10:19] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [10:10:19] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:12:34] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [10:21:16] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:21:16] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:34:40] mutante: apergos: hi :) Do you have any idea how I can pick a gid number for a new group ? [10:34:50] heh no :-D [10:35:25] I made some educatd guesses based on what was in the puppet files last time I needed a gid [10:35:35] going to do the same [10:35:38] 561 :-] [10:36:08] all righty then :-D [10:40:35] apergos: next question. How can I put a list of people in a group ? :-] [10:41:22] I thought about something similar to the sudo_user declarations in site.pp [10:45:37] I guess you would make a class lik for roots or restricted in admins.pp [10:46:03] or like analinterns [10:46:56] User[ [ "demon" , "hashar", "reed" ] ] { [10:46:56] groups +> 'jenkinsgroup', [10:46:56] require => Group['jenkinsgroup'] } [10:47:05] then you can include your groups::blah in it [10:47:10] that one is almost a guarantee to have mark to face palm :-) [10:47:15] oh. I like that less [10:47:36] make a class for them, maybe that' snot th ebest but at least it looks like what we already have in there [10:47:38] * hashar opens admin.pp [10:47:46] and if he wants it changed then those can get changed too [10:53:42] ::facepalm:: [10:58:51] :))) [11:15:36] New patchset: Hashar; "jenkins group for continuous integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [11:15:47] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3733 [11:17:12] New patchset: Hashar; "jenkins group for continuous integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3733 [11:17:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3733 [11:17:42] apergos: here we have some group madness :-] [11:20:07] good luck with that :-D [11:22:25] New patchset: Hashar; "gerrit.pp warned about invalid escape sequence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3734 [11:22:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3734 [11:58:46] New patchset: Hashar; "strip long paths from puppet linter output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3736 [11:59:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3736 [12:15:49] New patchset: Hashar; "makes puppet file mode always 4 digits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3738 [12:16:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3738 [13:24:05] RECOVERY - MySQL disk space on db42 is OK: DISK OK [13:25:41] !log db42 was out of disk , caused by ~5G citations.csv in /tmp, gzipped the file [13:25:44] RECOVERY - Disk space on db42 is OK: DISK OK [13:25:45] Logged the message, Master [13:37:14] !log while on it, installing a whole bunch of package updates on db42 [13:37:17] Logged the message, Master [13:45:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.740 seconds [13:48:23] RECOVERY - DPKG on db42 is OK: All packages OK [13:54:41] PROBLEM - DPKG on db42 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:56:47] RECOVERY - DPKG on db42 is OK: All packages OK [14:01:39] "Hello, I'm looking to get most trees removed off a 3-acre property I have in Cleveland, TX." sent to noc@ . . . uh whut? [14:03:13] haha , yeah Jeff, i said the same thing "i understand spam if you wnt to sell something, but this is just _Weird_. even with the tree photos attached" [14:04:53] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [14:05:23] maybe it's a secret code [14:05:36] ohh steganography? [14:05:45] do we have a code breaking dept yet? [14:06:29] no, although I think we hav a few people interested in such things [14:06:35] I used to do applied crypto at one job [14:06:46] not peter is into it I think [14:13:54] "Stegdetect is an automated tool for detecting steganographic content in images. It is capable of detecting several different steganographic methods to embed hidden information in JPEG images." [14:14:24] and wikispecies.org should be ours and a redirect, not parked on sedo [14:14:40] was checking on "Cryptocephalus", some leaf eating beetle :p [14:14:46] heh, wikispecies is so dead no one has gone after that domain. [14:14:51] ok, back to grub2 and GPT :o [14:14:59] i guess we can ask our legal folks to go get it [14:15:04] RobH: i have some edits there :) [14:15:21] I have never met a wikispecies editor before! [14:15:28] linking taxonomic names from wiktionary to species and back :p [14:15:29] today is historic. [14:15:33] hehe [14:15:58] :) [14:16:06] ok, onsite at eqiad, gonna install the fiber ducting today, and later we migrate fibers, huzzah [14:17:08] oh [14:17:17] maybe I'll be around for some of the fibers [14:17:23] we have our meeting today though [14:17:29] which means it will likely be late [14:18:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:19:58] apergos, you want fibers in your home now instead of the powerline stuff:) [14:20:21] you're kidding right [14:20:53] I cna't even get a contract for better connectivity (I have 2 Mbps now I think) until I have paperwork once [14:21:32] apergos: Wait, you can't get fast internet without your immigration paperwork? [14:21:45] That sounds draconian [14:23:47] I can't sign contracts without a tax number [14:23:56] which relaly requires that I have a residence permit [14:23:58] which ... [14:24:02] there ya go [14:24:15] I could dins someone else to sign a contract but that kinda sucks [14:24:18] *find [14:25:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.144 seconds [14:25:09] Hmm [14:25:10] especially since there is still some slim possibility (I think it's unlikely but who knows right?) that they could deny me the permit at any time [14:25:15] That sounds worse that the US [14:25:17] since I've not gotten one yet [14:25:26] I don't have a SSN yet but I should be able to get internet in my apt without that [14:29:03] "grub-setup: warn: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and its use is discouraged.." [14:29:11] _how_ unrealiable ? [14:30:37] 6! [14:30:50] ok:) [14:30:57] More than 3 [14:31:20] yes, and the scale is: 0 to Unreliable [14:38:22] can some take db59 down for me? I need to remove I/O cards [14:39:19] apergos: yea i wanna learn how to do it too so i figured we would do it post ops meetin gwith leslie [14:41:15] the fiber ducting i am installing would make any network admin feel warm and fuzzy [14:41:23] so much nicer than naked fiber draped around the cage =P [14:42:45] aww [14:43:24] heh ^ [14:44:12] robh or apergos can you take db59 down for me please [14:50:08] lemme take a look [14:50:20] I was on it but I don't know i [14:50:23] f it is serving anything [14:50:44] ahh, this was fusion io test [14:50:58] and asher is certainly done as he is no longer in country ;] [14:51:03] how did you find that out? [14:51:10] i looked in rt for db59 [14:51:17] robh: yes...the io test...the cards need to go out tomorrow [14:51:22] and checked the noc.w.o page for db to ensure its not in cluster [14:51:45] cmjohnson1: shutting it down now for you, you can feel free to ship them back, hrmm, i guess i need to get an address for ya ! [14:51:57] i will email our dell reps and CC you on the mail so they can reply back to you directly [14:52:00] would help [14:52:02] cool [14:52:07] thx [14:52:14] oh, you knew what to look for [14:52:17] I would have no idea [14:52:18] apergos: basically rt told me [14:52:30] well, i did rt search on db59 cuz chris doesnt touch anything without rt [14:52:39] and saw it was the test host for the io cards [14:52:39] ok [14:52:46] cmjohnson1: just so you know what else i did [14:52:55] i checked out http://noc.wikimedia.org/dbtree/ [14:53:10] yeah I looked at db.php directly [14:53:10] which shows if its in general DB use, however that does not show if its one of the 'misc db' servers [14:53:18] yea, i am lazy and the website is faster for me [14:53:19] heh [14:53:30] i used to do the db.php [14:53:43] right now though i just happen to know which the misc db servers are [14:53:48] though they need to be better documented [14:54:02] well I couldn't fid my bookmark for the diagram [14:54:04] if im not sure if its misc db, i have root, so i just login to the box and check out what databases it has [14:54:09] probably got lost in the upgrade [14:54:16] its linked off noc.wikimedia.org so i just do that [14:54:26] I would have had to remember it was on noc [14:54:33] !log db59 shutting down for io card removal per rt 2589 [14:54:36] Logged the message, RobH [14:54:37] heh [14:54:58] see, this is a perfect example of why cmjohnson1 needs root. [14:55:18] cmjohnson1: fyi, so on db servers, you cannot simply do shutdown -h now [14:55:25] as mysql takes longer to shutdown cleanly than that [14:55:38] you need to always stop mysql, let that finish, then shutdown the server. [14:55:46] ok [14:55:56] ok, its shutting off, when its powered off its all yours [14:56:39] thx [14:58:51] PROBLEM - Host db59 is DOWN: PING CRITICAL - Packet loss = 100% [15:00:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.102 seconds [15:11:18] RECOVERY - Host db59 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [15:15:48] PROBLEM - DPKG on db59 is CRITICAL: Connection refused by host [15:16:15] PROBLEM - MySQL disk space on db59 is CRITICAL: Connection refused by host [15:16:24] PROBLEM - Disk space on db59 is CRITICAL: Connection refused by host [15:17:27] PROBLEM - RAID on db59 is CRITICAL: Connection refused by host [15:17:36] PROBLEM - SSH on db59 is CRITICAL: Connection refused [15:18:26] cmjohnson1: dont worry about it, its not in rotatoin and no doubt asher had done funky stuff to it [15:18:47] ok..sounds good [15:18:49] !log db59 has errors, but as it was a fusion io testbed server, it is more than likely tweaked for such, it is not in any rotation [15:18:53] Logged the message, RobH [15:19:02] if it was in a cluster we would be trying to fix it [15:20:36] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:28:37] ok, fiber raceways make running fiber a million times easier [15:28:41] \o/ [15:35:27] RECOVERY - Disk space on srv220 is OK: DISK OK [15:36:38] heh, i can do in 15m what used to take 20, awesome. [15:36:49] shorter fibers everywhere. [15:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:04] breathe on stafford and it gets angry . . . [15:42:16] load 64 whee [15:44:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.898 seconds [15:46:55] i think stafford is a r310 [15:47:00] a high performance host it aint if so [15:47:14] ok, gonna run and get lunch before the rush, and i have 1pm with leslie to turn up the new fiber [15:47:25] it's not an r310 [15:47:31] it's pretty high performance [15:47:33] ahh [15:47:41] puppet just doesn't like doing its job [15:47:52] indeed its showing 16 cores, so it has ht on [15:47:54] i modified about 15 files that are all installed only on grosley/aluminium [15:47:55] for some reason all puppet runs are queuing again [15:47:56] and dual cpu [15:48:01] ok, afk a bit, back shortly [15:48:03] I -think- since leslie installed nagios in eqiad [15:48:06] and it got very very angry [15:48:18] it gets very angry every 30 mins [15:48:22] hah [15:48:44] let's fork it and name it troglodyte [15:51:24] * Jeff_Green (looks at ganglia stafford page) you aren't kidding [15:52:13] re: stafford, /var/lib/puppet/reports is getting large again, recently deleted some to prevent out of disk.. [15:53:05] the ganglia report is bizarre for this host [15:53:35] and some .yaml files got corrupted, see SAL, either puppet bug or it was because the master got interrupted while writing them or something [15:54:50] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=stafford.pmtpa.wmnet&m=load_one&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+pmtpa [15:56:19] I can't tell from that graph whether CPU utilization became very bursty (i.e. host is blocking on something else) after 12:00 or if it rose dramatically and reporting got spotty [15:59:07] mutante: when approx did you purge the reports? [16:00:58] Jeff_Green: March 23 01:55 deleting puppet report files older than 60hours on stafford to free disk space [16:01:06] k [16:01:49] Jeff_Green: puppet run completes on aluminium though [16:01:57] yeah, I ran them by hand on both [16:02:30] it worked generally as expected, except for the part where stafford freaked out [16:05:00] ok, might have been that you saw these as well ": Error 400 on SERVER: Could not parse YAML data .. syntax error". in that case need to delete the right yaml file on stafford [16:05:36] that happened on several mw boxex, ssl2, ms-be3 ... [16:07:06] mutante: there's a fair amount of chatter about that in stafford:/var/log/daemon [16:07:50] http://projects.puppetlabs.com/issues/1812 [16:08:18] yes, or http://projects.puppetlabs.com/issues/7836 [16:09:02] what happens if we purge *all* of that data? [16:09:31] actually i moved the files to /tmp and now to /root in case we want to report them [16:11:04] mmmm 5 guys. [16:11:18] the best of a list of poor in-n-out substitutes. [16:11:22] ha [16:12:23] mutante: do we ever use the yaml reports? [16:13:24] Jeff_Green: i asked ", do we want to keep those? " myself ;) [16:15:21] yeah, seems like a fair amount of overhead if we don't use it [16:15:54] Jeff_Green: 2 seprate things though: /var/lib/puppet/reports = gets large, and afaik just for human consumption (or the dashboard?) .. and /var/lib/puppet/yaml/node = other .yaml files that , if corrupted, break client runs. these are recreated by the next succesful run [16:16:29] yeah--it's the first one that I'm talking about [16:17:05] from puppet.conf: report = true [16:17:13] we could turn off client-side reporting [16:17:26] yep, i wasnt sure enough, so i asked when we got the first disk space warning, and just deleted the oldest ones when it got closer to running out of space [16:17:33] yeah [16:17:38] i'll post an RT ticket [16:19:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.950 seconds [16:33:56] woosters: anyone in Ops to handle search problems? http://bugzilla.wikimedia.org/35451 - Search engine does not index new pages on pl.wikipedia [16:34:55] mutante: message sent to ops@ [16:36:02] Jeff_Green: alright, cool [16:36:06] hexmode: notpeter [16:36:15] notpeter is the king of search you see. [16:36:25] hexmode ... let me create a rt ticket for it then [16:36:41] woosters: k, just let me know the # :) [16:36:49] hah, different kind of search ;) [16:37:14] oh [16:37:25] hrm, nope. assign to rainman? [16:37:37] or to me and I'll get in contact with him [16:37:46] notpeter: it is, but I'm not sure how much rainmain is available [16:37:47] heh, sorry if i was incorrect ;] [16:38:25] RobH: no, you're right. I thought that it was a robots.txt initially, but this is us [16:47:07] heh, i didnt even look at the bug [16:47:21] just when hexmode asked for search, i know yer the dude. [16:47:37] and if you werent, i assumed you were in better contact with rainman than the rest of us ;] [16:48:34] notpeter: https://rt.wikimedia.org/Ticket/Display.html?id=2700 [16:49:16] hexmode: cool. thank you. working on it presently :) [16:49:30] :) [16:49:34] lack of knowledge of polish slightly hindering... ;) [16:49:49] but, I'm a pattern recognition monkey [16:49:51] so that helps [16:51:08] notpeter: I know some -pl peeps are helpful and on IRC if you want me to ping them [16:52:11] I'll see what I can do with string matching first, but I might hop over to -pl [16:52:33] Beau -- guy who reported -- is there and helpful [16:52:52] er... could be a girl, I guess [16:55:23] ok, migrated leslie's new fiber to the raceway.... [16:55:39] woosters: is leslie in today? we have a conference call in 5 minutes [16:55:58] or any other ops person in the office would know ;] [16:56:29] she is not in office yet [16:56:53] hexmode: yes, the indexes on search7, the box that has plwiki on it, are from 2012-02-23. although the indexer has never ones. hurray.... [16:57:12] let me sms her ...peering port work isn't it? [16:57:50] thats my understanding, turning up the fiber for connection to EQ peering [16:58:04] i mean, its all ready on my end, but i am in call just in case something doesn't work. [17:01:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.401 seconds [17:14:54] !log backingup plwiki.nspart1 index on search7, deleting working copy, and restarting lsearchd. (note: this will probably cause some downtime on some languages while the proc restarts...) [17:14:59] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:19:26] hrmm, i think morebots needs title updates. [17:19:54] hah [17:20:46] notpeter: i felt you needed a better response. [17:22:14] it's true, it's been issuing the same threats for a while now... [17:23:59] RoanKattouw: can you search for something on nl and tell me if it's returning real results? [17:24:08] Sure [17:24:56] https://nl.wikipedia.org/w/index.php?title=Speciaal%3AZoeken&profile=default&search=van+dam&fulltext=Search WFM [17:25:08] I searched for a common surname and it turned up a bunch of semi-well-known people with that surname [17:25:22] cool, thanks! [17:36:51] !log fluorine coming down for new disks [17:41:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds [17:51:39] !log fluorine disk upgrade done, os install pending, details on rt 2350 [17:52:45] LeslieCarr: I see the fiber you are talking about now, well when we migrate it that will certainly see if it fixes it [17:53:34] !log cp1019 coming down for memory replacement per rt 2651 [18:06:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.32450678571 (gt 8.0) [18:13:03] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.14684903509 [18:21:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [18:29:47] New patchset: ArielGlenn; "ms1001 gets tweaks for high-bandwidth rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3749 [18:30:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3749 [18:32:31] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3749 [18:32:34] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3749 [19:00:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:20] lol whoops [19:03:55] I just noticed that the job runners are semi-broken [19:04:11] I mean they're running but if someone tried to restart them, they'd all break [19:07:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [19:08:02] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.36590393162 (gt 8.0) [19:19:35] !log cp1019 memory replaced per rt 2651 [19:20:38] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [19:22:44] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.55783869565 [19:25:53] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [19:26:47] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [19:33:32] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [19:41:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:26] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [19:47:02] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.162 seconds [19:47:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [19:47:47] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.174 seconds [19:55:09] !log stopping puppet on search6 and search15 for 24 hours to test new log rotation script [20:00:32] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [20:00:32] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [20:05:08] any ops around willing to review some of my changes on puppet please ? https://gerrit.wikimedia.org/r/#q,owner:hashar+project:operations/puppet+status:open,n,z [20:07:07] hashar: ok [20:07:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3531 [20:08:03] the topic branch regroup them [20:08:07] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.12571974359 (gt 8.0) [20:11:24] LeslieCarr: looks like I need to rebase some changes :) [20:11:43] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [20:11:43] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [20:12:19] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 1.44555983051 [20:20:43] woosters: shell access to bz -- https://rt.wikimedia.org/Ticket/Display.html?id=2584 -- robla gave his ok, what is left? [20:22:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:49] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:22:49] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:23:18] notpeter: is there anyway we could add monitors for freshness of the search index? [20:23:35] notpeter: so plwiki-like problems would show up sooner [20:23:46] yep [20:23:55] I'm working on getting some better monitoring in place [20:24:15] I shall make sure that something along those lines is part of that [20:24:51] notpeter: would you be offended if I created an RT ticket for this? or is there one already? [20:25:01] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3526 [20:25:04] go for it [20:25:14] hashar: sorry about getting this slowly, i'm doing other stuff at the same time [20:25:37] LeslieCarr: just focus on the other stuff so :-] [20:25:40] hexmode: I'm not doing a great job of creating sub-tickets for the work I'm doing on search stuffs [20:25:40] it is not that urgent! [20:25:53] trying to figure out how to rebase my change meanwhile [20:26:39] notpeter: I'm not faulting you :) I just know if I were to ask about it woosters would ask me where my ticket was ;) [20:27:07] heh, fair enough [20:28:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.045 seconds [20:51:00] New patchset: Hashar; "reindent / align hookconfig.py $filename hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3526 [20:51:15] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [20:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3526 [20:51:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [20:52:33] New review: Hashar; "I think I have rebased it correctly :-]" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3526 [20:58:09] random question--is there a quick way to obtain a list correlating hostname to wiki database? [20:59:30] meaning i.e. which database hosts the wiki at rmy.wikipedia.org [21:01:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.690 seconds [21:06:41] Jeff_Green: fenari has a list of dbs per cluster [21:07:13] so you'd need to find the file with it in, then from the cluster number, look up the dblist.. [21:07:14] a list of db hosts or a list of mysql databases? [21:07:20] Jeff_Green: http://noc.wikimedia.org/dbtree/ [21:07:20] ? [21:07:31] ohhh, hostname to wikidb, nm [21:07:33] yeah [21:07:37] so thats a basic convention [21:07:52] and the initializesettings has the nonstandard ties [21:07:55] if you look at it [21:08:02] ooh looking [21:09:14] # wgSitename @{ [21:09:19] I ask b/c I'm trying to simulate how apache+mw routes search api requests without going through apache/search [21:09:20] // Wikis, alphabetically by DB name [21:09:20] 'abwiki' => 'Авикипедиа', [21:09:20] 'advisorywiki' => 'Advisory Board', [21:09:28] etc... [21:09:34] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3526 [21:09:47] http://noc.wikimedia.org/conf/highlight.php?file=lucene.php [21:09:56] } elseif ( in_array( $wgDBname, array( 'eswiki' ) ) ) { [21:09:56] $wgLuceneHost = '10.0.3.14'; [21:09:57] etc [21:10:01] not sure if that is exactly what you want [21:10:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3433 [21:10:07] but its some of the data [21:10:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3526 [21:10:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [21:11:05] Reedy: yes, the issue is that I don't think I have $wgDBname without the apache+mw layer [21:11:38] in most cases, it's languagecode followed by the project [21:11:40] i.e. a request comes in to apache and gets routed to a virtualserver by ServerName and ServerAlias [21:12:08] RobH: where is that config file exactly? [21:12:20] baaaah [21:12:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3738 [21:12:27] i just put it all in with a preceeding / so it went into nothingness [21:12:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3738 [21:12:36] heh /home/wikipedia/common/wmf-config/ [21:12:40] http://noc.wikimedia.org/conf/special.dblist is your exception list [21:12:41] ah thanks [21:12:58] root@fenari:~# view /home/wikipedia/common/wmf-config/InitialiseSettings.php [21:13:02] sudo is for suckers. [21:13:12] ;] [21:13:12] ha [21:13:17] * hashar feels lame :-/ [21:13:24] i would feel bad but it seems all of ops operate like me ;] [21:13:26] beahaha [21:13:30] bwahaha even. [21:13:57] though I have a script to distribute my root tasks to root people [21:14:03] makes things much more productive :-] [21:14:05] RobH: why use a knife when you can use a chainsaw [21:17:33] RobH: can you give jdlrobson Author wordpress perms please [21:17:52] New patchset: Bhartshorne; "removing ms3 from the swift rings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3766 [21:18:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3766 [21:18:21] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3766 [21:18:24] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3766 [21:19:50] LeslieCarr! you merged change 3738 but didn't deploy! [21:20:02] because i hate freedom [21:20:05] it's good to deploy [21:20:07] haha [21:20:08] maybe this makes more sense if I reverse the question--how do I determine every hostname served by the enwiki database [21:20:13] tfinc: done [21:20:16] you do hate freedom! [21:20:16] RobH: thanks! [21:20:21] maplebed: yes. [21:20:21] (the changes are all premissions changes) [21:20:41] shes a network admin, shes all about lockin shit down [21:20:54] actually, not all. [21:20:57] ;] [21:20:57] there's also a gerrit change in there. [21:21:10] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3736 [21:21:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3736 [21:21:19] yeah [21:21:22] several changes [21:22:05] ok, deploying them. [21:23:55] oop. must start dinner or the children will track me down. ciao [21:24:09] track you down... and eat you! [21:24:13] oh that reminds me lunch would be good [21:24:17] thanks for the merge :) [21:24:39] the patch that strip long paths from gerrit linter bot would need to be tested ( https://gerrit.wikimedia.org/r/#change,3736 ) [21:24:42] noms [21:24:44] cause I am not sure it works [21:24:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3527 [21:25:00] hashar: how does one test it? [21:25:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3528 [21:26:33] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3529 [21:28:52] maplebed: merge in production & submit a faulty puppet file [21:28:57] then the linter will complain [21:29:22] doesn't the linter run on submit (before merge)? [21:29:22] it should no more show the long paths such as /var/tmp//file.pp [21:29:37] it is part of the patchset-created hook [21:29:40] so yes, on submit [21:29:50] New review: Lcarr; "bye bye nagios bot from #tech" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2675 [21:29:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [21:30:05] LeslieCarr: +111111 ^^^^ [21:31:31] I should probably have made separate commits [21:31:37] those chained commits are a mess [21:34:07] New patchset: Hashar; "abstract logic getting irc filename, add tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3527 [21:34:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3527 [21:34:28] New review: Hashar; "rebased" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3527 [21:34:35] New review: Lcarr; "poor #dev !" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3530 [21:35:38] bed time, will poke that tomorrow [21:35:45] LeslieCarr: feel free to skip if you are to busy [21:36:01] will do the rebase tomorrow morning if you don't :-] [21:36:05] byebye [21:36:14] bye [21:36:29] don't forget to go to lunch! [21:36:31] ;) [21:41:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:17] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3527 [21:41:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3527 [21:42:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3734 [21:42:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3734 [21:45:47] RobH: you guys made a strong case for SSD's on my FB post [21:46:00] now i just have to figure out what price i'm going to pay [21:46:10] thats the bad part =[ [21:46:22] i'm* [21:46:45] RobH: the lowest decent drive will run me just under $350 [21:46:49] 330/340 [21:46:53] crucial has some nice ones [21:46:58] alongside intel [21:47:11] i have not used crucial, but i know they stand by their stuff [21:47:18] used their memory before of course [21:47:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.157 seconds [21:47:41] RobH: i've always known them to produce top quality ram [21:56:44] i always like crucial ram [21:56:52] and corsair [21:56:55] the c's make good ram [22:28:10] !log pushing firmware updates to servertechs in sequence: ps1-[a2|a3|a4|a5|b2|b3|b4|b5|c1|c2|c3|d1|d2|d3]-sdtpa, disregard any errors from rebooting alerts [22:29:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [22:35:34] PROBLEM - Host ps1-a3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [22:35:52] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2111.02 ms [22:37:04] RECOVERY - Host ps1-a3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [22:37:31] PROBLEM - Host ps1-b2-sdtpa is DOWN: PING CRITICAL - Packet loss = 44%, RTA = 2162.35 ms [22:39:19] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.77 ms [22:39:55] RECOVERY - Host ps1-a4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.25 ms [22:44:53] !log also rolling firmware to ps1-d[1|2|3]-pmtpa [23:03:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.251 seconds [23:35:10] PROBLEM - LVS HTTP on m.wikimedia.org is CRITICAL: HTTP CRITICAL - pattern not found [23:35:48] hrm, why did nagios alert for that ? [23:36:46] the mobile team just deployed something... [23:37:24] * maplebed talks to preilly and arthur IRL [23:38:01] maplebed: I think it's the updated footer [23:40:08] it's looking for "Wikimedia Foundation, Inc" in the output [23:40:33] (checking en.m.wikipedia.org) [23:41:39] maplebed: it's now: Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. See Terms of use for details. [23:41:40] Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization. [23:41:40] Contact us [23:42:02] but, that's not on the main page [23:42:13] that includes the substring Wikimedia Foundation, Inc... [23:42:14] oh. [23:45:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds