[01:41:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [01:41:40] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 272 seconds [01:46:01] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [01:46:01] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:46:01] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:46:55] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:00:25] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 15 seconds [02:49:18] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [04:13:27] New patchset: Logicwiki; "(bug 37447) Modify ZIM display name" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10855 [04:13:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10855 [04:20:48] New review: Peachey88; "Should this not be done on the extension side of things compared to a local change?" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10855 [04:29:18] New patchset: Logicwiki; "(bug 37365) Install Narayam on Gujarati Projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [04:29:24] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10401 [04:31:12] New patchset: Logicwiki; "(bug 37365) Install Narayam on Gujarati Projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [04:31:17] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10401 [04:38:41] New review: Peachey88; "I'm not entirely fussed over it, but requests for individual projects should probably be done as ind..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10401 [04:54:18] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [05:09:15] New review: Logicwiki; "It was not done in extension side earlier and probably for some reason. Collection.php on the extens..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10855 [05:46:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:39:09] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [06:41:21] New review: Raimond Spekking; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/10401 [06:54:34] Logged the message, Master [06:54:39] Logged the message, Master [06:54:43] Logged the message, Master [08:13:05] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:13:03] https://community.rapid7.com/community/metasploit/blog/2012/06/11/cve-2012-2122-a-tragically-comedic-security-flaw-in-mysql [09:16:07] for i in `seq 1 1000`; do mysql -u root --password=bad -h 127.0.0.1 [09:16:09] aah [09:16:18] paravoid: that is really helpful to "recover" your root password :-] [09:18:20] hashar: wut [09:18:34] petan: hello :) [09:18:48] 1000 times of that results in something cool? [09:19:52] :D [09:19:59] petan: https://community.rapid7.com/community/metasploit/blog/2012/06/11/cve-2012-2122-a-tragically-comedic-security-flaw-in-mysql [09:20:03] lol I just see that [09:20:04] posted by para void [09:20:12] (wasn't sure you have seen the link hehe) [09:21:44] damn ya. tragically-comedic :p [09:22:09] ouch [09:22:14] hi there mutante, had a nice trip back? [09:22:33] hi! yea, all worked out [09:22:46] thanks for showing us around, enjoyed it [09:23:01] tapas++ [09:23:18] hi mutante [09:23:26] hi [09:25:08] mutante: would you have couple mins to hack that second query for me, pls? [09:26:12] alright. let me check "rrank.php" again [09:26:14] brb [09:26:32] or do you already have the exact query? [09:28:23] hashar: I tried it on my server [09:28:38] it resulted in 1000 * ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES) [09:28:53] mutante: nope, because idk how your data is stored. but i can create artificial one which you can accomodate [09:29:27] it should be something like select lang from table where family = $family and order = $order [09:30:42] I run ubuntu 10.04 with very old kernel [09:30:45] it's fine :D [09:30:52] only latest version [09:30:54] is broken [09:34:28] there are conflicting reports [09:34:39] people not being able to reproduce it on platforms that are vulnerable [09:39:35] CVE-2012-2122 is all trending on google now [09:39:43] conflicting reports indeed [09:39:54] well strcmp() need to return 128 [09:40:44] which seems to be the difference of the byte values between the first two different characters [09:41:20] anyway the patch is easy [09:41:24] well [09:41:24] do note [09:41:40] if it would be widespread [09:41:48] any bruteforce attack would've been successful in the past [09:41:49] :) [09:42:02] ahah [09:47:18] I tried it on Linux 2.6.32-32-generic #62-Ubuntu SMP Wed Apr 20 21:52:38 UTC 2011 x86_64 GNU/Linux and 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 i386 GNU/Linux [09:47:28] both didn't let me pass [09:47:41] first one is marked as vulnerable [09:47:45] second is 32 bit [09:47:46] pae [09:49:07] paravoid: could not reproduce on Ubuntu lucid in labs. mysql Ver 15.1 Distrib 5.3.7-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 [09:49:33] labs are x86 [09:49:39] these aren't marked [09:49:53] i see. ack [09:52:55] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [09:53:49] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication [09:53:58] I don't think this issue could affect prod anyway, because only ops can access mysql using shell [09:54:15] Danny_B|backup: how about this? https://wikistats.wmflabs.org/rrank.php (was just broken because it includes a config file which had been moved to correct path) [09:54:19] or, are there any mysql servers accessible from labs network? :P [09:54:33] I mean mysql on prod, accessible over labs [09:54:41] it's same network I guess [09:54:50] na, labs has its own network [09:55:04] ok, but is it possible to ping for example some machine on prod, using local IP? [09:55:10] from labs [09:55:13] no, it shouldnt be [09:55:15] ok [09:56:07] mutante: this does the rank for lang&family which we already have, not the language for rank & family [09:56:21] !log shut down mysqld on db1047, reparing tables [09:56:22] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:56:26] Logged the message, Master [09:57:26] apergos: you know you don't have to take mysql down [09:57:29] to 'repair tables' [09:57:31] right? [09:57:46] I'm having to make a copy of what's there, the quick repair failed [10:01:42] so my next q: do we have any idea if the table as it is repaired actually has the data in it that the master had at the point when slave replication broke? or is itlikely to be in some different state now? [10:02:08] /root/aft/readme.txt on the host has the check output and the repair output saved in it [10:02:43] domas: [10:03:54] if there's a MyISAM table on the cluster [10:03:59] that means people don't care about data in it [10:04:06] which means that you can safely truncate it [10:04:07] he he he he [10:04:34] :-P :-P [10:09:36] mutante: i think udp goes through to prod, icmp+tcp don't? (from labs) [10:09:50] unless it was fixed [10:10:20] apergos: if the table is tiny and doesn't get writes, you can just reload it on the slave once it has caught up replicating :) [10:10:41] ok, I'll see [10:10:43] thanks [10:11:03] I don't know he answer to any of those things but let me get it cranked up here [10:11:03] * jeremyb runs away [10:12:16] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [10:14:31] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 407089 seconds [10:14:49] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 407084 seconds [10:14:49] yeah it sureis [10:17:47] jeremyb: should have been fixed as well if referring to udp traffic for nagios snmp. labs nagios does it inside labs network and separated https://gerrit.wikimedia.org/r/#/c/2973/ [10:30:52] Danny_B|backup: like this? do you want even less output and just the prefix alone? https://wikistats.wmflabs.org/lrank.php?family=w&rank=9 [10:33:24] mutante: yup, nice. but some numbers do not return (proper) results - try 17 or 20 [10:35:56] mutante: lang code is good enough for me if it's a problem because of lang name [10:46:29] Danny_B|backup: yup, i see. still broken, i shall fix it after lunch. good enough to confirm that is what you intended. bbiab [11:46:23] New patchset: Mark Bergsma; "Add RFC 4760 attribute type constants" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10859 [11:46:23] New patchset: Mark Bergsma; "Add RFC4760 MP_REACH_NLRI attribute class" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10860 [11:46:24] New patchset: Mark Bergsma; "Add RFC4760 MP_UNREACH_NRLI attribute class" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10861 [11:46:25] New patchset: Mark Bergsma; "Fix BGPException name" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10862 [11:46:39] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [11:46:39] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:46:39] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:01:27] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10859 [12:01:33] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10860 [12:01:40] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10861 [12:01:45] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10862 [12:45:06] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:21] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:01:40] Can anyone help me with resending password to a list-admin on lists.wikimedia.org? [13:01:48] There is no forgot-password system that users can use. [13:02:16] I lost my cvn and cvn-private list-admin pwd code (one of those automatic generated ones), and like them to be resend or re-created. [13:20:39] Krinkle: yep, which list is it? [13:20:49] cvn@ and cvn-private@ [13:20:51] mutante: [13:20:52] New patchset: Mark Bergsma; "Add IPv6 support to IPPrefix classes" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10865 [13:28:46] Krinkle: the list owner email addresses should have received mail [13:28:59] thx [13:29:01] np [13:29:21] mutante: I kept receiving mails about stuff I need to approve or reject, but couldn't log in (they stack up every day) [13:29:25] Now I can fix them [13:30:51] note that you can also have/use separate "moderator" password who can do that but not change list settings. (you can now set that yourself and/or share the moderator pass) [13:33:42] the login URLs for admin and mod almost look the same but are "admin" vs. "admindb" [14:10:14] mutante: thx, I'll look into thatr [14:11:15] hi guys [14:11:31] i need to get a newer version of nodejs and npm into our apt repo [14:11:39] who should I ask about that? [14:17:19] mutante, would you know? [14:17:31] there is a guy who has made some good debs for it, and I can install if I use his repo [14:17:50] but we are supposed to put all the debs we install in our own wikimedia apt repo, right? [14:18:50] in production, yes. in labs you could use external [14:19:11] ottomata: is the version in precise newer? [14:19:23] or do you need a newer version than whats there? [14:20:05] the version in precise is new enough it hink [14:20:17] i think i saw that it is 0.6.12, which is new enough [14:20:25] and, this is for production on stat1 [14:20:37] we want to move the new reportcard stuff to stat1 and off of labs [14:20:47] * Ryan_Lane nods [14:20:55] we should upgrade to precise, then [14:20:56] ok, stat1 reinstall with precise? [14:21:12] we can do an upgrade, rather than a reinstall [14:21:16] we can? [14:21:19] why not? [14:21:20] true, yes [14:21:33] if we can just upgrade, that would be way easier than reinstalling [14:21:50] there's abunch of stuff there that we'd like to save (300GB or something), so an upgrade is much easier [14:21:59] i've got sudo there, can I do that? [14:22:13] or do I need console access of some kind? [14:22:39] unless there is something unexpected, no [14:23:24] would upgrade, dist-upgrade, edit apt sources files, upgrade, dist-upgrade, reboot ..or similar [14:23:45] if you want to i could watch mgmt during reboot [14:24:15] oh wait, we should use ubuntu way [14:24:44] yeah? I'm googling now [14:24:53] also [14:24:54] um [14:24:57] maybe this should happen first? [14:24:57] https://gerrit.wikimedia.org/r/#/c/9258/ [14:26:09] ottomata: http://wikitech.wikimedia.org/view/Distribution_upgrades [14:26:46] what i said before was more Debian, would most likely also work, but "do-release-upgrade" is preferred here [14:27:12] ok cool [14:27:21] should we get the amanda backups up and running first? [14:27:43] cant say no to backups before upgrades:) yes [14:35:27] New review: Dzahn; "i just meant the "include backup::client" line itself, which makes sure amanda is installed. the sta..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9258 [14:38:18] ottomata: ^. just the "install backup client" seems to fit in role class. no? and maybe that one red line 2115 in site.pp [14:39:29] ok [14:40:43] New patchset: Ottomata; "Setting up amanda backups for /a and /home on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [14:41:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9258 [14:42:10] New review: Dzahn; "yea, thx. i like not even having to touch site.pp that way." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9258 [14:42:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [14:44:55] danke [14:45:07] np. would still wait for succesful upgrades now [14:45:19] eh s/upgrades/backups :o [14:45:33] yeah [14:45:39] how do I know when they are finished? [14:45:42] i think it will take a while [14:45:47] i'm running puppet now... [14:49:04] ops list should receive email [14:50:43] hm, does the puppet master need to be updated? [14:51:18] i merged on sockpuppet [14:52:26] also logging in on stat1 [14:53:24] !log running puppet on stat1. installs plotting packages [14:53:28] huh? [14:53:29] Logged the message, Master [14:53:49] thought they had been installed all this time [14:54:03] but ensure changed 'purged' to 'present'.. well [14:54:47] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [14:58:27] i saw that too, dunno what's up witht that [14:58:34] i don't see /etc/amanda/amanda-client.conf being created though [15:00:23] yea. confirmed..and odd. [15:15:12] hm, mutante, what should I do about that? [15:19:49] still wondering myself.. hrmm [15:20:15] looking at puppet more [15:22:17] Bleh, / is full on hume [15:24:26] Or not quite.. [15:24:26] /dev/md0 159G 113G 39G 75% / [15:24:53] /dev/sda1 6.8G 4.1G 2.4G 64% / [15:25:02] @hume [15:25:04] duh [15:25:11] I missed ssh out [15:25:12] hume == puppetmaster? [15:25:13] * Reedy facepalms [15:25:26] hume is used for batch mediawiki jobs and stuff [15:25:37] Reedy: oh, BUT: 5.0G 5.0G 24K 100% /usr/local/apache [15:25:38] oh sorry, not related to my problem :p [15:25:47] ah [15:26:08] Looks like the problem mutante :p [15:26:15] 5.0Gcommon-local [15:26:31] we're keeping more branches around due to cached stuffs [15:26:44] Reedy: php-1.20wmf2 wmf3 and wmf4 must all exist ? [15:26:59] i see [15:27:04] wmf4 is in active use [15:27:09] wmf5 is starting today [15:27:28] I can't remember how many versions we need to keep due to cached stuff [15:27:33] it was at least 3 I think [15:33:07] no Free PE to extend LVM volume [15:36:26] so mutante, should I just wait a day before trying upgrade? [15:37:39] ottomata: currently i dont know better, but i will look at it again [15:38:29] ok, thanks [15:38:40] do you see the change on puppetmaster? [15:38:57] PROBLEM - Lucene disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 187 MB (2% inode=66%): /var/lib/ureadahead/debugfs 187 MB (2% inode=66%): [15:39:57] Reedy: _could_ probably take some space from /archive, but not sure about shrinking xfs [15:41:05] ottomata: well, yea, i saw the diff on sockpuppet and its merged and client talks to sockpuppet [15:41:06] It's not overly urgent.. [15:41:17] oh, good [15:41:21] ok [15:44:41] I'll check it with Roan when he's around [15:45:51] Hm... apparently RT does not automatically email me about bugs that I created. [15:46:26] Anyway... RobH, just to reconfirm: I should write my cisco partman scripts to start with sdc and just ignore sda and sdb? [15:46:41] !log hume /usr/local/apache is out of disk (just 5GB but more branches now). (LVM vg "tank" lv "tank-apache" ) but no free extents. could take from /archive but unsure about shrinking the xfs. [15:46:45] Logged the message, Master [15:46:56] andrewbogott: there is now analytics-cisco.cfg [15:47:07] yes, it starts with sdc [15:47:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:48:18] andrewbogott: re: RT, that probably depends on whether people use "comment" or "reply" or "resolve" [15:49:02] mutante: OK, I'll look. Do you know if it's better than/the same as virt-raid10.cfg? [15:49:16] andrewbogott: made 2 fixes analytics-cisco.cfg ,the latest one untested though..i hope they work but was also still going to actually test..(and paravoid as well maybe) [15:49:52] andrewbogott: not the same because raid10 != raid1, which is used in analytics-cisco [15:49:58] 'k [15:50:09] "better" depends on which level you want [15:51:14] that one started out as a copy of raid1-lvm.cfg [15:54:39] virt-raid-10.cfg works fine on /some/ of the ciscos, getting it to work elsewhere may be straightforward. [15:55:47] As if anything with partman is ever straightforward [15:55:54] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:39] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [15:56:42] /some/ sucks :/ [15:56:58] yeah, partman ..sigh [15:57:43] what could be the difference, some is weird [15:58:20] are you sure you didnt happen to hit Dells? (analytics has both, hence -cisco.cfg and -dell.cfg) [16:00:33] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [16:01:31] woosters: nice hostname :) [16:01:58] the hotel wifi is flaky [16:02:03] New patchset: Andrew Bogott; "As per RobH's advice, use drives c-j rather than a-h." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10945 [16:02:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10945 [16:03:46] woosters: Are you going to put in some beach time, or are you planning to spend your extra time in Greece working from the hotel? [16:05:07] hotel, and staring out the balcony, overlooking Pantheon ;-P [16:05:09] New review: Dzahn; "since this is already used in preseed.cfg for anything virt*, you would have to be sure that any vir..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10945 [16:05:14] New review: Andrew Bogott; "I'm confident that no one but me is using this script." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10945 [16:05:16] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10945 [16:07:41] New patchset: Andrew Bogott; "Revert "As per RobH's advice, use drives c-j rather than a-h."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10946 [16:08:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10946 [16:08:06] New review: Andrew Bogott; "*hurried revert*" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10946 [16:08:11] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10946 [16:09:07] andrewbogott: eh, no worries, i dont think anyone is reinstalling existing virt* now, just saying to not rely on anything called virt* being cisco [16:09:29] unless that is in the case. i wasnt using it but i saw it in preseed.cfg [16:10:58] New patchset: Andrew Bogott; "Just in case, create a cisco-specific partman." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10947 [16:11:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10947 [16:13:05] mutante: actually, in retrospect, I think that every virt* machine /is/ a cisco. Ryan_Lane, is that right? [16:13:17] no [16:13:20] the old ones aren't [16:13:33] also, virt1000 will not be, either [16:14:32] New review: Dzahn; "so then that looks better indeed as some are Dells" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10947 [16:14:33] hm... [16:14:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10947 [16:14:43] no? [16:15:47] Ryan_Lane: But both kinds have eight drives that we want in a raid-10? [16:15:55] Can you suggest a pattern I can use to distinguish? [16:16:28] virt100[1-9] [16:16:30] or something like that [16:17:55] andrewbogott: if necessary you can use | , i had to go like analytics100[1-9]|analytics1010) vs. analytics101[1-9]|analytics102[0-9]) [16:18:28] files/autoinstall/preseed.cfg [16:18:49] looks like preseed is already right, but netboot is not. [16:19:46] preseed.cfg: symbolic link to `netboot.cfg [16:20:15] Well, that explains something [16:25:28] New patchset: Andrew Bogott; "Use virt-raid10-cisco.cfg for cisco virt boxes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10951 [16:25:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10951 [16:27:59] !log installing security upgrades on sodium [16:28:03] Logged the message, Master [16:28:49] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [16:34:05] New review: Dzahn; "that looks right cause these are all in linux-host-entries.ttyS0-115200. the "S0" part tells you the..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10951 [16:34:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10951 [16:39:46] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [16:42:46] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:22] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [16:54:01] notpeter: searchidx1001 /dev/sda1 9.9G 9.4G 0 100% / [16:59:41] maybe move mw files to /a too [17:01:12] yes. hrm. ok. [17:01:22] damnit. I was hoping that that wouldn't happen for a week ro two... [17:02:03] haha [17:02:11] I need to confirm with roan how many we need to keep [17:02:13] and note it somewhere [17:04:10] well damnit. [17:04:39] yeah, I guess I'm going to have to throw it on /a [17:04:52] the good news is that that box will be reimaged soon [17:15:38] andrewbogott: group="virt"; for MAC in $(grep -b1 $group linux-host-entries.ttyS* | grep hardware | cut -d " " -f3 | cut -d: -f1,2,3); do echo $MAC; if $(curl -s http://www.coffer.com/mac_find/?string=$MAC | grep -q Cisco); then echo "It's a Cisco"; else echo "Might be Dell or something else"; fi; done [17:16:02] :p yes, probably shorter with awk or something ;) bbl [17:16:19] mawk [17:22:11] in ./puppet/files/dhcpd that is, local git repo. ..out [17:27:18] RECOVERY - Lucene disk space on searchidx1001 is OK: DISK OK [17:28:47] !log moving /usr/local/apache to /a/apche with symbolic link on searchidx1001 as a temp measure until it can be reimaged [17:28:52] Logged the message, notpeter [17:36:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10690 [17:36:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10690 [17:56:03] !log enabling TitleBlacklist on labsconsole [17:56:08] Logged the message, Master [17:59:21] can someone merge this https://gerrit.wikimedia.org/r/#/c/9627/ [17:59:55] Ryan_Lane: are we creating weird titles? [18:00:06] no [18:00:13] :o [18:00:16] I want to ensure certain user accounts are never created [18:13:39] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:39:03] New patchset: Hashar; "Upping scap forklimit from 5 to 10 to speed up sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9130 [18:39:29] New review: Hashar; "Patchset 2 rewrite commit message and rebase change on latest master." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [18:39:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9130 [18:41:30] can some ops please have a look at https://gerrit.wikimedia.org/r/#/c/9130/ ? It is about raising the fork limit from 5 to 10 when doing a scap so we get more boxes to sync at the sametime and hopefully make scap a bit faster [18:42:16] wasn't it set to 5 so that it didn't crash the nfs server? [18:42:35] !log resuming coversion of es1004 to innodb, using compact row format after testing dynamic and compressed [18:42:40] Logged the message, Master [18:42:51] Ryan_Lane: cause of too many parallel requests on the filesystem? Sounds weird [18:42:56] but not entirely impossible though :-( [18:43:54] * hashar checks log [18:46:58] Ryan_Lane: indeed, reduced from 30 to 5 by Tim Starling cause the main copy job was saturating the networking link out of nfs1 :-/ [18:47:15] * hashar wants QoS on our network links [18:48:04] QoS probably wouldn't help much there [18:48:13] the new deployment system will likely be slightly more efficient [18:48:34] we should really start sending the localization updates across compressed, as well [18:48:38] that right there would help a lot [18:49:05] <^demon> Or localize less ;-) [18:49:40] the localization stuff went from like 500MB to 50MB [18:49:43] compressed [18:49:47] \O/ [18:49:53] New review: Hashar; "The fork limit was reduced from 30 to 5 by Tim Starling with https://gerrit.wikimedia.org/r/#/c/6463..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [18:49:54] so, we should do that ;) [18:50:05] I have quoted Tim's reasoning on change 9130 [18:50:12] feel free to veto the change :-] [18:50:53] rsync --compress ? ;) [18:51:47] New review: Pyoungmeister; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10416 [18:51:49] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10416 [18:53:55] RobH, Ryan_Lane: I'm still confused about drive configuration on the ciscos. virt1001 (which I believe to be a cisco) is built and running and appears to have sda through sdh. So, if the ciscos don't have sda or sdb, what's with that? [18:54:17] umm [18:54:21] that would be strange [18:54:30] they should definitely have that [18:55:08] Ryan_Lane: Observe RT 3055. Robh thinks that none of the ciscos should have a or b. [18:55:17] Which also leaves me wondering if they should have c-h or c-j. [18:55:52] And regardless there's a clear difference between virt1001 and virt1002 which remains a problem regardless [18:56:02] um... boy, saying 'regardless' a lot. [18:56:22] hm [18:58:55] I just ran a c-j script ion virt1002 and it is also misbehaving in some unobvious way. [19:03:37] New review: Reedy; "I know. At 5 it leaves a lot of free network capacity, and also takes an age" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9130 [19:04:15] Reedy: been talking about --forklimit with Ryan here :) [19:04:23] (aka change 9130) [19:04:45] andrewbogott: such as? [19:05:30] hashar: amusingly, sync-common-file is still on 30, and pushing a directory tree out was fine.. [19:06:07] Reedy: looks like the issue is when transferring the 600MB or so of l10n cachefiles [19:06:16] Yes, I know [19:06:34] I've caused the problem 2 or 3 times :p [19:06:37] But then again, when I did it today... It was fine.. [19:07:59] andrewbogott: sorry my irc was borked since i upgraded to lion [19:08:00] Ryan_Lane: Unclear. Usually when partman fails it bounces you back to a manual config screen. On virt1002 it made it through but then waited for user input to confirm changes. And, the config that it displays on the confirmation screen only shows six drives rather than eight. [19:08:04] finally fixed =P [19:08:08] I'm waiting for the build to finish to see what it really did. [19:08:29] RobH: Do you have a backscroll or should I start over? [19:08:44] so mutante is the person who advised it didn't see the sda sdb thing [19:09:00] if he is about, he would be more advisable to chat with for this particular issue [19:09:01] Hm... that's definitely true for some but not all of the ciscos. [19:09:27] if you have a list of some that do and some that don't, just append it to an RT ticket and assign to him or me, but make me CC on ticket no matter what [19:09:36] cuz we may need to do a comparison of the servers to see what is differing [19:09:40] they should all act the same. [19:09:45] Reedy: so that was just my 2 cents :-] I am removing myself from 9130 for now [19:10:11] I got too many changes in my review queue :) [19:10:48]