[01:41:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [01:41:40] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 272 seconds [01:46:01] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [01:46:01] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:46:01] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [01:46:55] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:00:25] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 15 seconds [02:49:18] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [04:13:27] New patchset: Logicwiki; "(bug 37447) Modify ZIM display name" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10855 [04:13:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10855 [04:20:48] New review: Peachey88; "Should this not be done on the extension side of things compared to a local change?" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10855 [04:29:18] New patchset: Logicwiki; "(bug 37365) Install Narayam on Gujarati Projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [04:29:24] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10401 [04:31:12] New patchset: Logicwiki; "(bug 37365) Install Narayam on Gujarati Projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [04:31:17] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10401 [04:38:41] New review: Peachey88; "I'm not entirely fussed over it, but requests for individual projects should probably be done as ind..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10401 [04:54:18] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [05:09:15] New review: Logicwiki; "It was not done in extension side earlier and probably for some reason. Collection.php on the extens..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10855 [05:46:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:39:09] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [06:41:21] New review: Raimond Spekking; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/10401 [06:54:34] Logged the message, Master [06:54:39] Logged the message, Master [06:54:43] Logged the message, Master [08:13:05] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:13:03] https://community.rapid7.com/community/metasploit/blog/2012/06/11/cve-2012-2122-a-tragically-comedic-security-flaw-in-mysql [09:16:07] for i in `seq 1 1000`; do mysql -u root --password=bad -h 127.0.0.1 [09:16:09] aah [09:16:18] paravoid: that is really helpful to "recover" your root password :-] [09:18:20] hashar: wut [09:18:34] petan: hello :) [09:18:48] 1000 times of that results in something cool? [09:19:52] :D [09:19:59] petan: https://community.rapid7.com/community/metasploit/blog/2012/06/11/cve-2012-2122-a-tragically-comedic-security-flaw-in-mysql [09:20:03] lol I just see that [09:20:04] posted by para void [09:20:12] (wasn't sure you have seen the link hehe) [09:21:44] damn ya. tragically-comedic :p [09:22:09] ouch [09:22:14] hi there mutante, had a nice trip back? [09:22:33] hi! yea, all worked out [09:22:46] thanks for showing us around, enjoyed it [09:23:01] tapas++ [09:23:18] hi mutante [09:23:26] hi [09:25:08] mutante: would you have couple mins to hack that second query for me, pls? [09:26:12] alright. let me check "rrank.php" again [09:26:14] brb [09:26:32] or do you already have the exact query? [09:28:23] hashar: I tried it on my server [09:28:38] it resulted in 1000 * ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES) [09:28:53] mutante: nope, because idk how your data is stored. but i can create artificial one which you can accomodate [09:29:27] it should be something like select lang from table where family = $family and order = $order [09:30:42] I run ubuntu 10.04 with very old kernel [09:30:45] it's fine :D [09:30:52] only latest version [09:30:54] is broken [09:34:28] there are conflicting reports [09:34:39] people not being able to reproduce it on platforms that are vulnerable [09:39:35] CVE-2012-2122 is all trending on google now [09:39:43] conflicting reports indeed [09:39:54] well strcmp() need to return 128 [09:40:44] which seems to be the difference of the byte values between the first two different characters [09:41:20] anyway the patch is easy [09:41:24] well [09:41:24] do note [09:41:40] if it would be widespread [09:41:48] any bruteforce attack would've been successful in the past [09:41:49] :) [09:42:02] ahah [09:47:18] I tried it on Linux 2.6.32-32-generic #62-Ubuntu SMP Wed Apr 20 21:52:38 UTC 2011 x86_64 GNU/Linux and 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 i386 GNU/Linux [09:47:28] both didn't let me pass [09:47:41] first one is marked as vulnerable [09:47:45] second is 32 bit [09:47:46] pae [09:49:07] paravoid: could not reproduce on Ubuntu lucid in labs. mysql Ver 15.1 Distrib 5.3.7-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 [09:49:33] labs are x86 [09:49:39] these aren't marked [09:49:53] i see. ack [09:52:55] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [09:53:49] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication [09:53:58] I don't think this issue could affect prod anyway, because only ops can access mysql using shell [09:54:15] Danny_B|backup: how about this? https://wikistats.wmflabs.org/rrank.php (was just broken because it includes a config file which had been moved to correct path) [09:54:19] or, are there any mysql servers accessible from labs network? :P [09:54:33] I mean mysql on prod, accessible over labs [09:54:41] it's same network I guess [09:54:50] na, labs has its own network [09:55:04] ok, but is it possible to ping for example some machine on prod, using local IP? [09:55:10] from labs [09:55:13] no, it shouldnt be [09:55:15] ok [09:56:07] mutante: this does the rank for lang&family which we already have, not the language for rank & family [09:56:21] !log shut down mysqld on db1047, reparing tables [09:56:22] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:56:26] Logged the message, Master [09:57:26] apergos: you know you don't have to take mysql down [09:57:29] to 'repair tables' [09:57:31] right? [09:57:46] I'm having to make a copy of what's there, the quick repair failed [10:01:42] so my next q: do we have any idea if the table as it is repaired actually has the data in it that the master had at the point when slave replication broke? or is itlikely to be in some different state now? [10:02:08] /root/aft/readme.txt on the host has the check output and the repair output saved in it [10:02:43] domas: [10:03:54] if there's a MyISAM table on the cluster [10:03:59] that means people don't care about data in it [10:04:06] which means that you can safely truncate it [10:04:07] he he he he [10:04:34] :-P :-P [10:09:36] mutante: i think udp goes through to prod, icmp+tcp don't? (from labs) [10:09:50] unless it was fixed [10:10:20] apergos: if the table is tiny and doesn't get writes, you can just reload it on the slave once it has caught up replicating :) [10:10:41] ok, I'll see [10:10:43] thanks [10:11:03] I don't know he answer to any of those things but let me get it cranked up here [10:11:03] * jeremyb runs away [10:12:16] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [10:14:31] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 407089 seconds [10:14:49] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 407084 seconds [10:14:49] yeah it sureis [10:17:47] jeremyb: should have been fixed as well if referring to udp traffic for nagios snmp. labs nagios does it inside labs network and separated https://gerrit.wikimedia.org/r/#/c/2973/ [10:30:52] Danny_B|backup: like this? do you want even less output and just the prefix alone? https://wikistats.wmflabs.org/lrank.php?family=w&rank=9 [10:33:24] mutante: yup, nice. but some numbers do not return (proper) results - try 17 or 20 [10:35:56] mutante: lang code is good enough for me if it's a problem because of lang name [10:46:29] Danny_B|backup: yup, i see. still broken, i shall fix it after lunch. good enough to confirm that is what you intended. bbiab [11:46:23] New patchset: Mark Bergsma; "Add RFC 4760 attribute type constants" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10859 [11:46:23] New patchset: Mark Bergsma; "Add RFC4760 MP_REACH_NLRI attribute class" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10860 [11:46:24] New patchset: Mark Bergsma; "Add RFC4760 MP_UNREACH_NRLI attribute class" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10861 [11:46:25] New patchset: Mark Bergsma; "Fix BGPException name" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10862 [11:46:39] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [11:46:39] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:46:39] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [12:01:27] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10859 [12:01:33] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10860 [12:01:40] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10861 [12:01:45] New review: Faidon; "(no comment)" [operations/debs/pybal] (mp-bgp); V: 0 C: 1; - https://gerrit.wikimedia.org/r/10862 [12:45:06] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:21] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:01:40] Can anyone help me with resending password to a list-admin on lists.wikimedia.org? [13:01:48] There is no forgot-password system that users can use. [13:02:16] I lost my cvn and cvn-private list-admin pwd code (one of those automatic generated ones), and like them to be resend or re-created. [13:20:39] Krinkle: yep, which list is it? [13:20:49] cvn@ and cvn-private@ [13:20:51] mutante: [13:20:52] New patchset: Mark Bergsma; "Add IPv6 support to IPPrefix classes" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/10865 [13:28:46] Krinkle: the list owner email addresses should have received mail [13:28:59] thx [13:29:01] np [13:29:21] mutante: I kept receiving mails about stuff I need to approve or reject, but couldn't log in (they stack up every day) [13:29:25] Now I can fix them [13:30:51] note that you can also have/use separate "moderator" password who can do that but not change list settings. (you can now set that yourself and/or share the moderator pass) [13:33:42] the login URLs for admin and mod almost look the same but are "admin" vs. "admindb" [14:10:14] mutante: thx, I'll look into thatr [14:11:15] hi guys [14:11:31] i need to get a newer version of nodejs and npm into our apt repo [14:11:39] who should I ask about that? [14:17:19] mutante, would you know? [14:17:31] there is a guy who has made some good debs for it, and I can install if I use his repo [14:17:50] but we are supposed to put all the debs we install in our own wikimedia apt repo, right? [14:18:50] in production, yes. in labs you could use external [14:19:11] ottomata: is the version in precise newer? [14:19:23] or do you need a newer version than whats there? [14:20:05] the version in precise is new enough it hink [14:20:17] i think i saw that it is 0.6.12, which is new enough [14:20:25] and, this is for production on stat1 [14:20:37] we want to move the new reportcard stuff to stat1 and off of labs [14:20:47] * Ryan_Lane nods [14:20:55] we should upgrade to precise, then [14:20:56] ok, stat1 reinstall with precise? [14:21:12] we can do an upgrade, rather than a reinstall [14:21:16] we can? [14:21:19] why not? [14:21:20] true, yes [14:21:33] if we can just upgrade, that would be way easier than reinstalling [14:21:50] there's abunch of stuff there that we'd like to save (300GB or something), so an upgrade is much easier [14:21:59] i've got sudo there, can I do that? [14:22:13] or do I need console access of some kind? [14:22:39] unless there is something unexpected, no [14:23:24] would upgrade, dist-upgrade, edit apt sources files, upgrade, dist-upgrade, reboot ..or similar [14:23:45] if you want to i could watch mgmt during reboot [14:24:15] oh wait, we should use ubuntu way [14:24:44] yeah? I'm googling now [14:24:53] also [14:24:54] um [14:24:57] maybe this should happen first? [14:24:57] https://gerrit.wikimedia.org/r/#/c/9258/ [14:26:09] ottomata: http://wikitech.wikimedia.org/view/Distribution_upgrades [14:26:46] what i said before was more Debian, would most likely also work, but "do-release-upgrade" is preferred here [14:27:12] ok cool [14:27:21] should we get the amanda backups up and running first? [14:27:43] cant say no to backups before upgrades:) yes [14:35:27] New review: Dzahn; "i just meant the "include backup::client" line itself, which makes sure amanda is installed. the sta..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9258 [14:38:18] ottomata: ^. just the "install backup client" seems to fit in role class. no? and maybe that one red line 2115 in site.pp [14:39:29] ok [14:40:43] New patchset: Ottomata; "Setting up amanda backups for /a and /home on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [14:41:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9258 [14:42:10] New review: Dzahn; "yea, thx. i like not even having to touch site.pp that way." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9258 [14:42:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [14:44:55] danke [14:45:07] np. would still wait for succesful upgrades now [14:45:19] eh s/upgrades/backups :o [14:45:33] yeah [14:45:39] how do I know when they are finished? [14:45:42] i think it will take a while [14:45:47] i'm running puppet now... [14:49:04] ops list should receive email [14:50:43] hm, does the puppet master need to be updated? [14:51:18] i merged on sockpuppet [14:52:26] also logging in on stat1 [14:53:24] !log running puppet on stat1. installs plotting packages [14:53:28] huh? [14:53:29] Logged the message, Master [14:53:49] thought they had been installed all this time [14:54:03] but ensure changed 'purged' to 'present'.. well [14:54:47] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [14:58:27] i saw that too, dunno what's up witht that [14:58:34] i don't see /etc/amanda/amanda-client.conf being created though [15:00:23] yea. confirmed..and odd. [15:15:12] hm, mutante, what should I do about that? [15:19:49] still wondering myself.. hrmm [15:20:15] looking at puppet more [15:22:17] Bleh, / is full on hume [15:24:26] Or not quite.. [15:24:26] /dev/md0 159G 113G 39G 75% / [15:24:53] /dev/sda1 6.8G 4.1G 2.4G 64% / [15:25:02] @hume [15:25:04] duh [15:25:11] I missed ssh out [15:25:12] hume == puppetmaster? [15:25:13] * Reedy facepalms [15:25:26] hume is used for batch mediawiki jobs and stuff [15:25:37] Reedy: oh, BUT: 5.0G 5.0G 24K 100% /usr/local/apache [15:25:38] oh sorry, not related to my problem :p [15:25:47] ah [15:26:08] Looks like the problem mutante :p [15:26:15] 5.0Gcommon-local [15:26:31] we're keeping more branches around due to cached stuffs [15:26:44] Reedy: php-1.20wmf2 wmf3 and wmf4 must all exist ? [15:26:59] i see [15:27:04] wmf4 is in active use [15:27:09] wmf5 is starting today [15:27:28] I can't remember how many versions we need to keep due to cached stuff [15:27:33] it was at least 3 I think [15:33:07] no Free PE to extend LVM volume [15:36:26] so mutante, should I just wait a day before trying upgrade? [15:37:39] ottomata: currently i dont know better, but i will look at it again [15:38:29] ok, thanks [15:38:40] do you see the change on puppetmaster? [15:38:57] PROBLEM - Lucene disk space on searchidx1001 is CRITICAL: DISK CRITICAL - free space: / 187 MB (2% inode=66%): /var/lib/ureadahead/debugfs 187 MB (2% inode=66%): [15:39:57] Reedy: _could_ probably take some space from /archive, but not sure about shrinking xfs [15:41:05] ottomata: well, yea, i saw the diff on sockpuppet and its merged and client talks to sockpuppet [15:41:06] It's not overly urgent.. [15:41:17] oh, good [15:41:21] ok [15:44:41] I'll check it with Roan when he's around [15:45:51] Hm... apparently RT does not automatically email me about bugs that I created. [15:46:26] Anyway... RobH, just to reconfirm: I should write my cisco partman scripts to start with sdc and just ignore sda and sdb? [15:46:41] !log hume /usr/local/apache is out of disk (just 5GB but more branches now). (LVM vg "tank" lv "tank-apache" ) but no free extents. could take from /archive but unsure about shrinking the xfs. [15:46:45] Logged the message, Master [15:46:56] andrewbogott: there is now analytics-cisco.cfg [15:47:07] yes, it starts with sdc [15:47:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:48:18] andrewbogott: re: RT, that probably depends on whether people use "comment" or "reply" or "resolve" [15:49:02] mutante: OK, I'll look. Do you know if it's better than/the same as virt-raid10.cfg? [15:49:16] andrewbogott: made 2 fixes analytics-cisco.cfg ,the latest one untested though..i hope they work but was also still going to actually test..(and paravoid as well maybe) [15:49:52] andrewbogott: not the same because raid10 != raid1, which is used in analytics-cisco [15:49:58] 'k [15:50:09] "better" depends on which level you want [15:51:14] that one started out as a copy of raid1-lvm.cfg [15:54:39] virt-raid-10.cfg works fine on /some/ of the ciscos, getting it to work elsewhere may be straightforward. [15:55:47] As if anything with partman is ever straightforward [15:55:54] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:39] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [15:56:42] /some/ sucks :/ [15:56:58] yeah, partman ..sigh [15:57:43] what could be the difference, some is weird [15:58:20] are you sure you didnt happen to hit Dells? (analytics has both, hence -cisco.cfg and -dell.cfg) [16:00:33] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [16:01:31] woosters: nice hostname :) [16:01:58] the hotel wifi is flaky [16:02:03] New patchset: Andrew Bogott; "As per RobH's advice, use drives c-j rather than a-h." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10945 [16:02:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10945 [16:03:46] woosters: Are you going to put in some beach time, or are you planning to spend your extra time in Greece working from the hotel? [16:05:07] hotel, and staring out the balcony, overlooking Pantheon ;-P [16:05:09] New review: Dzahn; "since this is already used in preseed.cfg for anything virt*, you would have to be sure that any vir..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10945 [16:05:14] New review: Andrew Bogott; "I'm confident that no one but me is using this script." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10945 [16:05:16] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10945 [16:07:41] New patchset: Andrew Bogott; "Revert "As per RobH's advice, use drives c-j rather than a-h."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10946 [16:08:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10946 [16:08:06] New review: Andrew Bogott; "*hurried revert*" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10946 [16:08:11] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10946 [16:09:07] andrewbogott: eh, no worries, i dont think anyone is reinstalling existing virt* now, just saying to not rely on anything called virt* being cisco [16:09:29] unless that is in the case. i wasnt using it but i saw it in preseed.cfg [16:10:58] New patchset: Andrew Bogott; "Just in case, create a cisco-specific partman." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10947 [16:11:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10947 [16:13:05] mutante: actually, in retrospect, I think that every virt* machine /is/ a cisco. Ryan_Lane, is that right? [16:13:17] no [16:13:20] the old ones aren't [16:13:33] also, virt1000 will not be, either [16:14:32] New review: Dzahn; "so then that looks better indeed as some are Dells" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10947 [16:14:33] hm... [16:14:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10947 [16:14:43] no? [16:15:47] Ryan_Lane: But both kinds have eight drives that we want in a raid-10? [16:15:55] Can you suggest a pattern I can use to distinguish? [16:16:28] virt100[1-9] [16:16:30] or something like that [16:17:55] andrewbogott: if necessary you can use | , i had to go like analytics100[1-9]|analytics1010) vs. analytics101[1-9]|analytics102[0-9]) [16:18:28] files/autoinstall/preseed.cfg [16:18:49] looks like preseed is already right, but netboot is not. [16:19:46] preseed.cfg: symbolic link to `netboot.cfg [16:20:15] Well, that explains something [16:25:28] New patchset: Andrew Bogott; "Use virt-raid10-cisco.cfg for cisco virt boxes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10951 [16:25:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10951 [16:27:59] !log installing security upgrades on sodium [16:28:03] Logged the message, Master [16:28:49] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [16:34:05] New review: Dzahn; "that looks right cause these are all in linux-host-entries.ttyS0-115200. the "S0" part tells you the..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10951 [16:34:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10951 [16:39:46] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [16:42:46] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:22] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [16:54:01] notpeter: searchidx1001 /dev/sda1 9.9G 9.4G 0 100% / [16:59:41] maybe move mw files to /a too [17:01:12] yes. hrm. ok. [17:01:22] damnit. I was hoping that that wouldn't happen for a week ro two... [17:02:03] haha [17:02:11] I need to confirm with roan how many we need to keep [17:02:13] and note it somewhere [17:04:10] well damnit. [17:04:39] yeah, I guess I'm going to have to throw it on /a [17:04:52] the good news is that that box will be reimaged soon [17:15:38] andrewbogott: group="virt"; for MAC in $(grep -b1 $group linux-host-entries.ttyS* | grep hardware | cut -d " " -f3 | cut -d: -f1,2,3); do echo $MAC; if $(curl -s http://www.coffer.com/mac_find/?string=$MAC | grep -q Cisco); then echo "It's a Cisco"; else echo "Might be Dell or something else"; fi; done [17:16:02] :p yes, probably shorter with awk or something ;) bbl [17:16:19] mawk [17:22:11] in ./puppet/files/dhcpd that is, local git repo. ..out [17:27:18] RECOVERY - Lucene disk space on searchidx1001 is OK: DISK OK [17:28:47] !log moving /usr/local/apache to /a/apche with symbolic link on searchidx1001 as a temp measure until it can be reimaged [17:28:52] Logged the message, notpeter [17:36:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10690 [17:36:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10690 [17:56:03] !log enabling TitleBlacklist on labsconsole [17:56:08] Logged the message, Master [17:59:21] can someone merge this https://gerrit.wikimedia.org/r/#/c/9627/ [17:59:55] Ryan_Lane: are we creating weird titles? [18:00:06] no [18:00:13] :o [18:00:16] I want to ensure certain user accounts are never created [18:13:39] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:39:03] New patchset: Hashar; "Upping scap forklimit from 5 to 10 to speed up sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9130 [18:39:29] New review: Hashar; "Patchset 2 rewrite commit message and rebase change on latest master." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [18:39:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9130 [18:41:30] can some ops please have a look at https://gerrit.wikimedia.org/r/#/c/9130/ ? It is about raising the fork limit from 5 to 10 when doing a scap so we get more boxes to sync at the sametime and hopefully make scap a bit faster [18:42:16] wasn't it set to 5 so that it didn't crash the nfs server? [18:42:35] !log resuming coversion of es1004 to innodb, using compact row format after testing dynamic and compressed [18:42:40] Logged the message, Master [18:42:51] Ryan_Lane: cause of too many parallel requests on the filesystem? Sounds weird [18:42:56] but not entirely impossible though :-( [18:43:54] * hashar checks log [18:46:58] Ryan_Lane: indeed, reduced from 30 to 5 by Tim Starling cause the main copy job was saturating the networking link out of nfs1 :-/ [18:47:15] * hashar wants QoS on our network links [18:48:04] QoS probably wouldn't help much there [18:48:13] the new deployment system will likely be slightly more efficient [18:48:34] we should really start sending the localization updates across compressed, as well [18:48:38] that right there would help a lot [18:49:05] <^demon> Or localize less ;-) [18:49:40] the localization stuff went from like 500MB to 50MB [18:49:43] compressed [18:49:47] \O/ [18:49:53] New review: Hashar; "The fork limit was reduced from 30 to 5 by Tim Starling with https://gerrit.wikimedia.org/r/#/c/6463..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [18:49:54] so, we should do that ;) [18:50:05] I have quoted Tim's reasoning on change 9130 [18:50:12] feel free to veto the change :-] [18:50:53] rsync --compress ? ;) [18:51:47] New review: Pyoungmeister; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10416 [18:51:49] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10416 [18:53:55] RobH, Ryan_Lane: I'm still confused about drive configuration on the ciscos. virt1001 (which I believe to be a cisco) is built and running and appears to have sda through sdh. So, if the ciscos don't have sda or sdb, what's with that? [18:54:17] umm [18:54:21] that would be strange [18:54:30] they should definitely have that [18:55:08] Ryan_Lane: Observe RT 3055. Robh thinks that none of the ciscos should have a or b. [18:55:17] Which also leaves me wondering if they should have c-h or c-j. [18:55:52] And regardless there's a clear difference between virt1001 and virt1002 which remains a problem regardless [18:56:02] um... boy, saying 'regardless' a lot. [18:56:22] hm [18:58:55] I just ran a c-j script ion virt1002 and it is also misbehaving in some unobvious way. [19:03:37] New review: Reedy; "I know. At 5 it leaves a lot of free network capacity, and also takes an age" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9130 [19:04:15] Reedy: been talking about --forklimit with Ryan here :) [19:04:23] (aka change 9130) [19:04:45] andrewbogott: such as? [19:05:30] hashar: amusingly, sync-common-file is still on 30, and pushing a directory tree out was fine.. [19:06:07] Reedy: looks like the issue is when transferring the 600MB or so of l10n cachefiles [19:06:16] Yes, I know [19:06:34] I've caused the problem 2 or 3 times :p [19:06:37] But then again, when I did it today... It was fine.. [19:07:59] andrewbogott: sorry my irc was borked since i upgraded to lion [19:08:00] Ryan_Lane: Unclear. Usually when partman fails it bounces you back to a manual config screen. On virt1002 it made it through but then waited for user input to confirm changes. And, the config that it displays on the confirmation screen only shows six drives rather than eight. [19:08:04] finally fixed =P [19:08:08] I'm waiting for the build to finish to see what it really did. [19:08:29] RobH: Do you have a backscroll or should I start over? [19:08:44] so mutante is the person who advised it didn't see the sda sdb thing [19:09:00] if he is about, he would be more advisable to chat with for this particular issue [19:09:01] Hm... that's definitely true for some but not all of the ciscos. [19:09:27] if you have a list of some that do and some that don't, just append it to an RT ticket and assign to him or me, but make me CC on ticket no matter what [19:09:36] cuz we may need to do a comparison of the servers to see what is differing [19:09:40] they should all act the same. [19:09:45] Reedy: so that was just my 2 cents :-] I am removing myself from 9130 for now [19:10:11] I got too many changes in my review queue :) [19:10:48] RobH: OK, I'm confused. You're telling me that mutante confirmed the problem that I described in the RT bug and... [19:11:10] I'm saying that he ran into the issue you are talking about when installing those same servers for you initially [19:11:49] are you getting mixed results, some with and some without an sda and sdb? [19:12:04] heya, can someone please merge https://gerrit.wikimedia.org/r/#/c/9627/ [19:12:15] Sorry, here's why I'm confused. I think you are asking me to file a ticket which, isn't 3055 already that exact ticket? [19:12:27] yes. [19:12:52] "while virt1001 has sda-sdh defined, virt1002 has sdc-sdj defined and no sda or sdb" [19:13:29] So, I guess that's just one example. I'll check 1001-1008 and update the ticket accordingly. [19:13:35] andrewbogott, robh: "no sda/sdb" was true for analytics1001 to 1010, which are all Cisco UCS C250 M1. analytics1001 indeed appears to be different from others regarding sda, but nevertheless also has sde to sdj. both say "part of 10/2011 donation" [19:13:40] you don't need to refile, I'm reading [19:13:55] hrmm [19:14:13] andrewbogott: confirm we aren't using virt1001 quite yet though right? so when I'm on site tomorrow i can take it down for this? [19:14:25] drdee: I am not op, but I am not sure they are going to allow a git clone made out of a repo outside of the WMF cluster (aka git.less.ly ) [19:14:25] i am going to have to poke at all the settings on the two systems to determine whats different [19:14:36] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9130 [19:14:48] you can compare bios and such as well if you like, two sets of eyes and all [19:14:49] Ryan_Lane should confirm, but I believe that 1001-1008 are totally unused. [19:14:56] cuz all 40 systems should be identical. [19:15:04] I cycled/rebuilt 1001 an hour ago, so I sure as hell hope no one else is using it :) [19:15:12] the fact we have differing results for you guys is not ok =/ [19:16:05] RobH: eh sorry, i meant to say "virt1001" appears to be different from virt1002 as andrew points out, the analytics were the same [19:16:11] ok, well, i will be spending tomorrow onsite in eqiad, so i will take a look at this then, which incidentally will help both virt systems and analytics [19:16:15] Lemme see if I'm smart enough to tell where harddrives are mounted on a system w/out an OS... [19:16:38] hashar: yeah that was a desperate action on our side as we were waiting more than 10 days for a repo, guess we have to move it back now [19:16:41] RobH: OK, thank you! I wanted to make sure and catch you while you're in DC :) [19:16:43] the lights out mgmt web interface shows disks and the like [19:17:26] New patchset: Hashar; "make 'puppet parser validate' errors monospaced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4145 [19:17:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4145 [19:17:56] New review: Hashar; "Patchset 4 is a rebase on top of latest master" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [19:19:20] New patchset: Hashar; "puppetize wikibugs (irc bot for bugzilla)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8339 [19:19:44] New review: Hashar; "Patchset 3 is a rebase on top of latest master" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8339 [19:19:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8339 [19:20:34] New patchset: Hashar; "disable "last message repeated n times"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7773 [19:20:58] New patchset: Demon; "Setting pack.deltacompression = true" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10956 [19:21:20] New review: Hashar; "Patchset 2 is a rebase on top of latest master" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/7773 [19:21:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7773 [19:21:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10956 [19:24:03] New patchset: Ottomata; "Puppetizing gerrit-stats on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9627 [19:24:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9627 [19:27:15] can someone please review / merge https://gerrit.wikimedia.org/r/#/c/9627/ we moved the code to gerrit.wikimedia.org [19:30:07] New patchset: Pyoungmeister; "de-pooling one node per shard for kernel upgrade" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10957 [19:30:14] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10957 [19:46:07] New patchset: Ottomata; "Moving all Wikipedia Zero filters to oxygen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10958 [19:46:23] could someone approve that real quick please? [19:46:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/10958 [19:48:59] Ryan_Lane, got a min to give drdee and I some reviews? [19:49:30] well, I'm sick and technically have already worked all day.... [19:49:36] I'll take a quick look, but no promises [19:50:16] there's no need for a custom user for these stats? [19:50:42] ah [19:50:46] it's diederik [19:50:51] that sounds like a *really* bad idea [19:51:01] i had a long discussion with people about this already [19:51:10] and they said ok to this? [19:51:17] who was it, so that I can give them shit? [19:51:57] RobH: Hey I hear you're the guy to talk to if I need a machine? I need a nodejs box for the VisualEditor demo deployment [19:51:59] i actually don't remember, it was in this room [19:52:03] and [19:52:08] If you have a box for me that'd be great, otherwise I can use cadmium temporarily [19:52:21] (the Wikimania transcoding box that's almost done with Wikimania stuff) [19:52:22] but [19:52:25] the user that runs the command [19:52:27] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/9627 [19:52:30] needs to be able to ssh into gerrit and get stats [19:52:39] yes [19:52:46] which is exactly why it shouldn't be a real user [19:52:54] so what user? [19:53:02] we can make a user for it, probably [19:53:09] and yes, diederik went in and created a private ssh key on stat1 [19:53:14] and also added the public key for him in gerrit [19:53:17] <^demon> Should make a single-purpose user that only has read access. It's what I've done for a couple of things. [19:53:17] that's not acceptable [19:53:25] who can do that for us? [19:53:36] anyone who can make labs accounts [19:53:45] ^demon: can you do it for him? [19:53:52] since you kind of know how? [19:54:08] RoanKattouw: so yea, what ya should do is drop a ticket in procurement. usually list what you need spec-wise and such [19:54:24] if i have a server with the specs you need, i can then allocate it [19:54:31] realistically the user would have push access too, but that's not a big deal [19:54:53] it would have the same level of access as any other user with an account [19:54:53] RoanKattouw: so what will be run on it, if it needs cpu, disk, ram, etc, location dependent, .... [19:55:03] RobH: I have absolutely no idea what I need spec-wise, only in very general terms [19:55:05] but diederik has access to bypass review [19:55:10] and can +2 and merge, etc [19:55:21] <^demon> Ryan_Lane: So just a standard noob account, ok yeah I guess I can do that. [19:55:26] RoanKattouw: thats fine, be as specific as you can but otherwise no worries [19:55:35] Like, it's gonna be doing CPU-bound things, and it would be nice if it's in Tampa cause that's where the Apaches are [19:55:35] ^demon: thanks [19:55:36] mainly we have a standard misc server, single cpu, 4gb ram [19:55:47] I'm trying my best not to work, since I'm sick [19:55:47] Single CPU? [19:55:48] then we have a high performance misc box, dual cpu, more ram, more disk [19:56:00] OK that sounds better [19:56:00] <^demon> Ryan_Lane: I'm trying to not work either. Writing a paper :) [19:56:06] heh [19:56:27] <^demon> Last paper I'll ever write for college :D [19:56:32] so list in there that you would prefer the high performance misc box, all the details, if its public or private ip space, and if you care what datacenter it resides in (tampa or ashburn) [19:57:12] <^demon> ottomata: E-mail address for the account owner? It will send you a random password. Also: is the user "GerritStats" an ok name? Shell would be "gerritstats" [19:58:34] yeah that's cool [19:58:37] um, i guess my email? [19:58:40] otto@wikimedia.org [19:59:08] <^demon> "A randomly generated password for GerritStats has been sent to otto@wikimedia.org." [19:59:09] <^demon> :) [20:02:47] danke! [20:04:55] so I need to generate an ssh key for this, right? [20:05:03] what unix account should I use? [20:05:51] <^demon> Whatever account is going to be running the cron, I suppose? Doesn't really matter what the local account is, as long as the key matches the username you're connecting as. [20:06:23] ok, so I can run it as diederik? [20:06:27] no [20:06:32] you'll need to make a system account for this [20:06:47] crons should *never* run as normal users [20:06:55] I really wish we'd just turn cron off for non-system users [20:07:04] i put the cron in root's crontab [20:07:07] but sudo -u to the user [20:07:10] ewwwww [20:07:15] also don't do that :) [20:07:19] i like to keep all puppetized crons in root's crontab [20:07:22] noooooo [20:07:23] again, i dont' care what user we run as [20:07:27] yesssssss [20:07:30] make it run as the user [20:07:36] not via root [20:07:39] that's dirty [20:07:45] why not? you want a look in bunch of different files to figure what is running a cron? [20:07:50] you look in puppet [20:07:57] or in /var/spool/crons [20:08:22] err /var/spool/cron/crontabs/ [20:08:39] anyway, das cool [20:08:43] so um, what user? [20:08:54] you want me to create a new user for every different cron job purpose? [20:08:56] what's the shell account name for this new user? [20:08:58] or do we have a more generic one? [20:09:04] you can use this one for all of them for the stats, if you want [20:09:10] gerritstats [20:09:12] heh [20:09:16] is what ^demon just made [20:09:19] * Ryan_Lane nods [20:09:25] <^demon|away> gerritstats is what I made in ldap/gerrit :) [20:09:34] should we scrap that and have something more generic? [20:09:45] <^demon|away> "user1" [20:09:49] haha [20:11:01] stats? [20:11:10] that's fine with me [20:11:26] as long as the user didn't log into gerrit we can delete the user [20:11:26] so: [20:11:32] i logged in [20:11:33] well, not from mediawiki, I guess [20:11:35] to gerrit? [20:11:37] yes [20:11:38] meh [20:11:42] <^demon|away> Ok, don't do anything else. [20:11:45] <^demon|away> We can still delete. [20:11:49] <^demon|away> Before it gets nasty [20:11:51] ok [20:11:52] heh [20:11:54] should I log out? [20:11:56] yes [20:12:00] all I did was add an ssh key [20:12:13] <^demon|away> There went my snacktime :p [20:12:16] logged out [20:12:18] :D [20:15:14] <^demon|away> Ok, accounts, account_external_ids and account_ssh_keys cleaned up. That *should* be it if nothing else was done. [20:15:27] how do I delete it from mediawiki? [20:15:33] is it even possible? [20:15:59] re: cron jobs as system users (created via puppet only): in labs i had issues described in BZ 36206. (related to access.conf settings). should i also create ldap/gerrit users there as opposed to "puppet only"? [20:16:06] <^demon|away> Oh, and account_id. [20:16:09] mutante: that should be fixed [20:16:17] ah, cool, thx [20:16:35] it was a pam issue, I stopped using access.conf everywhere, and only used it for ssh [20:17:03] <^demon|away> Ryan_Lane: Um, if nothing else was done on the wiki, you'll need to delete the entry from `user` as well as the RC/log entry claiming to have created the account. [20:17:06] <^demon|away> I *think* that's it. [20:17:08] nice, feel free to close (some time.) [20:17:18] * Ryan_Lane twitches [20:18:38] Ryan_Lane: i'll switch back to other user and then close it. [20:18:44] * Ryan_Lane nods [20:22:42] easy enough [20:23:04] RobH: OK, I'll file a ticket in procurement. Are you otherwise fine with me using cadmium for this until a box is procured for real? I'm doing some final Wikimania video stuff on it now, should be done some time this week [20:25:21] ok, so, shoudl I be waiting for an email about this new 'stats' user? [20:25:30] and, do I need to also create a unix account in puppet for this user? [20:26:24] New review: Asher; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10957 [20:26:26] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10957 [20:27:58] ottomata: yes on the puppet system account [20:34:49] I can make the other account [20:35:06] awesome thanks [20:35:36] which email address should I use for this? [20:35:47] otto@wikimedia.org is fine [20:36:59] oook, thank you! [20:38:16] and [20:38:18] since this is a system account [20:38:23] I will be creating a private key on stat1 [20:38:27] for this user [20:38:28] is that ok? [20:39:19] yeah [20:39:25] or we can stick it in the private repo [20:39:46] where is the proper place to add a system unix account in puppet? [20:39:48] admins.pp? [20:39:52] or somewhere else? [20:40:01] with the class [20:40:12] oh really? [20:40:18] statistics.pp [20:40:39] use systemuser definition [20:41:05] see apaches.pp [20:41:15] and use /var/lib/ [20:41:21] using /home for system users is evil :) [20:41:35] ah ok [20:41:53] it breaks horribly when you have an environment with shared home directories [20:42:16] can I do /a [20:42:17] ? [20:42:20] for homedir? [20:42:28] /var/lib/ is better [20:42:34] hmmmm, ok [20:42:39] i'd like /var/lib/home/ better [20:42:42] it'll be on all systems [20:42:45] but ok... [20:45:01] s'ok? [20:45:01] https://gist.github.com/2912566 [20:45:34] why group wikidev? [20:45:44] does it need group write access, or something/ [20:45:54] well, other users are going to need access to the files it is going to create [20:46:06] * Ryan_Lane nods [20:46:11] I hate that everything is wikidev [20:46:19] no getting around that for now, though [20:46:37] yeah, this looks fine [20:46:42] k danke [20:48:29] yw [20:48:38] so yo uwant me to run this in the users crontab then [20:49:26] New patchset: Ottomata; "Puppetizing gerrit-stats on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9627 [20:49:45] ok, could you review that too then? [20:49:49] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/9627 [20:51:56] heh, after the fix, yeah :) [20:55:22] haha oop [20:56:17] New patchset: Ottomata; "Puppetizing gerrit-stats on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9627 [20:56:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9627 [20:57:11] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9627 [20:57:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9627 [21:02:12] yayy [21:15:24] New patchset: Ottomata; "statistics.pp - ensuring that $gerrit_stats_path is a directory owned by $gerrit_stats_user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11000 [21:15:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11000 [21:16:59] Ryan_Lane, if you are still around, that one should be a simple approve [21:17:29] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11000 [21:17:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11000 [21:17:42] done [21:17:44] thank you! [21:24:50] New patchset: Ottomata; "statistics.pp - need to cd to $gerrit_stats_path to run cron job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11001 [21:25:05] Ryan_Lane, ergh, one more you are so kind to me I hope you feel better real soon :) [21:25:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11001 [21:25:36] it's almost midnight here -_- [21:25:57] why not set CWD? [21:26:07] I know there's got to be a better way than cd :) [21:26:22] i'm not messing with the code [21:26:25] i'm just puppetizing [21:26:35] where are you? [21:27:02] berlin [21:27:06] ohhhhhh [21:27:08] you are a kind soul [21:27:15] no, I mean there's a way in cron to set the PWD [21:27:22] in puppet? [21:28:33] meh [21:28:48] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11001 [21:28:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11001 [21:28:53] haha, thanks [21:29:44] done [21:47:10] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [21:47:10] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [21:47:11] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [22:11:01] New patchset: Dereckson; "(bug 37456) Enable Narayam on kn.wikisource.org (bug 37472) Adding Narayam setup bugs in InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [22:11:07] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11003 [22:14:42] New patchset: Dereckson; "(bug 37456) Enable Narayam on kn.wikisource.org (bug 37472) Adding Narayam setup bugs in InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [22:14:48] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11003 [22:17:36] New patchset: Dereckson; "(bug 37456) Enable Narayam on kn.wikisource.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [22:17:42] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11003 [22:51:11] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [23:13:56] Would someone mind power cycling srv232? [23:14:06] It's responding to ping, but ssh seems dead to the world [23:15:44] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.232:11000 (timeout) [23:16:56] PROBLEM - SSH on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:08] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:18:56] * Reedy kicks nagios-wm [23:19:11] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:05] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:32] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.714 second response time [23:20:59] RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.362 second response time [23:21:26] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.364 second response time [23:22:11] PROBLEM - Memcached on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:20] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:41] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [23:30:23] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10673 [23:30:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10673 [23:33:39] do any of the IRC bots in here give special rights to people with mediawiki/* cloaks? [23:34:57] PROBLEM - NTP on srv232 is CRITICAL: NTP CRITICAL: No response from NTP server [23:41:23] PROBLEM - Host es3 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:26] PROBLEM - MySQL Slave Delay on es1 is CRITICAL: CRIT replication delay 211 seconds [23:42:53] PROBLEM - MySQL Slave Delay on es1003 is CRITICAL: CRIT replication delay 237 seconds [23:43:02] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 248 seconds [23:43:29] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 273 seconds [23:43:47] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 291 seconds [23:44:05] RECOVERY - MySQL Slave Delay on es1 is OK: OK replication delay NULL seconds [23:44:50] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay NULL seconds [23:45:08] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay NULL seconds [23:46:50] Thehelpfulone: I don't think so, but I don't know for sure. [23:47:00] I'm pretty sure no bot gives me any special rights... [23:47:01] :P [23:47:11] heh, you're on a wikimedia/* cloak :P [23:47:16] yes i am. [23:47:44] and I don't think a bot gives me anything special because of it... [23:48:10] !log preparing to switch es master to es1 [23:48:15] Logged the message, Master [23:49:17] oh good, you're both here [23:49:26] yeah I mean that because mediawiki/* was reserved for people with SVN access only [23:49:33] phew [23:49:35] I don't know if anyone decided they were special [23:50:22] oh, I misread your original question. I think the answer is still the same though. [23:50:53] hmm, maplebed isn't in -tech [23:50:57] fyi es3 errors [23:51:06] Reedy: binasher is doing a master rotation. [23:51:07] !log es1 is the new master, now switching mw conf [23:51:12] Logged the message, Master [23:52:43] articles cannot be edited [23:52:53] we can just perform log-actions [23:53:09] Vito: maintenance was underway. [23:53:12] should be finished now. [23:53:14] RECOVERY - MySQL Slave Delay on es1003 is OK: OK replication delay 0 seconds [23:53:14] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [23:53:15] oh I see [23:53:28] works now [23:53:31] ty maplebed [23:53:34] cool. [23:54:47] heh [23:54:58] * AaronSchulz wonders why we have rows referring to es cluster 16 [23:55:27] secret revisions