[00:14:58] New patchset: Krinkle; "gerrit.config: +commentlink for git hash, fix bugzilla to use word foundry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9109 [00:15:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9109 [00:44:46] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [00:51:40] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [00:55:34] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [00:55:34] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [01:01:34] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [01:01:34] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [01:01:34] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [01:12:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:41:55] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:44:46] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:53:37] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:55:34] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [02:21:14] New review: Jeremyb; "Need to watch for merge of I326c581135b38bc (whichever of these 2 is merged first needs to change to..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8344 [02:46:21] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Mon May 28 02:45:57 UTC 2012 [02:50:51] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Mon May 28 02:50:36 UTC 2012 [06:54:51] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8894 [06:54:53] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/8894 [07:10:46] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [07:10:46] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [07:10:46] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [07:10:46] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [07:52:46] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [07:56:12] New patchset: ArielGlenn; "dvds for Malayalam Wikisource" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9115 [07:56:37] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9115 [07:56:39] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9115 [08:19:31] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [08:35:07] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:00] !log rebooted snapshot1001, security updates [08:37:04] Logged the message, Master [08:39:01] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [08:43:28] !log rebooted snapshot1002, security updates (will do the same for 1003, 1004 shortly) [08:43:32] Logged the message, Master [09:05:29] !log rebooted snapshot4, 3 for security updates [09:05:32] Logged the message, Master [10:20:14] New patchset: ArielGlenn; "larger root partition, make partition setup more readable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9118 [10:20:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9118 [10:22:43] PROBLEM - Host mw1133 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:07] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [10:52:07] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [10:56:10] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [10:56:10] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [11:02:51] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [11:02:51] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [11:02:51] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [11:08:50] !log powercycled mw1133 [11:08:53] Logged the message, Master [11:11:41] RECOVERY - Host mw1133 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [11:13:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:22:11] !log updated kernel etc on mw1133, reboot [11:22:14] Logged the message, Master [11:54:53] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [11:57:31] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [12:46:43] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [13:11:10] PROBLEM - Host mw1077 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:23] !log doing security updates for a batch of mws in eqiad [13:12:27] Logged the message, Master [13:14:01] RECOVERY - Host mw1077 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [13:38:06] PROBLEM - Host mw1031 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:15] RECOVERY - Host mw1031 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [14:11:33] PROBLEM - Apache HTTP on mw18 is CRITICAL: Connection refused [14:11:46] hush [14:36:00] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [14:43:44] New patchset: Hashar; "restrict swift.php to 'pmtpa' cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9125 [14:43:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9125 [14:47:41] New review: Krinkle; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/9125 [14:48:10] srv188: rsync: write failed on "/usr/local/apache/common-local/php-1.20wmf4/extensions/WebFonts/.git/objects/pack/pack-e0e7307c2ec029a7c157432514128b0228af01b9.pack": No space left on device (28) [14:48:10] srv188: rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] [14:48:15] notpeter: oh noes ^ ;) [14:49:30] O_O [14:50:01] I'll have a look at the box when scap runs [14:51:16] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9125 [14:51:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9125 [14:51:44] runs? finishes [14:55:40] snapshot1: rsync: write failed on "/usr/local/apache/common-local/php-1.20wmf4/.git/objects/pack/pack-5b0a8f85406767a1a9bd988bcabfb487dc8ed17b.pack": No space left on device (28) [14:55:40] snapshot1: rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] [14:56:23] Reedy: thanks! [14:56:58] snapshot1001, snapshot1002, snapshot1003 [14:57:49] grrr [14:57:54] syncing a new branch are we? [14:58:02] well snapshot2 and 3 and maybe 4 will run out too [14:58:19] they just have small root partitions, I cleaned them up as best I could the other day but there wasn'tmuch to be done [14:58:25] I se an issue... [14:58:27] /dev/sda6 on /tmp type ext3 (rw) [14:58:27] /dev/sda7 on /a type jfs (rw) [14:58:43] ^ 188 [14:58:43] /dev/sda7 on /usr/local/apache type jfs (rw) [14:58:48] ^ 190 [14:58:50] is there any branch that can be taken out of the sync? [14:59:08] this is gonna mean my dumps don't run maybe [14:59:11] that's no goo [14:59:13] d [14:59:38] php-1.20wmf2 in theory.. But it was planned to keep files around longer due to cached stuff requesting them [14:59:47] and just removing them on those boxes will make scap put them back [15:01:01] Hmmm. Mounts look inconsistent.. [15:01:19] notpeter: about? [15:03:02] actually having localisation files to scap is gonna cause even more problems I guess.. [15:04:19] Reedy: what's up? [15:04:40] well lemme try a workaround on the snapshots. I'll do sn1 first [15:05:11] notpeter: looks like some of the apaches haver inconsistent /a /usr/local/apache [15:05:39] srv188 is out of space, as was a couple of the snapshot boxes [15:05:53] the snapshot hosts aren't apaches [15:05:56] srv188 is no longer in use, I believe [15:06:03] Oh [15:06:10] Why are we still pushing files to it then? ;) [15:06:12] they need copies of mw [15:06:12] lemme doubt check [15:06:34] and yes, I didn't touch the snapshot hosts, as I have on idea what they do or what their requirements are [15:07:07] coincidentally today I pushed a new partman config for them but I want someone to review it, then I get to reinstall the eqiad hosts [15:07:30] apergos: which "them"? [15:07:34] the snaps [15:07:53] ah, kk [15:08:15] Reedy: api:{ 'host': 'srv188.pmtpa.wmnet', 'weight': 80, 'enabled': False } [15:08:16] anyways for the immediate term I'm going to move stuff around and make a sym link and then ask reedy to push to sn1 (not yet, the copy is taking its time) [15:09:28] apergos: I've found that partman confs involve enough trial and error that just stopping puppet on brewster and doing it live there is the best bet :/ [15:10:04] this is a relatively small (except for the line breaks) change, which is why I went that route [15:10:30] each of these branched is 1.3 or 1.8gb or whatever [15:10:33] *branches [15:10:53] adds up quick [15:10:59] 500 meg or so without l10n cache [15:11:43] notpeter: can srv188 be removed from the mediawiki-installation dsh group? [15:13:54] so wmf2 is no longer used right? [15:13:59] Reedy: I think so? [15:14:33] Reedy: I mean, it's not in any pools, an dI think it's old enough to be decommed [15:14:36] so, sure! [15:14:41] Reedy: [15:14:59] apergos: yeah.. sort of. cached pages refer to resources in it though... [15:15:06] not for my backups they don't [15:15:34] yeah [15:15:43] ok wanna try syncing out to sn1? [15:15:46] wont scap etc recreate it? [15:15:51] I wanna make sure I didn't screw up the perms [15:15:57] not if I did thisright it won't [15:16:21] just sn1, I haven't done the other hosts yet, wanna make sure this is right [15:16:54] might be easier if you run sync-common on it locally [15:17:14] or i can [15:17:24] what's the full command and from what directory? (I assume I run it as root?) [15:17:31] i.e. what args does it need [15:17:57] no args [15:18:14] ok. as root, yes? from what directory? [15:18:20] can be run as anyone who can sudo to mwdeploy [15:18:27] any dir [15:18:49] it just calls /usr/bin/scap-1 [15:19:11] ok, I'm testing [15:19:57] 536mb [15:20:06] the i18n stuff isn't in right now? [15:20:21] er l10n [15:21:52] not in wmf4 [15:22:01] going to do that next [15:22:01] New patchset: Hashar; "fix assignement in conditional switch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9128 [15:22:07] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9128 [15:22:19] you'll need to give me about 5 mins and the rest of these hosts will be ok [15:22:37] Reedy: another small one whenever you have time https://gerrit.wikimedia.org/r/9128 fix assignment in a if statement: if( $cluster = 'pmtpa' ) [15:26:42] apergos: sure, no rush. The actual deployment window isn't for another 2.5 hours :). We just wanted to get the prep done ahead of time so we can keep the deployment window hopefully short [15:26:50] sure [15:27:09] I gotta wait for these copies to finish and then tidy up, then run a local sync [15:27:16] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9128 [15:27:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9128 [15:27:19] at that point they should all look alike and you should be good to go [15:27:29] Great [15:27:42] I'll go get on with some other stuff. AFK for 10 minutes or so [15:27:46] k [15:39:34] Reedy: you should be good for the next round of syncs [15:39:43] Great [15:39:54] That appeared as I just sat down. Talking about good timing [15:41:09] yay [15:41:26] well this way I can watch the next sync go around and be sure it's in fact all good [15:44:55] Updating ExtensionMessages-1.20wmf4.php... [15:44:55] Loading data from /home/wikipedia/common/php-1.20wmf3/extensions/AbuseFilter/AbuseFilter.php [15:45:08] ho hum [15:45:19] wmf3? [15:45:23] I'm not sure why it's loading stuff from wmf3 [15:45:23] yeah [15:45:50] did the sync go around ok? [15:45:56] spacewise I mean [15:46:06] it's just doing it now [15:46:25] ah sorry, saw the log entry and thought it had already gone [15:47:44] It should probably say "starting" and do a !log for when it's finished [15:51:51] New patchset: Reedy; "Making scap sync messages more verbose/accurate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9129 [15:52:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9129 [15:53:31] New patchset: Reedy; "Upping fan to 10 for syncing code. 5 is sloooooooow" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9130 [15:53:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9130 [15:54:01] * apergos twitches impatiently [16:04:26] yup, snapshot hosts look to have gone through fine [16:04:45] srv188 and srv187 are the only ones I've seen complaining so far [16:06:35] great [16:06:44] that should take care of them for the next 30 gb or so [16:06:50] or until I get to repartition them [16:07:58] heh [16:08:14] I think based on the squid retention times, we'll have at maximum 6 copies checked out [16:14:18] Updating ExtensionMessages-1.20wmf4.php... [16:14:18] Loading data from /home/wikipedia/common/php-1.20wmf4/extensions/AbuseFilter/AbuseFilter.php [16:14:30] apergos: I pressed up, changed the message and hit enter. Weird how it's fine the 2nd time round [16:14:47] riiiggghht [16:14:55] frickin computers [16:14:59] that deserves a wut [16:15:19] what's the phase of the moon? [16:21:48] New patchset: Hashar; "import CommonSettings from wmflabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9131 [16:21:48] New patchset: Hashar; "specific shell configuration for transcoding boxes" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9132 [16:21:54] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9131 [16:21:56] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9132 [16:22:48] Reedy: some more changes for you :-] [16:31:11] New patchset: Hashar; "move throttling related conf to throttle.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9136 [16:31:17] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9136 [16:31:51] finally done... that was a loooong time [16:32:04] Reedy: I have added you as a reviewer for 9131 9132 & 9136 [16:32:11] Luck me ;) [16:32:20] They are trivial changes [16:32:36] trying to get things that change often to be split out of CS.php [16:32:47] how is 1.20wmf4 going on ? [16:37:02] It's all done bar also moving testwiki and mediawikiwiki later and actually testing it [16:38:04] \O/ [17:12:03] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [17:12:03] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [17:12:03] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [17:12:03] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [17:51:51] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9129 [17:54:07] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [18:20:04] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [18:29:13] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 221 seconds [18:29:22] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 216 seconds [18:36:25] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 201 seconds [18:36:25] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 201 seconds [18:47:24] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 189 seconds [18:47:42] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 198 seconds [18:51:54] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 185 seconds [18:53:24] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [19:03:27] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 197 seconds [19:04:21] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 202 seconds [19:05:25] New review: Helder.wiki; "(no comment)" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9136 [19:50:15] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 204 seconds [20:08:13] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 30 seconds [20:09:16] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 17 seconds [20:47:13] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [20:53:13] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [20:57:16] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [20:57:16] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [21:04:03] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [21:04:03] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [21:04:03] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [21:07:03] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [21:07:30] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [21:10:39] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [21:13:10] > # srv258 - srv280 are application servers, job runners, memcached [21:14:33] hrmmm, 10.0.8.28 is not in mc.php at all [21:15:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:21:12] New patchset: Krinkle; "use protocol-relative url for nostalgiawiki wgSiteNotice" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9204 [21:21:17] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9204 [21:24:04] New patchset: Krinkle; "remove old wgNoticeBanner_Harvard2011 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9205 [21:24:10] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9205 [21:28:55] New patchset: Krinkle; "Clean up search-redirect.php per code conventions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9206 [21:29:00] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9206 [21:32:15] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.476 second response time [21:39:03] hi [21:39:16] do you have an update on https://bugzilla.wikimedia.org/show_bug.cgi?id=37089 ? [21:50:07] rguillebert: i can't remember exactly which day you were here before but there's a lot of people traveling (or packing bags!) now. [21:50:37] so not paying attention to computers [21:51:15] ok, it's just that I don't have access to the private bugtracker so I'm really in the dark [21:51:31] Not really [21:51:36] No replies have been made to it [21:52:32] yeah, it's the private mailing list where the reply would be. i think [21:52:36] not in RT [21:56:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [21:58:04] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [22:47:25] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours