[00:05:07] Reedy: For https://gerrit.wikimedia.org/r/#/c/26428/ - would scap be appropiate? [00:05:12] its a lot of different directories [00:05:20] don't want it to get out of sync half-way [00:05:28] That and you have message changes [00:05:32] e.g. two sync-dirs might cause troubles I imagine [00:05:33] oh, right. [00:05:45] errr [00:06:02] yes? [00:06:18] You haven't updated messages.inc [00:06:38] 1 message changed, 2 added, 2 deleted [00:06:58] Guess that shouldn't matter for production, but will need doing into core etc for TW [00:08:35] Reedy: grr, I hate that file [00:08:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [00:08:38] extensions don't have it [00:08:42] why do we need it in core [00:08:42] heh [00:08:51] No idea [00:16:49] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [00:25:32] scap in progress to fix search suggestions front-end in 1.21wmf1 [00:31:49] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:32:02] New patchset: Dzahn; "wikistats - update debian control/copyright and add additional files created by dh_make" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/26433 [00:33:30] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/26433 [00:35:17] New patchset: Tim Starling; "Add API log" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26434 [00:35:39] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26434 [00:41:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:35] I also see a lot of "mw24: No entry for terminal type "unknown"; " [00:42:49] mutante: is that one in puppet? [00:42:57] !log moved logmsgbot to -operations channel [00:43:04] (the script you just modified, wikitech says it is in /h/w/bin) [00:43:07] Logged the message, Master [00:43:17] afaik that's the old name of /usr/local/bin which is now in puppet [00:44:10] Krinkle: the bot is still in /h/w/bin [00:44:17] ok [00:44:20] Krinkle: https://svn.wikimedia.org/viewvc/mediawiki?view=revision&revision=115492 [00:45:06] Anyone here bureaucrat on officewiki? I'd like sysop to fixup local mediawiki: js there. security warnings in chrome are annoying me [00:45:15] Krinkle: but.. the channel is indeed define in site.pp in puppet.. [00:45:18] You've got shell ;) [00:45:22] thanks for pointing it [00:45:26] Reedy: I know, but I'd rather not do it directly. [00:45:35] Office IT, IIRC [00:46:06] k, I'll mail pbeaudette [00:46:39] New patchset: Dzahn; "move logmsgbot from wikimedia-tech to wikimedia-operations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26435 [00:47:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26435 [00:48:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26435 [00:56:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.237 seconds [00:59:04] mutante: Hm.. those variables aren't actually used in /h/w/bin/logmsgbot [00:59:12] mutante: do you know what those vars in site.pp are for?. [00:59:49] Krinkle: no, i did not expect it to be in puppet, just grepped for logmsgbot in puppet after you asked if its in there [00:59:56] right [01:00:03] Hm.. the nagios one has the same var names [01:00:07] but it has the name logmsgbot right above it [01:00:12] maybe for debug stuff from puppet itself? [01:00:49] Krinkle: nagios has a different bot, "nagios-wm" there [01:01:03] yes, but same var names [01:01:11] which suggests they are consumed by a similar script [01:01:14] the variables can be used for different bot nicks, which are all instances of ircecho [01:01:49] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [01:03:06] Krinkle: hmm, it appears that there used to be a ./misc/ircecho.pp [01:03:20] indeed [01:03:37] scripts write to a log file and then they are aggregated and sent to irc by one script [01:03:53] different scripts writing to different files and then ircecho.pp pushes it all to irc [01:04:00] logmsgbot now goes to irc directly [01:04:02] oh, nevermind ,it is in misc-servers.pp [01:04:27] class misc::ircecho { [01:04:46] uses those variables [01:04:46] but /var/log/logmsg/ is still being updated though on fenari [01:05:58] ./templates/ircecho/default.erb [01:17:36] !log krinkle Finished syncing Wikimedia installation... : Icd721011b40bb8d2c20aefa8b359a3e45413a07f [01:17:47] Logged the message, Master [01:19:14] over 40 mins [01:19:55] Reedy: [[Heterogeneous_deployment_v2]] on wikitech, is this a typo ? [01:19:57] "mwdeploy rm /home/wikipedia/common/php && ln -s mwdeploy rm /home/wikipedia/common/php-1.20wmf12 mwdeploy rm /home/wikipedia/common/php " [01:20:14] it seems odd to pass "mwdeploy rm .." to "ln -s" [01:20:40] or something [01:25:04] PROBLEM - Apache HTTP on srv233 is CRITICAL: Connection refused [01:27:55] TimStarling: 'live-1.5/fatal-ifjwghy9ibag.php' was created in 2007 afaik, any idea if it is still relevant? its keeping git status dirty. Would be nice if that '?' could find its way out of PS1 in my shel [01:28:03] (you own the file) [01:28:14] the filename is meant to be secret [01:28:39] I see [01:29:08] and fixed [01:29:30] same as the phpinfo() script [01:29:36] !log tstarling synchronized live-1.5 [01:29:47] Logged the message, Master [01:30:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:22] !log krinkle synchronized live-1.5 [01:35:33] Logged the message, Master [01:37:04] PROBLEM - Apache HTTP on mw69 is CRITICAL: Connection refused [01:37:49] PROBLEM - Apache HTTP on srv247 is CRITICAL: Connection refused [01:43:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 289 seconds [01:43:58] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.182 second response time [01:44:16] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 287 seconds [01:44:52] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [01:44:52] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:45:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.874 seconds [01:46:40] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [01:47:25] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 20 seconds [01:49:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [02:09:52] !log tstarling synchronized wmf-config/InitialiseSettings.php 'API log' [02:10:03] Logged the message, Master [02:12:57] Reedy: okay, that's it. I think Jenkins/Gerrit is down entirely, nothing is getting +1 anymore it seems [02:12:57] not just wmf / REL branches [02:18:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:42] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [02:26:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [02:31:56] !log LocalisationUpdate completed (1.20wmf12) at Wed Oct 3 02:31:56 UTC 2012 [02:32:09] Logged the message, Master [02:32:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.997 seconds [02:34:01] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:42:59] New patchset: Tim Starling; "es10 down" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26442 [02:43:10] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26442 [02:43:43] !log tstarling synchronized wmf-config/db.php [02:43:54] Logged the message, Master [02:44:59] !log es10 not responding to ssh, rebooting [02:45:10] Logged the message, Master [02:48:52] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [02:51:36] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 3 02:51:35 UTC 2012 [02:51:47] Logged the message, Master [02:52:19] PROBLEM - mysqld processes on es10 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:03:07] RECOVERY - mysqld processes on es10 is OK: PROCS OK: 1 process with command name mysqld [03:04:55] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [03:05:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [03:08:32] New patchset: Tim Starling; "es10 back up" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26443 [03:10:13] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26443 [03:11:25] !log tstarling synchronized wmf-config/db.php 'es10 back up' [03:11:36] Logged the message, Master [03:55:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:16:50] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:18:11] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.024 second response time on port 8123 [05:18:20] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:18:20] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:19:03] !log tstarling synchronized php-1.20wmf12/cache/trusted-xff.cdb [05:19:14] Logged the message, Master [05:22:03] !log tstarling synchronized php-1.21wmf1/cache/trusted-xff.cdb [05:22:14] Logged the message, Master [05:34:32] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:16:14] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.027 second response time on port 8123 [06:24:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:27:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:28:18] !log restarted lucene search on search1016 [06:28:28] Logged the message, Master [07:23:35] binasher: or not archive.new, I don't have that on ms10. :-P [07:32:50] apergos: just got archive.new from ms7 [07:33:03] ok cool [07:33:24] I guess the wikis after commons and up to cy still need to go (from ms7) [07:33:31] but then one last cleanup, right? [07:38:58] actually, i ran to c's after commons yesterday and that finished [07:39:28] private/archive is still going, on wikipedia/it right now [07:40:02] en and commons were the biggest there, so private/archive may be completed in a day [07:40:08] if not, we can break that up more [07:40:49] I bet it's done in a day [07:41:05] but that should be it, then just a last round of cleanups from ms7 [07:41:08] uh huh [07:41:25] but after writes are moved to the netapp [07:41:28] so scheduling the cutover; it actually needs to be announced and stuff, cause [07:42:06] there will be at least interruption in uploads, and prolly interruption in image service for a little bit [07:42:15] when the mounts are replaced [07:43:00] we could take batches of servers out of lvs while switching the mounts [07:44:13] still a good idea to announce a window I think, in case it doesn't go perfectly [07:46:16] what time frame are you thinking? (to decide if I can/need to be around) [07:52:09] binasher: [07:52:43] hmm [07:53:18] cause I think the rsyncs will be done by thurs am your time [07:53:29] it could even happen in your timeframe, if you and/or paravoid would be ok with doing it [07:53:48] and mark did the netapp volume setup [07:53:53] paravoid will be at some conference or other I think [07:54:07] I don't feel confident by myself on it [07:54:33] but if it's in your morning or evening I can be around [07:55:59] this is short notice for making an announcement [07:56:06] it is [07:56:12] there's also friday [07:56:17] yeah [07:56:19] lemme look at space over there [07:56:37] hate making changes on fridays, but maybe fri morning would be the best bet [07:56:41] 362G [07:57:09] thinking about whether we can afford to wait for monday [07:57:33] i'm ok with doing it friday [07:57:34] but friday morning, as you say, there's plenty of time then to patch things up if something goes awry [07:57:59] friday sort of sucks for me, but if we can't wait until monday, I guess we'll have to [07:58:00] it's not like we're 5pm switch and then go home without looking at the results [07:58:40] monday is a wmf holiday :-/ forgot [07:58:44] oh [07:58:54] those used to be the best days for doing stuff ;) [07:58:58] yep [07:59:48] one advantage to next week is also that snapmirror can finish then [07:59:57] i didn't enable that yet, as it slowed things down [07:59:59] i'm about to pass out.. i'm ok with friday or monday though, email and let me know [08:00:03] I preferred having all data on one netapp cluster first [08:00:23] i want to enable it today, after ms7 rsync is done [08:01:14] there are still 4 or 5 of the 16 commons subdirs in progress [08:01:39] a few of them just started up today, but they'll be done by tomorrow morning our time [08:02:22] so mark, it's your friday evening that will be ruined (I didn't schedule anythign for mine), your call [08:03:08] or friday day time [08:03:23] i don't really wanna do it in the evening, no time to fix anything [08:04:31] that's possible too [08:04:46] so what is the amount of space used, per day? [08:05:10] so that's a bit harder to get a good handle, we had wlm up til the end of the month [08:05:20] and probably a few late uploaders after that [08:05:41] worst case... [08:07:42] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=ms7.pmtpa.wmnet&v=389073468416&m=export-Upload_Free_space&r=hour&z=small&jr=&js=&st=1349251561&vl=Size&z=large [08:08:14] we can just mount the NFS share on a separate directory on all apaches [08:08:19] then move over the mount point in the mediawiki config [08:09:26] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [08:09:29] I remember we did that the last time, don't remember what the hiccups were but we had some [08:10:56] so last week was around 85gb a day average but that's with WLM [08:11:08] yesterday was 30gb [08:11:59] so we have like 10 days left at least [08:12:12] if we run zfs into 0% [08:12:17] but I'm not at all clear we can do that [08:12:22] we can remove some old snapshots [08:13:38] in unrelated news I am going to have to reboot ms8, those processes are indeed deadlocked >-< [08:13:40] >_< [08:14:14] go ahead [08:15:04] are we gonna make friday even? [08:15:07] only 16 TB copied now [08:15:54] yes, we will [08:18:07] !log rebooting ms8 to unstick deadlocked zfs recv and zfs list processes [08:18:18] Logged the message, Master [08:18:20] might have to hard power off, this may hang going down [08:29:57] let's see how that reset did [08:36:04] heya [08:36:09] what did I miss? :) [08:36:40] plans for world domination, as usual [08:36:41] us [08:38:07] !log stopped ms7 to ms8 replication [08:38:18] Logged the message, Master [08:42:00] aren't you at a conference or something? [08:44:24] New review: Werdna; "For general use, I don't think 50 is a good number. We have a few options:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25855 [08:44:37] I mam [08:47:21] how is it?any good sessions? [08:48:05] it's still 09:47 :) [08:49:04] ah :-D [08:49:11] enjoy your coffee [08:49:38] I'm in a session, it's just too early to say if it's good or not :) [09:43:42] err [09:43:51] HA Group Notification from nas1-a (OUT OF INODES) ERROR [09:43:52] uh oh [09:43:58] gahh [09:44:27] nas1-a> Wed Oct 3 09:44:18 GMT [nas1-a:ems.engine.suppressed:notice]: Event 'wafl.vol.outOfInodes' suppressed 6424 times in last 61 seconds. [09:44:27] I suspended my rsyncs on ms10 [09:44:27] Wed Oct 3 09:44:18 GMT [nas1-a:wafl.vol.outOfInodes:notice]: file system on Volume originals is out of inodes [09:44:43] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:44:43] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:44:43] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [09:44:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:44:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:45:20] Volume originals: maximum number of files is currently 31876689 (31876689 used). [09:45:50] susspended them on ms7 also [09:45:53] that should be all of them [09:45:54] it seems I can raise it [09:49:23] !log Doubled number of inodes on nas1-a:originals from 31876689 to 63753378 [09:49:32] restart the flooding! [09:49:33] Logged the message, Master [09:50:52] rsync: recv_generator: mkdir "/mnt/upload/private/archive/wikipedia/lv/n" failed: No space left on device (28) [09:51:05] still? [09:51:05] hm [09:51:06] ah those must have jsut been queued up [09:51:09] perhaps need to mount/umount [09:51:10] ah [09:51:16] (even though thhey showed up with delay) [09:52:07] ok all rsyncs back in business [09:52:45] cool [10:17:43] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [10:33:13] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [11:03:13] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [11:17:44] New review: Matthias Mullie; "Should I make the change for enwiki alone?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25855 [11:46:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:57:58] http://wikitech.wikimedia.org/view/Media_server#Transitioning_from_one_nfs_mount_to_another_.28ie._ms7_to_netapp.29.2C_Oct_2012 a first stab at it if we go the new mountpoint route [12:08:54] i wouldn't know why not really [12:09:27] see email [12:11:27] http://wikitech.wikimedia.org/index.php?title=Media_server&diff=52019&oldid=52018 [12:12:48] sure [12:13:04] mmm misc-servers.pp is extension-dist [12:13:33] so that step needs to stay in there separate I guess [12:27:33] ah, I see that we simply have an unused variable in there [12:27:35] [ariel@trouble puppet]$ grep -r extdist_download_dir . [12:27:35] ./manifests/misc-servers.pp: $extdist_download_dir = "/mnt/upload6/ext-dist" [12:35:29] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [13:12:27] New patchset: Mark Bergsma; "(Temporarily?) disable MIME type based streaming, as it complicates debugging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26454 [13:13:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26454 [13:13:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26454 [13:56:44] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:02:44] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [14:16:05] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:40] heya mutante, you around? [14:25:00] or someone else maybe, i'm still having reinstall troubles with some analytics machines [14:25:06] currently i'm working on analytics1007 [14:25:08] it reinstalls just fine [14:25:13] and the boot order is set to hdd [14:25:23] but it PXE boots whenever it starts [14:25:52] I loaded up the Cisco web interface to double check [14:25:59] http://cl.ly/image/04420U2F3h0I [14:26:06] it says boot order: hdd [14:26:22] but the actual boot order starts with PXE [14:26:23] ? [14:26:48] then it doesn't find a boot loader on the hard drive [14:26:54] aye [14:27:14] hm [14:29:34] any tips on how I can check that?, everything in the netboot install seems ok [14:29:50] probably it's installing to a different hard drive than the bios thinks is the first one [14:30:51] hm, ok, it is setting up raid, i'll see if I can tell which ones it is picking [14:38:35] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 28.16 ms [14:51:56] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:14] mark, hm, i'm not sure what is going on [15:02:24] the installer chooses the correct drives to set up raid [15:02:28] as far as I can tell [15:02:33] they are the first devices listed [15:02:45] and they are the same devices that another cisco analytics machine is using [15:03:00] here is a question [15:03:15] when it boots, I ahve the option of selecting a boot device [15:03:25] if I choose it, I get a menu listing all of the disks, and then some other devices [15:04:42] but I don't see one that looks like a raid device (i somehow doubt that I should) [15:04:42] so [15:04:47] i'm not sure what to do [15:04:56] i thought it only had software raid? [15:05:09] i don't know, i haven't installed any ciscos [15:05:12] right, it does, which is why I don't think I should see it here [15:05:24] i'm just exploring all the options, cause I dunno what to do either [15:05:24] hm [15:05:25] ok [15:07:20] New patchset: Demon; "Make gerrit service subscribe to secure.config and replication.config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26455 [15:07:33] perhaps daniel zahn knows? he's played with them some I think [15:07:41] i need to go now, back in a bit [15:08:10] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26455 [15:09:52] yeah, I just PMed him [15:09:59] hopefully he'll be able to help me! [15:10:00] thanks mark! [15:18:10] !log test-deploying SPF DNS records [15:18:21] Logged the message, Master [15:19:50] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:19:50] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:20:10] annnnd ns1 faceplants [15:23:39] New patchset: Dereckson; "(bug 40732) Content namespaces configuration on hr.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26456 [15:23:44] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25760 [15:39:07] New patchset: Alex Monk; "(bug 40736) Lift account creation limit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:41:52] !log reedy synchronized wmf-config/InitialiseSettings.php 'WikiLove on itwikiquote' [15:42:02] Logged the message, Master [15:42:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25738 [15:44:09] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25733 [15:44:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26456 [15:53:53] New patchset: Alex Monk; "(bug 40736) Lift account creation limit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [15:54:36] why no rdns for ipv6? (in this case I was looking at williams. 2620:0:860:2:21e:c9ff:feea:ab4a) [15:55:30] results in: Received: from [2620:0:860:2:21e:c9ff:feea:ab4a] (port=39439 helo=williams.wikimedia.org) by mchenry.wikimedia.org with esmtp (Exim 4.69) [16:04:03] jeremyb: very good question [16:05:11] Jeff_Green: should I RT? [16:05:19] yeah I think so [16:05:26] k [16:05:31] thank you [16:06:37] Jeff_Green: the zones don't exist at all? it's not just an isolated host? [16:07:20] * jeremyb wants DNS in public (or at least not top secret) version control ;-[ [16:07:52] there appear to be zone files but I'm not sure how to test them atm [16:08:05] ok [16:08:19] 7.2.0.c.d.d.e.f.f.f.9.b.9.1.2.0 1H IN PTR mchenry.wikimedia.org. [16:08:21] for example [16:09:08] great, thanks [16:09:38] $ host 2620:0:860:2:219:b9ff:fedd:c027 [16:09:38] 7.2.0.c.d.d.e.f.f.f.9.b.9.1.2.0.2.0.0.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer mchenry.wikimedia.org. [16:09:48] so it's not across the board [16:09:55] filing... [16:11:47] done [16:30:25] heya ^demon! [16:30:33] can I do that to? gerrit -> github? [16:30:35] too* [16:30:35] ? [16:30:54] You want to replicate yourself to github? [16:31:31] <^demon> ottomata: Right now I'm trying to get extensions replicated next. We have to manually create repos on the github side (yuck), so I'm working on something that takes care of that. [16:31:42] hm, ok [16:31:47] <^demon> Then we'll start replicating *everything*, rather than one-off requests :) [16:31:51] oh! [16:31:51] cool! [16:32:14] so [16:32:34] as for accounts: [16:32:35] mediawiki/* will be replicated from gerrit, and wikimedia/* will be plain github (as it is now) [16:32:37] ? [16:33:09] <^demon> I wouldn't mind replicating stuff to wikimedia/* as well. [16:33:12] <^demon> That's not hard to do. [16:33:22] <^demon> mediawiki/* is for mediawiki stuff :) [16:33:31] indeed [16:33:52] aye makes sense [16:35:39] <^demon> Should probably adjust the refspec...only should replicate refs/heads/* and refs/tags/* [16:35:46] <^demon> Refs/changes/ is just gerrit stuff. [16:47:13] * aude is breaking github ;) [16:47:23] https://github.com/mediawiki/core/graphs/contributors doesn't seem to work :O [16:49:08] ok, it finally generated [16:49:41] uh oh is that on demand? [16:50:02] <^demon> No, pretty sure it's cached. [16:50:07] <^demon> Anyway, github's problem :) [16:50:40] it's cached, but the first time takes a while :) [16:50:42] yeah it came up pretty quick for me [16:50:50] got abort:abort error [16:50:54] You should try loading the linux kernel one, it takes forever [16:51:28] heh [16:52:13] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26353 [16:53:50] <^demon> I dunno if those stats are right. I don't see myself [16:54:29] <^demon> Oh, I don't think it's properly counting the old svn-converted stuff. [16:54:30] <^demon> Oh well [16:55:01] no idea how it works [17:02:05] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 10.868404127 (gt 8.0) [17:04:48] ^demon: did you see brion's stats??? ;) [17:04:59] <^demon> No? [17:05:06] <^demon> Oh, heh [17:05:23] it's probably only doing stats for people that have github accounts with email addresses that match the commit address [17:05:33] so svn therefore won't count [17:05:45] (wild guess) [17:09:21] jeremyb: i think that's right [17:09:29] GitHub uses the email saved in a commit's header to link the commit to a GitHub user [17:09:35] https://help.github.com/articles/why-are-my-commits-linked-to-the-wrong-user [17:12:17] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 3.59633785714 [17:29:12] New review: Ryan Lane; "inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/26407 [17:30:15] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [17:31:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [17:32:26] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [17:33:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [17:33:45] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [17:34:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [17:38:30] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [17:39:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [17:40:20] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [17:41:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [18:10:14] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [18:19:15] New patchset: preilly; "fix DTAC Thailand (DT) IP range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26493 [18:20:08] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 9.87912261905 (gt 8.0) [18:20:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26493 [18:20:13] notpeter: ping [18:21:15] paravoid: ping [18:23:35] mark: ping [18:23:48] Actually is there anybody from OPs actually around [18:25:38] New patchset: Demon; "Make gerrit service subscribe to secure.config and replication.config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26455 [18:26:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26455 [18:27:12] preilly: ok, can push [18:27:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26493 [18:27:47] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0 [18:27:57] preilly: merged on sockpuppet [18:28:22] notpeter: thanks [18:53:59] New patchset: Demon; "Only replicate branches and tags" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26495 [18:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26495 [19:03:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26495 [19:10:32] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [19:16:41] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.94007206349 (gt 8.0) [19:26:27] RECOVERY - Puppet freshness on oxygen is OK: puppet ran at Wed Oct 3 19:26:19 UTC 2012 [19:28:25] !log preilly synchronized php-1.20wmf12/extensions/ZeroRatedMobileAccess 'update for landing page' [19:28:36] Logged the message, Master [19:29:26] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 3.4601276 [19:29:58] !log preilly synchronized php-1.21wmf1/extensions/ZeroRatedMobileAccess 'update for landing page' [19:30:09] Logged the message, Master [19:32:47] !log preilly synchronized php-1.20wmf12/extensions/MobileFrontend 'update for landing page' [19:32:58] Logged the message, Master [19:33:35] !log preilly synchronized php-1.21wmf1/extensions/MobileFrontend 'update for landing page' [19:33:46] Logged the message, Master [19:45:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:45:29] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [19:45:29] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [19:45:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:45:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [19:45:51] New patchset: Dzahn; "wikistats: add a debian .install file and logos for xhtml/css validator" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/26510 [19:55:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26455 [20:06:40] AaronSchulz: do we have enough stuff migrated off of NFS to go Swift-only if we need to for a bit? [20:08:09] totally swift-only? [20:08:14] that would break captcha [20:09:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, private and closed to 1.21wmf1 [20:09:49] Logged the message, Master [20:11:41] woosters: ^ [20:12:28] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiversity and wikiquote to 1.21wmf1 [20:12:38] Logged the message, Master [20:12:50] AaronSchulz: (sorry, I know you've told me before) What's the plan with Captcha? [20:12:54] Aaron, my question is that - should ms7 falls over now , would it bring down the scalers, and subsequently the site? [20:13:41] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks and wikinews to 1.21wmf1 [20:13:51] Logged the message, Master [20:14:10] the scalars shouldn't be using ms7/ms5 now [20:14:24] they read originals from swift and write them only to swift [20:15:26] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikimedia and wiktionary to 1.21wmf1 [20:15:37] Logged the message, Master [20:15:42] !log powercycling analytics1007 [20:15:53] Logged the message, Master [20:18:23] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else non wikipedia to 1.21wmf1 [20:18:29] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [20:18:34] Logged the message, Master [20:18:44] so should ms7 stops now, only captcha not working … right? [20:19:07] well timeline and math would break for people unless the ACLs were deployed to swift [20:19:08] * robla disappears back into meeting [20:19:30] though that can be done quickly, it would still require somebody to do something [20:19:31] really, the W3C logo/image for a passed validator.w3.org is not under a free license? sigh [20:20:23] lol. [20:20:35] http://en.wikipedia.org/wiki/File:Valid-xhtml10-v_%28W3C_Markup_Validation_Service%29.svg [20:20:55] of course all operations to originals would give errors too until someone pulls ms7 out of the MW config, in that vein [20:21:18] anyway, captcha is the main thing that could not be fixed with a fast config change if that's the question [20:21:22] * AaronSchulz looks for other stuff [20:22:02] ah, yes, extension distributor would break too, but oh well [20:22:32] and it would be nice for http://wikitech.wikimedia.org/view/Swift/Open_Issues_Aug_-_Sept_2012/Cruft_on_ms7#docroot to work [20:26:06] New review: Dzahn; "W3C logo icons used in accordance with http://www.w3.org/Consortium/Legal/logo-usage-20000308.html :p" [operations/debs/wikistats] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/26510 [20:26:12] New review: Dzahn; "W3C logo icons used in accordance with http://www.w3.org/Consortium/Legal/logo-usage-20000308.html :p" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/26510 [20:26:13] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/26510 [20:29:29] ottomata: powercycling it, can get to BIOS but not much more [20:29:44] looks like the other one where you just repeated install and it worked [20:30:02] ? [20:30:12] remember you mentioned something regarding a last step that failed with the installer [20:30:41] oh yeah kinda, that one I thought I had fixed by switching back to HDD boot while the installer was running [20:31:02] that way when it would reboot it would pick up the OS on the HDD after the install was finished [20:31:43] i confirmed it should try the first HDD first per BIOS boot settings [20:31:46] hrmm [20:32:05] i dont see any fails currently, just blank output on mgmt again [20:32:13] oh you get blank output? [20:32:16] on an07? [20:32:31] hmm, yeah maybe I got that too…….can't really remember now [20:32:34] but not always [20:32:40] lots of times it would just PXE boot after the install finished [20:33:45] heh, drdee, I shouldra run this on a smaller file first to check the results [20:33:50] buuut it is going, so we'll see: [20:33:50] http://analytics1001.wikimedia.org:8088/proxy/application_1349195921521_0019/mapreduce/job/job_1349195921521_0019 [20:35:30] ottomata: yes, on an07 [20:35:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:35:49] ooops, sorry [20:35:54] that last bit was the wrong chat [20:35:55] ottomata: check [20:36:26] yeah hmm, i dunno what's up mutante [20:36:34] it seems like it should HDD boot to you too, right? [20:36:46] is there a way we can check to see if the boot loader is on the disk? [20:36:51] its supposed to be software raid on / [20:36:52] sooooooo [20:36:55] i dunno [20:36:56] woosters, AaronSchulz: hi [20:37:08] hi [20:37:40] how's London? [20:37:45] my net is really really bad, so pardon any lag [20:38:12] so, math/timeline were switched on both squids & varnish [20:38:19] ottomata: yes, it does look like it should boot from HDD to me, unfortunately i don't see anything though after exiting from BIOS [20:38:36] ottomata: might be console redirection settings [20:38:47] i keep trying [20:39:01] paravoid: already? [20:39:06] I thought that wasn't done yet [20:39:15] mmk, thank youuuu [20:40:13] nope, it's deployed, it's also on the server admin log [20:40:18] and puppet's git history too :) [20:40:52] captcha and the rest of the cruft can be switched on the netapp instantly, we don't have to wait for the rsync to the netapps to finish [20:41:32] did you copy the captchas? there are only like 100k, and you don't need all of them for it to work [20:41:42] the only "question" is if it's a simple config change to switch captcha etc. to the new mount point [20:41:56] which I think it's trivial enough, but you'll know better [20:42:21] should be simple [20:42:51] paravoid: also the ext-dist stuff shouldn't take long to copy [20:43:05] exactly, that was on the "rest of the cruft" [20:43:12] docroot too [20:43:16] and whatever else is left [20:44:10] we were pondering if there's going to be any /mnt/upload6 hardcoded anywhere [20:44:22] or if we should just change mountpoints [20:46:12] New patchset: Alex Monk; "(bug 40669) Lift account creation throttle for Grace Hopper 2012 event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26541 [20:49:08] AaronSchulz: so, yeah, if you could check if there's any mount points hardcoded that'd be great [20:49:22] so that we can mount the netapp everywhere and starting switching configs to it [20:49:38] incl. captcha, ext-dist, new origs/thumbs etc. [20:49:57] both in MW and in squid/varnish's swift fallback [20:52:35] New review: Dzahn; "per Leslie, bug 40669" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/26541 [20:52:35] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26541 [20:55:01] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:39] !log dzahn synchronized wmf-config/throttle.php 'IP cap lift for Grace Hopper event, bug 40669' [20:55:50] Logged the message, Master [20:56:22] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.661 second response time [20:57:48] paravoid: I don't see upload6 in conf/squid [20:57:57] there is some stuff in wmf-config of course [20:59:07] nothing in MW code [21:13:38] paravoid: do you see anything else other than in wmf-config and puppet? [21:25:53] paravoid: ping [21:35:59] New review: Aaron Schulz; "Looks like it was merged." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/23392 [21:38:49] !log deleting education-ar mailman list per bug 40749 [21:38:59] Logged the message, Master [21:46:55] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:48:21] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:51] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.422 second response time [22:19:54] notpeter, is our custom Solr package publicly available? [22:32:12] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [22:36:21] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:46:33] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [22:47:28] !log reinstalling analytics1007 [22:47:38] Logged the message, Master [22:59:54] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:03:38] MaxSem: Presumably it'd be in our apt repo? [23:03:59] yeah, I've already found it [23:05:10] New patchset: Alex Monk; "(bug 40736) Lift account creation limit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26460 [23:08:20] nice http://commons.wikimedia.org/wiki/File:WALRUS_logo.svg [23:21:58] hmm.. Detected USB Devices " 3 Drives, 1 Keyboard, 2 Mice, 1 Hub " .. on a server in the dc? hmm [23:31:33] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:03] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.004 second response time [23:58:13] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours