[00:35:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:41:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:41:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:49:20] paravoid: mc2 came up without networking after boot but ixgbe-dkms was installed [00:49:35] looks like the .ko is installed to /lib/modules/3.2.0-29-generic/updates/dkms/ixgbe.ko [00:49:53] while /lib/modules/3.2.0-29-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko is the original [00:50:15] i insmod'd the dkms version and it works at least [00:58:00] hm [00:58:36] let me SSH for a moment [00:59:04] root@mc2:~# modinfo ixgbe [00:59:04] filename: /lib/modules/3.2.0-29-generic/updates/dkms/ixgbe.ko [00:59:08] that's strange [00:59:12] aaaaaaaaaargh, I know [00:59:28] the old one is in the initramfs [00:59:28] the initramfs probably has the module too [00:59:32] which doesnt get recreated [00:59:34] yeah, just verified that [00:59:36] yeah, I just thought of that [00:59:47] we need to do update-initramfs -k all -u in the postinst [00:59:51] after the dkms stuff [00:59:52] dammit [01:00:03] i just ran a update-initramfs and am going to see if the new one has the dkms version [01:00:10] oh it's even worse than that [01:00:17] we need to do it on every dkms run too [01:00:22] not just at package installation [01:01:11] ok, the new initramfs contains the dkms and orig versions [01:02:49] paravoid: dkms.conf has a REMAKE_INITRD setting which defaults to no [01:03:03] if it works, this might not be too bad [01:03:04] that's what I'm looking at right now [01:05:41] yep [01:05:43] works in my test [01:05:48] yay, that was simple enough [01:05:52] yay [01:06:31] new ixgbe-dkms package? [01:06:38] yes, about to deploy [01:07:53] ok, updated [01:08:00] (didn't change the version number, feeling lazy) [01:08:06] can you reimage mc2 just to be sure? [01:08:13] yup [01:08:43] [ 1679.503747] ixgbe 0000:04:00.0: eth0: detected SFP+: 5 [01:09:25] interesting... [01:10:01] Failed to fetch http://apt.wikimedia.org/wikimedia/pool/universe/i/ixgbe/ixgbe-dkms_3.6.7-k+wmf1_amd64.deb Size mismatch [01:10:25] (and that was after an apt-get update) [01:10:25] laziness has its tolls [01:10:44] might need to removedeb first [01:10:48] since its the same version [01:11:05] 487 reprepro remove precise-wikimedia ixgbe [01:11:05] 488 reprepro remove precise-wikimedia ixgbe-dkms [01:11:05] 489 reprepro -C universe include precise-wikimedia ixgbe_3.6.7-k+wmf1_amd64.changes [01:11:36] that's strange [01:11:51] old .changes file? [01:12:08] oh [01:12:15] http://apt.wikimedia.org/wikimedia/pool/universe/i/ixgbe/ixgbe-dkms_3.6.7-k+wmf1_amd64.deb … i think that's behind squid [01:12:25] argh [01:12:43] the machines talk to it through brewster's squid anyway [01:12:53] we set http_proxy for the internal ones to be able to reach apt.wikimedia [01:13:30] X-Cache: HIT from brewster.wikimedia.org [01:13:30] X-Cache-Lookup: HIT from brewster.wikimedia.org:8080 [01:13:32] of course [01:13:51] good catch [01:15:10] New patchset: Tim Starling; "Remove ScanSet" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24252 [01:15:25] TimStarling: yay! [01:16:09] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24252 [01:16:22] paravoid: ok, thats out of the way, and testing an install of the new deb before i reimage [01:16:30] did you purge it? [01:16:54] yep, manually purged that file from squid [01:17:01] Making new initrd.img-3.2.0-29-generic [01:17:02] (If next boot fails, revert to initrd.img-3.2.0-29-generic.old-dkms image) [01:17:04] that sounds good [01:17:23] I tried it too and I got a 404 [01:17:37] then I retried fetching it and got a HIT [01:17:54] now that I retry everything seems fine, so I think what happened was a race between me and you :-) [01:18:19] you purged (200), then I purged (404) then you fetched (MISS) then I fetched (HIT) [01:18:24] and a new initramfs build when uninstalling [01:18:25] heh [01:18:26] heh [01:19:50] * paravoid crosses fingers [01:20:40] this is the first time I use dkms btw (apt-get install virtualbox-ose-dkms doesn't count) [01:20:40] its booting into pxe now [01:20:58] i'm going to go afk for a bit, i'll let you know :) and thanks, as always [01:21:31] after it boots we need to try to install a new kernel (e.g. an older one) to see what happens [01:22:01] better do it now than at some random point in the future and have it potentially fail [01:25:46] New patchset: Tim Starling; "Don't customise $wgCortadoJarFile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24253 [01:26:30] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24253 [01:26:36] yay² [01:30:54] New patchset: Faidon; "swift: add support for container sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24254 [01:31:50] New patchset: Faidon; "swift: allow container sync between labs/labsupgrade" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24255 [01:32:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24254 [01:32:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24254 [01:32:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24255 [01:33:15] New review: Faidon; "Already deployed, tested with puppetmaster::self." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24255 [01:33:16] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24255 [01:35:12] binasher: btw, swift still leaking memory with precise :( [01:35:20] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1348018471&g=mem_report&z=large&c=Swift%20pmtpa [01:35:26] sigh [01:40:49] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 253 seconds [01:41:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:44:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [01:45:20] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:59:25] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [02:12:20] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [02:31:22] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [03:05:16] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:19:05] why is the wiki sooooo slooooooow [04:31:22] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:25] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:52] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:52] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:53] Reedy: ^ [04:34:40] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:40] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:58] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:42] 3 apaches wouldn't cause an issue [04:35:44] 6 is still meg [04:35:46] *meh [04:35:52] what about the LVS? [04:36:28] i'm not sure what rendering is [04:36:42] it's the imagescalers [04:36:54] I presumed that might be the case [04:37:02] a lot of the apaches look loaded based on ganglia [04:37:36] not much going on in dberror [04:38:23] does http://status.wikimedia.org/8777/156488/Ubuntu-mirror have anything to do with it? [04:38:29] no [04:38:41] Those apaches aren't in the app server or api app server pool... [04:38:54] this may be due to rendering.... [04:38:56] paravoid: ^^ [04:39:14] yup, all scalers [04:39:29] hm. now how do I get in touch with faidon [04:39:34] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=srv271.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [04:39:39] wonder what that spike was [04:40:16] damn it. I have no clue how to contact faidon [04:40:17] scap? [04:40:24] someone scapped? [04:40:37] I'm running it now [04:40:41] why? [04:40:42] Ryan_Lane: not got his phone number? [04:40:51] nope [04:40:53] I've got it in my phone.. [04:40:55] Rebuild localisation cache for page triage [04:41:09] if only we had some better deployment system [04:42:24] the swift log was just completely idle for a minute but now spewing again [04:42:53] Sep 19 04:41:31 10.0.6.202 object-replicator @ERROR: max connections (2) reached -- try again later [04:42:57] there's a bunch of swift services showing red [04:43:27] Sep 19 04:41:31 10.0.6.208 object-replicator @ERROR: max connections (2) reached -- try again later [04:43:37] looks like thats being repeated for at least a bunch of the ms-be hosts [04:43:44] binasher: that logs tends to come in bursts [04:43:46] yep [04:43:48] I noticed that weeks ago [04:44:07] Aaron|home: good 'ol buffering, ok [04:44:19] Sep 19 04:41:31 10.0.6.215 proxy-server STDOUT: ERROR:root:Timeout talking to memcached: ms-fe1.pmtpa.wmnet:11211 (txn: txb31160c2e48d40598129e40174fc6aaf) (client_ip: 201.185.71.29) [04:44:44] faidon is coming on [04:45:47] attempts to ssh into ms-fe1 aren't working going anywhere [04:45:56] well, I surely hope it didn't die [04:46:00] well, it's a frontend, right? [04:46:08] so maybe that won't be as much of a problem [04:46:38] the memcached timeouts to it could be causing problems [04:46:43] yep [04:46:48] it's faster again [04:46:57] so Reedy's job /was/ the culprit after all [04:47:04] that [04:47:16] that's not the only problem, though [04:47:18] didn't this happen before? [04:47:34] some link getting saturated or something... [04:47:43] no clue. I'm not terribly familiar with swift yet [04:48:02] I mean due to scap [04:48:17] though the rsync deltas should have been smallish right? [04:48:19] looks like image scalers are down [04:48:28] http://commons.wikimedia.org/wiki/Special:NewFiles [04:48:35] no new thumbnails are showing [04:48:57] woosters_: yes. it's known [04:49:01] [54690.553782] TCP: too many of orphaned sockets [04:49:02] [54708.594438] Out of socket memory [04:49:15] it's still rather slow [04:49:17] ms-fe1.. paravoid upgraded it to precise earlier today [04:49:29] but better [04:50:07] Reedy: btw, someone really needs to fix those metadata exceptions :) [04:50:08] much closer to normal [04:50:11] hi [04:50:19] Jasper_Deng: please bring commentary to -tech [04:50:19] what's going on? [04:50:20] not here [04:50:32] Aaron|home: so do it then? :p [04:50:39] paravoid: ms-fe1 is having issues [04:50:46] just fe1? [04:51:30] well, I don't know if the other ms systems are supposed to be down or not in nagios [04:51:31] (fwiw, my number is in the office wiki) [04:51:37] there are possibly backend issues with swift [04:51:42] as well [04:51:45] paravoid: that's where I got it ;) [04:53:34] ariel upgraded ms-fe1 to precise yesterday our night [04:54:09] and I pushed a config file change at ~18:00 [04:54:55] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:17] Aaron|home: did (I|we|you|someone) log a bug for it? [04:55:21] okay, that's down but pybal should depool it [04:55:51] lots of swift noise in the apache logs (unsuprising) [04:56:01] huh, ms-fe1 201/204 hits are 0 :( [04:56:14] Reedy: I thought there was one, and you closed it [04:56:16] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.009 seconds [04:56:40] i increased tcp_max_orphans, tcp_mem, and tcp_rmem [04:57:03] from what to what? [04:57:05] the tcp out of socket memory issues were fucking with memcached [04:57:21] and ironically all of my GET attempts work [04:57:34] Aaron|home: did I? [04:57:48] and i can ssh to it now [04:57:55] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [04:58:05] binasher: and the graphs are on an upwards trend [04:58:08] the req ones [04:58:31] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [04:58:35] binasher: you increased them on which box(es)? [04:58:40] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [04:58:40] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 62244 bytes in 0.169 seconds [04:58:40] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [04:58:40] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [04:58:41] http://commons.wikimedia.org/wiki/Special:NewFiles is less broken [04:58:58] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.828 second response time [05:00:36] paravoid: only ms-fe1 via the console.. compared to 2 and 3, it was the only one steadily spewing errors like [05:00:37] [55094.028931] TCP: too many of orphaned sockets [05:00:37] [55096.344298] Out of socket memory [05:01:13] none of that from any of the lucid fe's [05:01:17] okay [05:01:23] thanks :) [05:03:10] Sep 19 04:55:57 10.0.6.211 proxy-server STDOUT: ERROR:root:Timeout talking to memcached: ms-fe1.pmtpa.wmnet:11211 (txn: txeed8d608f769454883ba20e98f90dfd8) [05:03:13] that was the last of those [05:03:15] have you kept the values before? [05:03:28] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=40037 [05:03:36] ahh, brion closed it, then it was reopened [05:03:43] paravoid: the prior values were as set by out sysctl.d confs [05:03:58] that's what I'm looking, I'm wondering if there's a bug there [05:04:08] and i just increased with stupid "make them bigger now now now" values [05:04:10] e.g. sysctl not being applied for whatever buggy reason [05:05:09] i don't think so [05:05:28] the tcp_rmem and tcp_max_orphans both matched the configs [05:05:43] okay, good to know [05:05:48] and worrying at the same time [05:06:37] i increased tcp_max_orphans from 262144 to 2621440 which is probably stupid, and just added a 0 to the last fields of mem/rmem as well [05:06:59] heh I just saw that [05:07:50] so is the swift memory leak dangling sockets? [05:08:38] 60-swift-performance.conf.sysctl… performance! heh [05:08:48] there was a bug fixed 1.4.4 along those lines (fixing "socket hoarding") [05:09:31] i'm skeptical of it setting both tw_recycle and tw_reuse to 1, not that i think its relevant [05:09:49] 1.7 "Fixed a bug where an error may have caused the proxy to stop returning data to a client" [05:10:10] heh, seems like some of the errors I see in the logs about missing bytes [05:10:20] so, the max_orphans is a noop, the current orphan count is 0 [05:10:51] binasher: I think the docs recommend that [05:10:55] eh, it did log "TCP: too many of orphaned sockets" [05:11:44] oh it did? that's strange [05:12:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:12:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:13:39] TCP: inuse 1035 orphan 0 tw 1 alloc 21622 mem 791526 [05:13:43] erm [05:13:45] that's pages [05:14:05] 3.2G? [05:18:46] before i made the proc changes, sockstat was: [05:18:47] sockets: used 21933 [05:18:47] TCP: inuse 1093 orphan 0 tw 1 alloc 21761 mem 799052 [05:19:53] oh useful [05:20:02] so orphan was 0 before [05:20:10] orphan 0 at the time, weird that it logged too many literally a minute before [05:20:16] see dmesg [05:20:23] I believe you :) [05:20:40] I'll dig in the kernel code, see when that message is triggered [05:20:42] just in case [05:25:02] i set tcp_rmem and tcp_mem back to prior values [05:29:35] grr freaking wifi and mifi [05:30:25] paravoid: tcp_too_many_orphans() returns true if orphans > sysctl_tcp_max_orphans [05:30:26] OR [05:30:33] 287 if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && [05:30:34] 288 atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]) [05:30:35] 289 return true; [05:31:21] that's stupid [05:31:31] yes [05:31:40] I'm on my kernel tree too [05:32:03] but you beat me to it, friggin wifi and line wrapping [05:32:45] (grepping for "too many orphaned sockets" has no output, since " sockets" is on a separate line) [05:33:03] i wonder if the logging is improved in 3.4 or 5 [05:33:09] that was in tcp.h [05:33:16] I have a full git checkout, let me see [05:34:06] but anyways, just socket memory related.. i wonder what's different in precise [05:34:39] if mem is pages then the allocated memory is crazy [05:34:49] and I think it's pages [05:34:55] 3.2G for socket memory, that'd be fun [05:35:42] commit efcdbf24fd5daa88060869e51ed49f68b7ac8708 [05:35:52] net: Disambiguate kernel message [05:35:52] [05:35:52] Some of our machines were reporting: [05:35:52] [05:35:52] TCP: too many of orphaned sockets [05:35:54] [05:35:57] even when the number of orphaned sockets was well below the [05:35:59] limit. [05:36:11] v3.3-rc4~34^2~57 [05:36:38] good to know [05:37:36] tcp_mem = pages [05:37:56] its grown to 800738 [05:38:37] I'm wondering about the sockstat output [05:38:53] 800738 pages you think? [05:39:00] yah, i think mem in there is pages as well [05:39:03] that's ~3.2GB [05:39:11] root@ms-fe2:~# cat /proc/net/sockstat [05:39:12] sockets: used 1111 [05:39:12] TCP: inuse 906 orphan 0 tw 0 alloc 916 mem 23317 [05:39:23] yeah [05:39:55] a lot more reasonable [05:41:38] but sockets: used 22320 vs. sockets: used 1111 [05:42:10] oh hah [05:42:23] look at a swift process' fds [05:42:42] it's like 1685 [05:42:47] vs. 63 in ms-fe2 [05:43:04] 1981 vs. 59 [05:43:51] 1034/fd 2182 [05:43:51] 1037/fd 1982 [05:43:52] 1035/fd 1980 [05:43:53] yeah [05:44:40] eventlet maybe [05:44:43] lsof shows most as "can't identify protocol" [05:44:54] yeah they don't appear in netstat either [05:45:39] (or ss, if you want to be trendy) [05:46:16] I think it's a socket leak [05:46:21] not being close()d [05:49:13] it's something like that [05:50:25] I'm pretty sure it's that or very similar to that [05:52:22] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [05:59:32] binasher: I'm thinking to depool ms-fe1 and call it a day [05:59:37] unless you're still playing with it [06:01:55] !log depooling ms-fe1; broken after precise upgrade; pending further investigation [06:02:05] Logged the message, Master [06:02:09] go for it, i was trying to observe exactly what happens when it loses a socket under a couch cushion but, meh [06:03:00] i wonder if it should be taken out of the memcached list on the others.. maybe it won't matter of swift-proxy stops getting requests [06:03:23] yeah I think we'll be okay [06:06:51] i'm signing off for the night, seeya tmw [06:47:54] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:57:12] RECOVERY - Squid on brewster is OK: TCP OK - 0.000 second response time on port 8080 [07:09:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:06:21] well that was a fail (the ms-fe1 upgrade) [09:06:25] good I only did one [09:19:06] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:42:03] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:42:03] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:58:03] New review: ArielGlenn; "OK, if they're all eventually going to be short forms then there's no point in trying to keep this c..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23309 [10:58:03] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23309 [12:00:03] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [12:00:17] New patchset: Hashar; "(bug 40163) Try to fix ltwiki import source for betawikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419 [12:01:11] New review: Hashar; "lets try it!" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23419 [12:01:11] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419 [12:11:56] New patchset: Hashar; "(bug 39206) Namespaces configuration for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [12:12:30] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [12:17:08] New patchset: Hashar; "(bug 38840) Namespaces configuration on uz.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23102 [12:17:24] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23102 [12:23:38] New patchset: Hashar; "(bug 39866) Anexo: is a content namespace on es.wikipedia." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23068 [12:23:52] New review: Hashar; "Looks like there is consensus for this." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23068 [12:23:52] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23068 [12:27:33] New patchset: Hashar; "(bug 39264) Add Tudalen: and Indecs: namespaces to cy.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23094 [12:27:40] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23094 [12:31:06] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:32:09] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [12:43:16] New review: Hashar; "We definitely need tests coverage in mediawiki-config :-)" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23935 [13:31:18] hey guys, what is the URL that installs the gerrit post-commit hook? I can't find it on https://labsconsole.wikimedia.org/wiki/Git [13:31:22] !log Deactivated subnet for vlan 104 (Tampa) [13:31:32] Logged the message, Master [13:35:46] https://gerrit.wikimedia.org/r/tools/hooks/commit-msg in theory [13:35:56] ty apergos [13:36:26] yw [13:43:29] mark: got some time to fix a parameterized class please? https://gerrit.wikimedia.org/r/#/c/23770/ Tim made the role::mediawiki::logger to require a log path but we can't pass parameters in labsconsole :-) [13:43:39] so I have changed the class to have a default value [13:45:58] what would be the reason for not having one wikipedia attached to sul? [13:45:59] apergos, drdee: for future reference https://gerrit.wikimedia.org/r/Documentation/cmd-index.html#_client [13:50:28] I looked at this: http://www.mediawiki.org/w/index.php?title=Git/Workflow&oldid=503559 t find it :-P [13:59:09] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:06:39] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.034 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:12:03] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [14:27:58] apergos1: our disks on a fedex truck roaming around the city now...I should have them by 2p est [14:28:24] which is 4 pm here [14:28:28] no [14:28:33] noon here [14:28:46] er [14:28:47] no [14:28:49] midnight [14:29:02] your +10 from here? [14:29:12] oh, tampa :-/ [14:29:14] uhhh [14:29:16] utc is +5 [14:29:23] yeah that doesn'thelp [14:29:49] 2pm there is 11 am sf which is 9 pm my time [14:30:06] odds are good I won't be here by then [14:30:57] okay...to be clear...we are going to use ms-be10 for this...correct? [14:31:00] yes [14:31:09] should not need a reinstall, just dropping in new disks [14:32:05] I think that the filesystems will get set up on the new disks after the first puppet run' [14:32:51] are you able to shutdown boxes nicely (as opposed to power off from ipmi/drac)? [14:34:28] apergos1: yes [14:35:48] ok so if after the disks go in and the first puppet run completes, if you see a bunch of failures for the mounts, just shut down the box nicely and I'll look at it tomorrow morning [14:37:18] I'll check in when I'm back tonight but I don't know when that will be or how tired I might be by then [14:37:59] okay [14:39:06] i know how the problem ones went during the boot...i will update the ticket [14:40:00] these aren't hot swap right? [14:41:04]