[00:35:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:41:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:41:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:49:20] paravoid: mc2 came up without networking after boot but ixgbe-dkms was installed [00:49:35] looks like the .ko is installed to /lib/modules/3.2.0-29-generic/updates/dkms/ixgbe.ko [00:49:53] while /lib/modules/3.2.0-29-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko is the original [00:50:15] i insmod'd the dkms version and it works at least [00:58:00] hm [00:58:36] let me SSH for a moment [00:59:04] root@mc2:~# modinfo ixgbe [00:59:04] filename: /lib/modules/3.2.0-29-generic/updates/dkms/ixgbe.ko [00:59:08] that's strange [00:59:12] aaaaaaaaaargh, I know [00:59:28] the old one is in the initramfs [00:59:28] the initramfs probably has the module too [00:59:32] which doesnt get recreated [00:59:34] yeah, just verified that [00:59:36] yeah, I just thought of that [00:59:47] we need to do update-initramfs -k all -u in the postinst [00:59:51] after the dkms stuff [00:59:52] dammit [01:00:03] i just ran a update-initramfs and am going to see if the new one has the dkms version [01:00:10] oh it's even worse than that [01:00:17] we need to do it on every dkms run too [01:00:22] not just at package installation [01:01:11] ok, the new initramfs contains the dkms and orig versions [01:02:49] paravoid: dkms.conf has a REMAKE_INITRD setting which defaults to no [01:03:03] if it works, this might not be too bad [01:03:04] that's what I'm looking at right now [01:05:41] yep [01:05:43] works in my test [01:05:48] yay, that was simple enough [01:05:52] yay [01:06:31] new ixgbe-dkms package? [01:06:38] yes, about to deploy [01:07:53] ok, updated [01:08:00] (didn't change the version number, feeling lazy) [01:08:06] can you reimage mc2 just to be sure? [01:08:13] yup [01:08:43] [ 1679.503747] ixgbe 0000:04:00.0: eth0: detected SFP+: 5 [01:09:25] interesting... [01:10:01] Failed to fetch http://apt.wikimedia.org/wikimedia/pool/universe/i/ixgbe/ixgbe-dkms_3.6.7-k+wmf1_amd64.deb Size mismatch [01:10:25] (and that was after an apt-get update) [01:10:25] laziness has its tolls [01:10:44] might need to removedeb first [01:10:48] since its the same version [01:11:05] 487 reprepro remove precise-wikimedia ixgbe [01:11:05] 488 reprepro remove precise-wikimedia ixgbe-dkms [01:11:05] 489 reprepro -C universe include precise-wikimedia ixgbe_3.6.7-k+wmf1_amd64.changes [01:11:36] that's strange [01:11:51] old .changes file? [01:12:08] oh [01:12:15] http://apt.wikimedia.org/wikimedia/pool/universe/i/ixgbe/ixgbe-dkms_3.6.7-k+wmf1_amd64.deb … i think that's behind squid [01:12:25] argh [01:12:43] the machines talk to it through brewster's squid anyway [01:12:53] we set http_proxy for the internal ones to be able to reach apt.wikimedia [01:13:30] X-Cache: HIT from brewster.wikimedia.org [01:13:30] X-Cache-Lookup: HIT from brewster.wikimedia.org:8080 [01:13:32] of course [01:13:51] good catch [01:15:10] New patchset: Tim Starling; "Remove ScanSet" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24252 [01:15:25] TimStarling: yay! [01:16:09] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24252 [01:16:22] paravoid: ok, thats out of the way, and testing an install of the new deb before i reimage [01:16:30] did you purge it? [01:16:54] yep, manually purged that file from squid [01:17:01] Making new initrd.img-3.2.0-29-generic [01:17:02] (If next boot fails, revert to initrd.img-3.2.0-29-generic.old-dkms image) [01:17:04] that sounds good [01:17:23] I tried it too and I got a 404 [01:17:37] then I retried fetching it and got a HIT [01:17:54] now that I retry everything seems fine, so I think what happened was a race between me and you :-) [01:18:19] you purged (200), then I purged (404) then you fetched (MISS) then I fetched (HIT) [01:18:24] and a new initramfs build when uninstalling [01:18:25] heh [01:18:26] heh [01:19:50] * paravoid crosses fingers [01:20:40] this is the first time I use dkms btw (apt-get install virtualbox-ose-dkms doesn't count) [01:20:40] its booting into pxe now [01:20:58] i'm going to go afk for a bit, i'll let you know :) and thanks, as always [01:21:31] after it boots we need to try to install a new kernel (e.g. an older one) to see what happens [01:22:01] better do it now than at some random point in the future and have it potentially fail [01:25:46] New patchset: Tim Starling; "Don't customise $wgCortadoJarFile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24253 [01:26:30] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24253 [01:26:36] yay² [01:30:54] New patchset: Faidon; "swift: add support for container sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24254 [01:31:50] New patchset: Faidon; "swift: allow container sync between labs/labsupgrade" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24255 [01:32:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24254 [01:32:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24254 [01:32:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24255 [01:33:15] New review: Faidon; "Already deployed, tested with puppetmaster::self." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24255 [01:33:16] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24255 [01:35:12] binasher: btw, swift still leaking memory with precise :( [01:35:20] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1348018471&g=mem_report&z=large&c=Swift%20pmtpa [01:35:26] sigh [01:40:49] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 253 seconds [01:41:16] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 280 seconds [01:44:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [01:45:20] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:59:25] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [02:12:20] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [02:31:22] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [03:05:16] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:19:05] why is the wiki sooooo slooooooow [04:31:22] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:25] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:52] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:52] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:53] Reedy: ^ [04:34:40] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:40] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:34:58] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:42] 3 apaches wouldn't cause an issue [04:35:44] 6 is still meg [04:35:46] *meh [04:35:52] what about the LVS? [04:36:28] i'm not sure what rendering is [04:36:42] it's the imagescalers [04:36:54] I presumed that might be the case [04:37:02] a lot of the apaches look loaded based on ganglia [04:37:36] not much going on in dberror [04:38:23] does http://status.wikimedia.org/8777/156488/Ubuntu-mirror have anything to do with it? [04:38:29] no [04:38:41] Those apaches aren't in the app server or api app server pool... [04:38:54] this may be due to rendering.... [04:38:56] paravoid: ^^ [04:39:14] yup, all scalers [04:39:29] hm. now how do I get in touch with faidon [04:39:34] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&h=srv271.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [04:39:39] wonder what that spike was [04:40:16] damn it. I have no clue how to contact faidon [04:40:17] scap? [04:40:24] someone scapped? [04:40:37] I'm running it now [04:40:41] why? [04:40:42] Ryan_Lane: not got his phone number? [04:40:51] nope [04:40:53] I've got it in my phone.. [04:40:55] Rebuild localisation cache for page triage [04:41:09] if only we had some better deployment system [04:42:24] the swift log was just completely idle for a minute but now spewing again [04:42:53] Sep 19 04:41:31 10.0.6.202 object-replicator @ERROR: max connections (2) reached -- try again later [04:42:57] there's a bunch of swift services showing red [04:43:27] Sep 19 04:41:31 10.0.6.208 object-replicator @ERROR: max connections (2) reached -- try again later [04:43:37] looks like thats being repeated for at least a bunch of the ms-be hosts [04:43:44] binasher: that logs tends to come in bursts [04:43:46] yep [04:43:48] I noticed that weeks ago [04:44:07] Aaron|home: good 'ol buffering, ok [04:44:19] Sep 19 04:41:31 10.0.6.215 proxy-server STDOUT: ERROR:root:Timeout talking to memcached: ms-fe1.pmtpa.wmnet:11211 (txn: txb31160c2e48d40598129e40174fc6aaf) (client_ip: 201.185.71.29) [04:44:44] faidon is coming on [04:45:47] attempts to ssh into ms-fe1 aren't working going anywhere [04:45:56] well, I surely hope it didn't die [04:46:00] well, it's a frontend, right? [04:46:08] so maybe that won't be as much of a problem [04:46:38] the memcached timeouts to it could be causing problems [04:46:43] yep [04:46:48] it's faster again [04:46:57] so Reedy's job /was/ the culprit after all [04:47:04] that [04:47:16] that's not the only problem, though [04:47:18] didn't this happen before? [04:47:34] some link getting saturated or something... [04:47:43] no clue. I'm not terribly familiar with swift yet [04:48:02] I mean due to scap [04:48:17] though the rsync deltas should have been smallish right? [04:48:19] looks like image scalers are down [04:48:28] http://commons.wikimedia.org/wiki/Special:NewFiles [04:48:35] no new thumbnails are showing [04:48:57] woosters_: yes. it's known [04:49:01] [54690.553782] TCP: too many of orphaned sockets [04:49:02] [54708.594438] Out of socket memory [04:49:15] it's still rather slow [04:49:17] ms-fe1.. paravoid upgraded it to precise earlier today [04:49:29] but better [04:50:07] Reedy: btw, someone really needs to fix those metadata exceptions :) [04:50:08] much closer to normal [04:50:11] hi [04:50:19] Jasper_Deng: please bring commentary to -tech [04:50:19] what's going on? [04:50:20] not here [04:50:32] Aaron|home: so do it then? :p [04:50:39] paravoid: ms-fe1 is having issues [04:50:46] just fe1? [04:51:30] well, I don't know if the other ms systems are supposed to be down or not in nagios [04:51:31] (fwiw, my number is in the office wiki) [04:51:37] there are possibly backend issues with swift [04:51:42] as well [04:51:45] paravoid: that's where I got it ;) [04:53:34] ariel upgraded ms-fe1 to precise yesterday our night [04:54:09] and I pushed a config file change at ~18:00 [04:54:55] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:17] Aaron|home: did (I|we|you|someone) log a bug for it? [04:55:21] okay, that's down but pybal should depool it [04:55:51] lots of swift noise in the apache logs (unsuprising) [04:56:01] huh, ms-fe1 201/204 hits are 0 :( [04:56:14] Reedy: I thought there was one, and you closed it [04:56:16] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.009 seconds [04:56:40] i increased tcp_max_orphans, tcp_mem, and tcp_rmem [04:57:03] from what to what? [04:57:05] the tcp out of socket memory issues were fucking with memcached [04:57:21] and ironically all of my GET attempts work [04:57:34] Aaron|home: did I? [04:57:48] and i can ssh to it now [04:57:55] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [04:58:05] binasher: and the graphs are on an upwards trend [04:58:08] the req ones [04:58:31] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [04:58:35] binasher: you increased them on which box(es)? [04:58:40] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [04:58:40] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 62244 bytes in 0.169 seconds [04:58:40] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [04:58:40] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [04:58:41] http://commons.wikimedia.org/wiki/Special:NewFiles is less broken [04:58:58] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.828 second response time [05:00:36] paravoid: only ms-fe1 via the console.. compared to 2 and 3, it was the only one steadily spewing errors like [05:00:37] [55094.028931] TCP: too many of orphaned sockets [05:00:37] [55096.344298] Out of socket memory [05:01:13] none of that from any of the lucid fe's [05:01:17] okay [05:01:23] thanks :) [05:03:10] Sep 19 04:55:57 10.0.6.211 proxy-server STDOUT: ERROR:root:Timeout talking to memcached: ms-fe1.pmtpa.wmnet:11211 (txn: txeed8d608f769454883ba20e98f90dfd8) [05:03:13] that was the last of those [05:03:15] have you kept the values before? [05:03:28] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=40037 [05:03:36] ahh, brion closed it, then it was reopened [05:03:43] paravoid: the prior values were as set by out sysctl.d confs [05:03:58] that's what I'm looking, I'm wondering if there's a bug there [05:04:08] and i just increased with stupid "make them bigger now now now" values [05:04:10] e.g. sysctl not being applied for whatever buggy reason [05:05:09] i don't think so [05:05:28] the tcp_rmem and tcp_max_orphans both matched the configs [05:05:43] okay, good to know [05:05:48] and worrying at the same time [05:06:37] i increased tcp_max_orphans from 262144 to 2621440 which is probably stupid, and just added a 0 to the last fields of mem/rmem as well [05:06:59] heh I just saw that [05:07:50] so is the swift memory leak dangling sockets? [05:08:38] 60-swift-performance.conf.sysctl… performance! heh [05:08:48] there was a bug fixed 1.4.4 along those lines (fixing "socket hoarding") [05:09:31] i'm skeptical of it setting both tw_recycle and tw_reuse to 1, not that i think its relevant [05:09:49] 1.7 "Fixed a bug where an error may have caused the proxy to stop returning data to a client" [05:10:10] heh, seems like some of the errors I see in the logs about missing bytes [05:10:20] so, the max_orphans is a noop, the current orphan count is 0 [05:10:51] binasher: I think the docs recommend that [05:10:55] eh, it did log "TCP: too many of orphaned sockets" [05:11:44] oh it did? that's strange [05:12:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:12:28] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:12:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:13:39] TCP: inuse 1035 orphan 0 tw 1 alloc 21622 mem 791526 [05:13:43] erm [05:13:45] that's pages [05:14:05] 3.2G? [05:18:46] before i made the proc changes, sockstat was: [05:18:47] sockets: used 21933 [05:18:47] TCP: inuse 1093 orphan 0 tw 1 alloc 21761 mem 799052 [05:19:53] oh useful [05:20:02] so orphan was 0 before [05:20:10] orphan 0 at the time, weird that it logged too many literally a minute before [05:20:16] see dmesg [05:20:23] I believe you :) [05:20:40] I'll dig in the kernel code, see when that message is triggered [05:20:42] just in case [05:25:02] i set tcp_rmem and tcp_mem back to prior values [05:29:35] grr freaking wifi and mifi [05:30:25] paravoid: tcp_too_many_orphans() returns true if orphans > sysctl_tcp_max_orphans [05:30:26] OR [05:30:33] 287 if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF && [05:30:34] 288 atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2]) [05:30:35] 289 return true; [05:31:21] that's stupid [05:31:31] yes [05:31:40] I'm on my kernel tree too [05:32:03] but you beat me to it, friggin wifi and line wrapping [05:32:45] (grepping for "too many orphaned sockets" has no output, since " sockets" is on a separate line) [05:33:03] i wonder if the logging is improved in 3.4 or 5 [05:33:09] that was in tcp.h [05:33:16] I have a full git checkout, let me see [05:34:06] but anyways, just socket memory related.. i wonder what's different in precise [05:34:39] if mem is pages then the allocated memory is crazy [05:34:49] and I think it's pages [05:34:55] 3.2G for socket memory, that'd be fun [05:35:42] commit efcdbf24fd5daa88060869e51ed49f68b7ac8708 [05:35:52] net: Disambiguate kernel message [05:35:52] [05:35:52] Some of our machines were reporting: [05:35:52] [05:35:52] TCP: too many of orphaned sockets [05:35:54] [05:35:57] even when the number of orphaned sockets was well below the [05:35:59] limit. [05:36:11] v3.3-rc4~34^2~57 [05:36:38] good to know [05:37:36] tcp_mem = pages [05:37:56] its grown to 800738 [05:38:37] I'm wondering about the sockstat output [05:38:53] 800738 pages you think? [05:39:00] yah, i think mem in there is pages as well [05:39:03] that's ~3.2GB [05:39:11] root@ms-fe2:~# cat /proc/net/sockstat [05:39:12] sockets: used 1111 [05:39:12] TCP: inuse 906 orphan 0 tw 0 alloc 916 mem 23317 [05:39:23] yeah [05:39:55] a lot more reasonable [05:41:38] but sockets: used 22320 vs. sockets: used 1111 [05:42:10] oh hah [05:42:23] look at a swift process' fds [05:42:42] it's like 1685 [05:42:47] vs. 63 in ms-fe2 [05:43:04] 1981 vs. 59 [05:43:51] 1034/fd 2182 [05:43:51] 1037/fd 1982 [05:43:52] 1035/fd 1980 [05:43:53] yeah [05:44:40] eventlet maybe [05:44:43] lsof shows most as "can't identify protocol" [05:44:54] yeah they don't appear in netstat either [05:45:39] (or ss, if you want to be trendy) [05:46:16] I think it's a socket leak [05:46:21] not being close()d [05:49:13] it's something like that [05:50:25] I'm pretty sure it's that or very similar to that [05:52:22] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [05:59:32] binasher: I'm thinking to depool ms-fe1 and call it a day [05:59:37] unless you're still playing with it [06:01:55] !log depooling ms-fe1; broken after precise upgrade; pending further investigation [06:02:05] Logged the message, Master [06:02:09] go for it, i was trying to observe exactly what happens when it loses a socket under a couch cushion but, meh [06:03:00] i wonder if it should be taken out of the memcached list on the others.. maybe it won't matter of swift-proxy stops getting requests [06:03:23] yeah I think we'll be okay [06:06:51] i'm signing off for the night, seeya tmw [06:47:54] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:57:12] RECOVERY - Squid on brewster is OK: TCP OK - 0.000 second response time on port 8080 [07:09:03] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:06:21] well that was a fail (the ms-fe1 upgrade) [09:06:25] good I only did one [09:19:06] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:42:03] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:42:03] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:58:03] New review: ArielGlenn; "OK, if they're all eventually going to be short forms then there's no point in trying to keep this c..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23309 [10:58:03] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23309 [12:00:03] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [12:00:17] New patchset: Hashar; "(bug 40163) Try to fix ltwiki import source for betawikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419 [12:01:11] New review: Hashar; "lets try it!" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23419 [12:01:11] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23419 [12:11:56] New patchset: Hashar; "(bug 39206) Namespaces configuration for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [12:12:30] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [12:17:08] New patchset: Hashar; "(bug 38840) Namespaces configuration on uz.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23102 [12:17:24] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23102 [12:23:38] New patchset: Hashar; "(bug 39866) Anexo: is a content namespace on es.wikipedia." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23068 [12:23:52] New review: Hashar; "Looks like there is consensus for this." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/23068 [12:23:52] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23068 [12:27:33] New patchset: Hashar; "(bug 39264) Add Tudalen: and Indecs: namespaces to cy.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23094 [12:27:40] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23094 [12:31:06] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:32:09] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [12:43:16] New review: Hashar; "We definitely need tests coverage in mediawiki-config :-)" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23935 [13:31:18] hey guys, what is the URL that installs the gerrit post-commit hook? I can't find it on https://labsconsole.wikimedia.org/wiki/Git [13:31:22] !log Deactivated subnet for vlan 104 (Tampa) [13:31:32] Logged the message, Master [13:35:46] https://gerrit.wikimedia.org/r/tools/hooks/commit-msg in theory [13:35:56] ty apergos [13:36:26] yw [13:43:29] mark: got some time to fix a parameterized class please? https://gerrit.wikimedia.org/r/#/c/23770/ Tim made the role::mediawiki::logger to require a log path but we can't pass parameters in labsconsole :-) [13:43:39] so I have changed the class to have a default value [13:45:58] what would be the reason for not having one wikipedia attached to sul? [13:45:59] apergos, drdee: for future reference https://gerrit.wikimedia.org/r/Documentation/cmd-index.html#_client [13:50:28] I looked at this: http://www.mediawiki.org/w/index.php?title=Git/Workflow&oldid=503559 t find it :-P [13:59:09] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:06:39] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.034 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:12:03] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [14:27:58] apergos1: our disks on a fedex truck roaming around the city now...I should have them by 2p est [14:28:24] which is 4 pm here [14:28:28] no [14:28:33] noon here [14:28:46] er [14:28:47] no [14:28:49] midnight [14:29:02] your +10 from here? [14:29:12] oh, tampa :-/ [14:29:14] uhhh [14:29:16] utc is +5 [14:29:23] yeah that doesn'thelp [14:29:49] 2pm there is 11 am sf which is 9 pm my time [14:30:06] odds are good I won't be here by then [14:30:57] okay...to be clear...we are going to use ms-be10 for this...correct? [14:31:00] yes [14:31:09] should not need a reinstall, just dropping in new disks [14:32:05] I think that the filesystems will get set up on the new disks after the first puppet run' [14:32:51] are you able to shutdown boxes nicely (as opposed to power off from ipmi/drac)? [14:34:28] apergos1: yes [14:35:48] ok so if after the disks go in and the first puppet run completes, if you see a bunch of failures for the mounts, just shut down the box nicely and I'll look at it tomorrow morning [14:37:18] I'll check in when I'm back tonight but I don't know when that will be or how tired I might be by then [14:37:59] okay [14:39:06] i know how the problem ones went during the boot...i will update the ticket [14:40:00] these aren't hot swap right? [14:41:04] not in their current config as individual disk [14:41:24] ok [14:41:41] at least that is what ben told me before [14:41:54] well on boot up I expect mount failures cause they won't have filesystems on em [14:42:02] but puppet should take care of that [14:42:35] i want to see if they mount...or do we have to skip through [14:43:31] well they can't mount when you first put them in there [14:43:42] you'll have to wait for the first puppet run after they are installed [14:44:42] yep...that is what i meant...started the story 1 chapter in [14:44:49] ok [15:13:06] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [15:13:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:13:06] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:13:06] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:13:06] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:13:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:22:29] !log updating authdns. adding 12 new misc servers mgmt ip's [15:22:39] Logged the message, Master [15:24:36] bleh, dns zonefiles are a mess. [15:24:47] too many symlinks for secondary domains into primary templates [15:25:03] (just messy with lots of extra entries not needed since they are all redirected by apache) [15:25:21] we need to put it into git! [15:25:57] * RobH isnt sure if it should be git controlled, or just full out puppet controlled, which seems like a lot of work. [15:27:54] git it first [15:27:57] puppet it later ? :-)) [15:28:12] there might be a third party puppet module to manage dns [15:28:15] i don't see why you'd wanna do dns zonefiles with puppet [15:28:47] As long as there's nothing private, could at least get them in git. [15:29:37] http://forge.puppetlabs.com/ajjahn/puppet_dns ;) [15:31:52] mark: ok, just git [15:32:03] which is where im leaning, but i guess it uses svn. [15:32:10] but would be nice to see in gerrit and track changes [15:32:17] since on svn we all just root it. [15:32:21] (i think) [15:33:45] i would love some accountability considering i often make changes, and soon it will be cleanup domain changes. [15:53:00] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [16:34:02] * Greenlet no longer leaks memory after thread termination, as long as   terminated thread has no running greenlets left at the time. (release 0.4.0) but that's not socket leakage. still the fact that they now gc is probably a good thing [17:09:58] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:13:21] New patchset: Aaron Schulz; "Added support for timeline/math extensions." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24303 [17:14:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24303 [17:24:44] !log moving search traffic to pmtpa for eqiad search package upgrades, starting with en [17:24:54] Logged the message, notpeter [17:28:04] New patchset: Pyoungmeister; "lucene.php: moving enwiki search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24306 [17:29:25] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24306 [17:34:56] apergos1: do you keep an eye on your gerrit dashboard? [17:35:08] not very well [17:35:15] would be nice too :) [17:35:20] a long time (many days) can go by without me checking it [17:36:02] but you get the emails right? [17:37:12] New patchset: Pyoungmeister; "lucene.php: moving all search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24307 [17:38:05] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24307 [17:38:26] now that's a good question [17:41:00] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-dialog-sri-lanka.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [17:42:45] apergos1: ct mentioned you've got some ideas on dealing with https://rt.wikimedia.org/Ticket/Display.html?id=3111 (getting someone on the mobile team besides tomasz access to upload to dumps) - would you mind updating the rt ticket so we can follow along asynchronously? [17:44:13] New patchset: Aaron Schulz; "Use re.sub instead of weird while loop." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24308 [17:44:17] an rsync job should be set up from someplace to dataset2 [17:44:41] apergos1: well, expect to be tagged for stuff :) [17:44:51] AaronSchulz: I'm getting that idea [17:44:53] :-D [17:45:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24308 [17:45:09] well maybe this will get me sloooowly into doing tiny bits of code review [17:45:22] gone? [17:45:28] grr [17:48:57] apergos1: how good is your python library fu? [17:49:21] spotty, good in some areas and crapola in others [17:49:25] the "for header in ['Content-Length', 'Last-Modified', 'Accept-Ranges']:" code should be changed to just passthrough all the headers [17:49:31] rewrite.pu [17:49:32] *py [17:52:42] New patchset: Andrew Bogott; "Update wiki instance-status pages automatically." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [17:53:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23155 [18:01:06] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Puppet has not run in the last 10 hours [18:01:06] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [18:04:17] New review: Ryan Lane; "A few inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23155 [18:07:37] !log ms-be10 powering down to replace 12 disk drives [18:07:47] Logged the message, Master [18:08:03] all 12? [18:09:12] AaronSchulz: yep! [18:10:15] PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused [18:10:43] problems with the c2100's ...hoping this will fix it [18:13:27] cmjohnson1: the best way to test the old disks is tape them to your feet and pretend they are skates. [18:13:30] datacenter hockey! [18:13:35] ssd for puck. [18:14:04] that would be fun...or i could just use them as "pucks" [18:14:29] dual purpose. [18:14:39] lose the puck, remove part of skate, new puck. [18:27:03] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [18:51:17] apergos: good news to start...all the disk are showing in post [18:51:47] fsck from util-linux 2.20.1 [18:51:47] /dev/md0: clean, 65646/3668976 files, 717622/14647296 blocks [18:51:47] The disk drive for /srv/swift-storage/sdn1 is not ready yet or not present. [18:51:48] Continue to wait, or Press S to skip mounting or M for manual recovery [18:52:02] expected [18:52:05] skip mounting [18:52:27] k [18:53:00] RECOVERY - SSH on ms-be10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:53:09] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [18:54:03] RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:54:04] i wonder if the disk replacement is going to work. [18:54:12] RECOVERY - swift-object-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:54:12] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:54:39] RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [18:54:39] RECOVERY - swift-account-server on ms-be10 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:55:05] puppet running... [18:55:24] RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Wed Sep 19 18:55:19 UTC 2012 [18:58:05] ok [19:02:19] * apergos watches the log (since I'm still here :-/ ) [19:03:06] where else would you rather be at 10pm in Athens? [19:03:15] grrrr [19:05:49] so that's a successful puppet run [19:05:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [19:05:56] yep [19:05:59] next step I guess is to see if it survives some reboots [19:06:20] ah all ttoshibas I see [19:06:35] wanna do a few reboots and see if all the disks show up mounted? [19:06:47] yes...let's do that [19:07:10] i will cycle it through ...you wanna console in and watch the post? [19:09:08] apergos: starting now [19:11:09] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:57] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [19:20:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:24:48] cmjohnson1: dare i ask? [19:25:18] i was holding off until I heard what apergos has something but he went offline [19:27:29] cmjohnson1: oh, i thought you were in the powercycling. [19:27:41] i just did...going to watch the console and see how it goes [19:27:45] coolness [19:27:50] i did earlier but i thought apergos was watching [19:27:58] it went through though...so that is a good sign [19:28:13] well, as long as it can do it another half dozen times ;] [19:28:35] if this works, they are going to have to give us all the bad drive data [19:28:56] cuz we have ashburn hosts that exhibited this behavior and resolved themselves, which I imagine means the disks are part of that batch. [19:29:05] well, resolved with reinstall, not themselves. [19:29:17] eh? [19:29:33] so I don't know what messags people got from me or what I missed from being disconnected [19:29:46] we need to dd the whole disks before calling it a success though [19:29:48] but the two reboots had not the same list of disks claim to not be ready [19:30:11] and then if I waited a while they presented as ready and the os could mount the filesystems [19:30:29] I think that should be reported to dell, it's not normal behavior [19:31:23] (remember we already have a 90 second delay in grub) [19:32:39] New patchset: Andrew Bogott; "Update wiki instance-status pages automatically." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [19:32:55] apergos: i went through a reboot and while a few disk were slow to present themselves they did mount on their own in a few seconds [19:33:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23155 [19:33:38] don't know if it is a success or not...I would love to see this setup in something other than a c2100 [19:33:40] yes, I saw that on the two reboots, not all the same disks each time. it's not normal behavior [19:34:59] i think that issue of delayed response on different disks would be a controller issue...not able to keep up with individual disk presenting themselves [19:35:39] * cmjohnson1 wonders if we should ask for 12 more disk and setup be6 since it has a new controller/mobo/backplane and updated firmware [19:38:14] so aggravating [19:39:40] apergos: can we cycle one of the be's that work 1-4? let's see if it does the same thing? [19:39:56] I don't want to do that [19:42:22] I don't want to risk it not coming back up right [19:42:47] i understand...but the disk may present the same way...this may be normal for this setup [19:49:32] they shouldn't [19:50:00] so I can try to reach Ben, he should be back in the country by now [19:50:15] hmm also maybe rob remembers [19:50:52] about? [19:50:55] remember what? [19:51:15] RobH: do you remember seeing messages during bootup of ms-be 1 through 4, any of them, [19:51:22] of the sort [19:51:24] for timeout on disks? [19:51:32] The disk drive for /srv/swift-storage/sdn1 is not ready yet or not present. [19:51:37] or a delay in them presenting [19:51:39] not really, but i only worked on ashburn ones in detail, and those disk issues disappeared on ms-be1003 and 1005 [19:51:46] Continue to wait, or Press S to skip mounting or M for manual recovery [19:51:48] ? [19:51:50] yea, i didnt see that. [19:51:56] i would recall that. [19:51:57] ok, I should ask Ben then [19:52:04] he set those up? [19:52:08] yes [19:52:11] yep [19:59:49] New patchset: Andrew Bogott; "Update wiki instance-status pages automatically." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [20:00:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23155 [20:10:15] anyone have a working email address for ben? [20:11:18] ^ I don't but if anyone does should remember this is an open channel [20:11:26] uh huh, I expect them to pm [20:20:17] ah nm found an address [20:27:36] New patchset: IAlex; "Remove ".gz" URL ending check that disables output compression." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24326 [20:43:06] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [20:43:06] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [20:56:34] !log restarting gmetad on nickel [20:56:44] Logged the message, Master [21:22:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 3 seconds [21:27:01] rats, he didn't log the status of ms-be10 and neither did I [21:28:45] !log ms-be10: after chris installed the disks, filesystems were created ok by puppet and mounted but on a couple of reboots, there were boot messages about two not identical lists of disk not present or not ready; waiting a little bit allowed them to come up but it's not great. [21:28:56] Logged the message, Master [21:29:43] I just logged the status of ms be10 ( cmjohnson1 ). I sent email earlier to ben in case he remembers anything odd when he was booting be1-4. [21:30:36] okay...i am curious to find out what Ben has to say [21:30:56] me too [21:31:00] sigh [21:31:03] so what is the c2100 status ? [21:31:12] apergos: would still like to see this on a different box [21:31:14] lemme find the summary page I have [21:31:24] http://wikitech.wikimedia.org/view/Swift/Server_issues_Aug-Sept_2012 [21:31:25] that [21:31:56] there's always ms-be8 :-P [21:32:34] maybe we could convince mark to let us play with the 720 he has in esams [21:32:55] lesliecarr: are you going to the DC there at all? [21:33:11] not planning on it this time [21:33:13] need something? [21:33:22] next week i can probably get a bike and go over [21:33:38] maybe...i will get back to you [21:33:42] (omfg the train systems here…. if it was an average of 10c warmer throughout the year i would move… still hoping for global warming) [21:34:31] apergos1: a mail to ben about what exactly? [21:34:40] there is something nicer than bart/muni? [21:35:09] ;-] [21:35:27] I asked before, noone replied then you went and mailed Ben nevertheless [21:35:57] haha, cmjohnson1 god yes [21:36:06] imagine bart, but it runs more often and more places, and no urine [21:36:23] paravoid: we are curious to know if msbe1-5 had any issues presenting disks during the boot process. [21:36:36] LeslieCarr: watch out for trains that split and merge! [21:36:38] what do you mean by that? [21:36:58] when we replaced the disk in be10 they all presented but some took longer than others (we got the skip/mount question) [21:37:17] okay [21:37:18] what's the last message you got from me? [21:37:19] and the disks that did that changed each time [21:37:31] ooo, didn't see split and merging trains [21:37:46] we have a bunch of eqiad boxes we can reboot as often as we can [21:37:51] er, want [21:37:59] and do all kinds of comparisons [21:38:20] I thin he wanted to compare witha a known apparently working box [21:38:30] ^ yes [21:38:37] the eqiad boxes work atm [21:38:57] they've never been under real traffic so we don't know much [21:39:04] have they been in production? [21:39:06] that's the difference [21:39:21] we didn't see a problem until we put the tampa boxes in production [21:40:10] what does that have to do with what happens on boot? [21:40:43] it has to do with whether they are known good boxes or not [21:40:59] right now theyare basically "unknown" [21:41:34] apergos: can we put 10 into the ring and see how it responds? [21:41:58] I will let paravoid make that call [21:42:21] there is also ms-be7 [21:42:25] i would rather compare known working vs new hdd in be10 [21:42:35] first [21:43:06] which right now shows no errors on or after boot [21:43:13] dd the disks first? [21:43:24] dd if=/dev/zero of=/dev/sdN bs=1M [21:43:39] write to the whole disk and see what happens [21:44:29] then read the whole disk [21:44:44] if you're going to zero them you'll want to stop puppet on there and unmount them (and stop the swift processes) [21:44:51] yes [21:45:12] alteratively you can write a file within the filesystem [21:45:30] but try writing to the whole disk at least once before putting them in prod [21:45:55] mkfs doesn't touch the whole disk, so we basically have no idea if it actually works now or not [21:46:22] same when you mkfs the ms-be6 disks, a simple mkfs doesn't say much about the state of the disk [21:47:30] no, it doesn't (but we don't relly believe that all these disks are bad, do we?) [21:47:37] anyways it's certainly fine to check them [21:48:59] I find it extreme but plausible [21:49:11] it might be a bad shipment or something like that [21:49:44] in any case, a full dd will show us problems wherever they are [21:49:46] dell thinks it is the WD disk...that they're not compatible for what we're doing with them. [21:50:06] without actually using it for production traffic [21:50:38] !log temp stopping puppet on brewster [21:50:48] Logged the message, notpeter [21:53:52] sorry to upgrade and have fe1 break like that, I watched it for awhile and other than the memory leak it seemed ok [21:54:06] (of course I didn't have a clue that there was a socket issue to look for) [21:55:42] New patchset: Jgreen; "single ganglia cluster for fundraising" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24383 [21:55:57] I'd probably would have done the same :) [21:56:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24383 [21:56:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24383 [22:01:06] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [22:23:00] New patchset: Jgreen; "messing with ganglia config for fundraising eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24386 [22:23:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24386 [22:24:30] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24386 [22:32:00] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [22:33:03] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [22:44:47] notpeter: I am around :) [22:45:15] https://gerrit.wikimedia.org/r/#/c/22673/ (extracting l10n out of scap) should not cause any troube [22:45:30] but it is better to double check after it has been deployed [22:45:39] I am not sure if we can run scap whenever we want [22:48:31] !log pulling mw58 from lvs pool and deploy groups for temporary testing [22:48:41] Logged the message, Master [22:50:14] hashar: hrm, ok [22:50:27] if you can find that out, i will merge :) [22:51:31] Reedy: https://gerrit.wikimedia.org/r/#/c/22673/ is changing the `scap` script [22:51:43] I moved out the portion that updates the l10ncache to another script [22:52:00] so that might break scap :-) [22:52:32] It's trivially different.. so one would hope it doesn't ;) [22:53:55] notpeter: go ahead :-) [22:54:45] ok, cool [22:55:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22673 [22:55:11] |log scap is going to die [22:55:23] lulz [22:55:31] it's merged on sockpuppet now [22:56:21] I'm currently working on making scap die [23:02:35] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [23:03:28] New review: Hashar; "Added a call to mw-update-l10n from Ifd130c24" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [23:03:29] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [23:04:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [23:04:34] Change abandoned: Hashar; "per ops-l discussion, will have to wait for syslog architecture to be rewritten." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16661 [23:15:21] I also got one adding a default parameter to a role class https://gerrit.wikimedia.org/r/#/c/23770/ [23:15:36] I use that class on labs, and it does not let us pass parameters (yet) ;-) [23:15:51] any comment would help I guess :) [23:22:33] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 182 seconds [23:34:43] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [23:41:19] answer from ben! [23:41:23] Yeah, I did see those messages on some of the hosts (most commonly after other indications told me a host had bad disks, on reboot it would throw up that message). Sure enough, pressing 's' would cause it to continue to boot and the disk would be unavailable once it came up. [23:41:35] I don't remember specifically whether I saw the message on any of be1-4, but I did see it. [23:41:39] that's all we get [23:42:23] next we need to see if any of those had bad disks... from what he described, if they didn't then he probably didn't see the message [23:42:29] ( cmjohnson1 ) [23:42:36] geez [23:42:39] I am so not here any more [23:43:38] apergos1 [23:43:42] yes [23:43:49] just reading now [23:43:50] I'm here long enough for an answer but then [23:44:07] (almost 3am) got to try to sleep [23:44:44] that really is not very helpful [23:45:06] i think he is talking about ms-be6 prior to him leaving [23:45:32] well he says some of the hosts [23:45:35] anyways [23:45:59] apergos1: not going to be solved tonight. get some sleep [23:46:11] you're right about that [23:46:17] gone! [23:46:26] g'night [23:46:28] have a good rest of the day if there's any of it left [23:46:57] thx...tty soon