[00:03:02] New patchset: MaxSem; "WLM banner" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23021 [00:03:26] New review: MaxSem; "Not now." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/23021 [00:03:36] PROBLEM - Puppet freshness on search18 is CRITICAL: Puppet has not run in the last 10 hours [00:04:30] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [00:04:30] PROBLEM - Puppet freshness on search35 is CRITICAL: Puppet has not run in the last 10 hours [00:04:33] !log authdns-update to support wicipediacymraeg.org [00:04:42] Logged the message, RobH [00:05:33] PROBLEM - Puppet freshness on search32 is CRITICAL: Puppet has not run in the last 10 hours [00:05:33] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [00:05:33] PROBLEM - Puppet freshness on search34 is CRITICAL: Puppet has not run in the last 10 hours [00:06:30] !log remove cn from uniqueness check in LDAP. it's a global check and causes problems [00:06:39] Logged the message, Master [00:07:30] PROBLEM - Puppet freshness on search36 is CRITICAL: Puppet has not run in the last 10 hours [00:11:33] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [00:11:33] New patchset: RobH; "added in redirect for wicipediacymraeg.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23022 [00:13:30] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [00:14:33] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [00:14:33] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [00:14:33] PROBLEM - Puppet freshness on search28 is CRITICAL: Puppet has not run in the last 10 hours [00:15:36] PROBLEM - Puppet freshness on search22 is CRITICAL: Puppet has not run in the last 10 hours [00:15:36] PROBLEM - Puppet freshness on search31 is CRITICAL: Puppet has not run in the last 10 hours [00:15:47] New patchset: Aaron Schulz; "Added global backend config for things like math." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22941 [00:16:30] PROBLEM - Puppet freshness on search25 is CRITICAL: Puppet has not run in the last 10 hours [00:16:30] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [00:16:30] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [00:17:33] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [00:17:33] PROBLEM - Puppet freshness on search33 is CRITICAL: Puppet has not run in the last 10 hours [00:19:13] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22941 [00:21:36] PROBLEM - Puppet freshness on search29 is CRITICAL: Puppet has not run in the last 10 hours [00:22:30] PROBLEM - Puppet freshness on search21 is CRITICAL: Puppet has not run in the last 10 hours [00:22:51] New patchset: ArielGlenn; "db70 -> ms-be13" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23023 [00:23:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23023 [00:25:06] New patchset: ArielGlenn; "db70 -> ms-be13" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23023 [00:25:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23023 [00:26:09] Change abandoned: ArielGlenn; "committed a bunch of cruft instead of just what I wanted, grrr. need moar sleep." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23023 [00:27:55] New patchset: ArielGlenn; "db70 -> ms-be13" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23024 [00:28:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23024 [00:29:23] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23024 [00:31:58] ^demon: 'can not update the reference as a fast forward' [00:32:19] ^demon: did push get disabled for the config repo? [00:32:43] AaronSchulz: Possibly. Or you forgot to pull before you pushed? [00:32:52] I already did pull --rebase [00:33:00] Hmm right [00:33:03] You should be able to push [00:33:05] <^demon> I just quickly granted a couple of permissions to get it up and running earlier. [00:33:20] <^demon> It may need more setup. Anyone in deployment can adjust those permissions. [00:33:21] Right, the ACLs got hosed too [00:33:51] * AaronSchulz sighs [00:38:25] RoanKattouw: any idea how to fix? [00:40:55] Let me see [00:41:42] Try now [00:41:59] worked [00:59:52] New patchset: ArielGlenn; "ms-be13 disk layout without ssds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23028 [01:00:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23028 [01:07:05] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [01:07:46] apergos: so were did we say ext-dist is going? [01:07:53] we didn't [01:08:18] my notes there say "it would be nice not to be in swift. but dunno where." [01:08:20] ^h [01:08:23] (on the cruft page) [01:16:53] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23028 [01:20:00] New patchset: Dzahn; "bug 31369 - make redirects protorel where possible" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13293 [01:22:05] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [01:22:51] New patchset: Dzahn; "bug 31369 - make redirects protorel where possible" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/13293 [01:26:03] New review: Dzahn; "JeremyB: actually the problem was just the comment in line 164, Apache takes that as an argument to ..." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13293 [01:27:57] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 7s [01:28:05] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:29:58] apergos: so it seems like you guys won't need me for the docroot/jar stuff right? [01:30:11] I think not for the jar stuff for sure [01:30:46] we'll be keeping the same path in timedmediahandler and in uh [01:30:50] ogghandler [01:31:13] just making sure that url winds up going somewhere else. [01:32:24] and ext-dist doesn't need me if you just move it to another server and update the 3 globals [01:33:31] * apergos goes grepping for 503s [01:33:33] * jeremyb waves mutante [01:33:52] feel free to propose that on the cruft page [01:33:59] if it isn't already [01:34:15] mutante: sorry, didn't get to that yet, thanks for the extra testing [01:34:57] jeremyb: i rebased it. there was a major cleanup of redirects.conf in between. [01:38:25] mutante: yup, just went through the diffs (but ignored the rebasing) [01:40:32] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 190 seconds [01:41:26] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 289 seconds [01:43:33] apergos: do you want the task of ext-dist? [01:43:55] huh? [01:44:03] I won't get to it right away [01:44:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:44:31] I need to do the r510, then the dell call, if there is anything left to do for ms-be6 then that, and then pybaltest.txt [01:44:44] if I get that working I want to do jars/bla next [01:45:02] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 9 seconds [01:45:14] basically I need to know where we want stuff like that, the mw tarballs, etc [01:45:23] and I have no frickin clue [01:45:31] that = ext dist files [01:51:44] New patchset: Jeremyb; "tweak sync wrapper script" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23033 [01:54:42] New review: Jeremyb; "do both paths exist? is there a symlink or bind mount in there somewhere? which is really canonical?" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23033 [01:55:53] logging off [01:55:59] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [01:57:02] bye apergos! [02:01:41] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Y on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2438* [02:07:50] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Y on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2450* [02:12:38] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2575* [02:18:38] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [02:19:05] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Z on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [02:22:31] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Z on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [02:25:58] PROBLEM - mysqld processes on es9 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:26:34] PROBLEM - mysqld processes on es10 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:28:31] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Z on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [02:30:37] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [02:31:40] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-X on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-X CRITICAL - *2575* [02:37:40] PROBLEM - ps1-a5-sdtpa-infeed-load-tower-A-phase-Z on ps1-a5-sdtpa is CRITICAL: ps1-a5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [02:38:52] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Fri Sep 7 02:38:31 UTC 2012 [02:39:28] RECOVERY - Puppet freshness on search16 is OK: puppet ran at Fri Sep 7 02:39:17 UTC 2012 [02:42:01] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Fri Sep 7 02:41:48 UTC 2012 [02:42:55] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Fri Sep 7 02:42:39 UTC 2012 [02:43:41] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [02:43:41] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [02:43:41] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [02:43:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:43:41] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [02:43:42] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [02:43:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [02:43:43] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [02:43:43] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:43:49] RECOVERY - ps1-a5-sdtpa-infeed-load-tower-A-phase-Z on ps1-a5-sdtpa is OK: ps1-a5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [02:43:58] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Fri Sep 7 02:43:51 UTC 2012 [02:43:58] RECOVERY - Puppet freshness on search33 is OK: puppet ran at Fri Sep 7 02:43:51 UTC 2012 [02:44:07] RECOVERY - ps1-a5-sdtpa-infeed-load-tower-A-phase-X on ps1-a5-sdtpa is OK: ps1-a5-sdtpa-infeed-load-tower-A-phase-X OK - 2288 [02:44:34] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Y on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2438* [02:45:01] RECOVERY - Puppet freshness on search31 is OK: puppet ran at Fri Sep 7 02:44:54 UTC 2012 [02:45:10] RECOVERY - ps1-a5-sdtpa-infeed-load-tower-A-phase-Y on ps1-a5-sdtpa is OK: ps1-a5-sdtpa-infeed-load-tower-A-phase-Y OK - 2400 [02:45:28] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2475* [02:45:55] RECOVERY - Puppet freshness on search18 is OK: puppet ran at Fri Sep 7 02:45:41 UTC 2012 [02:47:52] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Y on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2413* [02:48:01] RECOVERY - Puppet freshness on search26 is OK: puppet ran at Fri Sep 7 02:47:46 UTC 2012 [02:48:01] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Fri Sep 7 02:47:49 UTC 2012 [02:49:31] RECOVERY - Puppet freshness on search17 is OK: puppet ran at Fri Sep 7 02:49:20 UTC 2012 [02:50:25] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Fri Sep 7 02:50:02 UTC 2012 [02:51:28] RECOVERY - Puppet freshness on search34 is OK: puppet ran at Fri Sep 7 02:51:22 UTC 2012 [02:51:55] RECOVERY - Puppet freshness on search15 is OK: puppet ran at Fri Sep 7 02:51:31 UTC 2012 [02:52:31] RECOVERY - Puppet freshness on search28 is OK: puppet ran at Fri Sep 7 02:52:05 UTC 2012 [02:52:31] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Sep 7 02:52:16 UTC 2012 [02:54:55] RECOVERY - Puppet freshness on search29 is OK: puppet ran at Fri Sep 7 02:54:55 UTC 2012 [02:56:25] RECOVERY - Puppet freshness on search25 is OK: puppet ran at Fri Sep 7 02:56:06 UTC 2012 [02:57:01] RECOVERY - Puppet freshness on search32 is OK: puppet ran at Fri Sep 7 02:56:51 UTC 2012 [02:58:58] RECOVERY - Puppet freshness on search30 is OK: puppet ran at Fri Sep 7 02:58:35 UTC 2012 [03:00:01] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Fri Sep 7 02:59:32 UTC 2012 [03:00:55] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Fri Sep 7 03:00:45 UTC 2012 [03:01:31] RECOVERY - Puppet freshness on search14 is OK: puppet ran at Fri Sep 7 03:01:08 UTC 2012 [03:03:28] RECOVERY - Puppet freshness on search36 is OK: puppet ran at Fri Sep 7 03:03:02 UTC 2012 [03:03:28] RECOVERY - Puppet freshness on search35 is OK: puppet ran at Fri Sep 7 03:03:04 UTC 2012 [03:04:58] RECOVERY - Puppet freshness on search21 is OK: puppet ran at Fri Sep 7 03:04:46 UTC 2012 [03:07:31] RECOVERY - poolcounter on ersch is OK: PROCS OK: 1 process with command name poolcounterd [03:07:40] RECOVERY - Puppet freshness on search22 is OK: puppet ran at Fri Sep 7 03:07:28 UTC 2012 [03:07:49] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Fri Sep 7 03:07:38 UTC 2012 [03:43:22] New review: Ori.livneh; "Might want to add your e-mail address to the comment." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22172 [03:50:45] apergos: pybaltestfile was/is on my radar... [03:52:03] apergos: as for the jars, I'm not sure if I agree, I've added some comments on the wiki page, have you seen that? [04:39:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [05:16:53] I didn't see your comments [05:17:44] I was planning t move it into swift in a "jars" container [05:17:59] maybe we'll wind up with other java crapola [05:18:11] so might as well have a container for it [05:18:14] paravoid: [05:19:08] as for pybaltestfile, unless you were planning to do it in the next couple days, I might get to it before you [05:19:16] we'll see [05:26:12] the next couple days I don't plan to do anything [05:26:14] (weekend) [05:27:04] and for pybaltestfile I was looking at modifying rewrite.py to provide one or two status URLs [05:27:19] like /status or something [05:27:37] one to be handled internally in rewrite.py and one that checks backends [05:27:48] and with pybal & squid using the former [05:28:47] if we don't do that, then we're risking that the loss of that file (because e.g. we had a triple failure like the one that almost happened these days) would result in a DoS of the cluster [05:29:19] also, I'd like having a few URLs that tell us more about the overall cluster health and plug those into Nagios. [05:29:56] so, I may haven't commited code yet, but as you see I've already planned it a bit :-) [05:30:00] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [05:30:36] TimStarling: wouldn't you say that jars fit better into bits than upload? [05:30:40] or am I way off? [05:31:39] you are way off [05:32:09] heh [05:32:09] they would have been on bits to start with, except that it is required that they be hosted from the same domain as the video files, due to the same origin policy in Java [05:32:32] java has crossdomain.xml nowadayss [05:32:50] it's always had it, I think [05:33:05] no, it's >= 6 update 10 [05:33:15] (I checked that yesterday) [05:33:26] I knew it was a Flash thing that Java copied it at some point [05:33:50] couldn't we just serve a crossdomain.xml from docroot and have cortado in bits? [05:34:29] when was update 10 released? [05:34:48] 2008 apparently [05:35:02] but Java generally has autoupdate, so it's slightly better [05:35:26] where is the documentation for it? [05:35:40] for what? 2008 or crossdomain.xml? [05:35:52] crossdomain.xml [05:36:05] http://www.oracle.com/technetwork/java/javase/plugin2-142482.html [05:36:12] links to http://www.adobe.com/devnet/articles/crossdomain_policy_file_spec.html [05:36:31] Java Web Start and the Java Plug-In currently implement only a subset of the cross-domain policy file functionality. Specifically, site access restrictions are not supported. Access to a particular server is only granted if the crossdomain.xml file is present and contains only an entry granting access from all domains: [05:36:36] hah [05:36:54] so, yeah, [05:36:58] should suffice [05:41:18] * paravoid checks [05:44:45] http://javasourcecode.org/html/open-source/jdk/jdk-6u23/com/sun/deploy/net/CrossDomainXML.html [05:44:54] "This implementation only supports the cross-domain policy . More specific access policies are not implemented. This class will return false if they are specified. " [05:45:59] I just mentioned that above [05:46:12] but allow from=* should be fine for us [05:46:14] so you did [05:47:29] how do you set a system property? [05:48:44] maybe that is just java web start, not the plugin [05:49:33] mmm, I guess so [05:49:50] and if you run things with JWS you're probably screwed anyway, ask Leslie [05:50:14] ok, I am fine with it [06:18:44] heh, not exactly way off after all :-) [06:22:48] so, I ran a grep on the squid logs for Java versions in UAs [06:24:03] http://pastebin.com/EFyjQ6kE [06:24:41] that includes all hits, incl. quite a few that are API hits [06:25:16] and none of Java/1.6.0 with no update number are browser hits [06:31:08] the cortado.jar sample is very small, 29 hits in the whole sampled-1000.log [06:31:29] out of those only 2 are < se6 upd. 10, but the sample is too small to extrapolate [06:31:55] so, yeah, looks good [06:42:29] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [06:42:29] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [06:42:29] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [06:43:55] TimStarling: I created a test page in my own domain, verified it did request '/crossdomain.xml' and didn't work by saying "Not allowed" [06:44:12] TimStarling: then I placed the crossdomain.xml above in ms7 and retried and it works [06:44:22] :-) [06:44:31] so, who should I talk with to move this to bits? [06:44:35] file a bug? [06:48:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:48:56] I can probably do it [06:49:26] that'd be great :) [06:49:38] always happy to work with you :-) [06:49:55] TMH sucks so much [06:50:04] it's both OggHandler and TMH iirc [06:50:09] oh and TMH has the path hardcoded in the code [06:50:23] yeah I noticed [06:51:43] you can open a bug against TMH asking them to clean up their code and support some sort of runtime configuration [06:51:49] I'll do a patch for OggHandler [06:53:50] thanks! [06:54:05] is that crossdomain.xml file likely to disappear in the future? [06:56:07] you know we're moving to swift right? ;) [06:57:16] hahaha :) [06:57:53] so, I've really liked your docroot idea [06:58:06] and I like also having an index.html, favicon.ico, robots.txt etc. [06:58:11] New patchset: Tim Starling; "Don't customise $wgCortadoJarFile" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23036 [06:58:15] that's good [06:58:27] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23036 [06:59:29] I haven't decided where to host that docroot yet obviously [06:59:37] but we could probably do it on all appservers, right? [06:59:56] you're pushing docroot regularly to them aiui [07:00:03] so it'd just be another vhost [07:01:02] doesn't work for me [07:01:12] what isn't? [07:01:14] "Not allowed http://upload.wikimedia.org/..." [07:01:19] cortado can't read from upload [07:02:10] try http://tty.gr/cortado.html [07:02:39] not allowed [07:03:07] you have Java/1.6.0_24 from what I can see [07:03:20] could you tcpdump and see if you request crossdomain.xml? [07:03:30] works for me [07:04:38] I have icedtea-6-plugin 1.2-2ubuntu1.2 [07:04:53] 26.709004 192.168.24.60 -> 178.32.250.226 HTTP GET /cortado.jar HTTP/1.1 [07:04:56] 27.037665 192.168.24.60 -> 91.198.174.234 HTTP GET /crossdomain.xml HTTP/1.1 [07:04:59] 27.488855 192.168.24.60 -> 91.198.174.234 HTTP GET /wikipedia/commons/0/0e/Merikoski_Dam_Oulu_20120810.OGG HTTP/1.1 [07:07:44] no, it doesn't fetch crossdomain.xml, confirmed with tcpdump [07:10:14] it is definitely IcedTea [07:10:20] yeah I'm on their bugtracker [07:10:29] searching [07:11:19] sigh [07:12:06] New patchset: Tim Starling; "Revert "Don't customise $wgCortadoJarFile"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23037 [07:12:27] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23037 [07:14:02] your 99.9% figure probably still stands, but yeah, let's not be evil and break icedtea knowlingly :) [07:14:43] actually it seems to be broken with the old location too [07:15:37] hm? broken how? [07:16:24] New patchset: Dereckson; "(bug 39942) Disables patrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22999 [07:16:28] "no such element: httpsrc (check plugins.ini)" [07:18:22] httpsrc = ElementFactory.makeByName("httpsrc", "httpsrc"); [07:18:22] if (httpsrc == null) { [07:18:22] noSuchElement ("httpsrc"); [07:18:39] New review: Dereckson; "Amending the config change to include new page patrol." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/22999 [07:22:09] so maybe it doesn't matter [07:22:39] everyone who uses linux has a few different ways to play theora files, most notably the browser [07:22:49] okay, a bit offtopic, but why are we using Java in the first place? [07:22:53] instead of flash? [07:23:24] open source [07:23:46] and you can't play theora in flash [07:23:57] probably the latter more than the former [07:24:14] ha hm, apparently so [07:24:25] I just googled that just after I asked [07:25:13] cortado contains a complete port of ogg/theora/vorbis to java [07:35:43] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [07:36:53] same here with icedtea plugin [07:37:15] and that's with icedtea-7-plugin [07:37:49] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [07:51:10] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/countries-100.log, /var/log/squid/countries-10.log, /var/log/squid/countries-1.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [08:14:07] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [08:14:07] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:37:43] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [08:37:43] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [09:08:43] New patchset: Dereckson; "(bug 37524) Change namespaces configuration - ku.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12556 [09:09:42] New review: Dereckson; "Rebased to master" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/12556 [09:16:39] New patchset: Dereckson; "(bug 37524) Change namespaces configuration - ku.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12556 [09:22:09] New patchset: Dereckson; "(bug 37524) Change namespaces configuration - ku.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12556 [09:22:26] New review: Dereckson; "Added namespaces aliases." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/12556 [11:07:03] !log Sending Canadian upload traffic to upload-lb.eqiad (Varnish) [11:07:13] Logged the message, Master [11:11:13] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:12:34] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.023 seconds response time. www.wikipedia.org returns 208.80.154.225 [11:14:02] are you picking countries at random? :) [11:14:08] no [11:14:12] i'm picking small countries [11:14:17] argentina was spanish speaking [11:14:20] now an english speaking one [11:14:25] and needs to be outside europe [11:15:24] funny to read argentina & canada as small countries, but I know what you meant :) [11:15:46] yeah, in terms of wikipedia traffic [11:15:54] argentina was 1%, canada around 2% of our traffic iirc [11:16:17] how do you get those numbers? [11:20:12] stats.wikimedia.org [11:20:48] amazing, I didn't know that [11:20:52] I knew about reportcard but not that [11:21:09] it has quite a lot [11:24:18] are you planning to gradually switch all traffic? [11:35:39] well, no [11:35:41] did that last time [11:36:02] and when I switched pretty much all traffic, noticed that all GigE links were at > 80% capacity ;-) [11:36:10] 8 servers was plenty to handle the load in terms of varnish efficiency [11:36:18] but forgot that with fewer servers, GigE links become a problem ;-) [11:36:24] then added 8 more servers just to solve that problem [11:36:36] but now I have 8 servers active again, the remaining 8 i'm gonna have moved and upgraded to 10G [11:36:40] and then probably 8 is enough again [11:36:50] but until then, these can't really handle the full traffic load [11:36:54] but I want to seed the cache a bit [11:37:02] and having a small mix of languages/countries active on it will help with that [11:59:21] 10G? not bonds? [11:59:33] like 2x1G? [12:00:22] if we were at ~80%, then 2x1G would get us to 40% and better resiliency than a single network card [12:11:11] morning [12:12:24] apergos: morning, slowly drifting I see [12:12:29] the reason I didn't really care about moving the pybal status check file from ms7 to swift is that retrieving it from ms7 was also a way to have the whole cluster fall over if ms7 was sad (but actually if ms7 dies then the upload squids are going to be unhappy anyways) [12:12:34] no, I woke up at 4 am again [12:12:49] just stubbornly stayed in the bed for an hour to see if I could sleep more. no dice [12:13:14] I wonder if I'll go through the whole 2 weeks without any adjustment [12:13:21] New patchset: Dereckson; "(bug 29902) Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059 [12:13:52] paravoid: no not bonds [12:14:01] those servers can do a whole lot more than 2G [12:14:01] today I'll finish the r510 setup and, assuming all goes well, I plan to add it into the pool of backends, it's temporary though because asher wants it back [12:14:06] and the switches only have 48 ports [12:14:16] apergos: then don't [12:14:41] filing up a backend server takes more than a week [12:14:42] we need to know it's going to work [12:14:50] why wouldn't it work [12:14:53] we know that these servers work [12:15:02] stop checking whether wheels are round, they are [12:15:04] have we had them in the pool? no [12:15:07] the only "question" is whether we can see individual disks [12:15:13] instead of raid [12:15:16] we don't know how they are performance wise [12:15:19] you can always make individual raid arrays, which sucks a bit, but does work [12:15:24] they are very good performance wise [12:15:29] they run our mysql core databases [12:15:39] we don't know how they are performance wise for swift [12:15:47] ben said no, asher said yes, I tend to believe asher but let's double-check. we don't need to fill it for that [12:15:48] again, stop checking whether wheels are round [12:16:38] well I already got told that not having ssds is enough of an issue that we could wind up returning 503s to people purigng thumbs [12:16:49] so yes I want to test [12:17:38] feel free to, but please don't move around 20TB back and forth just for that... [12:17:50] it won't be 20t [12:17:59] this box has smaller disks and I intend to weight it lower [12:20:43] I also don't see the point of performance testing it, and I see it even less if you use different disks and downweight it in traffic [12:20:45] ah paravoid do you think you will be on the dell call (do you care about being on it)? I think it will mostly be them asking about how the system is set up [12:21:37] I won't be, I have a couple of friends visiting Greece and we're going out for dinner [12:21:45] ok no worries [12:22:45] replacements aren't even gonna be R510s [12:22:49] since those are end of sale [12:22:57] they will be 520s I expect [12:23:03] i expect they'll be 720XDs [12:23:44] what's the difference in the specs? [12:23:54] this is the first I've heard that we would put 720xds iin there [12:23:57] http://www.dell.com/us/enterprise/p/poweredge-r720xd/pd?~ck=anav [12:24:02] no it's not [12:24:05] yes, it is [12:24:06] i've said this multiple times already [12:24:12] we bought these in esams [12:24:15] they're racked there [12:24:53] you said you had these in esams, we've (or at least while I've been in the channel) never discussed it as a replacement for the c2100s [12:25:00] well DUH [12:25:06] and I've heard about the 720xds exactly once before this [12:25:29] so testing an R510 now, which has a rather different configuration, seems pretty pointless [12:26:33] fine, I'll give theh box back and undo my puppet changes [12:27:16] it would have been nice to know this was the plan though, I have been talking about this for a few days (in the channel) [12:27:24] i've said it several times [12:27:45] it's not "the plan", I don't think there is "a plan", but it's what makes sense to me [12:27:56] in any case it's not gonna be end of sale R510s [12:28:14] R720XD looks like pretty similar to C2100 except in the proper dell poweredge R line [12:28:17] no, I assumed it would be 520s [12:29:22] dell NL didn't want to sell the C2100 to us as they were having many issues and were going out [12:30:05] what a good idea. too bad dell usa didn't do the same [12:31:54] do you know if you got the h310 or the h710 for the ones in amsterdam? [12:32:12] what do you mean? [12:32:15] oh the raid controller [12:33:04] H310 [12:33:10] which they assured me, can do JBOD [12:33:13] but i haven't tested that yet [12:33:20] RT #1961 [12:34:06] uh huh [12:38:27] ok so if the backplane doesn't do the trick on ms-be6, what do we want next? (this assumes no great insights are gained from the dell phone call) [12:39:47] i have not been following that, so no idea [12:39:57] i think those systems are broken and need to be replaced [12:40:18] basically what I'm asking is: we will have tried the controller, the motherboard, the backlane, and yes I think they are broken [12:40:36] but I don't want this to turn into "dell ships us new bits or a new box, we deploy it, it still sucks" like dataset1 [12:40:50] no, a different box, not the same box [12:40:54] I want to get the broken stuff replaced by working stuff as soon as possible [12:42:30] having not done this with a vendor before, would we be in a position to lean on them to send us a 720xd and refund us for the c2100 for example? [12:43:11] what do you think? [12:43:25] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2563* [12:43:49] I think vendors don't like to send new gear, and they like to insist that the warranty means they will send a new box of the old type [12:43:57] that's why I'm asking for your input [12:44:11] I haven't spent really any time working with dell as a vendor [12:44:12] I think vendors like to keep getting more money [12:44:28] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [12:44:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:44:28] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:44:28] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [12:44:28] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:44:29] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [12:44:29] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [12:44:30] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:44:30] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:44:55] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 2363 [12:52:50] New patchset: Dereckson; "Cleaning Berlin hackathon tutorial configuration." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23062 [13:01:14] apergos: I don't worry much about performance testing of boxes [13:01:27] we have the same traffic that was handled by two boxes spread out over twelve [13:01:41] I already said I would stop what I was doing, revert th epuppet changes in the repo and give the box back [13:01:58] I'm not talking about this specific test but generally [13:03:22] in the specific case we would have the non ssd issue [13:03:36] yeah that's bad [13:03:58] which reminds me I was going to look to see if we are htting any of those issues now [13:04:29] the summary is that thumb purges rely on container listings which if done on boxs where the containers are not on ssds can take longer than the 10 seconds we allow before timeout [13:05:10] if none of the three replicates have the copy on ssd, than (says aaron, whicih is what ben found I guess) the user will wait 30 seconds and then not actually get their thumbs purged [13:05:52] anyways I'll look at that today [13:07:12] is there a plan to remove the SSDs? [13:07:21] some boxes don't have ssds [13:07:40] and several with them are down or out of the rings [13:09:29] no new disks unreachable today it seems, that's good [13:15:21] two OLD boxes onto twelve NEW ones yeah ;) [13:17:57] the two boxes weren't handling all read traffic were they? I thought part of what we did recently was move a pile of read traffic over [13:18:40] were the 2 old boxes handling all thumb reads or something? [13:20:23] New patchset: Dereckson; "(bug 39942) Disables patrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22999 [13:20:37] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/22999 [13:24:55] how weren't they handling all read traffic? [13:26:21] ok which two boxes were these? [13:26:50] ms5 and ms7? [13:27:32] 12 + 4 proxies actually [13:29:14] ah those two boxes, but ehy were going a lot less work than swift does [13:29:35] how? :) [13:30:43] one get via swift has a lot more associated with it than one get via nginx [13:31:15] that makes swift inefficient and my argument all the more true ;) [13:31:26] sure, it needs to replicate more, but even ms7 was replicating [13:38:48] hashar: you there? [13:38:58] notpeter: yup [13:39:14] hey, so I was looking at the labs imagescaler puppet configs [13:39:30] they look pretty (nearly completely) similar to the production ones, is this correct? [13:40:26] also, if I can get a test box up in production (but not live) by the tech days, would you be willing to help me test it? [13:40:48] notpeter: yeah beta is almost using the same conf as production [13:40:55] imagescaler might have some specific ::labs classes though [13:41:15] it seems to have an nfs mount that prod doesn't, but that seems to be about it [13:42:14] also the imagescaler class is applied on apaches box IIRC [13:42:26] yeah [13:42:33] it's sorta just an extention of the apache stuff [13:44:30] so the 'beta' Apaches boxes are applying imagescaler::labs [13:44:31] but it does not seem to be there anymore [13:44:55] imagescaler::labs is in site.pp [13:45:05] (which is confusing, I admit) [13:45:20] (we're trying to get rid of having those role classes in site.pp) [13:47:48] bah [13:47:57] I forgot my labs ssh key password [13:48:13] doh [13:58:44] notpeter: so I re ran puppet manually on the Apache boxes [13:58:49] seems the class work [14:05:41] woosters: added canadian traffic today [14:06:06] cool [14:06:23] and if something fails you know what to do [14:06:31] "blame Canada" [14:07:24] u think it has more traffic than Argentina? [14:07:46] can see the traffic surging 50% ahead already [14:08:36] hmm double actually [14:08:42] yes, quite a bit more [14:10:41] guess u could turn it up on Monday when u are here [14:10:59] can't do full traffic anyway [14:11:21] so it can simmer like this until rob is back at eqiad and we can upgrade network links [14:12:01] you are not using the other 8 then [14:12:08] indeed [14:12:23] don't you just love Varnish [14:12:27] quite some saving [14:12:36] this hw is optimized for squid [14:12:46] we could do even less [14:13:16] what do you mean optimized for squid? [14:14:35] not a lot of cpu, squid does (almost) no multithreading anyway [14:14:38] and is quite inefficient [14:14:49] so with varnish we can do multiple cpus and many cores [14:14:56] and get a lot more out of a single box [14:15:05] and thus more memory too [14:15:51] hehe [14:16:17] it's also why they don't have 10G ;) [14:16:29] new varnish boxes we do generally buy with 10G [14:27:00] New patchset: Hashar; "meta favicon as symbolic links" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23074 [14:39:57] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:44:41] ok so the good news is that I am not seeing a bunch f purge related 503s in the swift logs afaict [14:45:11] gonna check the rings for the containers and see how they are distributed [14:45:24] later, cause I should go into the office instead of typing from my hotel room [14:46:47] in other good news, Monday's rebalance seems to be nearing completion [14:46:56] yay for that [14:47:08] I'm going to push another rebalance on Monday to increase the weight from 66->100 [14:47:14] on two (out of three) of those boxes [14:47:22] ok [14:47:25] I'm going to keep ms-be12 at 66, since we need to empty that at some point [14:47:28] to change its zone [14:50:04] New patchset: Hashar; "migrate favicon from upload to bits" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23077 [15:00:29] mark: can you verify that mrjp-c2-sdtpa port 14 on LC11 is enabled plz [15:01:25] it's down but not disabled [15:01:49] cmjohnson1: did you see the ticket about professor? [15:01:58] paravoid�that is what I am working on now. [15:02:04] apparently the port is down [15:02:14] ah okay :) [15:02:36] not really involved with that, I just happened to be around when Tim was asking for remote hands [15:03:07] mark: please enable for me [15:03:16] it is enabled [15:03:20] i see its done ..thx [15:03:23] it wasn't disabled [15:03:25] i didn't do anything [15:03:50] flaky cable? [15:03:52] New review: Tpt; "It's not needed. Default will de set by the ProofreadPage extension at 250 for Page and 252 for Inde..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/20876 [15:04:10] mark: it was a flaky cable [15:04:21] probably one from when the site was built [15:04:46] sdtpa? i think that was also stayonline back then ;) [15:05:10] that's pretty recent you know ;-) [15:07:29] !log moving en search traffic from pmtpa to eqiad [15:07:39] Logged the message, notpeter [15:08:27] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [15:09:06] New patchset: Pyoungmeister; "lucene.php: moving pool1 (en) traffic to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23080 [15:13:31] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23080 [15:14:12] New review: Hashar; "They are different scripts :-/ We used to have all scripts in /home/wikipedia/bin, then they got m..." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23033 [15:14:19] bah [15:14:25] sync-apache has two different versions :( [15:16:47] ugh hashar [15:17:08] jeremyb: that is messy :-) [15:17:51] hashar: so it's unversioned then? [15:17:57] !log moving pool2,3,4 and prefix search traffic to eqiad [15:18:06] Logged the message, notpeter [15:19:12] jeremyb: /home/wikipedia/bin is full of old scripts [15:19:30] some of them being in the same state they were before being migrated to puppet :/ [15:20:10] * jeremyb closes his ears [15:20:23] * hashar hides his eyes [15:20:46] New patchset: Pyoungmeister; "lucene.php: moving all search traffic to eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23081 [15:21:07] OHHHH [15:21:14] that is an apache-config [15:21:15] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23081 [15:22:32] New review: Hashar; "The issue is caused by I6f2686844d916ffc6af95d9917193df0b57f5923 (by myself)." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/23033 [15:22:39] jeremyb: good to me now :-) [15:23:13] New review: Hashar; "Path changed to the correct '/usr/local/bin' with If7a71a57" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/15440 [15:23:36] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [15:23:41] jeremyb: we still need to phase out the /h/w/bin scripts though [15:24:15] hashar: shouldn't be too hard. can chat about it later (e.g. tomorrow or monday) [15:24:38] not there this weekend [15:24:45] and I am flying to SF on monday [15:25:01] I might raise the issue on the op list [15:25:26] New patchset: Ottomata; "filters.emery.erb - udp-filter no longer uses the -f flag" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23082 [15:25:48] hashar: i was thinking have a single wrapper script for all of /home/w/b and symlink to it as needed [15:26:08] hashar: it detects what it needs to call in /usr/local/bin based on symlink's name [15:26:31] hashar: and could even send some stats to graphite before forwarding to /usr/local/bin ;-P [15:26:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23082 [15:27:02] jeremyb: you got me at the 'graphite' part [15:27:04] anyone around to review that? [15:27:07] you are definitely evil :-]]]]]]] [15:27:16] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:27:16] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:27:16] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:27:16] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:27:16] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:27:26] hashar: /me ? why of course [15:27:50] jeremyb: the symlinking is a good idea [15:27:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23082 [15:28:00] ottomata: done [15:28:01] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:28:01] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:28:01] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [15:28:06] thank youuuu! [15:28:15] fresh air, brb [15:28:20] hashar: /me too [15:28:46] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [15:30:16] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [15:30:43] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [15:38:58] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:43:55] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [15:45:34] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [15:48:34] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [15:49:19] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [15:54:16] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [15:57:27] New patchset: Hashar; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [15:58:03] New patchset: Hashar; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [16:01:21] New review: Hashar; "I am not sure what happened with the previous patchset. Anyway, this is fine." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/22172 [16:01:21] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [16:09:18] New patchset: Dereckson; "(bug 39432) Enable Narayam on ka.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23085 [16:10:53] New patchset: Dereckson; "(bug 39432) Enable Narayam on ka.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23085 [16:11:23] New review: Dereckson; "PS2: removed trailing space" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23085 [16:15:41] New patchset: Pyoungmeister; "disabling lucene loggin fro prefix hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23086 [16:16:33] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/23086 [16:17:29] New patchset: Pyoungmeister; "disabling lucene loggin fro prefix hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23086 [16:18:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23086 [16:21:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23086 [16:23:17] New patchset: Pyoungmeister; "log4j for lucene: fixing typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23087 [16:24:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23087 [16:24:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23087 [16:43:58] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [16:43:58] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [16:43:58] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [16:49:58] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [17:15:56] notpeter: i am looking into search 32 and despite the sys event log saying there is bad dimm everything else looks normal [17:16:19] i checked the avail men and it shows [17:16:20] total used free shared buffers cached [17:16:20] Mem: 96675 6056 90618 0 102 1884 [17:16:20] -/+ buffers/cache: 4070 92604 [17:16:20] Swap: 951 0 951 [17:20:01] cmjohnson1: huh, weird [17:20:10] it seems to be staying on this time, though.... [17:20:54] it does�thought about reseating the dimm that shows an error but � it is working [17:21:39] heh, yes [17:21:48] let's not mess with this house of cards [17:22:36] sounds good to me [17:23:01] I guess feel free to close the ticket, as well. can always reopen if/when it dies again [17:24:05] right�it is not like we have done that 3-4 times before ;] [17:25:52] hehehe, yes indeed [17:41:37] Change abandoned: RobH; "(no reason)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23022 [17:53:40] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/23076 [18:00:47] New patchset: Ottomata; "Adding --settings flag to gerrit stats cronjob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23093 [18:01:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23093 [18:03:06] heyo, mergey wergey? [18:03:08] notpeter? [18:14:54] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [18:14:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:20:38] sup [18:21:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23093 [18:21:04] sure, sounds great [18:22:47] danke [18:23:37] New patchset: Dereckson; "(bug 39264) Add Tudalen: and Indecs: namespaces to cy.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23094 [18:24:42] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/23094 [18:38:54] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:38:54] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [19:00:53] New patchset: Asher; "moving carbon-collector pidfile somewhere ephemeral to prevent a stale file post-server crash" [operations/software] (master) - https://gerrit.wikimedia.org/r/23099 [19:29:17] New review: Helder.wiki; "Reedy, do you know if I missed something? Or is there a bug preventing this from working as desired?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21475 [19:38:31] http://www.reddit.com/r/unitedkingdom/comments/zhut0/dont_wait_for_the_uk_snoopers_charter_to_pass/ [19:49:01] New patchset: Dereckson; "(bug 38840) Namespaces configuration on uz.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23102 [19:57:44] robh: after changing out the mobo�.do i need to install OS again? [19:57:59] usually yep [19:58:34] ok [19:59:01] !log loading new OS on mw8 after mother board swap [19:59:10] Logged the message, Master [20:15:31] robh: [20:23:48] New patchset: Pyoungmeister; "all mw, srv and bast hosts now use mw.cfg. simplified." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23106 [20:24:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23106 [20:24:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23106 [20:26:25] cmjohnson1: you should be good to go now [20:26:39] notpeter: thx for fixing that [20:26:45] yep, no prob [20:35:03] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [20:43:00] RECOVERY - SSH on mw8 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:44:56] New patchset: Ottomata; "Rsyncing lucene logs over to dataset2." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23107 [20:45:48] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/23107 [20:46:18] oo [20:46:56] New patchset: Ottomata; "Rsyncing lucene logs over to dataset2." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23107 [20:47:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23107 [20:48:23] zats better [20:48:27] notpeter could you do me a kindness? [20:54:05] ack cafe closing, gotta run [21:04:40] PROBLEM - NTP on mw8 is CRITICAL: NTP CRITICAL: No response from NTP server [21:07:17] Change restored: RobH; "(no reason)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23022 [21:08:46] New patchset: RobH; "added in redirect for wicipediacymraeg.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23022 [21:11:33] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/23022 [21:20:34] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [21:25:33] paravoid: I tagged you for a mw revision [21:25:56] hm? [21:26:01] gerrit? [21:26:14] ah [21:26:39] sorry for the delay, I am still dealing with dell [21:28:38] AaronSchulz: want to explain a bit more? I'm not familiar with your code :) [21:28:44] is that for the shorted thumb URLs? [21:31:03] yes [21:31:29] this patch just redirects requests for the long style names [21:34:05] did you see notmyname's patch? [21:35:12] New patchset: MaxSem; "Bug 39919 - weekly feeds for frwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23112 [21:36:00] paravoid: yes, what about it? [21:36:12] New review: MaxSem; "Requires a FF commit to be merged first." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/23112 [21:39:34] !log pulled mw8 out of dsh groups since its down [21:39:35] New review: Dereckson; "Config ok. Comments and commit message could be improved." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23112 [21:39:44] Logged the message, RobH [21:42:04] !log For reference, the dsh groups Rob removed mw8 from are mediawiki-installation and apaches [21:42:14] Logged the message, Mr. Obvious [21:43:16] !log for more reference, mw8 is down for DIMM errors (RT-3499) [21:43:25] Logged the message, Master [21:46:38] RobH: apache-graceful-all logging should be fixed now [21:46:47] Also [21:46:51] Let me actually make it !log [21:47:30] Previously it would just output "robh has gracefulled all apaches" into #wikimedia-tech without prefixing it with !log [21:49:40] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:50:52] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:51:09] that's a lie, it's only up so I can test with the latest mpt2sas driver [21:51:29] New review: MaxSem; "> No local consensus for this solution (please get a local consensus first)." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/23112 [22:01:18] New review: Dereckson; "I only see at https://fr.wikisource.org/wiki/Wikisource:Scriptorium#Flux_des_.22textes_de_la_semaine..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23112 [22:22:19] !log creating new gerrit project for the wikibugs IRC bot as a .deb (operations/debs/wikibugs) [22:22:28] Logged the message, Master [22:36:39] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:03] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:45:30] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [22:45:30] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [22:45:30] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [22:45:30] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [22:45:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:45:31] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [22:45:31] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [22:45:32] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:45:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [22:50:42] binasher: can you osc drop fj_path_sha1 from the filejournal tables? [22:50:53] another todo list item I guess [22:53:12] AaronSchulz: what's fj_new_sha1 ? [22:53:52] and is that droppable now? [22:53:55] the hash of the file that something attempted to store there [22:54:10] only fj_path_sha1 is droppable now [22:55:45] ok, added it to http://wikitech.wikimedia.org/view/Schema_changes#To_be_scheduled [22:56:15] maybe i'll run all of the pending ones during the allstaff [22:57:17] i'll just *have* to have my laptop open to keep an eye on them, damn! [23:02:05] New patchset: Andrew Bogott; "Update wiki instance-status pages automatically." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23155 [23:02:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23155 [23:16:00] binasher: where can i find redirector.c? i see the redirector binary in /puppet/files/squid/redirector but can't find the .c [23:16:46] its with the squid pks source [23:16:48] hold on [23:17:27] awjr: the operations/debs/squid repo [23:20:27] thanks binasher [23:21:15] binasher is there a bug or rt ticket open about the issue? [23:21:50] not that i know of, i think erik just emailed patrick from the offsite [23:36:13] binasher: you should ask about mem use in #swiftstack [23:36:31] buh having weird issues compiling redirector.c to test - redirector.c:(.text.startup+0xa9): undefined reference to `pcre_compile' [23:36:59] AaronSchulz: yeah.. although, i wouldn't be surprised if the stock answer was "upgrade to 1.7.1" [23:38:38] awjr: -lpcre ? [23:38:55] binasher yeah im doing gcc -O3 -o redirector -lpcre redirector.c [23:46:46] awjr: does the same compile the version in master ok on your test system? [23:47:36] binasher: no [23:49:15] binasher: i was able to compile elsewhere; must be something peculiar to my test box. [23:51:46] New patchset: awjrichards; "Disable redirect to officewiki" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/23161 [23:52:03] binasher: https://gerrit.wikimedia.org/r/#/c/23161/ [23:52:27] whoops need to amend [23:53:18] New patchset: awjrichards; "Disable redirect to officewiki" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/23161 [23:53:36] binasher ^ [23:55:30] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/23161 [23:56:02] hm, i'd rather not deploy to the squids right now though [23:56:07] monday morning i guess [23:57:28] binasher: for sure.