[00:12:31] !log updated Parsoid to 45944a0 [00:12:39] Logged the message, Master [00:19:08] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [00:23:08] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [00:31:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [00:46:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [00:58:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [01:01:16] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.0009568929672 secs [01:03:05] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.002873778343 secs [01:08:58] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:18] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 90.61 ms [01:20:59] AaronSchulz: indeed [01:21:25] AaronSchulz: I checked the (incomplete) list you gave me the other time and it was all deleted files [02:07:13] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 20 02:07:13 UTC 2013 [02:07:23] Logged the message, Master [02:12:59] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 20 02:12:59 UTC 2013 [02:13:07] Logged the message, Master [02:22:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 20 02:22:16 UTC 2013 [02:22:25] Logged the message, Master [02:51:58] New patchset: Ori.livneh; "Allow vanadium to log via logmsgbot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69616 [02:52:58] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [02:53:58] RECOVERY - DPKG on mc15 is OK: All packages OK [03:04:43] New patchset: Ori.livneh; "Set common rsync and dsh parameters in mw-deployment-vars" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/57890 [04:30:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.140 second response time [04:49:00] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [04:58:50] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [04:59:30] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:21:35] apergos: morning [05:33:18] morning [05:44:05] there's a swift USN but it doesn't affect us [05:45:28] glad to hear it [05:45:42] did you do another all nighter? [05:47:16] uhm [05:47:17] kind of :) [05:47:23] not really, I woke up at 4am [05:47:46] ouch! [05:48:03] nah it's fine [05:48:22] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69616 [05:58:06] paravoid: thanks [05:58:14] hahaha [05:58:22] :) [05:58:37] it's not like I know exactly what that does [05:58:43] but it seemed harmless enough [06:00:26] i wrote tcpircbot to tim's spec. tin doesn't have a public interface like fenari did, so we had to re-do logmsgbot [06:00:46] it's a simple python script that reads from socket and writes to irc, with CIDR based filtering [06:03:29] you should translate mapped ipv4-mapped ipv6 to v4 though :) [06:08:38] paravoid: yeah, that part was just a bad design decision [06:08:42] but the impact is small [06:08:45] seems easy to fix [06:08:47] I'm on it [06:08:58] if only netaddr 0.7.7 that debian unstable has wasn't broken [06:09:01] I'd have a fix already [06:09:34] oh, wheezy too [06:09:35] nice [06:10:32] omg, 0.7.4 is also broken but in a different way [06:14:03] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [06:14:33] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [06:17:23] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:17:37] paravoid: what's broken? [06:19:01] sec [06:22:36] New patchset: Faidon; "tcpircbot: work with IPv6 & no ACL, clarify option" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69624 [06:22:36] New patchset: Faidon; "tcpircbot: IPv4 cidr instead of IPv4-mapped IPv6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69625 [06:22:41] ori-l: ^ [06:25:52] paravoid: nice change! testing [06:32:22] New review: Ori.livneh; "Nice changed; verified." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/69625 [06:37:40] *change [06:39:29] did you see both? [06:39:42] ori-l: they're two [06:39:54] I +1'd the other one, too, but gerrit-wm didn't announce it [06:39:59] perhaps because I didn't leave a comment, just the score [06:40:11] ah [06:40:43] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69624 [06:40:52] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69625 [06:41:12] better! [06:41:39] yes! that was a bit ugly before, thanks for that [07:03:10] New patchset: Aklapper; "Bugzilla Weekly Report: Don't list random products but top 5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69629 [07:04:15] New review: Aklapper; "...to make this consistent with the rest of the existing queries, like the totally similar "Componen..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69629 [07:51:35] gooood morning [07:51:49] hey, hashar [07:52:39] howdy? [07:52:41] <-- sick [07:53:43] oh, sorry to hear that. I'm fine, a bit bored [07:55:12] i'm checking out https://github.com/mozilla/wiki-tests which seems quite nice [07:56:32] one of the developers was on #mediawiki earlier but i missed him [07:56:49] zeljkof is your guy when it comes to selenium tests :) [07:57:31] though we are using a ruby implementation to drive selenium [07:57:58] hashar, ori-l: I have talked with mozilla guys at selenium conference last week [07:58:44] they write tests in python, as far as I know [07:58:53] howdy all looking for the ruby tests that I saw at sel conf I'm from Mozilla [07:59:03] marktraceur pointed him to qa/browsertests [07:59:16] he was looking for test case ideas, I think [08:00:30] ori-l: I do not remember talking to stephend at seconf [08:00:38] ori-l: regarding your Ganglia graph of mw exceptions & fatal, I replied on wikitech-l . There is a nagios plugin to check a ganglia metric :) [08:00:44] qa/browsertests is the right place [08:01:12] ori-l: you could follow up with Leslie / Daniel [08:02:17] Yeah, I saw your reply, haven't had the chance to check out the plug-in yet [08:02:41] I did ask for someone from ops to pair with me on this and as you can imagine my inbox was flooded with replies [08:02:58] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.003545403481 secs [08:03:14] ori-l: you want Daniel :) [08:03:29] if you manage to setup a time with him ahead of time, I am sure I will be happy to help [08:03:54] yeah, i know :) i'm just being a little trollish. people are nice, just busy [08:04:21] do you think monitoring ganglia is the right approach? keep in mind that i wrote the python daemon that generates the ganglia stats, so i have direct high-level access to the underlying data [08:04:47] so that script (or a variant of it) could also emit alerts without having to rely on parsing ganglia [08:05:09] i was looking at various algorithms for anomaly detection but they're mostly too advanced for me [08:05:23] though etsy just open-sourced a library: https://github.com/etsy/skyline [08:05:31] is this for icinga checks? [08:05:37] yeah [08:05:51] I'd say go for the underlying data [08:05:59] sometimes ganglia has issues, why involve it [08:06:03] right [08:06:14] anomaly detection is probably too fancy, it's probably adequate to have a rule calibrated to an absolute threshold [08:06:25] for a first tak, absolutely [08:06:27] *take [08:07:02] i haven't written an icinga plugin before, i should take a look [08:07:19] i think i did once before and got a little lost [08:07:42] I've never worked with them either (a reason I didn't volunteer to your email ;-)) [08:10:14] ori-l: I saw skyline too [08:10:24] looks interesting [08:12:15] after reading about it and some other anomaly detection stuff i remember ryan e-mailed the list to say that canonical was interested in getting a dump of ganglia data, and i bet you they meant to use it as training data for a machine learning anomaly detection algorithm [08:13:24] what sort of anomalies? [08:14:26] Nemo_bis: the algorithms don't know/care about the meaning. any significant deviation from established patterns. in the context of failure analysis, you look for, say, spikes in CPU load [08:15:12] hm [08:15:32] so definitely not something like Mozilla's stats on error rates http://laxstrom.name/blag/2013/02/11/fosdem-talk-reflections-23-docs-code-and-community-health-stability/ [08:16:50] ori-l: I am not sure how the mwerrors are counted. Seems it is doing counter += 1 , so that is most probably saved as a counter and hence you should have a rate of the errors [08:17:16] ori-l: the nagios plugin could raise an error whenever the rate of errors is higher than something (like more than 5 errors per minutes) [08:17:26] Nemo_bis: not that exactly, but the overarching goal is the same [08:18:28] paravoid: if you google for 'anomaly detection ddos' you'll find a bunch of interesting papers [08:18:34] hashar: yeah, that's what i'm going to do, i think [08:18:35] hashar: shouldn't it divide by pages served? [08:19:06] Nemo_bis: I don't think that is relevant since we have a good steam of pages being served :D [08:19:11] right [08:19:20] regardless of the time of the day [08:24:59] yeah, when someone deploys a bug it's usually quite unambiguous [08:33:01] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.005766034126 secs [08:33:02] paravoid: hi! Got a few minutes? apergos and I have a puppet layout question for you :) [08:34:13] go ahead [08:34:15] I got a template in the applicationserver module which need a variable to be set differently based on the realm. So I have added a class parameter, then I had to update all the callers in the role class to pass the variable https://gerrit.wikimedia.org/r/#/c/68831/2/manifests/role/applicationserver.pp,unified [08:34:45] So we end up requiring to call 5 times: class { 'applicationserver::config::php': [08:34:45] fatal_log_file => $role::applicationserver::configuration::fatal_log_file[$::realm] [08:35:46] there's only the two values, they only vary by realm... where should we better put that variation to avoid 5 calls like that? [08:35:55] ahh I could call that directly into role::applicationserver::configuration [08:36:40] and convert the fatal_log_file hash into a ? $::realm { 'production' => foo, 'labs' => bar } , then call that application::config::php with the resulting value [08:37:41] does it make any sense ? :-] [08:38:31] bleh the whole role class could use some refactoring [08:42:52] I am rewriting my patch to call the parameterized class in the configuration role class [08:43:57] move ::php into ::common? [08:44:03] it's not like we have appservers without php [08:45:44] well, you can also cheat and not do this in puppet at all, since it's done in CommonSettings.php [08:46:06] that is true [08:46:07] see the switch( $wmfRealm ) block that sets $wmfUdp2logDest to different values based on the realm [08:47:07] that is for the wmerrors PHP Extensions [08:47:19] I am not sure the wmerrors.log_file is set in CommonSettings.php [08:47:59] well, what if you set $fatal_log_file to 'udp://$wmfUdp2logDest/wmerrors' [08:48:48] which wmerrors are you referring to, btw? [08:51:05] the PHP Extension [08:51:12] that catch the fatals and send them to a log file [08:51:19] (or over udp) [08:54:11] New patchset: Hashar; "vary wmerrors.ini 'fatal_log_file' per realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [08:54:16] apergos: ^^^ :-) [08:56:01] trying on labs [08:58:03] btw, you guys saw closedmouth's report on #wikimedia-tech? i don't know how to diagnose that [08:58:28] New review: Hashar; "seems to work fine on integration-puppet.pmtpa.wmflabs labs instance :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [08:59:08] I see it now [09:00:37] S1 slaves are lagged out http://noc.wikimedia.org/dbtree/ [09:00:49] db1043 db1049 db1050 db1051 and db1052 [09:00:58] though db63 seems fine [09:02:23] yes just the eqiad slaves [09:04:04] some lag started around 8:49 UTC [09:04:18] seems to be resolving [09:04:36] hourly lag graph http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=mysql_slave_lag&mreg[]=%5Emysql_slave_lag%24&hreg[]=db1051&aggregate=1&hl=db1051.eqiad.wmnet%7CMySQL+eqiad [09:05:37] yeah there's only one lagged now [09:06:35] I was looking at the processlist but things seem to be moving through on master [09:12:01] apergos: so the patch is a bit nicer now https://gerrit.wikimedia.org/r/#/c/68831/3/manifests/role/applicationserver.pp,unified [09:12:18] the call to the parameterized class is now in the role::applicationserver::configuration [09:12:24] Wikimedia Platform operations, serious stuff | Log: http://bit.ly/wikisal | Channel logs: http://ur1.ca/edq22 | MediaWiki error counts: https://tinyurl.com/n3twd8k | on RT duty: RobH [09:12:27] which is loaded from everywhere [09:12:39] yes, I have been looking at it [09:12:45] this definitely is nicer [09:14:05] \O/ [09:22:57] New patchset: Hashar; "vary wmerrors.ini 'fatal_log_file' per realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [09:23:43] New review: Hashar; "Prefixed the configuration class with 'php':" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [09:28:33] New patchset: Hashar; "vary wmerrors.ini 'fatal_log_file' per realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [09:28:33] New patchset: Hashar; "PHP fatal destination is now a class parameter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68830 [09:28:45] New review: Hashar; "rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [09:28:53] apergos: they are good to go :-) [09:29:02] once merged in puppet, I can try them out on the beta apaches [09:29:07] okay [09:29:07] sec [09:29:12] then merge in sock puppet and try out in prod :-] [09:30:07] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68830 [09:30:57] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [09:31:04] ok they're both in [09:31:13] trying out on labs [09:31:16] err on beta [09:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:11] -wmerrors.log_file=udp://10.64.0.21:8420 [09:32:11] +wmerrors.log_file=udp://10.4.0.58:8420 [09:32:12] :-] [09:32:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:32:27] now I have no idea how to generate a fatal [09:33:13] uh oh [09:40:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:43:04] arghg [09:43:12] unrelated screaming [09:51:14] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [09:51:14] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:14] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:14] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:14] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:15] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:15] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [09:51:16] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:16] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:17] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [09:56:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:57:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.157 second response time [10:00:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [10:19:21] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [10:23:21] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [10:26:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:34:00] poor stafford [11:02:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [11:11:09] New review: Nikerabbit; "Did you forgot to sync this?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68947 [11:35:07] mark: https://gerrit.wikimedia.org/r/#/q/project:operations/debs/ircd-ratbox+owner:%22AzaToth+%253Cazatoth%2540gmail.com%253E%22,n,z [11:35:20] mark: ryan wanted you to perhaps look into it [11:45:47] snack time [11:59:17] New patchset: Nikerabbit; "ULS deployment phase 3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [12:01:59] New review: Nikerabbit; "Planned for 2013-06-25" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69643 [12:12:00] is there anybody here who can increase the max_user_connections to the replicas for a tool labs project? [12:22:39] JohannesK_WMDE: #wikimedia-labs :-D [12:22:48] JohannesK_WMDE: and/or fill in a bug :-] [12:23:55] hashar: folks in #wikimedia-labs directed me here [12:24:54] JohannesK_WMDE: so a bug will do it :-] [12:25:05] apparently asher can do it but he's not here yet, so i wanted to know if anybody else can do it. we need it urgently. [12:25:08] east coast staff will connect soon [12:25:55] follow up on -labs [12:54:59] PROBLEM - swift-object-server on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:59] PROBLEM - swift-object-replicator on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:59] PROBLEM - Disk space on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:54:59] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:18] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:18] PROBLEM - RAID on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:28] PROBLEM - swift-account-server on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:28] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:28] PROBLEM - swift-object-updater on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:38] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:48] PROBLEM - swift-container-server on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:48] PROBLEM - swift-container-updater on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:49] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:58] PROBLEM - swift-object-auditor on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:58] PROBLEM - DPKG on ms-be2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:37] !log dns update [13:27:46] Logged the message, Master [13:53:10] New patchset: coren; "Tool Labs: Bump max_user_connections to 512" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69648 [13:58:06] New review: Demon; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69648 [13:59:08] New patchset: coren; "Tool Labs: Bump max_user_connections to 512" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69648 [14:01:16] New patchset: coren; "Tool Labs: Bump max_user_connections to 512" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69648 [14:08:51] mark, are you available to stand by and test after I merge https://gerrit.wikimedia.org/r/#/c/68584/? [14:16:57] !log jenkins updating all mediawiki extensions unit testing jobs ( mwext-.*-testextensions-master' [14:17:06] Logged the message, Master [14:39:10] New patchset: Cmjohnson; "adding cp1056-cp1070 dhcpd/fixing spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69663 [14:41:10] New patchset: Cmjohnson; "adding cp1056-cp1070 dhcpd/fixing spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69663 [14:41:54] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69663 [14:59:10] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [15:10:36] PROBLEM - SSH on ms-be2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [15:26:06] New review: Andrew Bogott; "My previous comment is incorrect; in labs we need to replace ldap::client::wmf-test-cluster with lda..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [15:39:49] hashar, faidon, any idea what the story is with puppet class nfs::server? It doesn't seem to be used anywhere. [15:52:04] New patchset: Andrew Bogott; "Moved nfs manifest into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [15:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:50] New patchset: Andrew Bogott; "Moved nfs manifest into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [15:53:11] New review: Andrew Bogott; "Work in Progress -- do not merge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [15:54:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [15:54:31] !log updated Parsoid to bf8d3df [15:54:41] Logged the message, Master [15:57:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [16:08:20] !log nuked neon puppet.log b/c /var/log was 99% full [16:08:29] Logged the message, Master [16:18:21] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [16:20:26] apergos: did you see about ms-be2? [16:33:28] New review: Andrew Bogott; "This is now tested and ready for merge, pending explanation of the weird nfs::server class in the ol..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [16:37:52] Hey Reedy, have you cut wmf8 yet? I want to make sure a last centralauth change goes in... [16:37:59] Yeah [16:38:38] 50 minutes ago apparently [16:38:38] https://git.wikimedia.org/log/mediawiki%2Fcore.git/refs%2Fheads%2Fwmf%2F1.22wmf8 [16:42:24] New review: Faidon; "See inline. You're using tabs & spaces inconsistently but in any case, the agreement is to use 4-spa..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69682 [16:45:45] paravoid, oops, I forgot to type 'git review' after fixing all the whitespace [16:45:50] thanks for reading, new patch coming soon [16:47:11] paravoid, you think I should just excise the monitoring stuff? That's all cut-n-paste, not sure what it's about. [16:47:52] not the monitoring [16:48:03] just move the subclass's content outside [16:48:13] I didn't see the monitoring class being referenced anywhere else [16:48:18] (unless I was wrong) [16:51:12] the class is defined and then immediately included. So that means it does something, doesn't it? [16:51:29] Um… unlike the 'backup' class whcih is not included. Hm... [16:52:00] yes, that's my point [16:52:18] instead of class monitoring { foo } include monitoring [16:52:20] just do foo [16:52:35] it's a single definition inside, it's not like the class serves as a grouping [16:52:36] oh, I see what you're saying, ok. [16:53:00] but, the 'backup' class is just dead code, isn't it? Or am I misunderstanding how that works? [16:53:37] !log reedy synchronized php-1.22wmf8/ 'initial sync' [16:53:45] Logged the message, Master [16:53:57] no idea [16:55:44] !log reedy synchronized docroot and w [16:55:53] Logged the message, Master [17:00:54] Dang. Reedy, let me know when your wmf8 deploy is done, and I'll push the latest centralauth. Sorry about that. [17:04:49] New review: Andrew Bogott; "retabbed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [17:04:51] New patchset: Andrew Bogott; "Moved nfs manifest into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [17:08:01] New patchset: Andrew Bogott; "Moved nfs manifest into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [17:08:59] grrr [17:09:09] !log reedy Started syncing Wikimedia installation... : rebuild localisation cache and testwiki to 1.22wmf8 [17:09:17] Logged the message, Master [17:21:01] apergos: Could you kill php-1.22wmf2 from snapshot3 please? [17:21:44] Reedy, Campaigns is a new extension (already deployed in 1.22wmf6 and 7) but it wasn't in make-wmf-branch/default.conf. https://gerrit.wikimedia.org/r/#/c/69691/ adds it. Sorry I missed it [17:22:08] Feel free to make a patchset just adding it to wmf/1.22wmf8 and I'll make sure it's synced out [17:33:36] !log reedy Finished syncing Wikimedia installation... : rebuild localisation cache and testwiki to 1.22wmf8 [17:33:44] Logged the message, Master [17:37:47] !log reedy synchronized php-1.22wmf8/extensions/ 'Sync Campaigns and CentralAuth' [17:37:55] Logged the message, Master [17:38:16] New patchset: Reedy; "testwiki to 1.22wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69696 [17:38:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69696 [17:39:18] New patchset: Reedy; "(bug 49358) Remove MoodBar from it.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68352 [17:39:39] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68352 [17:40:00] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 20 17:39:59 UTC 2013 [17:40:07] Logged the message, Master [17:40:47] !log LocalisationUpdate completed (1.22wmf6) at Thu Jun 20 17:40:46 UTC 2013 [17:40:49] New patchset: Reedy; "(bug 49575) Set up $wgImportSources for vec.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68655 [17:40:55] Logged the message, Master [17:41:10] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68655 [17:41:35] New patchset: Reedy; "(bug 49612) Localise $wgSitename for fr.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68858 [17:41:57] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68858 [17:42:14] New patchset: Reedy; "(bug 49335) Modify wgNamespacesToBeSearchedDefault for ukwikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69160 [17:42:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69160 [17:48:21] New patchset: Reedy; "Remove narayam and webfonts from extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69699 [17:49:34] New patchset: Reedy; "Remove narayam and webfonts from extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69699 [17:49:42] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69699 [17:50:39] New patchset: Reedy; "Bug 48354: exclude MediaWiki: namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68183 [17:51:00] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68183 [17:51:29] Change abandoned: Reedy; "I created a dupe of this and already merged" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68648 [17:51:51] !log LocalisationUpdate completed (1.22wmf8) at Thu Jun 20 17:51:50 UTC 2013 [17:52:03] Logged the message, Master [17:52:14] New patchset: Reedy; "(bug 46244) Enable wmgUseVectorFooterCleanup on ilowiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68298 [17:52:33] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68298 [17:54:09] !log reedy synchronized wmf-config/ [17:54:17] Logged the message, Master [17:57:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 20 17:57:01 UTC 2013 [17:57:09] Logged the message, Master [17:57:54] Ryan_Lane: had time to test? [17:57:59] 19:47 < average> does the Depends: field accept any form of the "OR" operator ? [17:58:02] 19:47 < average> for example, there are multiple packages providing JDK (the Java JDK) [17:58:05] 19:47 < average> and I want to do stuff like [17:58:07] 19:47 < algernon> yes. "|" [17:58:10] 19:47 < average> Depends: sun-java6 OR sun-java7 OR gcj OR openjdk [17:58:12] 19:47 < algernon> look at any of the java packages for an example :) [17:58:15] 19:48 < average> algernon: could you please point me to an example ? [17:58:18] 19:49 < wRAR> average: you should read the policy [17:58:21] 19:49 < wRAR> 7.1 Syntax of relationship fields in this case [17:58:23] 19:50 < algernon> average: ant is one such example, but see the policy as wRAR mentioned [17:58:24] average_drifter: stop schpamming [17:58:26] 19:50 < babilen> average: You really shouldn't depend on sun-* java *at all* but on default-jdk (that is openjdk) [17:58:29] 19:51 < babilen> In particular not Sun's/Oracle's JDK6 as it hasn't been maintained in quite a while and is a security nightmare. [17:58:35] paravoid: what is your oppinion on the above ? [17:58:38] quoted from #debian-mentors on irc.debian.org [17:58:42] AzaToth: should have used a pastie or gist, sorry [18:00:05] average_drifter: you should never depend on sun(oracle) java [18:00:41] AzaToth: ok, would you agree that whenever JDK is a dependency, I should use default-jdk to provide it ? [18:00:59] average_drifter: difficult to answer [18:01:19] for buck I had to depend on openjdk-7-jdk [18:01:32] as default-jdk point to 6 [18:01:35] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki, test2wiki and testwikidatawiki to 1.22wmf8 [18:01:45] Logged the message, Master [18:02:01] average_drifter: but normally, default-jdk is the correct one [18:02:42] New patchset: Reedy; "test2wiki, testwikidatawiki and mediawikiwiki to 1.22wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69701 [18:03:14] AzaToth: average_drifter is working on dclass that needs to run in the hadoop cluster, the hadoop cluster is still using sun java6 so whatever jdk is chosen it needs to be compatible with sun java 6 [18:03:40] drdee: I would assume it's forward compatible [18:03:46] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: rest of wikipedias to 1.22wmf7 [18:03:49] i.e. it can be run on jdk7 [18:03:54] Logged the message, Master [18:04:19] well if you compile it with jdk7 then i don't think it will run with sun java6 [18:04:29] drdee: is the hadoop cluser using oracle java6 or openjdk 6? [18:05:45] drdee: afaik you are able to specify targer version [18:05:48] oracle java6 [18:05:52] !log updated Parsoid to b206b54 [18:05:59] Logged the message, Master [18:06:28] drdee: aint that against policy? [18:06:36] AzaToth: not just yet [18:06:37] oracle java isn't opensource [18:06:59] right please don't beat a dead horse [18:07:00] Ryan_Lane: okidoki [18:07:09] that discussion has been had multiple times [18:07:20] drdee: sorry, didn't get the memo [18:07:21] we will migrate as soon as hadoop is openjdk compatible [18:07:35] New patchset: Reedy; "Wikipedias to 1.22wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69702 [18:10:32] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69701 [18:10:37] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69702 [18:11:43] PROBLEM - Apache HTTP on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50027 bytes in 0.009 second response time [18:13:24] drdee: damn, I was searching for info about hadoop on google and ended up on quora.com, and to read more than one "answer" you need to login using google or facebook, and it demands to be able to "Manage your contacts" [18:14:45] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [18:15:21] average_drifter: I would assume that even though hadoop required oracle java 6, you can still complile additions/plugins using openjdk [18:16:27] AzaToth: you would assume.. but I should try it to confirm it [18:16:38] AzaToth: openjdk and oracle java6 are compatible ? [18:17:08] average_drifter: they should make complatible byte code [18:17:24] compatible* [18:18:04] that sounds encouraging [18:18:13] New patchset: Andrew Bogott; "Moved nfs manifest into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69682 [18:18:13] New patchset: Andrew Bogott; "Move generic::rsyncd into its own module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69703 [18:18:20] I would say though to try to find a alternative to hadoop [18:19:40] pls 2 +2 https://gerrit.wikimedia.org/r/#/c/69648/ ? [18:20:26] AzaToth: you are corageous [18:20:34] AzaToth: there are alternatives like http://www.iterativemapreduce.org/ [18:21:05] AzaToth: but I think kraken is a mature codebase using hadoop. the team already has a lot of knowledge on hadoop [18:21:38] k [18:21:44] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69648 [18:23:19] Ryan_Lane, can you have a look at https://gerrit.wikimedia.org/r/#/c/69337/ ? [18:23:25] New patchset: coren; "Tool Labs: +qt4-make in dev (user request)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69704 [18:23:32] It's not as bad as it looks :) [18:23:47] average_drifter: I would assume it's a bit out of my leauge, so I'll keep my mouth shut [18:24:05] andrewbogott: heh. yeah. gimme a bit [18:24:46] Ryan_Lane, mostly I'd like to talk through the process of merging as I merge, since there are multiple steps. But, yeah, whenever you have a change. [18:24:52] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69704 [18:25:28] AzaToth: no no, it's not like that, I very much appreciate your oppinion. don't worry, it's an open discussion [18:26:46] AzaToth: Kraken will be a recurring topic anyway. the package I'm trying to make with your help, Andrew's and Faidon's, is just one step towards that goal. There will be other packages to come also [18:27:13] *chance [18:31:16] New patchset: Yurik; "Script-updated zero configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69705 [18:31:36] ^demon: I'm going to work on some of the different search options that seem to be in use in production. I just pushed an update to the TODO file with a list. [18:32:13] <^demon> Okie dokie [18:32:28] andrewbogott: it's mostly just checking a few things [18:32:45] andrewbogott: I'm assuming you tested this on a puppetmaster::self instance? [18:32:58] I played for a bit with java packaging and white listing dependencies - that'll be exhausting if we go that was but less bad than .deb-ing all the dependencies from source. [18:33:57] Ryan_Lane, yep, tested in several configs [18:34:09] cool. that's my biggest concern [18:34:12] that and gerrit [18:34:18] and the other web services that use ldap [18:36:49] <^demon> Ryan_Lane: Solr won't use ldap :) [18:37:00] solr? [18:37:10] I don't think I mentioned ti :) [18:37:11] *it [18:37:30] <^demon> I missed the jump from java packaging to ldap. [18:37:42] manybubbles: I don't see how it'd be so exhausting but I don't mind that much either [18:38:07] average_drifter: haven't done any mapreduceing in my life, so it is ooml [18:38:27] <^demon> Ryan_Lane: If we want saner gerrit packaging, someone other than me has to review AzaToth's work ;-) [18:38:52] don't know how to take that ツ [18:39:07] ^demon: yeah. I'll be taking a look at it soon [18:39:30] s/take that/analyse the meaning of that statement/ [18:39:35] <^demon> AzaToth: That I'm not qualified to review it :) [18:39:56] ^demon: I would assume you have some knowledge about gerrit right? [18:40:05] lol. [18:40:14] <^demon> No tell me more :p [18:40:18] :-P [18:40:29] we're seeing 503s and 504s on some special pages on the mobile version of the site [18:40:52] <^demon> awjr: Get 505s and you take home a prize :) [18:41:13] hehehe [18:41:13] Ryan_Lane, the other steps are… 1) change defaultclasses to use ldap::role::client::labs 2) test new instance creation 3) change puppet class for all existing instances [18:41:25] I spiffed up puppetValues.php to make step 3 easy [18:42:34] ^demon: I assume you want to take a look at https://gerrit.wikimedia.org/r/#/q/project:operations/debs/ircd-ratbox+owner:%22AzaToth+%253Cazatoth%2540gmail.com%253E%22,n,z instead [18:42:38] New patchset: Reedy; "Update gitweb/gitblit RSS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68415 [18:43:08] hmm now im seeing issues on special:watchlist on desktop enwiki [18:43:10] Due to high database server lag, changes newer than 72 seconds may not appear in this list. [18:43:11] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68415 [18:43:16] related? [18:43:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [18:45:04] ^demon: actually, could you explain to me why buck cant use system jython and must use bundled standalone atm? [18:45:23] <^demon> Not a clue. [18:46:22] <^demon> Reedy: You seen/filed this ProofreadPage getCode() exception? [18:46:23] I had the same question [18:46:28] Yus [18:46:39] testwiki is rathter broken currently [18:46:48] <^demon> I hate proofread page. [18:46:59] paravoid: I got funky errors when using /usr/share/java/jython.jar, even they had the same version [18:47:00] phase it out ? [18:47:11] I'm guessing it's related to the recent refactoring [18:47:30] AzaToth: that's weird [18:47:36] <^demon> Reedy: It's kind of spammy. fatal.log isn't useful atm. [18:47:58] Not so bad in the apache logs [18:48:03] I did notice more from job runners [18:48:32] https://www.mediawiki.org/wiki/MediaWiki_1.22/wmf8/Changelog#ProofreadPage [18:48:40] andrewbogott: cool [18:48:54] I'll move it back a version in wmf8 [18:49:05] ^demon/Ryan_Lane: could you ±2 https://gerrit.wikimedia.org/r/69607 https://gerrit.wikimedia.org/r/69608 and https://gerrit.wikimedia.org/r/69609 [18:49:26] <^demon> I can't, no permissions. [18:49:27] AzaToth: http://bugs.debian.org/589436 [18:49:35] you might want to coordinate with Thomas [18:49:41] Not sure if it's actually PP itself, or core [18:49:47] and I can sponsor all of them of course [18:50:05] <^demon> That dependency list is a mess. [18:50:06] ewwwy, 300s db lag [18:50:15] AzaToth: path conflict on 69607 [18:50:28] <^demon> Reedy: Any insight on this db lag? [18:50:30] Where? [18:50:36] Ryan_Lane, are you about to go to lunch? I'd prefer to have you on hand when I merge just in case. [18:50:37] Ryan_Lane: on what? [18:50:45] <^demon> !replag [18:50:47] enwiki :< [18:50:52] <^demon> @replag [18:50:53] ^demon: [s1] db1049: 339s [18:50:53] Ryan_Lane: remember you replaced the git yesterdaty [18:51:10] Just db1049 [18:51:21] db1049 [18:51:25] silly db [18:51:33] Ryan_Lane: i.e. fetch, reset, checkout wmf [18:51:34] Watchlists/recentchanges etc [18:52:03] <^demon> AzaToth, paravoid: Fwiw, one of those dependencies isn't very telling. h2 has to be decoupled from core and made optional. [18:52:16] <^demon> Which upstream hasn't done. [18:52:22] hm? [18:52:46] db1049 is full of mobile watchlist queries [18:53:03] h2? [18:53:10] <^demon> paravoid: There was some debate over requiring the h2 database (it's used for on-disk caching). [18:53:16] ah [18:53:41] awjr: ping [18:53:47] <^demon> So the ITP links to the "package h2" but iirc on-list there was a request to just drop the h2 requirement. [18:53:55] pong paravoid [18:53:57] SpecialMobileWatchlist::doFeedQuery [18:54:06] paravoid: yeah - can you do an explain on that? [18:54:31] paravoid: http://paste.debian.net/11630/ [18:55:35] paravoid: as long buck is used and buck is fuck, gerrit is nogo on debian proper [18:56:27] ^demon: I'm not depending on h2 packagewize [18:56:44] andrewbogott: well, I'm going home after lunch [18:56:52] <^demon> AzaToth: On-disk caching requires it... [18:57:01] <^demon> Oh, or you just letting buck deal with it? [18:57:01] (today is my birthday, so I may not be around a lot) [18:57:02] ^demon: don't know of what list of dependices you are referring to [18:57:16] <^demon> The one on the ITP: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=589436 [18:57:30] <^demon> Specifically http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=607891 [18:58:08] AzaToth: I'd discuss the whole situation with Thomas, he's been very active on the java packaging team [18:58:32] Ryan_Lane, ok, I'll start now and hope things break before you go :) [18:58:37] ^demon: currently buck maven gerrit deps [18:58:40] Also, happy birthday! Mine was Saturday. [18:59:06] <^demon> AzaToth: Yeah I know, but buck has an offline mode supposedly, to let you rely on system-provided dependencies. [18:59:16] it has? [18:59:23] <^demon> Supposedly. [18:59:42] well, buck is so fuck I've no idea what it can do [19:00:00] New patchset: Andrew Bogott; "Move ldap into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [19:00:11] <^demon> I just type things into it and pray. [19:00:56] only thing I can see is "--build-dependencies (-b) [FIRST_ORDER : How to handle including dependencies [19:00:56] _ONLY | WARN_ON_TRANSITIVE | TRANSITIV : [19:00:56] E]" [19:01:34] Reedy: hm, the config is not merged is it [19:02:52] <^demon> Ugh, soy templates. [19:02:53] <^demon> I hate soy [19:02:54] andrewbogott: oh. cool. happy belated birthday to you! [19:03:00] New review: Andrew Bogott; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [19:03:12] oh [19:03:17] who has his birthday? [19:03:39] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [19:04:44] Ryan_Lane, want to force a puppet run on gerrit and see how it copes? I'm doing the labs test... [19:07:47] paravoid: approx 10 million males have birthday today [19:08:04] paravoid, including Ryan. [19:08:40] Nemo_bis: It is, not synced [19:08:52] oki [19:09:45] oh! [19:09:48] Ryan_Lane: happy birthday! [19:09:52] andrewbogott: and you too :) [19:10:03] thanks :) [19:10:45] ^demon: maven install in gerrit is defined in tools/build.defs [19:11:00] thanks [19:11:54] Reedy, I'm sure you noticed but I'm seeing a lot of Fatal error: Call to a member function getCode() on a non-object at /usr/local/apache/common-local/php-1.22wmf8/includes/GlobalFunctions.php on line 1288 [19:12:44] spagewmf: There's already a bug logged and the ProofreadPage guy is looking into it [19:13:14] spagewmf: https://bugzilla.wikimedia.org/49897 [19:14:14] Reedy, paravoid any thoughts about what's causing the problem? as far as i can tell that query looks sane [19:14:20] MaxSem: ^ [19:14:41] 431 out of 494 queries is SpecialMobileWatchlist::doFeedQuery [19:14:45] I see a shitload of queries for the same people [19:14:50] all of them running for half an hour or so [19:14:51] The number of total queries is dropping [19:14:54] and it's all for WMF people [19:14:59] * AzaToth didn't even notice enwiki was down [19:15:17] There's maybe 20 queries copying to temporary tables [19:15:19] it's 4 people [19:15:24] 400 queries [19:15:33] can we just kill the queries for the wmf folks? [19:15:42] we're basically all in the same room right now anyway [19:15:47] Due to high database server lag, changes newer than 999 seconds may not appear in this list. [19:16:09] srsly. wikiadmin cannot kill wikiuser queries [19:16:13] <^demon> manybubbles: Can you rebuild the index for enwikiquote? I see the others were rebuilt but enwikiquote was still on the old one. [19:16:20] paravoid, hi, could you +2 minor IP fix for zero? https://gerrit.wikimedia.org/r/#/c/69705/ [19:16:24] <^demon> With the new schema, that is. [19:17:03] New patchset: Tpt; "(bug 49897) configure properly page and index namespaces for test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69755 [19:17:11] 403 active queries [19:17:17] 385 [19:17:18] Served by mw1214 in 56.965 secs [19:17:19] Ryan_Lane: I can create a new instance and log into it… that seems like a good sign. Do you want to test (or want me to test) anything else before I change all the labs ldap entries? [19:17:40] only 6 queries copying to tmp currently [19:18:07] andrewbogott: did the initial puppet run go as well? [19:18:18] yep, clean puppet runs on the new instance. [19:18:22] cool [19:18:26] I'd say go for it [19:18:50] On virto the puppet run had a big diff but all things like --A INPUT -m comment --comment deny_all_glance_api_glance_api -p tcp -j DROP --dport 9292 [19:19:02] which I presume is routine and unrelated… seems to be stuff like that frequently [19:19:08] *virt0 [19:19:56] the forced index is slowing it down, somehow [19:20:33] 1032 seconds lag now [19:20:52] AzaToth, we know [19:21:24] MaxSem: but I don't understand why pageload takes such a long time (60 seconds approx) [19:21:40] DB server overload [19:22:04] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69755 [19:22:52] !log reedy synchronized wmf-config/ [19:23:01] Logged the message, Master [19:26:30] paravoid: note that the explain doesn't show use index for rc_timestamp [19:26:39] that might be a symptom of the 'order by' rather than the force index [19:27:46] does "Sending data" mean that it's sending out a billion rows or it's just a hung connection? [19:27:47] and yet the force index one takes half an hour or more to run and removing that returns instanteously [19:30:59] commit 87e6622714c02d264d7032e9d2062fd192ca3156 [19:30:59] Author: Max Semenik [19:31:00] Date: Fri Jun 7 23:02:55 2013 +0400 [19:31:00] Force index to avoid filesort in feed query [19:31:04] when was this deployed? [19:31:36] also its parent [19:31:45] awjr, MaxSem: ^ [19:32:14] prolly this week [19:32:16] likely on the 11th, paravoid [19:32:18] or the 18th [19:32:30] okay, revert them. [19:32:37] awjr, it wasn't merged before this week's deployment [19:33:05] fairly sure it's either of the two [19:33:16] im double checking our logs [19:33:16] removing the force index makes the queries run instanteously [19:33:30] the rc_type one otoh, we don't have an index that includes that [19:33:51] it was deployed the 18th, this week [19:33:51] http://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2013-06-18 [19:34:43] can you revert? [19:34:48] aye [19:35:04] judging from the queries noone uses the watchlist [19:35:19] so it didn't manifest until you had your meeting and all of you played with the feature :) [19:35:36] cascading failure [19:35:55] hehe [19:36:10] heheeh [19:38:27] MaxSem: think we can get that out after the MW deploy window? [19:38:54] not now? [19:38:58] now please [19:39:06] now is fine with me, but check with folks currently deploying [19:39:16] Reedy, i think? [19:39:43] I'm not doing anything atm [19:39:47] paravoid, if you're still working I'd appreciate a second read of https://gerrit.wikimedia.org/r/#/c/69682/ and its dependency, https://gerrit.wikimedia.org/r/#/c/69703/ [19:39:49] engage! [19:39:50] Deleting a load of bad translation pages [19:40:34] MaxSem: ^^ [19:40:46] in process [19:42:11] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [19:42:59] (that's me, putting an H710 controller in) [19:46:47] !log maxsem synchronized php-1.22wmf8/extensions/MobileFrontend [19:46:56] Logged the message, Master [19:47:21] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 87.87 ms [19:48:22] !log maxsem synchronized php-1.22wmf7/extensions/MobileFrontend [19:48:31] Logged the message, Master [19:52:11] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [19:52:11] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:11] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:12] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:12] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:12] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [19:52:13] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:13] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:14] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:52:21] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:31] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:41] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:42] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:42] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:42] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:42] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [19:53:22] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:01] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:01] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:05] !lag [19:54:13] (that's me, turning them off) [19:54:51] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:52] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:52] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:22] ok, time to pack and go home [19:55:58] New patchset: Demon; "How did I make this typo twice?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69771 [20:03:48] New patchset: Ori.livneh; "Enable NavigationTiming on beta cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69774 [20:04:44] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69774 [20:05:22] ori-l: you are too fast [20:05:52] ori-l: I have edited that bug like a minute ago [20:06:00] hashar: :) [20:06:00] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'I83fcaa5ec: enable NavigationTiming on beta cluster' [20:06:07] Logged the message, Master [20:06:43] * hashar reads http://www.mediawiki.org/wiki/Extension:NavigationTiming [20:07:04] authors Asher + Ori + Preilly => too complicated for me :) [20:07:44] that is a nice ext [20:07:47] oh come on, its 2k of php and 3k of js :P [20:08:03] they took it simple for once :) [20:08:08] I can't wait to have all the event logging events available via LIMN or something [20:10:16] !log olivneh synchronized php-1.22wmf7/extensions/CoreEvents 'CoreEvents to d291b64248' [20:10:24] Logged the message, Master [20:10:46] !log olivneh synchronized php-1.22wmf8/extensions/CoreEvents 'CoreEvents to d291b64248' [20:10:51] ebernhardson, hashar: :D [20:10:54] Logged the message, Master [20:11:11] we need to graph that data, definitely [20:12:39] * Reedy graphs ori-l [20:14:37] Reedy: plot shows suspicious spikes [20:14:55] /\_____/\/\/\/\/\____ [20:19:26] !log Gracefully reloading Zuul to deploy I8c0ac58d9498979b [20:19:34] Logged the message, Master [20:20:11] !log olivneh synchronized php-1.22wmf8/extensions/EventLogging 'EventLogging to 2351b4ccbb' [20:20:18] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [20:20:20] Logged the message, Master [20:23:27] !log olivneh synchronized php-1.22wmf7/extensions/EventLogging 'EventLogging to 2351b4ccbb' [20:23:34] Logged the message, Master [20:23:42] !log reedy synchronized php-1.22wmf8/extensions/ProofreadPage/ [20:23:50] Logged the message, Master [20:24:18] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [20:27:00] !log Gracefully reloading Zuul to deploy Ie7513f356cd8 [20:27:07] Logged the message, Master [20:49:19] New patchset: Jgreen; "remove tridge from fundraising backup schemes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69788 [20:50:00] !log olivneh synchronized php-1.22wmf7/extensions/GuidedTour 'GuidedTour to a6fdf3c910' [20:50:08] Logged the message, Master [20:50:20] !log olivneh synchronized php-1.22wmf7/extensions/GettingStarted 'GettingStarted to bf05656766' [20:50:27] Logged the message, Master [20:51:08] New patchset: Demon; "Updating with some of the new options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69789 [20:51:35] !log olivneh synchronized php-1.22wmf8/extensions/GuidedTour 'GuidedTour to a6fdf3c910' [20:51:43] Logged the message, Master [20:51:54] !log olivneh synchronized php-1.22wmf8/extensions/GettingStarted 'GettingStarted to bf05656766' [20:52:02] Logged the message, Master [20:57:24] New patchset: Ori.livneh; "Update EventLogging config to utilize API" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69790 [20:57:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69788 [20:58:17] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69790 [21:00:01] !log olivneh synchronized wmf-config/CommonSettings.php 'I24bbe0d4e: Update EventLogging config to utilize API' [21:00:09] Logged the message, Master [21:55:56] Reedy: done [21:56:54] apergos: thanks for the clean-up on fluorine [21:57:03] yw [21:57:09] good you caught it [22:01:13] !log olivneh synchronized php-1.22wmf7/extensions/GuidedTour 'Updating GuidedTour to 950ee2c70417be517f4cdce3a1b590f6fc28d388' [22:01:21] Logged the message, Master [22:05:23] !log olivneh synchronized php-1.22wmf8/extensions/GuidedTour 'Updating GuidedTour to 950ee2c70417be517f4cdce3a1b590f6fc28d388' [22:05:31] Logged the message, Master [22:08:55] New patchset: Ottomata; "Puppetizing hive client, server and metastore." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69353 [22:09:23] New patchset: Ottomata; "Puppetizing oozie client and server" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69804 [22:09:47] New patchset: Ottomata; "Puppetizing hue." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69805 [22:11:03] ottomata: you're a machine! [22:11:11] haha [22:11:20] that's what happens when you code on a plane [22:11:50] have internet, time to push [22:12:47] yeah, plus doing puppet stuff is oddly gratifying, like popping bubblewrap [22:13:29] hha [22:13:31] yeah right! [22:13:35] its so cool when it works [22:13:51] i reinstalled and repuppetized two of the hadoop datanodes yesterday [22:13:57] the puppetization worked like a charm [22:14:07] i just formatted the partitions, ran puppet, and bam! a new datanode [22:15:28] that's pretty cool [22:15:40] btw, ori-l, I am on 3rd floor, woo! [22:15:56] wat! i'll come by and harass [22:22:15] <^demon> manybubbles: Minor schema change, namespace is now stored. [22:35:18] ^demon: fine by me. Why store it? just curious. [22:36:06] ^demon: while I've got you distracted you added this to the TODOs a while back: "Handle offsets in Solr rather than MediaWiki. Only search what you need." but it looks like you did it. [22:36:08] am I missing something? [22:36:45] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:37:34] <^demon> manybubbles: Yeah I fixed that awhile ago. [22:37:49] <^demon> Was storing since we'll be searching by it, not just returning it as a result. [22:38:15] I think something came out backwards. [22:38:37] storing is what you do when you want to return it but you don't have to if all you are doing is searching by it [22:38:48] also, I've been playing with namespace searches today [22:39:17] the one where you click on "advanced" and pick a namespace. or where you start the query with a namespace name. [22:40:47] actually it looks like I'm having trouble with prefixing the search with the namespace name [22:41:42] manybubbles: do you work in the SF office? [22:41:56] ottomata: just today and tomorrow [22:41:59] oh! [22:42:02] i am here to [22:42:03] same time! [22:42:11] are you on 3rd floor right now? [22:43:16] right now! [22:43:23] in the middle [22:44:44] <^demon> manybubbles: Yeah, I was going to start playing with namespaces too. For some reason I thought we'd need to return and query by it. [22:45:32] ^demon: I don't _think_ we need to return it. But we certainly do need to query by it. I think I've got that mostly covered between our code and SearchEngine::replacePrefixes [22:50:24] I'm going to see if I can wrap the Solr exceptions so they don't cause us to give up. [22:50:39] we can show the user an error instead of a white page :) [22:51:51] <^demon> manybubbles: Yeah. Also, I need to put the PoolCounter stuff in there. [22:52:04] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:52:34] ^demon: you want me to have a look at it tonight? you've had a bunch of long days recently [22:52:53] <^demon> Na, I got a good night sleep last night :) [22:57:52] !log updated Parsoid to 6240c19 [22:58:00] Logged the message, Master [22:59:20] could anybody purge the Parsoid varnish caches? [22:59:37] cerium and titanium [23:11:05] <^demon> manybubbles: https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FCirrusSearch.git/672e28c445e2623567ac243bc8673e795ce386b6 - update/delete now wrapped in PoolCounter. [23:22:03] ^demon: nice and simple. thanks1 [23:29:04] !log updated Parsoid to 6dadde3 [23:29:12] Logged the message, Master