[00:26:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 00:26:40 UTC 2013 [00:27:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:29:04] New patchset: Ram; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [00:38:33] !log kaldari synchronized php-1.21wmf11/extensions/UploadWizard/ 'deploying some bugfixes for UploadWizard' [00:38:39] Logged the message, Master [00:53:20] New review: Ram; "Not ready for merge; still being tested." [operations/debs/lucene-search-2] (master) C: -1; - https://gerrit.wikimedia.org/r/53299 [00:57:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 00:57:34 UTC 2013 [00:58:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:07:22] New review: Ram; "Since the changes corresponding to what was reverted in the OAI extension are still present here, I ..." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [01:28:26] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 01:28:22 UTC 2013 [01:29:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:37:55] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:50:55] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:55:35] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 188 seconds [01:55:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [01:59:16] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 01:59:10 UTC 2013 [01:59:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:10:06] !log deployed change 53304 to OATHAuth on wikitech [02:10:18] Logged the message, Master [02:10:32] morebots: what's your deal? why are you so slow right now? [02:10:32] I am a logbot running on wikitech-static. [02:10:32] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:10:32] To log a message, type !log . [02:10:34] oh [02:10:35] rigt [02:10:36] right [02:10:41] that box is doing an import [02:16:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:16:47] !log deploying change 53248 and 52951 to OpenStackManager on wikitech [02:16:53] Logged the message, Master [02:23:05] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:23:05] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [02:29:35] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:29:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 02:29:40 UTC 2013 [02:29:55] !log LocalisationUpdate completed (1.21wmf11) at Tue Mar 12 02:29:54 UTC 2013 [02:29:55] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:30:03] Logged the message, Master [02:30:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:33:55] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:43:55] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:52:43] !log LocalisationUpdate completed (1.21wmf10) at Tue Mar 12 02:52:43 UTC 2013 [02:52:50] Logged the message, Master [03:00:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 03:00:10 UTC 2013 [03:00:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:21:45] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [03:21:55] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds [03:30:55] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 03:30:52 UTC 2013 [03:31:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:40:45] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:40:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [04:01:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 04:01:23 UTC 2013 [04:02:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:32:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 04:31:56 UTC 2013 [04:32:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:46:25] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [04:48:06] New review: MZMcBride; "\o/" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53276 [04:50:38] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 229 seconds [04:50:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [04:52:25] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [04:56:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 04:56:19 UTC 2013 [04:56:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:57:37] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [04:57:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [04:59:56] jfyi - the replag on db1025 and db78 were me [05:02:18] !log tstarling synchronized php-1.21wmf11/extensions/Math [05:02:30] Logged the message, Master [05:02:50] !log tstarling synchronized php-1.21wmf10/extensions/Math [05:02:57] Logged the message, Master [05:26:45] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 05:26:44 UTC 2013 [05:27:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [05:36:58] New review: Tim Starling; "It's not just MediaWiki, it's MediaWiki minus some things and plus some other things. It's a subdire..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [05:55:29] New review: Krinkle; "Ah, okay. That makes sense. We'd have /h/w/c/w but I suppose that's what we deserve." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [05:57:26] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 05:57:21 UTC 2013 [05:58:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:27:57] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 06:27:52 UTC 2013 [06:28:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:32:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [06:32:56] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [06:39:57] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [06:39:58] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [06:41:05] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [06:45:57] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:45:58] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:58:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 06:58:19 UTC 2013 [06:58:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [07:28:56] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 07:28:47 UTC 2013 [07:29:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [07:37:55] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 193 seconds [07:37:55] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 192 seconds [07:40:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [07:40:56] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:42:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 211 seconds [07:42:55] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [07:59:55] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 07:59:47 UTC 2013 [08:00:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:31:37] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 08:31:25 UTC 2013 [08:32:11] !log jenkins: live hacked the ant build script for MediaWiki jobs. Now using the local replicated git directory instead of a handmade replication. [08:32:22] Logged the message, Master [08:32:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:37:57] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [08:37:57] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [08:38:57] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:38:57] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [08:48:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 08:48:28 UTC 2013 [08:49:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:51:42] https://bugzilla.wikimedia.org/show_bug.cgi?id=46018 [08:52:33] New review: Nemo bis; "Weird side-effect? https://bugzilla.wikimedia.org/show_bug.cgi?id=46018" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53264 [09:00:09] New review: Hashar; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53264 [09:11:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:11:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [09:11:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:11:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [09:11:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:19:05] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 09:18:56 UTC 2013 [09:19:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [09:26:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [09:26:55] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 192 seconds [09:45:55] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [09:45:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [09:49:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 09:49:23 UTC 2013 [09:50:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [09:53:06] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [10:08:12] If anybody is awake: Looks like https://gerrit.wikimedia.org/r/#/c/53264/ created https://bugzilla.wikimedia.org/show_bug.cgi?id=46018 - could somebody revert / fix? [10:12:48] andre__: noticed that one already :-] [10:13:03] andre__: and proposed some fix to the rewrite rule. [10:13:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [10:13:13] guess we want to ping someone this afternoon [10:13:19] yeah. but having somebody to deploy it would be nice ;) [10:13:22] yeah [10:13:29] not a big issue anyway [10:13:47] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 205 seconds [10:13:55] not unless google crawls mediawiki.org [10:19:55] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 10:19:50 UTC 2013 [10:20:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:25:28] hmm [10:37:45] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [10:38:05] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [10:50:26] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 10:50:19 UTC 2013 [10:50:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:20:46] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 11:20:42 UTC 2013 [11:21:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:32:37] New patchset: Demon; "(bug 45911) Set $wgCategoryCollation to 'uca-pt' for the Portuguese Wikipedia and Wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52903 [11:33:48] New patchset: Mark Bergsma; "Handle average (sum/count) values properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53346 [11:51:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 11:51:28 UTC 2013 [11:52:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:58:24] New patchset: Krinkle; "Rename legacy 'live-1.5/' to 'w/'." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53125 [12:15:31] Change abandoned: Mark Bergsma; "This was sort of implemented a while ago. GeoIP_last_netmask() is inherently thread unsafe, but this..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28295 [12:17:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:23:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:05] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:24:06] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [12:25:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 12:25:07 UTC 2013 [12:25:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [12:25:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.577 second response time [12:48:53] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [12:51:32] New patchset: Hashar; "Varnish rules for Beta cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [12:52:24] New review: Hashar; "rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [12:55:03] !log gallium : upgrading Jenkins (unscheduled) [12:55:10] Logged the message, Master [13:06:06] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 13:06:00 UTC 2013 [13:06:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:16:24] "(unscheduled)" can be appended to all my log messages [13:21:52] haha [13:29:53] !log Jenkins restarted! [13:30:00] Logged the message, Master [13:36:37] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 13:36:32 UTC 2013 [13:37:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:40:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [13:40:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 199 seconds [13:43:03] paravoid: wanna merge some of my contint puppet modules ? :-D [13:44:35] got them tested in labs so they should be fine :-] [13:44:46] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 187 seconds [13:44:49] https://gerrit.wikimedia.org/r/#/c/47664/ annnd https://gerrit.wikimedia.org/r/#/c/47742/ =) [13:45:06] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [13:57:01] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [13:57:01] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [13:57:01] New patchset: Mark Bergsma; "Create pbuilder images with ubuntu APT components main and universe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53358 [13:58:30] hashar: looking now [13:58:44] paravoid: rebased them to make sure they works fine [13:58:49] and running on labs right now [13:59:22] you did review the first one (jenkins module), the other one is mostly to cleanup apache [13:59:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [14:00:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 205 seconds [14:07:25] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 14:07:18 UTC 2013 [14:07:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [14:07:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53346 [14:08:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53358 [14:09:02] hashar: that's a lot of patchsets :) [14:09:46] paravoid: lot of rebases too :-] [14:10:06] hashar: can we move groups::jenkins to the module? [14:10:19] I remember saying this before, I don't remember your reply, sorrY ;-) [14:10:24] yeah [14:10:33] I need the group for some other parts [14:10:59] such as adding users to the group, and making sure web files belong to group jenkins [14:11:11] so? [14:11:12] but we can create a "groups" module ;] [14:11:15] noo [14:11:38] what would be the problem with including the class from the module instead? [14:12:40] something like jenkins::user_group ? [14:13:04] that means that manifests/admins.pp will depends on the jenkins modue [14:13:31] hence any puppet run will have to load the jenkins manifest [14:13:53] ah found my comment: """ We need it to be global since it is used to add contint admins in the jenkins group (somewhere in manifests/admins.pp """ [14:13:58] https://gerrit.wikimedia.org/r/#/c/47664/4/modules/jenkins/manifests/user.pp,unified [14:14:26] jenkins::user would be fine [14:14:42] puppet's autoloader won't load the whole module, just that specific class [14:14:56] but preferrably we wouldn't have that class included in all machines [14:15:07] that also mean loading one extra file for any puppet run on any server [14:15:09] just gallium/other contint boxes [14:15:22] unless I'm missing something? [14:16:31] hmm now I am confused [14:16:57] so yeah that is the whole point, sticking the group declaration in admins.pp since that it is used there. Let us skip loading the jenkins::user class on all servers [14:18:09] I don't understand [14:18:12] and we also need the jenkins group to be defined next to the other groups to make sure nobody is going to still its did :D [14:19:39] let me start over [14:19:44] yes please :) [14:20:04] the jenkins group is used to admin access between the jenkins daemon (which run under jenkins:jenkins) and the jenkins administrator [14:20:13] we add the jenkins administrators on gallium to the jenkins group [14:20:34] files that are produced by jenkins but need to be somehow altered/moved by the admins belong to the group jenkins [14:20:56] that group is defined in the global manifests/admins.pp just like wikidev [14:21:12] and received a unique (cluster wide) GID [14:21:48] since the admins are defined in the global file with (include groups::jenkins) it makes more sense to me to have the group there [14:21:54] rather than having to require the module class [14:21:59] why? [14:22:08] that save one extra file lookup for each run of puppet on the full cluster [14:22:15] depending from manifests -> a module isn't bad [14:22:19] AND, make sure that nobody is going to still the GID of the jenkins group [14:22:26] since all groups are defined at the same place [14:23:34] so that is kind of a micro optimization and ensuring consistency of our GIDs [14:23:36] if you just remove the gid, puppet will just create a system group [14:23:44] if the group doesn't exist already [14:23:50] which is also fine [14:24:09] it isn't an optimization at all btw [14:24:20] doh [14:24:20] this just matters for gallium which will open that file anyway [14:24:46] but even if it was, let's not structure our puppet repo based on how many files the puppetmaster needs to open :) [14:25:03] hehe :-] [14:25:09] we can put EVERYTHING in site.pp [14:25:13] hehe [14:25:43] which is more or less what we are doing right now hehe [14:26:01] not by a long shot [14:26:02] depending from manifests -> module is fine [14:26:04] paravoid: so want me to move the groups::jenkins definition to jenkins::group ? [14:26:16] having the module depend on admins.pp is not [14:26:23] imho [14:27:00] either that or move it within jenkins::user [14:27:32] did I convince you or are you just doing whatever it takes for me to merge it? :P [14:28:04] 2 [14:28:05] ;-] [14:28:18] haha [14:28:45] so I should just remove the gid => 561 ? :-] [14:28:46] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 12 seconds [14:29:00] I think it should be fine, yes [14:29:05] my idea was merely to keep the way we define groups [14:29:07] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 4 seconds [14:29:13] probably does not matter since only gallium has it [14:33:20] rebasing and trying out on the labs instance [14:33:39] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:33:39] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:35:13] New review: Hashar; "Per discussion with Faidon, moved the 'jenkins' group definition out of admins.pp to the Jenkins mod..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:35:35] notice: Finished catalog run in 37.58 seconds \O/ [14:35:52] paravoid: https://gerrit.wikimedia.org/r/#/c/47664/13..14/manifests/admins.pp,unified :-] [14:36:05] the group is there https://gerrit.wikimedia.org/r/#/c/47664/13..14/modules/jenkins/manifests/group.pp,unified [14:36:23] just 40 seconds, that's \o/ indeed [14:36:24] * paravoid grins [14:36:39] yeah I got puppet to run in a ram disk [14:36:43] makes things faster [14:36:52] jk [14:37:55] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 14:37:45 UTC 2013 [14:38:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [14:38:44] I have a minor nitpick [14:38:46] https://gerrit.wikimedia.org/r/#/c/47664/14/modules/jenkins/manifests/user.pp [14:39:02] you don't need to require the class if you require => Group [14:39:17] you can either just include the class or drop the require => Group [14:39:24] personally I'd prefer the former but either works [14:40:06] ahh [14:40:23] very very minor as I said :) [14:40:36] I have to maintain by fame of badass reviewer though [14:41:20] done :-] [14:41:36] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:41:37] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:41:47] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:43:19] New review: Faidon; "Thanks!" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47664 [14:43:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:44:48] hashar: why is there a "require" here: https://gerrit.wikimedia.org/r/#/c/47742/11/manifests/misc/contint.pp ? [14:45:01] cause I am a noob ? :-] [14:45:20] I still don't know when to use include or require [14:45:28] so to play it safe I usually use require [14:45:30] require is basically include + a require => Class [14:45:49] but it really should be avoided imho [14:49:20] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:51:30] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:51:43] paravoid: rebased and fixed the require [14:52:08] trying in labs [14:53:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:53:33] my other "complaint" [14:53:44] (not a big one obviously, I just merged it :) [14:53:58] is the /srv tree organization [14:54:32] e.g. you define /srv/localhost which is very non-contint namespace and could potentially clash with another module in the future [14:54:36] but let's cross that bridge then :) [14:54:42] if ever [14:56:15] yeah that might be eventually a probel [14:56:26] should probably have been something like /srv/qunit/localhost [14:57:11] New patchset: Hashar; "pep8 configuration file" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53360 [14:57:11] New patchset: Hashar; "pass pep8 linting checks" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53361 [14:58:43] paravoid: thanks for your review sprint :-] [14:59:39] New review: Krinkle; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [14:59:51] hashar: ignore that comment [15:00:10] Krinkle: the one you have put in gerrit or your last comment asking me to ignore it ? :-] [15:00:10] Ignore tabs used as indentation (W191) since mark loves tabs. [15:00:11] hahahaha [15:00:19] I'm not +2 that [15:00:26] I'll let mark do it [15:01:01] New patchset: Hashar; "drop 'sys' import: unused" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53363 [15:02:02] What's the status of RT 4695? [15:02:49] reopened today [15:08:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 15:08:28 UTC 2013 [15:09:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:09:54] I suspect it's that extra [OR] which is causing issues [15:12:41] paravoid: turns out jenkins::user has to require jenkins::group despite the require => group['jenkins'] [15:12:51] New review: Faidon; "Andrew, are you taking over this? If not, maybe we should abandon?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [15:12:56] paravoid: I got err: Failed to apply catalog: Could not find dependency Group[jenkins] for User[jenkins] at /var/lib/git/operations/puppet/modules/jenkins/manifests/user.pp:13 [15:13:35] New review: Andrew Bogott; "Yep, I'll take it on." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [15:13:39] that's not the problem [15:13:43] class jenkins::group { [15:13:43] class jenkins { [15:13:59] jenkins::group::jenkins ? :) [15:14:03] I don't know how I missed that [15:15:19] my class in the module is jenkins::group so ::jenkins:group ? [15:15:32] ? [15:15:37] remove the subclass [15:15:46] ohhh [15:15:47] sorry [15:15:51] class jenkins::group { group { 'jenkins': ... } } [15:16:13] system => true [15:16:40] I'll do it [15:16:46] got it [15:17:24] I'm ready to push :) [15:18:00] New patchset: Hashar; "fix jenkins::group class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53366 [15:18:06] paravoid: ^^^ [15:18:28] New review: Hashar; "recheck" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53363 [15:19:18] New review: Hashar; "recheck" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53363 [15:19:43] New patchset: Faidon; "fix jenkins::group class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53366 [15:20:43] jenkins is annoyingly slow lately [15:21:00] yeah I have been reloading it [15:21:22] so it waits for all jobs to complete before proceeding new ones :/ [15:23:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53366 [15:23:55] too slow [15:25:51] ahharhar [15:25:58] uh oh [15:25:59] what? [15:28:49] Again, if anybody is awake: Looks like https://gerrit.wikimedia.org/r/#/c/53264/ created https://bugzilla.wikimedia.org/show_bug.cgi?id=46018 - could somebody revert / fix? [15:28:57] (Redirects such as http://mediawiki.org are going to http://wikiquote.org (301 Moved Permanently)) [15:30:18] ugh [15:31:57] !log Restarted Zuul process. Was stuck somehow waiting for a job that has been canceled. [15:32:04] Logged the message, Master [15:33:22] New patchset: Faidon; "Fix wikiquote.net spurrious OR" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53369 [15:33:27] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with args zuul-server [15:33:48] New patchset: Mark Bergsma; "Imported Upstream version 3.1.2" [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53370 [15:34:13] Change merged: Mark Bergsma; [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53370 [15:34:54] Change merged: Faidon; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53369 [15:34:56] 3.1.2? [15:35:16] just as initial commit [15:35:32] it's the first commit from the debian repo [15:35:43] i couldn't get gerrit to accept it otherwise [15:35:52] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [15:36:25] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with args zuul-server [15:38:24] New review: Hashar; "recheck" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [15:38:43] root is doing a graceful restart of all apaches [15:38:53] that would be moi [15:39:06] !log root gracefulled all apaches [15:39:15] Logged the message, Master [15:39:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 15:39:06 UTC 2013 [15:39:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:39:39] Change abandoned: Hashar; "(no reason)" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53371 [15:39:53] New review: Hashar; "recheck" [operations/debs/ircecho] (master) - https://gerrit.wikimedia.org/r/53363 [15:40:36] New patchset: Mark Bergsma; "Imported Upstream version 3.5.0" [operations/debs/ganglia] (upstream) - https://gerrit.wikimedia.org/r/53372 [15:41:08] New patchset: Mark Bergsma; "Imported Upstream version 3.5.0" [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53373 [15:41:08] New patchset: Mark Bergsma; "ganglia (3.5.0-wm1) precise; urgency=low" [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53374 [15:41:49] !log apache graceful to fix bz 46018 / RT 4695 [15:41:55] Logged the message, Master [15:42:01] and of course the 301 is cached [15:43:01] mark: what's the easiest way to send a purge http://mediawiki.org/ (among others) across every squid? [15:43:12] Change merged: Mark Bergsma; [operations/debs/ganglia] (upstream) - https://gerrit.wikimedia.org/r/53372 [15:43:29] there's a purge maintenance script in mediawiki maintenance/ [15:43:33] it takes URLs as stdin [15:44:08] in /home/wikipedia/common/php/maintenance do: echo 'http://blahblahblah' | php ./purgeList.php --wiki aawiki [15:44:11] aawiki? [15:44:22] it doesn't matter [15:45:07] okay [15:45:10] thanks [15:45:11] Change merged: Mark Bergsma; [operations/debs/ganglia] (master) - https://gerrit.wikimedia.org/r/53373 [15:45:56] No MWMultiVersion instance initialized! MWScript.php wrapper not used? [15:46:09] heh, instructions are outdated, who would have thought [15:49:43] wtf [15:49:51] apache-graceful-all is still broken with eqiad? [15:49:55] dammit [15:52:12] and how are we pushing apache configs all these days if that's still broken? [15:58:56] New review: Demon; "recheck" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52890 [16:02:16] New review: Faidon; "I tested it and it works. I was bitten by this today, really annoying. Could you merge, build the pa..." [operations/debs/wikimedia-task-appserver] (master) C: 2; - https://gerrit.wikimedia.org/r/49231 [16:02:30] mutante: can you please take care of https://gerrit.wikimedia.org/r/49231 ? [16:03:00] andre__: fixed, resolved the BZ & RT in case you didn't see it [16:03:16] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53378 [16:05:44] paravoid, big thanks! [16:06:47] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53378 [16:07:47] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53378 [16:08:26] Change abandoned: Demon; "Looks good, can abandon now :)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53378 [16:08:35] New review: Demon; "recheck" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52890 [16:08:52] I think recheck is bugged [16:09:04] and it will not work anyway because that is only for the 'check' pipeline [16:09:05] :( [16:09:34] ^demon: the tweak I have made (and deployed) https://gerrit.wikimedia.org/r/#/c/53376/2..3/operations-debs.yaml,unified [16:09:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 16:09:33 UTC 2013 [16:09:42] ^demon: will be there later tonight [16:09:45] daughter time for now [16:09:52] <^demon> Awesome, thanks! [16:09:56] New review: Silke Meyer; "Still do not merge! WikibaseSolr is being re-checked, something has been changed in the database tha..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52043 [16:10:03] ^demon: to retrrigerr the tests you will need a new patchset :( [16:10:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:42:07] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [16:42:07] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 16:42:00 UTC 2013 [16:42:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:49:38] is it no longer possible to check changes into mediawiki svn? [16:50:30] hrm, i think nm. operator error. [16:53:07] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 187 seconds [16:53:45] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [16:54:47] New patchset: Demon; "Revert "Fix for bug 45266. Needs parallel changes to OAI."" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53385 [16:57:10] I think svn is supposed to be read only about now... [16:57:40] <^demon> All of mediawiki svn is. [16:57:43] i remember a message at some point about either wikimedia or mediawiki [16:57:47] ah mediawiki [16:58:06] <^demon> What're you trying to work on? If it's not already in git, we can move it there :) [16:58:07] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [16:58:07] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [16:58:17] otrs (/me shudders) [16:58:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:38] mediawiki/trunk/otrs [16:59:13] <^demon> Hmm, now...where to put this in git [16:59:26] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.056 second response time [16:59:28] i'm not sure it's worth moving it to git at this point [16:59:58] in theory we'll upgrade before the apocalypse, and all the patches in svn will either have to be redone or will already be in the new version [17:00:15] i'm tempted to live-hack the one-line change [17:00:30] * ^demon will pretend he didn't hear that ;-) [17:00:47] lol. now I *really* want to live-hack it [17:01:05] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [17:01:05] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [17:01:26] New review: Ram; "Looks good" [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/53385 [17:01:51] New patchset: Faidon; "Cleanup the base class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33066 [17:02:36] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:41] Change merged: Demon; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53385 [17:06:50] LeslieCarr: what ever happened with yesterday's mailman issues? [17:06:57] see 4716 [17:12:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 17:12:34 UTC 2013 [17:13:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:21:03] no idea jeremyb_ it went away on its own [17:21:17] LeslieCarr: want to look at the ticket? [17:21:42] hey binasher (morning), could you maybe quickly chime in on https://rt.wikimedia.org/Ticket/Display.html?id=4689 ? [17:21:47] a similar thing was happening to someone else [17:25:25] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.771 second response time [17:25:38] drdee: europium is fully replacing locke, yeah? [17:26:38] yes [17:27:13] yay for europe [17:28:08] hmm i'd rather we use a high perf misc server instead of a misc server [17:28:16] !log reedy synchronized php-1.21wmf11/includes/Collation.php [17:28:23] Logged the message, Master [17:28:31] RobH: would it be possible to swap europium with a high perf misc? [17:28:59] LeslieCarr: i replied asking her to try again [17:29:10] thanks [17:29:14] !log reedy synchronized php-1.21wmf10/includes/Collation.php [17:29:19] Logged the message, Master [17:30:52] heh, /me confused Collation with Collection in that last !log :) [17:31:36] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:48] binasher: Yep, I just matched it to what it was already running on. [17:31:59] locke vs europium [17:32:17] can you drop a ticket for the request, and can I also take europium offilne and reclaim? [17:32:27] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.054 second response time [17:34:14] New patchset: Aklapper; "bugzilla_report.php: Add query and formatting for list of urgent issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53387 [17:36:36] Change abandoned: Aklapper; "Gerrit very annoying. Was easier to create a new patchset than to amend and run into cryptic errors...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53128 [17:37:00] RobH: https://rt.wikimedia.org/Ticket/Display.html?id=4718 [17:38:55] binasher: Cool, I will get this spun up for you today [17:39:03] New patchset: Mark Bergsma; "Prevent packages from getting automatically upgraded during Ganglia rework" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53388 [17:40:02] New review: Aklapper; "By the way, I have no idea if I have to sanitize somehow the bugsummary strings that I query from Bu..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53387 [17:40:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53388 [17:40:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:37] drdee: so that ticket is going to be closed now, but re: andrew's question, it would be nice to use the mcast relay from oxygen, but i'd rather not make oxygen more of a spof. [17:40:55] ok, got it [17:41:10] if we can make the mcast relay HA across more than one host, then that would be great [17:41:48] !log anomie synchronized php-1.21wmf11/extensions/Scribunto 'Update Scribunto to master' [17:41:50] maybe we should do that [17:41:56] Logged the message, Master [17:43:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 17:43:06 UTC 2013 [17:43:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:43:47] bbiab [17:50:35] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 9.983 second response time [17:51:18] New patchset: RobH; "en-wp.com/org redirect support per rt4669" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53391 [17:53:12] !log live-hack trivial OTRS patch to support filtering on X-Spam-Score for RT #4713, will check in once project is copied to git [17:53:19] Logged the message, Master [17:53:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:54:15] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 213340 seconds [17:54:33] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53391 [17:58:16] robh is doing a graceful restart of all apaches [17:58:23] not really logmsgbot [17:58:30] it still doesnt work for eqiad apaches ;_; [17:58:34] !log robh gracefulled all apaches [17:58:40] Logged the message, Master [18:00:27] !log ran apache restart as myself, not root, had a bunch of fails. restarted them as root, hopefully nothing noticable to users [18:00:37] Logged the message, RobH [18:01:55] !log en-wp.com/org now redirect to en.wikipedia per rt 4669 [18:02:02] Logged the message, RobH [18:04:55] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 322411 seconds [18:04:55] PROBLEM - MySQL Replication Heartbeat on db65 is CRITICAL: CRIT replication delay 322382 seconds [18:05:36] paravoid: redirecting seems to be still an issue: https://bugzilla.wikimedia.org/show_bug.cgi?id=46018 [18:05:46] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 322394 seconds [18:06:01] andre__: we can't purge all possible URLs I'm afraid [18:06:24] ah [18:06:45] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 322407 seconds [18:07:05] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: CRIT replication delay 213807 seconds [18:08:08] I should read the latest bug comments first before making noise here, darn. [18:09:05] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 322617 seconds [18:09:19] don't worry about it [18:11:47] PROBLEM - MySQL Slave Delay on db65 is CRITICAL: CRIT replication delay 321723 seconds [18:13:35] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 18:13:30 UTC 2013 [18:14:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:14:43] mutante: saw my ping above? [18:14:57] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 322640 seconds [18:15:03] notpeter: so search23 had its mainboard swapped, but now its serial redirection is not properly set. I have a ticket in for Steve to fix it [18:15:16] just fyi, you are down a server in tampa search pool1. [18:15:28] mutante: also, I have to go in a while, could you watch https://bugzilla.wikimedia.org/show_bug.cgi?id=46018 and see if anything serious crops up? [18:15:35] in a bit even [18:15:36] for a while :) [18:15:41] RobH: wooo [18:15:42] andre__: see above [18:15:42] cool! [18:15:43] thank you [18:16:02] once its fixed, i'll reinstall and run puppet, etc [18:16:10] but i may not put it back into service without checking with ya ;] [18:16:45] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 321737 seconds [18:16:49] ACKNOWLEDGEMENT - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% rhalsell rt3423 [18:16:52] RobH: cool! sounds good [18:17:32] New review: pugmajere; "Honestly, we could also just rename ircecho to ircecho.py and adjust the Makefile (aka debian/rules ..." [operations/debs/ircecho] (master) C: 1; - https://gerrit.wikimedia.org/r/53360 [18:17:55] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 323021 seconds [18:18:55] paravoid: re.. on it now [18:19:13] i see, i put an [OR] too much in there [18:19:30] !log europium reclaimed per rt4689 [18:19:35] Logged the message, RobH [18:19:36] my ping above was for apache-graceful-all [18:19:48] PROBLEM - MySQL Slave Delay on db68 is CRITICAL: CRIT replication delay 322871 seconds [18:19:54] That reads really amusingly [18:20:44] paravoid: yep, ok [18:20:49] it bit me too :) [18:22:10] New patchset: RobH; "reclaiming europium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53397 [18:27:07] Hi guys, does anyone know how we could restore access on EE-Prototype on WMFLabs? (http://ee-prototype.wmflabs.org/) We need the site for a critical deployment this week, led by matthiasmullie. Here's the Bugzilla ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=46035 [18:29:23] paravoid: Did Tim say anything to you about icu upgrades? [18:29:38] New review: pugmajere; "Thanks." [operations/debs/ircecho] (master) C: 1; - https://gerrit.wikimedia.org/r/53361 [18:29:39] he didn't but I saw the backlog [18:29:42] and built packages [18:29:46] and sent him a personal email [18:30:21] Awesome :D [18:30:32] New review: pugmajere; "Thanks." [operations/debs/ircecho] (master) C: 1; - https://gerrit.wikimedia.org/r/53363 [18:31:06] he shouldn't need me to do it but I'll try to be around when he's here too [18:31:31] Cool. He should be around in 4 or so hours I suspect [18:31:36] it should be just a "reprepro include" to put it into apt and apt-get update; apt-get upgrade [18:35:26] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.078 second response time [18:37:19] paravoid: so the sync script is catching up through the last wiki (commons) [18:37:30] AaronSchulz: originals? [18:37:33] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53397 [18:37:46] at any rate there is still only like one rgw frontend running [18:37:54] journal replay? [18:37:56] when might that be straighted out [18:38:03] yes [18:38:14] great [18:38:23] yeah, I need to put a bit more work [18:38:32] I'd like that fixed before using multiwrite [18:38:33] we have two rgw frontends, unless one of them crashed or something [18:38:35] Krenair: i see you reopened that ticket for wikivoyage.com. that works for me meanwhile. redirect to wikivoyage.org [18:38:47] ok, that sounds reasonable [18:38:54] it's still redirecting me [18:39:03] only fe1002 has any real load [18:39:13] 1001 has a bit [18:39:14] Krenair: how about now? [18:39:26] still [18:39:38] I just purged it, so unlikely [18:39:41] try shift+f5 [18:39:46] AaronSchulz: ok, I'll have a look [18:39:50] works now [18:39:52] AaronSchulz: not now though, I really have to go [18:39:57] ok [18:41:36] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:25] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.090 second response time [18:42:54] Krenair: paravoid , cool, thanks [18:42:57] !log Imported ganglia 3.5.0 packages into the precise-wikimedia APT repository [18:43:03] Logged the message, Master [18:43:27] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.0331889362 (gt 8.0) [18:44:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 18:44:09 UTC 2013 [18:44:26] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:45:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:47:16] !log europium shuttin down [18:47:23] Logged the message, RobH [18:48:35] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 7.107 second response time [18:49:01] New review: Ram; "Looks good" [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52890 [18:49:05] PROBLEM - Host europium is DOWN: CRITICAL - Plugin timed out after 15 seconds [18:51:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:21] New review: Ram; "Looks good." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52895 [18:53:39] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 7.360 second response time [18:54:47] !log authdns-update for europium to gadolinium ip [18:54:53] Logged the message, RobH [18:56:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:25] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.45343707143 [18:59:26] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.248 second response time [19:02:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] New patchset: RobH; "gadolinium replacing europium/locke" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53401 [19:06:55] New review: Krinkle; "Fixed in I6b4143de2299578." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53264 [19:09:53] New patchset: JanZerebecki; "Fold wikiquote rewrites into one." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53403 [19:12:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [19:12:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [19:12:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [19:12:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:12:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:12:25] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.066 second response time [19:14:46] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 19:14:39 UTC 2013 [19:15:25] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:18:24] Change merged: Dzahn; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [19:18:28] New patchset: Reedy; "Bug 45936 - multilingual babel boxes in Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53404 [19:19:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53404 [19:20:05] !log reedy synchronized wmf-config/InitialiseSettings.php [19:20:11] Logged the message, Master [19:20:58] New patchset: Demon; "Fix a bunch of type mistakes that relied on casting" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52894 [19:21:21] New review: Demon; "PS2 drops unrelated update to lib/mwdumper.jar" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52894 [19:21:31] New patchset: Demon; "Useless warning supression" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52895 [19:25:31] New review: Ram; "The change in PositionalScorer.java is risky since the HashMap discards duplicates (old code) but th..." [operations/debs/lucene-search-2] (master) C: -1; - https://gerrit.wikimedia.org/r/52894 [19:27:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:20] New review: Ram; "Looks good." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52893 [19:35:29] New review: Ram; "Looks good." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52895 [19:37:34] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52953 [19:38:35] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 6.939 second response time [19:38:57] New review: Jeremyb; "(it's actually RT 4695)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53369 [19:39:08] Change abandoned: RobH; "forgot to merge, someone else handled in another patchset" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51031 [19:40:12] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53401 [19:41:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:08] New patchset: RobH; "forgot to add gadolinium partman entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53405 [19:45:15] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 19:45:06 UTC 2013 [19:45:27] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:46:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53405 [19:46:46] New review: Ram; "Generally speaking, this is a good thing to do. However, one thing that worries me about this change..." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52892 [19:47:34] New review: Ram; "Looks good." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52891 [19:54:05] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [19:54:35] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 9.751 second response time [20:08:35] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:55] RECOVERY - Host europium is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:09:51] ......thats amazing since i took it offline, heh [20:11:29] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 4.223 second response time [20:11:58] PROBLEM - SSH on europium is CRITICAL: Connection refused [20:12:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:29] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [20:12:29] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 183 seconds [20:13:30] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.29297223022 (gt 8.0) [20:13:41] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 194 seconds [20:13:41] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 194 seconds [20:16:41] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 20:16:36 UTC 2013 [20:17:32] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.0 [20:17:36] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [20:20:47] trying sync-common on a single server for testing.. takes forever.. hmm [20:20:53] RECOVERY - SSH on europium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:23:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.443 second response time [20:24:23] root@mw1044:~# sync-common [20:24:23] Copying to mw1044 from 10.0.5.8... [20:24:31] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 2 seconds [20:24:31] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds [20:24:31] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 20:24:30 UTC 2013 [20:25:21] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [20:34:33] PROBLEM - NTP on europium is CRITICAL: NTP CRITICAL: No response from NTP server [20:35:13] Reedy: poke! [20:35:20] New patchset: RobH; "testing the new unified cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53415 [20:35:24] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=45446#c13 – "Maybe CACHE_ANYTHING goes to a different cache then was being purge(?)" ? [20:35:38] I tried it [20:35:38] No [20:35:41] Reedy: (this is the sv.wiki thing) [20:35:48] Oh [20:35:52] Maybe [20:35:53] No idea [20:36:36] > var_dump( wfGetCache( CACHE_ANYTHING ) ); [20:36:37] object(MemcachedPeclBagOStuff)#21 (2) { [20:39:01] Reedy: can you try purging that key again? [20:39:06] (just in case) [20:39:10] Just in case? [20:39:13] (won't hurt anything, will it.) [20:39:13] It doesn't [20:39:16] I tried earlie r:p [20:39:36] heh [20:41:46] New patchset: RobH; "testing the new unified cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53415 [20:45:51] New patchset: RobH; "testing the new unified cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53415 [20:47:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53415 [20:56:42] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 20:56:31 UTC 2013 [20:57:22] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:03:41] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:32] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.055 second response time [21:06:40] New patchset: Hashar; "contint: move analytics packages to the contint module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53422 [21:06:41] New patchset: Hashar; "contint: get rid of Sun JDK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53423 [21:06:42] New patchset: Hashar; "contint: move tmpfs disk to the module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53424 [21:06:43] New patchset: Hashar; "contint: move apache proxy configuration to module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53425 [21:07:31] ori-l: i'm on one of the comfy chairs near the mushroom kingdom [21:09:30] hmm [21:09:35] surprisingly my 4 patches works fine [21:12:21] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 30 seconds [21:12:21] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 30 seconds [21:17:48] Is there already a bug for the intermittent bits (css+js) failures I've been getting over the last few days? [21:17:51] New patchset: Hashar; "contint: move apache proxy configuration to module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53425 [21:21:12] RobH: mutante: mediawiki.org no longer has a valid ssl cert [21:21:33] really? [21:21:59] hahahahaha [21:22:04] well [21:22:05] I know why [21:22:17] the new unified certificate is missing mediawiki.org [21:22:23] goddamn it [21:22:25] i blame the parents. [21:22:27] I should probably log what we're doing [21:22:44] !log installing the new unified ssl certificate on all ssl servers [21:22:51] Logged the message, Master [21:22:53] still gets the old one (for now) [21:23:25] !log seems that the unified certificate is missing mediawiki.org, depooling the ssl servers that are serving it [21:23:32] Logged the message, Master [21:24:50] New patchset: RobH; "cert is missing mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53462 [21:24:59] ok. it should be ok now [21:25:45] or not. checking the pooling [21:27:11] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 21:27:09 UTC 2013 [21:27:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:28:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53462 [21:39:52] ah. mediawiki.org always goes to tamps [21:39:53] *tampa [21:40:59] ok. fixed [21:45:25] New patchset: Pyoungmeister; "adding to changelog for https://gerrit.wikimedia.org/r/#/c/52543/" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53463 [21:47:01] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:47:21] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:47:32] Shall I take that as a "no" regarding intermittent issues with bits.wikimedia.org ? [21:48:00] Jarry1250: I haven't seen it, honestly. But, you mean just bits.w.o just not responding or what? [21:48:30] greg-g: Well, or any error, or something. Just a lack of CSS/JS on some pageloads [21:49:54] Jarry1250: haven't heard of anything, nothing in Bugzilla that I could find [21:50:26] I can try and track the next occurance, get some more details [21:53:46] Jarry1250: are you using https? [21:53:52] RobH: cert switch in progress still? [21:54:00] we are doing that, yes [21:54:01] but.... [21:54:10] we're ensuring all connections are gone from a host before restarting nginx [21:54:17] k [21:54:23] it shouldn't cause any user noticable issues [21:54:37] yes, https, but this has been going on like, at least least 24 hours [21:55:07] android chrome (nexus 4) is not trusting whatever I'm getting now. so presumably that's the new cert [21:55:22] yeah, definitely unrelated, then [21:55:53] Firefox on the same device is fine [21:57:41] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 21:57:33 UTC 2013 [21:58:21] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:58:56] something weird. maybe it's a missing part of the chain. I'm getting the old cert untrusted in chrome [21:59:26] (only one of them is 2016-01-20, right???) [21:59:35] are you in europe? [21:59:43] Brooklyn [22:00:30] jeremyb-phone: which site? [22:00:56] was just using https://en.wikipedia.org/w/api.php as a test [22:01:46] it's working properly for me [22:02:23] and it shows the full chain for me [22:02:47] back at a computer now [22:03:01] hrm, can you go to europe to prove our hypothesis correct? ;) [22:03:26] * MatmaRex is in Europe, what are you breaking? [22:03:53] just got a watchmouse certificate alert [22:04:34] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53463 [22:04:49] ALERT! https services: SSL certificate problem: unable to get local issuer certificate (Peer certificate cannot be authenticated with given CA certificates). [22:05:58] Ryan_Lane: it checks https://en.wikipedia.org/wiki/Main_Page [22:08:05] are we really sure all the servers are serving the new one ? with it being sporadic it sounds sort of like there's 1 server misbehaving [22:08:37] i'm getting somewhere with s_client i think [22:09:31] ssl1001-1004 are all serving the same cert [22:10:01] New patchset: Pyoungmeister; "CHANGELOG!!!!!!!!!!!!!!!!!!!!!!!!!!" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53465 [22:10:13] let me use certificate verification with s_client [22:11:20] yeah, this is broken [22:11:28] be clearer [22:11:44] it's serving a chain of a good cert on top of a WMF self signed cert [22:11:50] that's it. just 2 in the chain [22:11:55] self-signed? ewwwwwww [22:11:59] god damnit [22:12:14] Issuer: C=US, ST=California, L=San Francisco, O=Wikimedia Foundation, CN=Wikimedia CA [22:12:23] that'll do it [22:12:44] ah [22:12:52] found it [22:13:07] New patchset: Pyoungmeister; "CHANGELOG!!!!!!!!!!!!!!!!!!!!!!!!!!" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53465 [22:13:14] !!!!!!!!!!!!!!!!!!!!!!!! [22:13:27] most browsers actually won't complain about this [22:13:37] s_client does! :) [22:13:41] yeah, of course ;) [22:13:47] any strict client will and should [22:13:59] browsers will say "this isn't the right cert in the chain, let's actually just look it up" [22:14:28] some will [22:14:35] some are not deterministic [22:14:38] chrome, firefox, safari and opera don't [22:14:40] neither does IE [22:14:43] some depend if they've seen the other intermediate before [22:14:53] the intermediate is actually served [22:14:57] err [22:15:00] sorry [22:15:10] if the intermediate is missing, it looks it up [22:15:15] basically all browsers do [22:15:21] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds [22:15:21] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [22:15:23] anyway, let me know when to test :) [22:15:24] the root CAs are installed in the browser [22:16:16] most libraries that bots and such use would complain, if they actually check. many libraries don't even check by default [22:16:43] maybe my info is outdated. but there was a point not too long ago when it was true. some depended on whether they had seen the intermediate before. some just refuse to work unless the intermediate is there [22:16:56] New patchset: Ryan Lane; "Name the chain for unified" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53466 [22:17:01] it's been years since I've seen an issue with that [22:17:01] Change merged: Pyoungmeister; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53465 [22:17:20] we should really make puppet fail if the chain isn't listed [22:18:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:18:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53466 [22:20:13] Change merged: Demon; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52890 [22:20:53] Change merged: jenkins-bot; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52891 [22:21:00] !log depooling ssl3001 ssl3002 ssl1 ssl2 ssl1001 and ssl1002 [22:21:08] Logged the message, Master [22:23:05] New patchset: Demon; "Close a couple of leaking resources" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52893 [22:23:06] !log pushing new version 2.7 of wikimedia-task-appserver to wikipedia-precise repo (to fix apache-sanity-check) [22:23:12] Logged the message, Master [22:23:16] wikimedia-precise of course [22:23:57] Change merged: jenkins-bot; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52893 [22:25:01] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [22:25:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:28:12] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 22:28:02 UTC 2013 [22:28:23] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [22:28:35] !log pooling ssl3001 ssl3002 ssl1 ssl2 ssl1001 and ssl1002 [22:28:41] Logged the message, Master [22:28:47] !log depooling ssl3003 ssl3 ssl4 ssl1003 ssl1004 [22:28:54] Logged the message, Master [22:31:07] New patchset: Demon; "Fix a bunch of type mistakes that relied on casting" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52894 [22:32:36] New review: Demon; "PS3 restores the HashMap in PositionalScorer." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52894 [22:32:45] New patchset: Demon; "Useless warning supression" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52895 [22:32:48] jeremyb-phone: ok. should be good now [22:32:52] PROBLEM - Host europium is DOWN: CRITICAL - Plugin timed out after 15 seconds [22:33:03] !log pooling ssl3003 ssl3 ssl4 ssl1003 ssl1004 [22:33:09] Logged the message, Master [22:33:34] New review: Ram; "Looks ok." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52894 [22:33:50] New patchset: Dzahn; "up version of wikimedia-task-appserver to 2.7 in changelog" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/53469 [22:34:53] Change merged: jenkins-bot; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52894 [22:35:29] Change merged: Dzahn; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/53469 [22:35:43] New review: Ram; "Looks good." [operations/debs/lucene-search-2] (master) C: 1; - https://gerrit.wikimedia.org/r/52895 [22:38:05] RECOVERY - Host europium is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [22:38:17] Ryan_Lane: looks good [22:38:22] great [22:39:00] * jeremyb_ is not crazy about having the same expiry for new and old though [22:39:10] they just reissued [22:39:14] e.g. on my phone it doesn't show subjectaltnames [22:39:21] but it does show expiry! [22:39:28] so that would have been a nice way to know [22:39:35] that's cause your phone's display of it sucks :) [22:39:44] New patchset: Lcarr; "removing celsus from decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53471 [22:39:49] put a bug in with your phone's browser's upstream [22:40:03] hey, chrome's display is what i was talking about and that's better than firefox's! [22:40:14] (both mobile) [22:40:19] put a bug in with both ;) [22:40:31] PROBLEM - SSH on europium is CRITICAL: Connection timed out [22:41:11] Change merged: jenkins-bot; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52895 [22:42:06] hey notpeter looks like the mysql module has an issue that i cant figure out in 60 seconds, if you want to look at it -- err: Failed to apply catalog: Parameter alias failed: mysql-client-5.5 can not create alias mysql-client: object already exists at /var/lib/git/operations/puppet/manifests/mysql.pp:513 [22:44:50] PROBLEM - Host europium is DOWN: PING CRITICAL - Packet loss = 100% [22:45:25] Ryan_Lane, does this new cert cover pa.us.wikimedia.org? [22:45:38] wtf is pa.us.wikimedia.org? [22:45:39] and nop [22:45:44] *no. it doesn't [22:45:47] wikimedia chapter [22:45:48] a valid domain with https enabled :) [22:45:56] i suggested to use wikimedia.us for those chapters .. but ... [22:46:04] Pennsylvania, USA [22:46:05] too much of a fight [22:46:21] we don't support sub-sub domains [22:46:30] because that would get rid of sub.sub domains, the cert errors AND make use of wikimedia.us .. and they are US chapters .. [22:46:34] they should all be renamed [22:46:39] ... Well in that case you have a sub-sub domain to rename :) [22:46:43] At the moment it gives a cert for *.wikipedia.org [22:46:46] Which is invalid [22:46:49] yeah [22:46:54] Which is no different [22:47:02] it should be renamed to pa-us.wikimedia.org [22:47:16] or we should put all chapters under chapters.wikimedia.org [22:47:23] then we can add *.chapters.wikimedia.org [22:47:47] it's absurd to have randomly named sub-sub domains [22:47:53] get wikimedia.us cert and make it pa.wikimedia.us, wikimedia.us is unconfigured in Apache [22:48:04] mutante: that doesn't solve any prolems [22:48:06] *problems [22:48:10] that actually makes it worse [22:48:37] We've more of these stupid ones [22:48:37] arbcom.nl.wikipedia.org [22:48:43] we should stick all chapters under some sane url scheme [22:48:48] 4 arbcoms [22:49:06] most have their own TLDs [22:49:08] noboard_chapterswikimedia [22:49:11] those can redirect [22:49:18] *_labswikimedia need to go and die [22:49:24] Krenair: why do we need to support SSL for defunct chapters? [22:49:24] Reedy: +1000 [22:49:29] 'wg_enwiki' => '//wg.en.wikipedia.org', [22:49:42] jeremyb_, we don't. [22:49:51] But since we're supporting SSL, we should give a valid certificate. [22:49:52] Krenair: Pennsylvania's defunct [22:50:07] readerfeedback.labs [22:50:10] Pennsylvania has been claimed by NYC! [22:50:16] www.meta [22:50:20] www.nl [22:51:09] liquidthreads.labs, flaggedrevs.labs, www.commons, de.labs, en.labs, those should be all in wikimedia.org zone [22:51:27] all of the labs ones should just redirect to wikitech [22:51:35] DIEDIEDIE [22:51:42] why are they still workin? [22:51:51] * jeremyb_ gets ready to run away [22:52:11] :D [22:52:47] http://pastebin.com/3LpHwtRX - lines from wgServer which match (\..*){3} [22:53:07] most of those are easy to kill off [22:53:18] we have more than just those, though [22:53:36] wg.en.wikipedia.org? [22:53:43] "workinggroup" [22:53:48] hahaha [22:54:43] What's stopping us from kill the labs wikis? [22:54:55] nothing really [22:55:03] we just need to serve a redirect for it [22:55:33] ipv6and4.labs :) [22:55:49] * Ryan_Lane sighs [22:55:51] funny how we have so many things called lab [22:55:53] labs [22:55:59] that is an A record directly to ssl3001 [22:56:04] I want that to die very badly [22:56:07] maybe wikimedians aren't very inventive when it comes to names [22:56:23] Couldn't we just delete *.labs.wikimedia.org? [22:56:23] it's to collect statistics on ipv6 usage [22:56:26] do we have beakers and test tubes? [22:56:31] Krenair: better to redirect [22:56:46] I thought they were all closed testing wikis [22:56:54] TimStarling: pretty sure all of those were named by the foundation [22:57:07] There's likely inbound links all over the place [22:57:30] hell, I wasn't even the one to use Labs as my project name [22:57:33] it was assigned to me [22:57:47] New review: Dzahn; "upped version to 2.7-1 in:" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [22:58:49] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 22:58:43 UTC 2013 [22:59:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [23:00:08] there they are: http://pastebin.com/Xcn87J9z [23:00:32] RECOVERY - MySQL Slave Delay on db65 is OK: OK replication delay 0 seconds [23:00:32] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay 0 seconds [23:00:49] duplicates! [23:00:59] I have some different ones: http://pastebin.com/3LpHwtRX [23:01:25] Oh mutante's are all *.labs.wm.o [23:02:42] So what needs to be done to move these domains? Just changes to apache-config, mediawiki-config, and DNS stuff? [23:03:08] not MediaWiki config [23:03:11] mediawiki config is just to drop the old stuff [23:03:13] if we want to keep and redirect, just Apache config, keep in DNS [23:03:27] add to deleted.dblist [23:03:28] already started to open redirects.conf [23:03:36] let's kill them [23:03:39] WITH PREJUDICE [23:03:54] But wouldn't we need to update wgServer in mediawiki-config? [23:04:04] Why? [23:04:44] It's pointing to the old domains... [23:06:00] woooo. kill kill kill [23:06:05] This is if we keep the wikis of course [23:06:25] soo.. redirect or kill .. kill is just dropping from DNS :) [23:06:43] results.labs - can't connect to server, btw [23:07:29] I can see that we might want to delete the *.labs.wm.o domains [23:08:25] But I don't think arbcom.en.wikipedia.org should be deleted :) [23:08:39] why not? :) [23:11:14] New patchset: Demon; "Update the ldap scripts to pep8 compliant" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53476 [23:19:30] New patchset: Dzahn; "redirect old labs sub.sub domains to wikitech" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53478 [23:20:28] New patchset: Dzahn; "redirect old labs sub.sub domains to wikitech" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53478 [23:22:40] New patchset: Dzahn; "redirect old labs sub.sub domains to wikitech, and also move the old labs.wm redirect to wikitech now" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53478 [23:29:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 23:29:12 UTC 2013 [23:29:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [23:29:41] If we get Peter/Asher to run database backups for them, we can drop those out too [23:31:28] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52936 [23:31:34] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52938 [23:31:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53471 [23:33:48] !log tstarling synchronized php-1.21wmf10/maintenance/updateCollation.php [23:33:54] Logged the message, Master [23:34:27] !log tstarling synchronized php-1.21wmf11/maintenance/updateCollation.php [23:34:34] Logged the message, Master [23:38:10] New patchset: RobH; "changing ssl cert monitoring to 90 days" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53485 [23:39:29] !log upgrading pmtpa search nodes to lucene-search-2 2.1.7wm1 [23:39:37] Logged the message, notpeter [23:43:16] New patchset: Reedy; "Kill wikimedialabs wiki configs!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53487 [23:44:07] I'm doing package upgrades with dsh -F30 [23:44:22] since it all needs to be done as near to atomically as possible [23:44:44] but I guess the updateCollation.php will still take a day [23:46:16] maybe we should find a better way to do this when it comes to Ubuntu 14.04 [23:47:05] New review: RobH; "jenkins testing seems broken, so merging this simple change, full speed ahead and damn the consequen..." [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/53485 [23:47:09] You mean so it's not a year after the release date before we deploy the updates? [23:47:12] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53485 [23:47:18] well, 11 months [23:48:32] well, that would be nice too [23:48:55] but I mean so that categories aren't broken for a day or two while the update is in progress [23:49:53] the ICU manual suggests compiling the application against both versions of ICU [23:50:47] if we did that, we could configure the ICU version on a per-wiki basis, which would be an improvement [23:51:08] maybe we could even encode the ICU version in the categorylinks row and then choose an appropriate version to use for sorting each category [23:51:08] storing 2 cl_sortykey columns? [23:51:23] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53478 [23:51:28] sure, could do that too [23:51:34] with the chinese-collation branch [23:52:23] though, storing the version as you suggested would be cheaper (storage space wise). 2 columns allows migration to occur while not changing the currently used version... [23:52:39] I haven't actually looked at that branch in quite a while [23:53:54] the chinese-collation branch just uses more than one categorylinks row per cl_from/cl_to pair [23:54:11] so cl_from/cl_to/cl_collation is unique [23:54:50] Aha [23:57:01] woosters: why is en.m throwing SSL warnings in chrome ? [23:57:05] did you guys change somethign ? [23:57:16] RobH: Ryan_Lane ^^ [23:57:25] we pushed the new unified certificate [23:57:26] tfinc: is it currently? [23:57:27] it's supposed to be fixed... [23:57:32] tfinc: for which site? [23:57:35] as in the past couple hours [23:57:37] nope [23:57:38] en.m [23:57:40] in mobile, or desktop? [23:57:41] like i said [23:57:45] both [23:57:47] en.m. what? [23:57:58] en.m.wikipedia.org on both desktop and mobile [23:58:26] it would have been useful to hear this was going to happen [23:59:49] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Tue Mar 12 23:59:45 UTC 2013