[00:00:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [00:02:05] New patchset: awjrichards; "ensure all photo uploads go to commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36103 [00:03:00] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36103 [00:05:04] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Preparing to use commons API for mobile beta photo upload' [00:05:12] Logged the message, Master [00:06:08] !log awjrichards synchronized wmf-config/mobile.php 'Enabling use of commons API for mobile beta photo upload' [00:06:16] Logged the message, Master [00:06:28] Oh… mutante, I think your suggesting that I use the parameterized webserver::php5 broke things, because it conflicts with webserver::apache2 which is defined elsewhere [00:06:34] *suggestion [00:06:40] Hm [00:07:02] andrewbogott: how can you tell it breaks? [00:07:12] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[apache2] is already defined in file /var/lib/git/operations/puppet/manifests/webserver.pp at line 91; cannot redefine at /var/lib/git/operations/puppet/manifests/webserver.pp:42 on node gallium.wikimedia.org [00:07:32] oh.. webserver::apache2 is already on gallium too? [00:08:00] yeah, that line in my class was a c/p from the integration class [00:08:32] eh..yeah.. in that case :14:00 2 different setups on one host... meh [00:08:55] So, I can change contint.pp to use the parameterized version, or change docs to use the simple version. [00:09:06] use the one with parameters [00:09:09] New patchset: awjrichards; "Disable commons API for mobile beta photo uploads on testwiki and test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36926 [00:09:16] and you just need it once [00:09:35] Well… you only need it once if you know that both classes are going to be on the same system. [00:09:52] A proper class should list all its dependencies, not just assume that they'll be there...? [00:09:55] i guess it should technically be 3 classes then :p [00:10:03] one that installs the webserver, and one per Apache site [00:10:29] and then all 3 applied on the node [00:10:40] But… isnt' it a dependency? [00:10:48] I mean, every apache site class should specify that it requires apache [00:10:57] It's not harmful for each of them to have the same requirement [00:11:27] i guess it isnt, as long as you dont mix the 2 methods.. true [00:11:37] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36926 [00:11:50] the one with parameters is definitely newer than the other [00:12:35] New patchset: Andrew Bogott; "Use the slightly-more-modern parameterized def for apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36927 [00:12:50] So… that patch makes that change, although I'm somewhat skeptical that it won't break a bunch of things [00:13:11] There's one quick way to find out! [00:13:30] why would it break a bunch? [00:13:35] we use it in other places for sure [00:13:48] *shrug* I'm just superstitious about touching a file that I didn't write and haven't thought about [00:13:56] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Disable use of commons api for mobile beta photo uploads on testwiki and test2wiki' [00:14:04] Logged the message, Master [00:14:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36927 [00:16:49] Ryan_Lane, LeslieCarr: yeah I've postponed merging it because it affects all hosts and want to reserve a nice time to try it/merge it (testing in labs doesn't help in this case) [00:17:07] Ryan_Lane, LeslieCarr: didn't know anyone else was interested, I'll have a look tomorrow. [00:17:36] New patchset: Ryan Lane; "Dep host needs appserver packages (like php)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36928 [00:17:57] LeslieCarr: as for the SSH module, it's outdated by now. I've refrained from merging it because ma rk didn't like the whitespace and we were supposed to discuss it further [00:18:10] cool [00:18:12] what? 2 spaces? [00:18:21] i was just all "egads so many commits" [00:18:34] mutante: Yeah, it breaks because the integration class explicitly declared a bunch of classes that are now created by the parametrized class... [00:19:37] …or maybe just one actually [00:19:39] LeslieCarr: thanks, that's awesome [00:19:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36928 [00:20:38] New patchset: Andrew Bogott; "Don't add libapache2-mod-php5 here, we're now getting it from webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36929 [00:21:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36929 [00:21:49] andrewbogott: yep..ack [00:22:50] andrewbogott: and one more.. SSL [00:22:53] i can do it :p [00:23:40] thanks! That's "apache_module { ssl: name => "ssl" }" right? [00:23:49] yes [00:24:16] New patchset: Dzahn; "dont need to load Apache module SSL, getting it from parameterized class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36930 [00:24:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36930 [00:25:33] Hm. [00:26:20] Oh! The thing you said before about only needing to define it in one place was right :( I was thinking of 'requires' which this is not. [00:26:23] So, I will fix that one. [00:26:53] Hm… there must be some way of properly declaring the dependency in two places... [00:26:56] i just saw that..yes [00:27:02] for now..just take it out of docs.pp [00:27:23] i think proper would be one class that _just_ setsup Apache [00:27:30] but not the sites [00:27:50] New patchset: Andrew Bogott; "Remove a duplicate declaration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36931 [00:27:58] Yeah, and the sites should 'require' that rather than declaring anything… I think. [00:30:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36931 [00:30:30] andrewbogott: can you think of a better term than "doc site server"? [00:30:46] i was going to add a system role [00:31:11] It should be more specific, but I don't yet know what all this site will be used for. [00:31:30] Like, it's different from wikitech, somehow... [00:32:11] New patchset: Dzahn; "add a system role for the new misc/docs class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36933 [00:32:50] andrewbogott: ^ that is always nice to have.. these roles all show in motd when you connect to server [00:32:59] cool. [00:33:00] ..or should... [00:33:11] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36933 [00:33:39] I'm not logged into sockpuppet atm, I'll leave that bit to you [00:33:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:23] !log rebooting es1004 for kernel upgrade before repooling [00:37:31] Logged the message, notpeter [00:40:54] Does stafford need attention or just a break? [00:41:56] New patchset: awjrichards; "Fix setting of wgMFPhotoUPloadEndpoint to happen AFTER inclusion of MObileFrontend.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36936 [00:44:01] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36936 [00:44:57] andrewbogott: a little of both [00:45:06] !log awjrichards synchronized wmf-config/mobile.php 'Fix api endpoint definition for mobile beta' [00:45:06] we need to scale puppet [00:45:14] Logged the message, Master [00:48:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.525 seconds [00:48:23] !log mlitn synchronized php-1.21wmf4/extensions/ArticleFeedbackv5/modules 're-sync aftv5 js&css files with newer modification date' [00:48:31] Logged the message, Master [00:48:57] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 6411790 seconds [00:55:29] mutante: It goes! http://doc.wikimedia.org/puppet/ [00:56:20] whoa [00:56:24] *click* [00:57:23] andrewbogott: woot [00:57:34] Ugly, but enough for today. [00:57:36] looks like javadoc [00:57:50] is this hooked up somehow to jenkins? or a cron? [00:58:34] I am pretty sure that puppet will update it. [00:58:46] oh, by puppet! hah [01:10:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:10:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:22:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:06] New patchset: Dzahn; "contint - do not install all of the huge ia32-libs, instead just install what is actually needed by Android SDK on a multiarch system. thanks paravoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:26:28] New review: Dzahn; "http://stackoverflow.com/questions/2710499/android-sdk-on-a-64-bit-linux-machine" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36945 [01:27:48] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:27:48] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:29:02] New patchset: Dzahn; "contint - do not install all of the huge ia32-libs, instead just install what is actually needed by Android SDK on a multiarch system. thanks paravoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:29:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.300 seconds [01:45:51] !log gallium - puppet runs fine again, has our librsvg 2.36-1wm1 packages and Android SDK stuff should also work (@hashar) [01:46:04] Logged the message, Master [01:50:46] /msg NickServ identify WM$s01.B [01:51:26] /msg NickServ mike_wang WM$s01.B [02:03:19] test message [02:04:10] Success. [02:05:03] test message [02:05:48] test from Tampa [02:08:36] New patchset: Ryan Lane; "Add ability to set allows and denies for docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36948 [02:09:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36948 [02:13:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:32] New patchset: Ryan Lane; "Limit deployment webserver access to our networks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36949 [02:16:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36949 [02:18:04] good night! [02:18:18] mike_wang: night [02:18:28] mike_wang: you may want to change your identify password, btw [02:18:35] you had written it into the channel [02:18:40] (and this is a public channel) [02:19:18] I made a mistake. I will change my passwd. [02:24:05] !log LocalisationUpdate completed (1.21wmf5) at Wed Dec 5 02:24:05 UTC 2012 [02:24:14] Logged the message, Master [02:30:07] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:31:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [02:42:16] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 117053 MB (3% inode=99%): [02:44:52] !log LocalisationUpdate completed (1.21wmf4) at Wed Dec 5 02:44:51 UTC 2012 [02:45:00] Logged the message, Master [02:46:28] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 5 02:46:12 UTC 2012 [03:31:10] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [05:00:57] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:45] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:59] Change abandoned: Ori.livneh; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31146 [05:15:30] PROBLEM - SSH on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:30] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [05:41:00] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:57] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:17:27] AaronSchulz: was that query on frwiki? (which is now officially over 4 million in the job queue >_<) [06:52:24] PROBLEM - swift-container-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:52:33] PROBLEM - swift-account-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:54:21] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:55:07] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:55:34] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:57:31] PROBLEM - swift-object-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:57:40] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:57:40] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:57:58] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:01:22] hmm [07:03:13] RECOVERY - swift-account-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:03:23] yeah, those were bogus [07:03:49] RECOVERY - swift-container-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:03:58] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:04:07] RECOVERY - swift-object-server on ms-be2 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:04:07] RECOVERY - swift-account-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:04:16] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:04:25] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:07:16] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:08:28] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:33:47] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:39:49] New patchset: ArielGlenn; "change object replication max conns to 3 in rsync conf also" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36955 [07:40:14] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36955 [07:41:28] and now we see how that is [08:04:36] !log cleaning up stuff on ms-be5 /srv/swift-storage/sde1/objects/ again, data somehow wound up on the underlying root filesystem, root partition was full [08:04:46] Logged the message, Master [08:05:26] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:05:35] PROBLEM - swift-account-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:05:53] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:05:53] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:05:55] yep, we know [08:06:02] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:06:02] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:06:02] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:06:20] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:06:29] PROBLEM - swift-container-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:06:47] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:06:47] PROBLEM - swift-object-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:06:47] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:07:32] once these files are gone I can remount, turn puppet back on, restart all services, shouldn't take long [08:09:47] that should get it [08:09:47] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:09:47] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:10:05] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:10:05] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:10:05] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:10:14] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:10:14] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:10:41] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:10:41] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:10:50] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:10:51] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:10:51] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:14:00] ok looks like sde is bad. meh [08:27:42] PROBLEM - SSH on snapshot1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:41] !log powercycling snapshot1, was in swapdeath [08:40:54] Logged the message, Master [08:43:45] PROBLEM - Host snapshot1 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:21] RECOVERY - SSH on snapshot1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:47:30] RECOVERY - Host snapshot1 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [09:10:27] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:44:53] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:11:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:11:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:28:56] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:28:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:34:38] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [11:39:44] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 110.34 ms [11:50:42] New patchset: Dereckson; "(bug 42720) Add ipblock-exempt right to bot group on cs.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36970 [11:53:59] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/36970 [12:09:45] New review: Mormegil; "See http://cs.wikipedia.org/wiki/Wikipedie:Pod_l%C3%ADpou_%28n%C3%A1vrhy%29#V.C3.BDjimky_z_blokov.C3..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/36970 [12:31:14] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:36:53] New patchset: Aude; "update settings for wikibase client and repo, in prep for next deployment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36205 [12:47:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:31:50] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [13:39:21] hush whiner [13:40:16] hehe [13:56:59] Can we make it whine for puppet if it's like over 24 hours or something? [14:02:20] !log reedy synchronized php-1.21wmf5/includes/dao [14:02:30] Logged the message, Master [14:02:59] !log reedy synchronized php-1.21wmf5/includes/db [14:03:08] Logged the message, Master [14:03:32] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [14:03:32] Offending key for IP in /etc/ssh/ssh_known_hosts:835 [14:03:32] Matching host key in /etc/ssh/ssh_known_hosts:603 [14:03:41] yay, ipv6 [14:03:43] !log reedy synchronized php-1.21wmf5/includes/AutoLoader.php [14:03:52] Logged the message, Master [14:49:59] New review: Dereckson; "Shellpolicy issue has been resolved." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/36970 [14:53:24] !log Phasing out the Jenkins job linting operations/puppet.git , replacing it with https://integration.mediawiki.org/ci/job/operations-puppet-validate/ [14:53:32] Logged the message, Master [14:57:43] New patchset: Hashar; "validate jenkins job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [15:01:16] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37009 [15:02:53] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [15:12:19] baahhh [15:12:28] stupid stuck scrollback [15:17:18] !log pulled ms-be5:sde1 and ms-be12:sdi from rings, they were borking obj replication across the cluster [15:17:26] Logged the message, Master [15:19:56] sq48 seems to be still down (just looking at nagios-wm scrollback) [15:20:03] i've not a clue if that's important [15:24:14] I'm not ignoring you, I'm poking around [15:29:46] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:29:55] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:34:16] !log powercycled sq48, hmm it was up for 214 days... fishy squid! [15:34:24] Logged the message, Master [15:39:04] RECOVERY - Backend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 460 bytes in 0.024 seconds [15:40:07] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 605 bytes in 0.007 seconds [15:41:04] well almost [15:42:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:43:56] really not sure if that was sufficient for the replication issue, I see steadily increasing after a sharp decrease in eta and not sure why [15:46:42] may be too soon to tell, I'll look at in a while [15:46:51] out for a couple hours, back later [15:48:06] I suppose taking out the devices forces some data to be moved which will screw up the estimates for a bit too [15:48:10] anyways back in a while [15:49:00] bye [15:50:01] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [15:50:28] RECOVERY - MySQL disk space on db78 is OK: DISK OK [16:13:31] New patchset: Jgreen; "fixed db name in dumper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37021 [16:14:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37021 [16:30:58] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:31:16] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:45:01] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 10.0815765079 (gt 8.0) [16:47:34] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.188751788618 [16:47:43] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:48:37] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:53:43] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:40] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:01:27] New patchset: Ottomata; "Setting up blog.wikimedia.org to send varnish logs to main udp2log stream." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37037 [18:02:57] RobH review :)? [18:03:01] https://gerrit.wikimedia.org/r/37037 [18:08:33] ottomata: hey [18:08:45] your hadoop crons are no longer throwing errors [18:08:51] but they are spaming root [18:09:08] does ottomata not get his own spam? ;) [18:09:11] would you be willing to pipe output to whoever this data will be useful to? [18:09:16] yup [18:09:31] well, right now the whole ops team gets it... [18:09:34] but we don't want it [18:09:41] are you sure ? :D [18:10:49] java: a DSL for turning xml files into stacktraces ;) [18:10:54] yes, I'm pretty sure. [18:10:55] notpeter: ape rgos got several labs people to kill their cronspam iirc [18:13:04] paravoid: wait there is a ceph cluster now? [18:13:14] a small one :-) [18:13:39] we've been playing with it a bit, finally [18:13:50] hardware and Dell really didn't make us any favors [18:13:54] * AaronSchulz wishes he knew about that [18:13:57] how many servers? [18:14:19] we've built one in esams with one "backend" and one radosgw [18:14:23] we have more machines there [18:14:34] and we have the 720xds in eqiad that are supposed to be arriving today I think [18:14:44] that we can also format and use them as a ceph lab [18:17:13] preilly: If nothing else came up I'd ignore it, but the autoloader thing, unit tests being added later, and other people complaining is enough reason to just revert [18:17:55] AaronSchulz: okay [18:17:58] it can always just be added back, but it's hard to argue on the lists that we should keep it if given the fact that it was missing from the autoloader [18:19:21] hmm, I'll probably make another list post [18:19:32] AaronSchulz: I'm pretty upset about this right now so I'd like to not talk about it [18:19:33] cmjohnson1: hey, when you take down hume to replace disk , please let me know [18:19:40] preilly: sure [18:19:53] I reimaged w/o raid yesterday, so I'm going to need to reimage again when it's got its disks back [18:20:05] (I added link on ticket for disk) [18:20:08] New review: RobH; "Seems fine to me, but my experience with pushing things into memcached like that is limited. I appr..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/37037 [18:20:17] oh, but you're not in tampa anymore.... [18:20:25] heh [18:20:28] sbernardin: ^ [18:20:32] !seen volunteer [18:20:35] Steve is the Tampa dude now. [18:20:52] but yea, put it in the RT ticket for it to hunt you down bnefore doing something and he will [18:21:01] ^demon|away: do we have the Forge Author access control for Gerrit set? [18:21:07] or after, as need be. [18:21:24] <^demon|away> preilly: Forge Author is granted on all repos, yes. [18:21:32] <^demon|away> Not forge committer, though. [18:21:32] ^demon|away: okay cool thanks [18:21:54] <^demon|away> (You couldn't amend someone else's change without forge author :)) [18:22:12] <^demon|away> Well, not as easily at least. [18:22:18] ^demon|away: yes you could [18:22:29] ^demon|away: but, I'll agree NOT easily [18:24:02] yeah, put comment in [18:24:03] thanks! [18:29:29] binasher, does stuff like https://gerrit.wikimedia.org/r/#/c/36894/2/sql/externally-backed.sql require your approval, too? [18:30:47] MaxSem: it should, though a couple other people should be able to review schema stuff too [18:31:01] i +1'd it [18:31:28] thanks! [18:32:24] <^demon|away> binasher: That's not how it works. Once you've signed up, you've taken over sole responsibility ;-) [18:33:57] * binasher needs to get responsibility sharding merged [18:34:43] <^demon|away> binasher: Put the responsibilities in Git, then they'll be distributed. [18:35:44] speaking of which [18:35:57] binasher: where are the definitions/code for gdash? [18:38:32] Jeff_Green: did you see that db1025 is having a problem with snapshotting? [18:38:59] notpeter: no--where did you see that? [18:39:03] cronspam [18:39:21] 2 hours ago [18:39:28] blarney. [18:39:39] <^demon|away> mutante: Could you please? https://gerrit.wikimedia.org/r/#/c/36882/ [18:40:08] ^demon|away: ok [18:40:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36882 [18:40:57] notpeter: Could not retrieve catalog; skipping run ..hrmm [18:41:35] ^demon|away: fyi.. gallium should be fine again .. recent issue with librsvg packages and android sdk [18:41:55] even though i did not really build the android app yet [18:42:08] i just heard you can do it on the cmdline on gallium [18:42:32] <^demon|away> librsvg & android sdk don't affect me, but ok :) [18:42:42] Nemo_bis: they're in puppet/files/graphite/gdash-dashboards but i never actually made a puppet rules to install them.. i really need to, especially if others are going to contribute [18:43:36] ^demon|away: you saw doc.wikimedia.org already? [18:43:37] notpeter: found it. now the question is wtf to do about it? [18:43:48] <^demon|away> mutante: hashar pointed me to it. [18:43:55] ^demon|away: ohh..oops..it broke? :p [18:44:04] it looked different yesterday [18:44:24] or maybe it didnt.and i just did not try and open the index [18:44:24] * ^demon|away shrugs [18:44:26] <^demon|away> I dunno [18:44:27] mutante: andrewbogott is working on doc.wikimedia.orgI think [18:44:37] PROBLEM - swift-object-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:48] is today really an insert-leap-second day? [18:44:50] yep, nevermind,i think i just opened it with /puppet/ before [18:45:14] hashar: https://gerrit.wikimedia.org/r/#/c/36945/2/manifests/misc/contint.pp [18:45:23] mutante / notpeter what do you think of just adding screen to the base install ? [18:45:29] binasher: thanks, do you also happen to know about ganglia config for custom metrics like http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length ? [18:45:34] LeslieCarr: love it [18:45:40] screen++ [18:45:50] Jeff_Green: a fine question. [18:45:51] LeslieCarr: yes, is it not? i thought its just not there because we have no base yet..puppet did not run [18:45:55] oh [18:46:00] that's possible [18:46:01] :) [18:46:04] notpeter: Buffer I/O error on device dm-4, logical block 0 [18:46:08] that looks unsuperduper to me [18:46:16] oh [18:46:17] fu [18:46:30] LeslieCarr: notpeter ran screen from bastion host [18:46:32] mutante is right - it is in the base already [18:46:43] <^demon|away> While we're talking about base--we've talked about putting git(-core) there. Almost every host needs it these days. [18:46:48] man we're good :) [18:46:58] Jeff_Green: yeah, this be looking like a disk problem to me [18:47:04] yawp [18:47:10] Nemo_bis: see the oddly named puppet class nagios::ganglia::monitor::enwiki in puppet/manifests/nagios.pp [18:47:37] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:04] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:04] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:07] binasher: thank you very much [18:48:22] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:29] New patchset: Reedy; "Everything else over to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37043 [18:48:31] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:40] PROBLEM - swift-container-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-account-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:58] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:25] RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:49:34] RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:49:34] RECOVERY - swift-container-updater on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:49:52] RECOVERY - swift-object-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:50:01] RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:50:02] mutante, Jeff_Green, hashar: Previously doc.wikimedia.org just said "It works!" Now it points to a directory which may at some point contain a table of contents :) [18:50:10] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:50:20] RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:50:20] RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:50:20] RECOVERY - swift-account-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:50:20] RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:50:21] andrewbogott: you are the boss :-] [18:50:27] But I don't have specific plans outside of tuning up the two directories that are already present. [18:50:28] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:50:43] .htaccess DENY FROM ALL [18:50:44] done. [18:50:46] RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [18:55:46] andrewbogott: not sure if that is what you wanted..but i once used something like this to remove the "It works" stuff... apache_site { no_default: name => '000-default', ensure => absent } [18:57:01] mutante: I think in this case the apache config I added overrode the default. Probably you'll still get 'it works' if you visit by IP. [18:59:13] andrewbogott: hmm.. the default page should be called 000-default.. and it is in sites-enabled.. you just added another file besides it.. doc.wikimedia.org ..did not overwrite or remove the default page [19:00:21] write, the file is still there but the default is overridden by the site I added when a browser uses doc.wikimedia.org [19:00:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37043 [19:00:27] *right [19:00:47] There are multiple sites hosted on that one box [19:00:49] RECOVERY - Puppet freshness on zhen is OK: puppet ran at Wed Dec 5 19:00:39 UTC 2012 [19:04:13] andrewbogott: yes, 2 virtual hosts. integration. and doc. ..so what would you like to have instead of the directory ? an index.html with nicer links? [19:04:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf5 [19:04:45] mutante: Yeah, an index.html with a table of contents probably. [19:04:55] Logged the message, Master [19:05:01] I am assuming that hashar has a vision for other doc sets he would like to put on that site, and that mine is just the first. [19:05:35] yes, thats true [19:07:14] mutante: I was responding to earlier discussion of whether or not the site broke. Just saying: it's not broken, it changed from nothing to something :) [19:08:52] oh yeah, that was just me :) i did not add the /puppet/ to the URL [19:10:49] !log reedy synchronized php-1.21wmf5/extensions/DataValues [19:10:57] Logged the message, Master [19:11:01] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [19:11:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36205 [19:12:52] !log reedy synchronized wmf-config/CommonSettings.php 'wikidatawiki readonly for maintenance' [19:13:01] Logged the message, Master [19:15:27] !log reedy synchronized php-1.21wmf5/extensions/Diff [19:15:35] Logged the message, Master [19:16:07] !log reedy synchronized php-1.21wmf5/extensions/Wikibase [19:16:16] Logged the message, Master [19:16:59] robh: ping [19:17:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: git pull is helpful. Actually put everything else to 1.21wmf5 [19:18:08] Logged the message, Master [19:18:56] !log reedy synchronized wmf-config/ 'Wikidata config updates' [19:19:04] Logged the message, Master [19:22:48] cmjohnson1: sup? [19:23:31] intel sfp's....one in the server only? [19:23:45] or both ends? [19:28:43] server end only [19:28:43] finisar on switch side [19:28:44] k...thought so but wanted to confirm...thx [19:28:44] np, thx for checking [19:28:46] notpeter: so the mc server sin eqiad can all go down right? [19:28:46] i would like chris to replace all the DCA cables with fiber [19:29:24] bleh... i may not have enough of something, nm... maybe. [19:29:44] cmjohnson1: so there are finisar in the switch [19:29:47] as well as on the table [19:29:59] and there MAY be some in the peering switch in A1, but i think there are only sfp there, not sfp+ [19:30:12] the finisar sfp is the round lever (so you can look and see rather than remove) [19:30:18] the sfp+ finisar have the flat metal lever [19:30:37] if there isnt enough, drop me a procurement ticket and i will order them asap [19:30:50] and just replace enough for what you have, which is what you were doing anyhow =] [19:31:24] i know there were 5 on table, so thats only 2 in addition to the 3 that were down =P [19:31:27] crappy. [19:32:48] only 3 on the table...so 1009.1010 and 1002 may be the only ones swapped today. nothing on the switch...there are sfps in A1...so I will put a ticket in for ya [19:32:58] well damn. [19:33:08] so yea, count up the rest you need [19:33:16] plus lets try to keep 10 spare on site [19:33:25] that is only 5 new connections anyhow. [19:33:40] cmjohnson1: and you ahve like 2 spare in tampa right? [19:34:02] i have 8 spare in tampa [19:34:15] that seems fine. [19:34:27] i'll order a couple extra and they can go to tampa when you go down later [19:34:47] so they will have 10 spare too [19:34:52] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:55] ok [19:36:05] New patchset: Ottomata; "Installing libdb-dev on gallium for webstatscollector build" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37045 [19:36:27] New review: Ottomata; "Hashar has approved via email." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37045 [19:36:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37045 [19:36:42] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:37:06] New patchset: Reedy; "Add DataValues to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37046 [19:37:15] sorry about that cmjohnson1 i should have noticed we were short [19:37:27] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:31] lemme know the rt # and i will push the aprrovals through [19:37:38] notpeter: hmm, "Host key verification failed" for hume [19:37:47] well I can get into fenari, meh [19:37:49] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37046 [19:39:19] robh: rt4031 [19:40:09] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [19:40:50] !log reedy synchronized wmf-config/ [19:40:59] Logged the message, Master [19:41:18] AaronSchulz: it's ipv6ing [19:41:35] that's a verb now? [19:41:41] damn it [19:42:02] Reedy: oh, hume? [19:42:25] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [19:43:09] RECOVERY - Host db1025 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [19:45:24] AaronSchulz: yeah ,I reimaged recently [19:45:26] sorry about that [19:45:42] should I just update my key? [19:45:42] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [19:45:43] should be cleaned up by now... odd that it's not [19:45:43] yeah [19:46:19] heh, yeah, now I get Reedy's message [19:46:54] PROBLEM - mysqld processes on db1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:48:24] New patchset: Dzahn; "add simple index.html for doc.wm site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37050 [19:48:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37050 [19:50:52] !log reedy Started syncing Wikimedia installation... : Rebuild localisationcache for wikidatawiki [19:51:01] Logged the message, Master [19:51:10] hasharEating: andrewbogott_afk: done ^^ .. and http://doc.wikimedia.org/ | http://validator.w3.org/check?uri=http%3A%2F%2Fdoc.wikimedia.org%2F&charset=%28detect+automatically%29&doctype=Inline&group=0 [19:51:35] ohh [19:51:42] RECOVERY - mysqld processes on db1025 is OK: PROCS OK: 1 process with command name mysqld [19:52:45] mutante: I noticed the https: versions complains about a wrong certification, namely *.mediawiki.org [19:52:50] typo in it ..grr [19:53:12] same issue happens on https://integration.wikimedia.org/ [19:53:20] oh..i did not mean that [19:53:34] hashar: oh yea, we just added new ServerAliases but not new certificates [19:53:43] -:-) [19:53:59] not sure we can have several server aliases and cert in the same virtual host though [19:54:06] not with one IP [19:54:09] (not yet) [19:54:15] hashar: https://gerrit.wikimedia.org/r/#/c/36684/ easy review [19:54:26] mutante: so I guess we want another vhost :-] [19:54:42] AaronSchulz: hey :-] Sorry about the morning drama regarding UUID Generator. [19:55:00] AaronSchulz: I thought that the uniq_id command from PHP would be good enough for our use ;) [19:55:27] hashar: meh, shit happens [19:56:53] hashar: would need a second IP on gallium to have 2 SSL hosts without cert issues, cant use SNI yet http://en.wikipedia.org/wiki/Server_Name_Indication [19:57:13] hashar: or it would have to be cluster apache and not standalone on gallium [19:57:40] bbiab [19:58:13] AaronSchulz: job queue stuff, I am not really confident in it. I barely know that code :( [20:00:32] ahh [20:04:59] notpeter: mc1002/1009/1010 have the new intel sfps and are yours when you are ready [20:05:33] RobH: You're awesome. [20:05:41] ? [20:05:46] yes, yes i am [20:05:52] though i have no idea why you are pointing it out ;] [20:06:18] cmjohnson1: thanks! [20:06:35] RobH: Random act of thanks. Or possibly triggered by an e-mail. :-) [20:06:50] Reedy: I think https://gerrit.wikimedia.org/r/#/c/36815/ is turning into a vote ;) [20:07:14] TimStarling counts as 5 if says anything ;) [20:07:27] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [20:07:37] <^demon|away> AaronSchulz: All gerrit admins are +5 ;-) [20:08:11] * AaronSchulz has +5 [20:12:16] !log temp stopping puppet on brewster [20:12:24] Logged the message, notpeter [20:13:36] <^demon|away> AaronSchulz: Heh, there's a way to write a prolog rule where 1 + 1 = 2. [20:13:50] <^demon|away> (Not that it would make sense for us) [20:14:40] New patchset: Matthias Mullie; "AFTv5 CTA buckets had no default, but _is_ enabled on test, causing notice" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37058 [20:22:00] New patchset: Dzahn; "temp. disable icinga class on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37060 [20:23:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37060 [20:24:05] notpeter: Do you think you could review https://gerrit.wikimedia.org/r/#/c/36670/ soonish? It has the changes to the Parsoid deployment setup we discussed (mostly just deleting stuff that's now unnecessary), and it puts Parsoid in its own Ganglia group [20:26:30] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Dec 5 20:26:03 UTC 2012 [20:28:53] notpeter, could Dec 5 19:59:45 10.0.11.26 apache2[24680]: [notice] child pid 28810 exit signal Segmentation fault (11) be related to precise upgrades? [20:30:16] !log reedy Finished syncing Wikimedia installation... : Rebuild localisationcache for wikidatawiki [20:30:26] Logged the message, Master [20:31:57] RoanKattouw: I'll take a look [20:32:01] MaxSem: when did it start? [20:32:27] notpeter: Thanks man [20:33:01] notpeter, I think I saw it first today, about 12 hours ago. since it's still there, it's probably not some random glitch [20:33:51] MaxSem: then seems unlikely that it's related to upgrades, as I finished those about a month ago [20:34:01] on what hosts(s) [20:34:01] okay [20:35:21] 10.0.11.70 10.0.11.42 10.0.11.46 10.0.11.31 10.0.11.32 10.0.11.29 10.0.11.26 [20:37:11] !log reedy synchronized wmf-config/CommonSettings.php 'wikidatawiki not readonly' [20:37:20] Logged the message, Master [20:38:57] hey Ryan_Lane, you around? [20:40:21] MaxSem: segfaults related to php and libxml2 are regular and have been for a very long while. mediawiki bugs are likely to blame for some, bugs in php core and libxml2 for others [20:40:39] MaxSem: please fix :) [20:41:00] gimme stacktrace:) [20:41:53] tim may have collected some actually [20:42:02] peculiarly, there were 3 dead processes on different machines within one second, 19:59:45 [20:45:05] MaxSem: look through the archive logs as well [20:45:38] you'll find lots of peculiar [20:46:45] RECOVERY - NTP on neon is OK: NTP OK: Offset -0.01428127289 secs [20:46:53] preilly will probably be looking into the now that he has full cluster access [20:47:17] binasher: indeed [20:47:19] !log reedy synchronized wmf-config/CommonSettings.php 'wikidatawiki not readonly' [20:47:27] Logged the message, Master [20:56:10] New patchset: MaxSem; "Update device detection to match MobileFrontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35298 [20:59:50] RECOVERY - Host cp1031 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [21:00:12] ottomata: was at lunch. what's up? [21:00:19] heya [21:00:35] i'm looking into this: [21:00:36] https://rt.wikimedia.org/Ticket/Display.html?id=3774 [21:00:52] yes [21:00:54] diederik mentions there that he briefly discussed this with you [21:00:57] yep [21:01:00] what's the networking issue (eqiad vs pmtpa?) [21:01:04] no [21:01:06] production vs labs [21:01:10] oh [21:01:13] you are going to enable ldap auth [21:01:16] which changes things [21:01:20] this isn't in labs [21:01:27] it's going to use labs auth [21:01:32] oh [21:01:33] ohhhh [21:01:37] k [21:01:39] that's the reason for ldap ;) [21:01:50] and it'll likely take jobs from labs users [21:01:51] binasher mentioned that I check out the graphite ldap stuff [21:02:02] well, you want actual system authentication, right? [21:02:07] so I was looking at the connection info there [21:02:09] well [21:02:10] yeahhh….. [21:02:11] hm [21:02:11] yea [21:02:15] maybe not system authentication [21:02:21] but at minumum for the system to know the users [21:02:23] so, nss [21:02:26] not necessarily pam [21:02:30] hm, aye right [21:02:31] yeah [21:02:33] just hue [21:02:40] they don't need ldap for shell stuff (if that's what you mean) [21:02:51] nss by default can give authentication, if coupled with ssh keys [21:02:52] https://ccp.cloudera.com/display/CDH4DOC/Hue+Installation#HueInstallation-EnablingtheLDAPServerforUserAuthentication [21:02:54] so be careful there [21:03:00] !g 37108 [21:03:01] https://gerrit.wikimedia.org/r/#q,37108,n,z [21:03:08] PROBLEM - Varnish HTTP upload-frontend on cp1031 is CRITICAL: Connection refused [21:03:09] well, the idea of ldap integration was for data protection [21:03:21] you can either use kerberos or file system permissions [21:03:44] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: Connection refused by host [21:03:50] hm, ok, the data we're talking about here is in hadoop, which has kerberos support too (haven't set that up) [21:03:53] PROBLEM - SSH on cp1031 is CRITICAL: Connection refused [21:03:59] yes, we don't want to use kerbers [21:04:01] kerberos [21:04:02] aye [21:04:02] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: Connection refused by host [21:04:02] if possible [21:04:05] file system stuff is default [21:04:07] anyway [21:04:09] file system permissions are way easier [21:04:11] PROBLEM - Varnish HTTP upload-backend on cp1031 is CRITICAL: Connection refused [21:04:12] yeah that's cool [21:04:21] i'm really just looking to make it so I don't ahve to create user accounts manually for hue [21:04:21] this is just for the web server, though, for now, right? [21:04:24] yeah [21:04:30] ok. that's easy enough [21:04:39] there's no network issues from that perspective [21:04:51] hm, ok cool, basically follow the instructions, fill in the values, hope it works? [21:04:52] though realistically we should move it into the labs network sooner than later [21:04:56] it'll be harder later [21:04:58] yes [21:05:04] hm [21:05:16] not sure I know the diff, but the ldap url I was going to use is [21:05:16] ldap://nfs1.pmtpa.wmnet nfs2.pmtpa.wmnet/ou=people,dc=wikimedia,dc=org?cn [21:05:23] no [21:05:25] don't use those [21:05:28] ok [21:05:30] graphite needs to change that too [21:05:39] use virt0.wikimedia.org and virt1000.wikimedia.org [21:05:47] I want to kill off nfs1/2 for ldap eventually [21:05:56] ok, same details for the rest though? [21:06:00] ou, dc, etc.? [21:06:00] yeah [21:06:02] k [21:06:16] we have a standard of using cn to authenticate via the web [21:06:21] and uid via shell [21:06:27] it makes things consistent [21:06:31] hm, ok [21:07:09] you'll want to use ldap_username_pattern [21:07:16] New patchset: Dzahn; "re-enabled icinga class on neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37114 [21:07:29] does the username need to map to the user's system username? [21:07:50] if so then you'll need to be inconsistent and use uid for the pattern [21:08:07] hm, i'm not sure [21:08:07] make sure to use tls or ssl [21:08:15] *ldap + start_tls or ldaps [21:08:15] thus far hadoop has wanted a user to have a shell account [21:08:26] which isn't ideal, but its ok [21:08:26] ok, if that's the case you'll need to use uid [21:08:36] and you'll also need to set up nss [21:08:36] ,dc=org?cn -> ,dc=org?uid [21:08:37] ? [21:08:50] dap_username_pattern="uid=,ou=People,dc=mycompany,dc=com" [21:08:51] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37114 [21:09:06] ldap_username_pattern="uid=,ou=people,dc=wikmedia,dc=org" [21:09:13] aye, ok [21:09:19] don't know much about nss [21:09:35] puppet can configure that for you [21:10:09] role::ldap::client::labs [21:10:29] and I can use this for pw? [21:10:29] $proxypass = $passwords::ldap::wmf_cluster::proxypass [21:10:57] for which user? [21:11:21] um, proxyagent? (i'm an ldap noob as well) [21:11:23] when you use role::ldap::client::labs, you'll also need to modify ldapincludes to pass in nss as well [21:11:24] yeah [21:11:30] those go together [21:11:35] its set in the graphite stuff [21:11:41] AuthLDAPBindDN cn=proxyagent,ou=profile,dc=wikimedia,dc=org [21:11:41] AuthLDAPBindPassword <%= proxypass %> [21:11:48] think of it like the mysql user you're connecting with [21:11:53] right ja [21:12:05] our ldap servers require bind to search [21:12:06] which will be querying for the authenticating user sinfo [21:12:14] yes [21:12:29] then when a user is found, it'll bind as that user to check the password [21:12:38] aye k [21:12:41] andrewbogott: one thing about the "docs" generation. i noticed if in actual site.pp there is something like this: node /^(grosley|aluminium)\.wikimedia\.org$/ .. that turns into this in the generated docs: node "grosleyaluminium.wikimedia.org" [21:12:56] ldap is just a database with a strict schema [21:13:02] and a different query language [21:13:02] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:13:02] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:13:04] aye [21:13:11] PROBLEM - Host cp1031 is DOWN: PING CRITICAL - Packet loss = 100% [21:13:11] so, re tls cert stuff? I need that? [21:13:15] yes [21:13:15] oh that is for nss? [21:13:18] no [21:13:26] ok [21:13:29] I can set this for hue [21:13:30] # Path to certificate for authentication over TLS [21:13:30] ## ldap_cert= [21:13:36] unless you want to send our passwords over the line in the clear, you'll need to use tls [21:13:47] RECOVERY - SSH on cp1031 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:13:56] RECOVERY - Host cp1031 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [21:13:57] ok, so I need to generate my own cert for this and use that? [21:14:01] no [21:14:03] there's a system one [21:14:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:06] ah [21:14:14] I honestly don't know why you need to provide one [21:14:17] fucking java apps [21:14:25] lemme see which CA [21:14:42] ah certificates::wmf_ca [21:14:43] ok [21:14:45] no [21:14:48] or [21:14:48] certificates::wmf_labs_ca [21:14:49] ? [21:14:49] not that one [21:14:54] this uses the * cert [21:15:04] Equifax_Secure_CA.pem [21:15:09] <^demon|away> If you use install_certificate{}, it should do all that for you? [21:15:20] ^demon|away: it'll install the certificate and the CA [21:15:27] but it won't configure hue to use it [21:15:32] that's fine [21:15:34] i can find the path [21:15:38] no args to instsall_cert [21:15:38] ? [21:15:45] java apps have a different way to manage certificate trusts [21:15:54] (hue is django, btw) [21:16:12] this is the CA certificate hue wants [21:16:18] for ldap_cert [21:16:23] not the star cert [21:16:36] this is to trust the certificate the ldap server is presenting [21:16:40] <^demon|away> We don't do this with gerrit for ldap? [21:16:46] user -> https -> hue [21:16:52] hue -> ldaps -> virt0 [21:17:04] ^demon|away: it's smart enough to use the system trust [21:17:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [21:17:57] do I manage java's trust in the install certificate class? [21:17:59] so [21:18:01] ca => "star.wikimedia.org" [21:18:02] ? [21:18:03] I kind of doubt it [21:18:07] ottomata: no [21:18:11] ha [21:18:11] binasher: we do have a lot of new segfaults in libmemcached.so though [21:18:19] star's CA [21:18:26] yay [21:18:30] Equifax_Secure_CA.pem [21:18:37] "star.wikimedia.org" => "Equifax_Secure_CA.pem", [21:18:41] preilly: ^^ re: segfaults in libmemcached [21:18:47] ottomata: wmf-ca.pem from ./files/ssl i believe [21:18:54] mutante: NO [21:18:55] oh. what Ryan says [21:18:58] ouch [21:18:59] heh [21:19:01] ha [21:19:03] you guys are killing me :) [21:19:12] i'm only barely following I think atm [21:19:17] summary: [21:19:29] i need to use define install_certificate to get a cert on the hue server [21:19:29] so, we're going to have two things providing tls/ssl here [21:19:38] hold on. let me explain [21:19:38] then I set ldap_cert= path to cert [21:19:41] ok ok go ahead [21:19:44] you have a web server [21:19:46] 42 libc-2.15.so [21:19:46] 491 libmemcached.so.10.0.0 [21:19:47] 42 libxml2.so.2.7.8 [21:19:47] 104 php5 [21:19:47] it needs https [21:19:56] that's segfaults [21:20:06] ottomata: it's using a wikimedia.org address, right? [21:20:13] is it going to be hue.wikimedia.org or something? [21:20:26] rigiht now, it is at http://hue.analytics.wikimedia.org/, but there is no dns for that [21:20:29] i'm doing it manually [21:20:29] and [21:20:32] we only have one public IP [21:20:35] don't use sub-sub domains [21:20:37] and this is running on a backend server [21:20:46] so i'm using haproxy [21:20:58] whatever, we can change the domain to wahtever we want, [21:21:00] bleh haproxy :D [21:21:11] hah, i like it! so easy and flexible! but whatever [21:21:14] i don't care we can use whatever [21:21:14] so. your proxy will need to provide https [21:21:19] i can do that [21:21:28] that will use the star certificate [21:21:29] i can put it back on apache or nginx if I need to [21:21:36] this is why you can't use sub-sub domains [21:21:38] that I can figure out i think (with maybe minimal help) [21:21:40] aye [21:21:41] makes sense [21:21:43] the star cert won't work with sub-sub domains [21:22:11] ok, but that's a separate bit, right? getting ssl to work? [21:22:17] paravoid: thanks, will investigate! (though right now, lunch) [21:22:18] right now i'm just looking at the ldap auth [21:22:22] so, that's one server that's providing ssl [21:22:30] it needs the star cert [21:22:42] the other server *providing* ssl is the ldap server [21:22:56] hue is performing authentication against the ldap server [21:23:02] so it's a consumer of ssl [21:23:09] which means it needs to trust the provider [21:23:20] so, you'll be configuring hue to trust the ldap server [21:23:25] which is why you need the CA certificate [21:23:37] that server doesn't need the star certificate at all [21:23:40] just its CA [21:23:45] oh [21:23:50] I think that may be loaded by defauly [21:23:53] *default [21:24:00] by what? [21:24:06] sec [21:24:26] /etc/ssl/certs/Equifax_Secure_CA.pem [21:24:27] indeed it is [21:24:38] ah cool [21:24:40] i do have that [21:24:41] cool [21:24:42] I think it comes with the ca-certificates package [21:24:44] PROBLEM - Host cp1031 is DOWN: PING CRITICAL - Packet loss = 100% [21:24:48] which is installed on all systems [21:24:51] aye [21:25:03] so, you just need to configure hue to use that (likely using the full path) [21:25:12] ok cool, [21:26:29] so, real quick [21:26:34] here are a buncha things i'm about to try [21:26:35] https://gist.github.com/4219662 [21:26:53] use lowercase [21:26:57] for the base dn [21:27:04] k [21:27:14] (that was from hue instructions, but ok) [21:27:16] dns are case insensitive, but it's good practice [21:27:18] *practice [21:27:19] k [21:27:30] active directory uses upper case for their dns [21:28:15] ldap://virt0.wikimedia.org virt1000.wikimedia.org/ou=people,dc=wikimedia,dc=org?cn [21:28:19] ^^ that doesn't look right [21:28:46] hm, ok? [21:29:00] i just replaced from the value that was in graphite confs: [21:29:00] AuthLDAPURL "ldap://nfs1.pmtpa.wmnet nfs2.pmtpa.wmnet/ou=people,dc=wikimedia,dc=org?cn" [21:29:11] yeah, that's for mod_authzldap [21:29:19] New patchset: Dzahn; "require icinga::monitor::packages in icinga::monitor::service because the package needs to create /etc/icinga before stuff can be put there" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37116 [21:29:21] it has a different syntax [21:29:23] hm ok [21:29:26] https://ccp.cloudera.com/display/CDH4DOC/Hue+Installation#HueInstallation-EnablingtheLDAPServerforUserAuthentication [21:29:31] ldap_url [21:29:35] ldap_url=ldap://auth.mycompany.com [21:29:36] soo [21:29:57] ldap_url=ldaps://virt0.wikimedia.org [21:29:59] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [21:29:59] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:30:03] hmm,k [21:30:20] ldap_url=ldaps://virt0.wikimedia.org ldaps://virt1000.wikimedia.org [21:30:26] RECOVERY - Varnish HTTP upload-backend on cp1031 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.055 seconds [21:30:26] RECOVERY - Host cp1031 is UP: PING OK - Packet loss = 0%, RTA = 26.73 ms [21:30:27] ^^ I'd guess it would be like that for multuple [21:30:35] RECOVERY - Varnish HTCP daemon on cp1031 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [21:30:52] New review: Dzahn; "attempt to fix Failed to apply catalog: Could not find dependency File[/etc/icinga/icinga.cfg] for S..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37116 [21:30:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37116 [21:31:14] dap_username_pattern="uid=,ou=people,dc=wikimedia,dc=org" [21:31:16] ldap_username_pattern="uid=,ou=people,dc=wikimedia,dc=org" [21:31:26] backend=desktop.auth.backend.LdapBackend [21:31:41] ldap_cert=/etc/ssl/certs/Equifax_Secure_CA.pem [21:32:07] aye [21:32:40] no need for a bind_dn or bind_password [21:32:43] so no proxy agent [21:32:57] it seems that the software only supports direct binds [21:33:02] which is ok because we use a flat tree [21:33:30] uh, yeah? ok will try that [21:34:03] this ldap support is kind of crappy :D [21:34:07] oh yeah? [21:34:28] it'll work [21:34:42] I'm assuming you'll be able to group users and such inside of the software [21:34:47] since it doesn't support ldap groups [21:35:42] mmmmmmm i really hope it does, else i'll probably file a feature request [21:35:57] yeah it does [21:36:09] wait, what doesn't support ldap groups? [21:36:13] New patchset: Reedy; "Enable WikibaseClient for test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37117 [21:36:22] I don't see support for groups [21:36:32] I see the ability to import groups [21:36:32] there are also these settings [21:36:33] https://gist.github.com/4219728 [21:37:02] that's for import [21:37:06] not for auth [21:37:06] ah, hm ok [21:37:23] import isn't going to work [21:37:30] since it won't have access to the passwords [21:38:25] I wonder if you can use auth and also use group import [21:38:33] that would be nice [21:38:57] heh. no base_dn for groups [21:39:06] it's going to search from the tree all the way down [21:39:08] so... [21:39:18] let's see... [21:40:47] I have a strong feeling group import won't work [21:40:58] group_filter="objectclass=groupofnames" [21:41:16] group_name_attr=cn [21:41:40] should I set that? [21:41:43] unless it's going to match the user's dn with the group [21:41:47] if you want to try group import, yes [21:41:48] 'groupofnames'? [21:41:58] I'd get auth working before trying to get the group stuff working [21:42:02] hm, right now i'm just trying to see if it will use ldap for general auth [21:42:02] yeah [21:46:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37117 [21:47:34] New patchset: Andrew Bogott; "Create a link in / to /srv/org/wikimedia/doc/puppetsource" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37118 [21:48:44] mutante: Any gues about a better way to accomplish ^ ? I'm pretty unhappy about dropping a random link into / [21:49:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:49:58] why is there a need to have /puppetsource? [21:50:53] Ryan_Lane: Note the absolute paths that litter these docs: http://doc.wikimedia.org/puppet/ [21:51:23] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable wikibaseclient on test2wiki' [21:51:23] ah. it's to get rid of the long paths? [21:51:25] Those absolute paths don't correspond to the actual web paths that I want in the docs. Pointing the doc script at that symlink is a hack to get the paths the way I want them. [21:51:28] Yeah. [21:51:32] Logged the message, Master [21:51:38] And it gets paths that are relative to the website so that links work. [21:51:42] (or, so the theory goes :) ) [21:52:00] Alternatives are… sedding the entire site after each generation, or modifying the doc script. [21:52:22] I like the 'sed' option except that I think the doc script is doing incremental updates and it might be confounded by that. [21:53:07] I'm not a fan of the sed option :) [21:53:22] fixing the script seems like the best option [21:53:23] !log authdns-update for new misc servers [21:53:31] Logged the message, RobH [21:53:59] Yeah, probably. Have you hacked on the puppet tools? Do they welcome patches? [21:54:18] it's also possible to alias in the long path [21:55:11] That would fix links but not the ugly display right? [21:55:16] Alias /puppet/files/srv/org/wikimedia/doc/puppetsource/ /srv/org/wikimedia/doc/puppetsource/ [21:55:19] indeed [21:55:19] New patchset: awjrichards; "Enabling mobile beta photo uploads to commons for testing on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37120 [21:55:52] hm, maybe that's fine. [21:56:10] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37120 [21:56:36] hm. I think I'm going to try to upgrade labsconsole today [21:56:53] great! [21:58:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.961 seconds [21:58:34] new smw is out [21:59:14] ah, Ryan_Lane, so, nss [21:59:20] I need to just include that role on this machine? [21:59:32] role::ldap::client::labs [21:59:45] read the role [21:59:47] it's missing nss [21:59:51] for production [21:59:58] lesliecarr: rt4026 Layer42 OOB connection. did you get a time frame on when that will be dropped to our cage? [22:00:02] you need to call it with nss added [22:00:27] cmjohnson1: so i realized yesterday i hadn't ordered the x-connect [22:00:29] so ordering it now [22:00:41] with the same ones it comes with? [22:00:42] so [22:00:48] okay..sounds good [22:00:50] ldapincludes => ['openldap', 'utils', 'nss'] [22:00:51] ? [22:00:52] yep [22:00:55] k [22:01:12] RoanKattouw: that patches looks ok, but you need to do more to get ganglia actually working [22:01:19] you need to set hosts as aggrigators [22:01:24] Ugh [22:01:29] and add them to a hash in ganglia.pp [22:01:34] lemme find line number [22:01:38] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [22:01:38] RECOVERY - Varnish HTTP upload-frontend on cp1031 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [22:01:40] it's only like 2 more lines [22:01:43] * RoanKattouw stabs ganglia [22:01:50] and at least you're not trying to hunt the problem down :) [22:01:52] that'd hella annoying [22:02:32] RoanKattouw: $ganglia_clusters [22:02:33] and [22:02:43] New patchset: Reedy; "Fixing AFTv5 notice" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37122 [22:02:48] $data_sources [22:02:59] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37122 [22:03:05] pretty straight forward. lemmeknow if you need any halp [22:04:45] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Enable mobile beta photo uploads to commons from testwiki' [22:04:53] Logged the message, Master [22:05:11] notpeter: What is ip_oct? [22:05:25] it's for the multicast address [22:05:30] just choose one that's not already in use [22:05:43] KO [22:05:52] KNOCKOUT PUNCH [22:06:04] haha [22:06:14] And I have to choose a host within the cluster to be the aggregator? Blegh [22:06:17] * RoanKattouw stabs Ganglia [22:07:27] New patchset: RobH; "adding misc servers kuo mexia lardner tola" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37124 [22:07:56] yes [22:08:06] one per DC, ideally [22:08:17] New patchset: Ottomata; "including ldap labs role on an27" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37125 [22:08:41] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37125 [22:08:42] RoanKattouw: come on, this is sysadmin level of programming: cut, paste, fill in the blanks ;) [22:08:46] cmjohnson1: wanna review that ^ ? [22:08:58] k [22:09:14] * AaronSchulz lols at https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/989900/comments/51 [22:10:34] <^demon|away> AaronSchulz: Ubuntu broke Comcast for millions of users? I must've gotten lucky ;-) [22:10:37] New review: Andrew Bogott; "There may be a better way..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/37118 [22:10:41] notpeter: :) That's gotten me in trouble before, e.g. wtp1001.pmpta.wmnet ) [22:10:47] New patchset: Catrope; "Parsoid puppetization changes:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36670 [22:10:59] cmjohnson1: so, I'm having no luck getting link on these mc boxes [22:11:06] let me know if you have time to take a look [22:11:11] RoanKattouw: lulz. fair enough [22:11:29] paravoid: give up yet? :D [22:12:11] robh: looks okay...were there space issues with the lvs entries? [22:12:21] Ryan_Lane, in the nss confs installed by that class [22:12:22] yea was spaces not tab [22:12:23] i see [22:12:24] so i fixed it [22:12:26] /etc/nslcd.conf [22:12:30] cool [22:12:30] notpeter: Amendded [22:12:36] i see [22:12:36] +uri ldap://virt1000.wikimedia.org:389 ldap://virt0.wikimedia.org:389 [22:12:38] shoudl that be my url? [22:12:41] not ldaps [22:12:43] ? [22:13:52] RoanKattouw: you have extra / in site.pp [22:13:59] unless that's some syntax that I don't know about [22:14:10] er, 2 extra [22:14:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37124 [22:14:49] ottomata: yes, it should [22:14:56] ottomata: nslcd is configured to use start_tls [22:15:01] notpeter: Oops [22:15:08] AaronSchulz: any thoughts about the 4 million job queue of frwiki? [22:15:08] notpeter: Regex -> string conversion snafu [22:15:17] (as long as I'm still vaguely awake) [22:15:21] ldaps is ldap+ssl on another port [22:15:36] ldap with start_tls is on the normal ldap port [22:15:45] RoanKattouw: that's what I was guessing [22:15:52] New patchset: Catrope; "Parsoid puppetization changes:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36670 [22:15:54] no idea why http with start_tls never took off [22:16:12] hm, ok, so the ldaps in my hue conf is correct [22:16:16] ? [22:16:27] yes [22:16:33] also, are there logs on virt0 somewhere I can look at to see if hue is actually attempting to authenticate? [22:16:45] RoanKattouw: looks good! although needs rebase. wamp wamp [22:16:48] <^demon|away> Ryan_Lane: "Before STARTTLS was well established, a number of TCP ports were defined for well known protocols which established SSL security first, and then presented a communication stream identical to the old un-encrypted protocol." [22:16:48] yeah, but it'll be spammy as hell [22:16:51] notpeter: On it [22:16:58] <^demon|away> So the answer is "Stuff already worked, so people didn't change" [22:17:03] RoanKattouw: do you have +2 or would you like me to? [22:17:08] ok, maybe I can grep for hostname or something, where is it? [22:17:10] HA [22:17:13] that's the same reason ldaps still exist, though it's deprecated [22:17:20] ottomata: you can grep by ip [22:17:26] I'm conflicting with RobH having already added my new boxes. [22:17:32] ottomata: but… it only lists it once, then it lists by connection id [22:17:36] heh, awesome :) [22:17:39] which is why this is a pain [22:17:39] best. conflict. ever. [22:17:42] I am actually happy with a conflicting change [22:17:44] yeah srsly [22:17:46] I should really feed these logs into a sysem [22:17:51] *system [22:17:59] RoanKattouw: ? [22:17:59] RobH: Thank you for your change! It conflicted with mine but that's OK cause I get my shiny boxes :) [22:18:06] oh, git conflict [22:18:09] haha, i win! [22:18:12] im merged ;] [22:18:14] Yes, you do [22:18:23] And this is the one time I'm actually happy with a conflict [22:18:29] ottomata: it's at /var/opendj/instance/logs/access [22:18:59] hm yeah i see wat you mean [22:19:01] not much to grep on there [22:19:03] RobH: What is the up-ness of those boxes? [22:19:09] New patchset: Catrope; "Parsoid puppetization changes:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36670 [22:19:18] i am just now working on spinning them up and OS installed. [22:19:25] Excellent [22:19:27] notpeter: Amended [22:19:34] they are hardware raid, the new msic boxes [22:19:42] so have to confirm its raided before i spin installer. [22:19:43] robh: did you have to enable the intel sfps [22:19:45] By the time that's done they should have Ganglia on them as well with that change I just rebased [22:19:54] cmjohnson1: there is no enabling no [22:20:05] apergos: hmm, 50% of those are for like 7 templates [22:20:11] ottomata: well, you grep by ip [22:20:16] why jenkins no do verified? [22:20:18] ottomata: find the connection id, then grep by that [22:20:35] I saw on enwiki something similar, where bots kept changing those pages several (even dozens) of times per day [22:21:11] well the oldest job in there is the same as it was yesterday mid day [22:21:19] aye ok, i think hue is not communicating properly [22:21:22] i don't see anything [22:21:23] although I saw some jobs processed in the log [22:21:27] for fr wiki I mean [22:21:29] and all the users in hue are the ones we have previously manually added [22:21:45] all refreshLinks2 [22:21:47] i gotta run real soon though [22:21:56] yep [22:22:15] Ryan_Lane, would you have a min today to look into this for me? [22:22:19] s'ok if not [22:22:31] cmjohnson1: are you getting a link light? [22:22:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36670 [22:22:35] I thought the refreshllinks jobs got processed firsts in first out but maybe that's wrong? [22:22:41] notpeter: robh: no link light [22:22:47] apergos: they are all random [22:22:48] ottomata: sure [22:22:50] hue is on analytics1027.eqiad.wmnet [22:22:58] oh. random out of 4 million, nice [22:23:02] cmjohnson1: So, we have a few servers that we know are working. [22:23:11] if notpeter has no issues, use the parts for 1015 [22:23:11] cmjohnson1: so I'm not crazy! [22:23:12] er [22:23:14] and try to see if it works [22:23:15] well.... [22:23:18] random per wiki seems a bit odd [22:23:23] dont confuse them [22:23:32] but try the fiber for 1015 (known good) for 1002 [22:23:32] and, there are access instructions for kraken stuff in general here: [22:23:32] yeha, none of these are in service yet [22:23:33] https://www.mediawiki.org/wiki/Analytics/Kraken/Access [22:23:41] if fiber doesnt fix, try intel sfp [22:23:50] just localize the fault with the known good stuff [22:23:59] but yea, there is no flashing of the intel sfp or anything [22:24:11] it just plugs in and works [22:24:32] okay, i am going to pull 1015 and plug into 1002 [22:24:44] ok. feel free to shut it down [22:24:51] if its just left online is that ok? [22:24:57] motherlover. I broken puppet. sorry [22:24:59] it just wont be in sync for a few minutes [22:25:18] as long as you didnt break my installer or apt repo ;] [22:25:59] apergos: http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Wikiprojet/cat%C3%A9gorisation&diff=86071362&oldid=85707076 [22:26:05] such an innocent change :) [22:26:22] yeah, real innocent :-D [22:26:26] New patchset: Pyoungmeister; "comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37126 [22:27:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37126 [22:27:21] why is jenkins not giving verified? [22:27:32] notpeter: The Jenkins workflow changed [22:27:42] I didn't know hashar had applied this to operations/puppet too [22:27:44] New patchset: Dzahn; "icinga - try to fix order of classes being applied by using arrow syntax" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37127 [22:27:55] he did? [22:28:13] it should give you a link to integration.mw though [22:28:15] notpeter: The new workflow is 1) Jenkins does CR+1 on submission, 2) a reviewer approves with CR+2 , 3) Jenkins runs again, sets V+1, submits [22:28:21] well.... can we get it to automatically execute all the checkingz for ops employees on the puppet repo at least? [22:28:28] ....eww, i just self verified [22:28:29] heh [22:28:30] a;lskjfda;lkjdsfa;lskjfd [22:28:37] Yeah there is talk of whitelisting staff and other trusted users [22:28:38] RoanKattouw: see, i broke it already. [22:28:42] Yup [22:28:44] me too ;) [22:28:56] I took away the right to V+1 for all humans in extensions/VisualEditor [22:28:56] well, technically, roan's code was borken... [22:28:57] !log awjrichards synchronized php-1.21wmf5/extensions/MobileFrontend/ 'Bug fixes for MobileFrontend' [22:28:58] ugh [22:29:05] Logged the message, Master [22:29:05] ... [22:29:11] well, ok [22:29:13] New review: Dzahn; "verified :p" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37127 [22:29:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37127 [22:29:22] goes from +2 -> verified [22:29:24] Ugh, whoops, sorry about that not [22:29:24] RoanKattouw: So after I did +2 [22:29:26] *notpeter [22:29:26] sure, ican do this [22:29:27] i should have waited? [22:29:43] RoanKattouw: no prob, I was the cowboy who deployed it ;) [22:29:45] RobH: Yes, in theory Jenkins should take the +2 as a trigger to go and verify your change [22:29:56] and it would have auto reviewed and merged? [22:29:58] For ext/VE this is very fast, like, you refresh the page and it has already happened [22:29:59] Yes [22:30:06] ahh, it didnt do it fast enough for me! [22:30:09] puppet lint might take a bit longer [22:30:10] heh [22:30:15] ok, I can work with this. I didn't know that this was going to apply to ops as well [22:30:18] cool! [22:30:19] thanks, RoanKattouw [22:30:21] well, since its still automated, i will simply adjust. [22:30:25] thanks for explaining RoanKattouw [22:30:31] But ext/VE is just JS lint on not /that/ much code, it takes like 5s [22:30:31] you can still merge if the verification failed [22:30:36] happened the other day [22:30:40] Yes, you can vote V+1 yourself [22:30:44] indeed, i did. [22:31:00] Which is why in the VE repo, I restricted V+1 to jenkins-bot (and l10n-bot after it broke, oops) [22:31:29] RobH: yea, but it did not show a broken validation either https://integration.mediawiki.org/ci/view/All-enabled/job/operations-puppet/ [22:31:49] i just didnt give it enough time to get back around to reviewing me again. [22:31:59] ah [22:32:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:24] was +2 and merge... what you mean its not verified, i saw it said ok, so self verify! [22:32:25] heh [22:32:40] 'i know this code is perfect i wrote it' [22:32:40] heh [22:32:49] RobH: I said that once [22:33:02] I ended up being taken up on my promise to buy the entire team lunch if they found a bug [22:33:57] :-D [22:34:24] RoanKattouw: Ok, you have two choices. I am finished with the base OS install. [22:34:32] I can sign the puppet certs for you and get it going [22:34:44] or if you are making changes and crap and rather i leave them unsigned, i can do that instead [22:34:57] puppet run i assume, but best to ask. [22:35:29] Go ahead and run puppet on em [22:35:46] I don't have any more puppet changes to make [22:35:55] ok, will do, you'll have them shortly [22:36:00] New patchset: Jgreen; "adjusting time for fundraising dump runs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37130 [22:36:02] Excellent [22:36:35] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37130 [22:37:53] !log attempting upgrade of mediawiki on labsconsole to wmf/1.21wmf4 [22:38:02] !log running update.php on labsconsole [22:38:02] Logged the message, Master [22:38:10] Logged the message, Master [22:47:16] !log upgraded GeoData tables on testwiki by recreation [22:47:25] Logged the message, Master [22:47:36] binasher: does graphite account for the 1/50 sampling factor? [22:48:06] RobH: Do you have a way to let me know when the puppet runs are done (other than checking manually)? [22:48:19] they are being reallllly slow [22:48:41] Ya srsly [22:48:42] well, once it runs successfully for these, they will start showing in ganglia [22:49:02] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [22:49:07] When Leslie and I were doing the LVS stuff, she just went into sockpuppet and started killing stuff, because we needed the puppet runs on lvs* to just finish already :) [22:49:08] New patchset: MaxSem; "Switch GeoData to Solr schema in data collection mode" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37133 [22:49:35] RoanKattouw: err: Failed to apply catalog: Could not find dependency Exec[parsoid-npm-install] for Service[parsoid] at /var/lib/git/operations/puppet/manifests/misc/parsoid.pp:38 [22:49:37] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37133 [22:49:50] you have broken parsoid stuff it seems. [22:49:56] or perhaps not, this your change? [22:50:00] Argh [22:50:02] That's me [22:50:04] Fixing [22:50:53] New patchset: Catrope; "Fix reference to removed resource" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37134 [22:51:02] There [22:51:33] well, merged to gerrit, not on sockpuppet i assume [22:51:48] RoanKattouw: right? [22:51:53] ? [22:51:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [22:52:04] RobH: No, it was a legitimate error, the patchset above fixes it [22:52:04] you just merged a change in gerirt only right [22:52:08] No [22:52:14] I cannot merge things in operations/puppet [22:52:28] ahhh, i need to [22:52:29] ok [22:52:35] puppet has totally gone to sleep :( stupid stafford [22:52:45] I mean I'm a Gerrit admin so theoretically I could give myself merge perms and then do it... but I get the feeling that wouldn't go over too well [22:52:58] :) [22:53:03] New review: RobH; "yay a fix" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/37134 [22:53:14] RoanKattouw: so now gerrit will come get it and merge it? [22:53:27] this is annoying for operations where we then after it gets around to it, go merge on sockpuppet. [22:53:32] Jenkins should [22:53:43] this is non ideal, i should have said something in the email thread [22:53:51] but i assumed it wouldnt change operations workflow. [22:54:04] There isn't really a reason for puppet to have this behavior [22:54:14] RoanKattouw: it will spam channel when it does? [22:54:20] The tests run on submission and on approval aren't different [22:54:23] It should, I thnik [22:54:46] so can operations not fall into this new system? [22:54:50] and use the old review, heh [22:55:12] Hmm, looks like it's broken [22:55:17] Should complain to hashar when he wakes up [22:55:26] RobH: Please manually V+1 it [22:55:29] so should i just verify it myself, ok [22:55:47] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37134 [22:55:56] New patchset: MaxSem; "Fix module disable" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37140 [22:56:13] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37140 [22:57:13] RoanKattouw: ok, merged it on sockpuppet and re-running puppet on the 4 new installs [22:57:42] !log upgraded labsconsole successfully [22:57:51] Logged the message, Master [22:58:03] RobH: Notified ops-l and hashar of the breakage [22:58:16] I'm really glad I hardly use any non-wmf extensions [23:00:13] Heh [23:00:19] hashar has now gotten that message 3 times [23:00:29] Because I sent it to ops-l@wm.o, then ops-l@lists.wm.o , then ops@lists.wm.o [23:00:41] RoanKattouw: Ok, servers are all yours, im resolving the ticket for the deployemnet (the procurement ticket stays open till we get replacements in) [23:01:33] Excellent [23:01:37] Thanks man [23:01:41] welcome =] [23:06:12] New patchset: Dzahn; "icinga - remove requirement for config in service class (debug)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37142 [23:06:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37142 [23:13:50] New patchset: MaxSem; "Now disable geosearch for real..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37145 [23:14:39] Reedy: does php in cli mode log to fatal.log/exception.log? [23:14:46] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37145 [23:15:06] seems to [23:23:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:47] AaronSchulz: you need to account for the sampling factor, i just feed to graphite as i collect it, and things are collected per minute, so if you scale by 0.833, you get per second accounting for the 2% sampling [23:26:50] i.e. https://graphite.wikimedia.org/render?from=-9month&width=1280&height=600&target=cactiStyle%28scale%28-total.count%2C0.8333%29%29&uniq=0.7425310740537836&title=php.reqs.per.sec [23:32:24] !log maxsem synchronized php-1.21wmf5/extensions/GeoData/ [23:32:32] Logged the message, Master [23:32:38] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [23:33:44] !log maxsem synchronized wmf-config/CommonSettings.php 'GeoData tweaks' [23:33:53] Logged the message, Master [23:36:24] New patchset: Dzahn; "create basic structure for turning icinga into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37153 [23:37:23] !log shutting down mysql on db1043 for upgrade and testing [23:37:32] Logged the message, Master [23:38:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.761 seconds [23:42:05] PROBLEM - mysqld processes on db1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:42:34] !log creating GeoData tables in all wikipedias [23:42:42] Logged the message, Master [23:45:21] !log created GeoData tables in all wikivoyages [23:45:30] Logged the message, Master [23:46:00] New patchset: Demon; "Tweak hooks so jenkins-bot is reported, just not on comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37155 [23:49:08] PROBLEM - MySQL Idle Transactions on db1043 is CRITICAL: Connection refused by host [23:49:08] PROBLEM - MySQL disk space on db1043 is CRITICAL: Connection refused by host [23:49:08] PROBLEM - MySQL Recent Restart on db1043 is CRITICAL: Connection refused by host [23:49:17] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: Connection refused by host [23:49:35] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: Connection refused by host [23:49:44] PROBLEM - Full LVS Snapshot on db1043 is CRITICAL: Connection refused by host [23:50:02] PROBLEM - MySQL Slave Running on db1043 is CRITICAL: Connection refused by host [23:50:48] notpeter: Hmm [23:51:13] notpeter: If I'm to use bast1001 for Parsoid deployments, I'll need certain packages there. Like node and npm. Or git (!) [23:51:27] Will submit a chagne [23:51:44] if this truly is all going to be temporary, you can also just d oit by hand [23:51:47] up to you [23:52:50] Right [23:53:02] I think technically we don't want those things there anyway [23:53:17] Do I have your blessing to just go ahead and install git-core and npm manually? [23:54:24] notpeter: ? [23:54:29] New patchset: MaxSem; "Test enable GeoData on enwiki in collection mode" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37157 [23:55:06] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37157 [23:55:44] RECOVERY - MySQL Idle Transactions on db1043 is OK: OK longest blocking idle transaction sleeps for seconds [23:55:53] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay seconds [23:56:11] RECOVERY - Full LVS Snapshot on db1043 is OK: OK no full LVM snapshot volumes [23:56:16] !log upgrading labsconsole to wmf/1.21wmf6 [23:56:22] * Ryan_Lane grumbles [23:56:24] Logged the message, Master [23:56:29] RECOVERY - MySQL Slave Running on db1043 is OK: OK replication [23:57:05] RECOVERY - MySQL disk space on db1043 is OK: DISK OK [23:57:05] RECOVERY - MySQL Recent Restart on db1043 is OK: OK seconds since restart [23:57:14] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay seconds [23:58:28] RoanKattouw: sure [23:59:00] bah. that branch doesn't even exist yet [23:59:06] !log make that upgrading labsconsole to wmf/1.21wmf5 [23:59:08] !log Temporarily installing git-core and npm on bast1001 by hand with Peter's blessing [23:59:14] Logged the message, Master [23:59:24] Logged the message, Mr. Obvious [23:59:45] !log and build-essentials too [23:59:53] Logged the message, Mr. Obvious [23:59:57] !log *build-essential