[00:00:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [00:02:05] New patchset: awjrichards; "ensure all photo uploads go to commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36103 [00:03:00] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36103 [00:05:04] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Preparing to use commons API for mobile beta photo upload' [00:05:12] Logged the message, Master [00:06:08] !log awjrichards synchronized wmf-config/mobile.php 'Enabling use of commons API for mobile beta photo upload' [00:06:16] Logged the message, Master [00:06:28] Oh… mutante, I think your suggesting that I use the parameterized webserver::php5 broke things, because it conflicts with webserver::apache2 which is defined elsewhere [00:06:34] *suggestion [00:06:40] Hm [00:07:02] andrewbogott: how can you tell it breaks? [00:07:12] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[apache2] is already defined in file /var/lib/git/operations/puppet/manifests/webserver.pp at line 91; cannot redefine at /var/lib/git/operations/puppet/manifests/webserver.pp:42 on node gallium.wikimedia.org [00:07:32] oh.. webserver::apache2 is already on gallium too? [00:08:00] yeah, that line in my class was a c/p from the integration class [00:08:32] eh..yeah.. in that case :14:00 2 different setups on one host... meh [00:08:55] So, I can change contint.pp to use the parameterized version, or change docs to use the simple version. [00:09:06] use the one with parameters [00:09:09] New patchset: awjrichards; "Disable commons API for mobile beta photo uploads on testwiki and test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36926 [00:09:16] and you just need it once [00:09:35] Well… you only need it once if you know that both classes are going to be on the same system. [00:09:52] A proper class should list all its dependencies, not just assume that they'll be there...? [00:09:55] i guess it should technically be 3 classes then :p [00:10:03] one that installs the webserver, and one per Apache site [00:10:29] and then all 3 applied on the node [00:10:40] But… isnt' it a dependency? [00:10:48] I mean, every apache site class should specify that it requires apache [00:10:57] It's not harmful for each of them to have the same requirement [00:11:27] i guess it isnt, as long as you dont mix the 2 methods.. true [00:11:37] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36926 [00:11:50] the one with parameters is definitely newer than the other [00:12:35] New patchset: Andrew Bogott; "Use the slightly-more-modern parameterized def for apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36927 [00:12:50] So… that patch makes that change, although I'm somewhat skeptical that it won't break a bunch of things [00:13:11] There's one quick way to find out! [00:13:30] why would it break a bunch? [00:13:35] we use it in other places for sure [00:13:48] *shrug* I'm just superstitious about touching a file that I didn't write and haven't thought about [00:13:56] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Disable use of commons api for mobile beta photo uploads on testwiki and test2wiki' [00:14:04] Logged the message, Master [00:14:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36927 [00:16:49] Ryan_Lane, LeslieCarr: yeah I've postponed merging it because it affects all hosts and want to reserve a nice time to try it/merge it (testing in labs doesn't help in this case) [00:17:07] Ryan_Lane, LeslieCarr: didn't know anyone else was interested, I'll have a look tomorrow. [00:17:36] New patchset: Ryan Lane; "Dep host needs appserver packages (like php)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36928 [00:17:57] LeslieCarr: as for the SSH module, it's outdated by now. I've refrained from merging it because ma rk didn't like the whitespace and we were supposed to discuss it further [00:18:10] cool [00:18:12] what? 2 spaces? [00:18:21] i was just all "egads so many commits" [00:18:34] mutante: Yeah, it breaks because the integration class explicitly declared a bunch of classes that are now created by the parametrized class... [00:19:37] …or maybe just one actually [00:19:39] LeslieCarr: thanks, that's awesome [00:19:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36928 [00:20:38] New patchset: Andrew Bogott; "Don't add libapache2-mod-php5 here, we're now getting it from webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36929 [00:21:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36929 [00:21:49] andrewbogott: yep..ack [00:22:50] andrewbogott: and one more.. SSL [00:22:53] i can do it :p [00:23:40] thanks! That's "apache_module { ssl: name => "ssl" }" right? [00:23:49] yes [00:24:16] New patchset: Dzahn; "dont need to load Apache module SSL, getting it from parameterized class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36930 [00:24:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36930 [00:25:33] Hm. [00:26:20] Oh! The thing you said before about only needing to define it in one place was right :( I was thinking of 'requires' which this is not. [00:26:23] So, I will fix that one. [00:26:53] Hm… there must be some way of properly declaring the dependency in two places... [00:26:56] i just saw that..yes [00:27:02] for now..just take it out of docs.pp [00:27:23] i think proper would be one class that _just_ setsup Apache [00:27:30] but not the sites [00:27:50] New patchset: Andrew Bogott; "Remove a duplicate declaration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36931 [00:27:58] Yeah, and the sites should 'require' that rather than declaring anything… I think. [00:30:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36931 [00:30:30] andrewbogott: can you think of a better term than "doc site server"? [00:30:46] i was going to add a system role [00:31:11] It should be more specific, but I don't yet know what all this site will be used for. [00:31:30] Like, it's different from wikitech, somehow... [00:32:11] New patchset: Dzahn; "add a system role for the new misc/docs class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36933 [00:32:50] andrewbogott: ^ that is always nice to have.. these roles all show in motd when you connect to server [00:32:59] cool. [00:33:00] ..or should... [00:33:11] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36933 [00:33:39] I'm not logged into sockpuppet atm, I'll leave that bit to you [00:33:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:23] !log rebooting es1004 for kernel upgrade before repooling [00:37:31] Logged the message, notpeter [00:40:54] Does stafford need attention or just a break? [00:41:56] New patchset: awjrichards; "Fix setting of wgMFPhotoUPloadEndpoint to happen AFTER inclusion of MObileFrontend.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36936 [00:44:01] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36936 [00:44:57] andrewbogott: a little of both [00:45:06] !log awjrichards synchronized wmf-config/mobile.php 'Fix api endpoint definition for mobile beta' [00:45:06] we need to scale puppet [00:45:14] Logged the message, Master [00:48:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.525 seconds [00:48:23] !log mlitn synchronized php-1.21wmf4/extensions/ArticleFeedbackv5/modules 're-sync aftv5 js&css files with newer modification date' [00:48:31] Logged the message, Master [00:48:57] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 6411790 seconds [00:55:29] mutante: It goes! http://doc.wikimedia.org/puppet/ [00:56:20] whoa [00:56:24] *click* [00:57:23] andrewbogott: woot [00:57:34] Ugly, but enough for today. [00:57:36] looks like javadoc [00:57:50] is this hooked up somehow to jenkins? or a cron? [00:58:34] I am pretty sure that puppet will update it. [00:58:46] oh, by puppet! hah [01:10:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:10:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:22:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:26:06] New patchset: Dzahn; "contint - do not install all of the huge ia32-libs, instead just install what is actually needed by Android SDK on a multiarch system. thanks paravoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:26:28] New review: Dzahn; "http://stackoverflow.com/questions/2710499/android-sdk-on-a-64-bit-linux-machine" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36945 [01:27:48] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:27:48] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:29:02] New patchset: Dzahn; "contint - do not install all of the huge ia32-libs, instead just install what is actually needed by Android SDK on a multiarch system. thanks paravoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:29:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36945 [01:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.300 seconds [01:45:51] !log gallium - puppet runs fine again, has our librsvg 2.36-1wm1 packages and Android SDK stuff should also work (@hashar) [01:46:04] Logged the message, Master [01:50:46] /msg NickServ identify WM$s01.B [01:51:26] /msg NickServ mike_wang WM$s01.B [02:03:19] test message [02:04:10] Success. [02:05:03] test message [02:05:48] test from Tampa [02:08:36] New patchset: Ryan Lane; "Add ability to set allows and denies for docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36948 [02:09:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36948 [02:13:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:32] New patchset: Ryan Lane; "Limit deployment webserver access to our networks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36949 [02:16:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36949 [02:18:04] good night! [02:18:18] mike_wang: night [02:18:28] mike_wang: you may want to change your identify password, btw [02:18:35] you had written it into the channel [02:18:40] (and this is a public channel) [02:19:18] I made a mistake. I will change my passwd. [02:24:05] !log LocalisationUpdate completed (1.21wmf5) at Wed Dec 5 02:24:05 UTC 2012 [02:24:14] Logged the message, Master [02:30:07] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:31:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.017 seconds [02:42:16] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 117053 MB (3% inode=99%): [02:44:52] !log LocalisationUpdate completed (1.21wmf4) at Wed Dec 5 02:44:51 UTC 2012 [02:45:00] Logged the message, Master [02:46:28] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 5 02:46:12 UTC 2012 [03:31:10] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [05:00:57] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:45] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:59] Change abandoned: Ori.livneh; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31146 [05:15:30] PROBLEM - SSH on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:30] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [05:41:00] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:57] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:17:27] AaronSchulz: was that query on frwiki? (which is now officially over 4 million in the job queue >_<) [06:52:24] PROBLEM - swift-container-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:52:33] PROBLEM - swift-account-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:54:21] PROBLEM - swift-container-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:55:07] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:55:34] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:57:31] PROBLEM - swift-object-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:57:40] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:57:40] PROBLEM - swift-account-replicator on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:57:58] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:01:22] hmm [07:03:13] RECOVERY - swift-account-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:03:23] yeah, those were bogus [07:03:49] RECOVERY - swift-container-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:03:58] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:04:07] RECOVERY - swift-object-server on ms-be2 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:04:07] RECOVERY - swift-account-replicator on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:04:16] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:04:25] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:07:16] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:08:28] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:33:47] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:39:49] New patchset: ArielGlenn; "change object replication max conns to 3 in rsync conf also" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36955 [07:40:14] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36955 [07:41:28] and now we see how that is [08:04:36] !log cleaning up stuff on ms-be5 /srv/swift-storage/sde1/objects/ again, data somehow wound up on the underlying root filesystem, root partition was full [08:04:46] Logged the message, Master [08:05:26] PROBLEM - swift-object-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:05:35] PROBLEM - swift-account-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:05:53] PROBLEM - swift-account-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:05:53] PROBLEM - swift-container-updater on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:05:55] yep, we know [08:06:02] PROBLEM - swift-object-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:06:02] PROBLEM - swift-object-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:06:02] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:06:20] PROBLEM - swift-account-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:06:29] PROBLEM - swift-container-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:06:47] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:06:47] PROBLEM - swift-object-server on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:06:47] PROBLEM - swift-container-replicator on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:07:32] once these files are gone I can remount, turn puppet back on, restart all services, shouldn't take long [08:09:47] that should get it [08:09:47] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:09:47] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:10:05] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:10:05] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:10:05] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [08:10:14] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:10:14] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [08:10:41] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:10:41] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:10:50] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [08:10:51] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [08:10:51] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:14:00] ok looks like sde is bad. meh [08:27:42] PROBLEM - SSH on snapshot1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:41] !log powercycling snapshot1, was in swapdeath [08:40:54] Logged the message, Master [08:43:45] PROBLEM - Host snapshot1 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:21] RECOVERY - SSH on snapshot1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:47:30] RECOVERY - Host snapshot1 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [09:10:27] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:44:53] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:11:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:11:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:28:56] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:28:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:34:38] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [11:39:44] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 110.34 ms [11:50:42] New patchset: Dereckson; "(bug 42720) Add ipblock-exempt right to bot group on cs.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36970 [11:53:59] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/36970 [12:09:45] New review: Mormegil; "See http://cs.wikipedia.org/wiki/Wikipedie:Pod_l%C3%ADpou_%28n%C3%A1vrhy%29#V.C3.BDjimky_z_blokov.C3..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/36970 [12:31:14] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:36:53] New patchset: Aude; "update settings for wikibase client and repo, in prep for next deployment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36205 [12:47:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:31:50] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [13:39:21] hush whiner [13:40:16] hehe [13:56:59] Can we make it whine for puppet if it's like over 24 hours or something? [14:02:20] !log reedy synchronized php-1.21wmf5/includes/dao [14:02:30] Logged the message, Master [14:02:59] !log reedy synchronized php-1.21wmf5/includes/db [14:03:08] Logged the message, Master [14:03:32] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [14:03:32] Offending key for IP in /etc/ssh/ssh_known_hosts:835 [14:03:32] Matching host key in /etc/ssh/ssh_known_hosts:603 [14:03:41] yay, ipv6 [14:03:43] !log reedy synchronized php-1.21wmf5/includes/AutoLoader.php [14:03:52] Logged the message, Master [14:49:59] New review: Dereckson; "Shellpolicy issue has been resolved." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/36970 [14:53:24] !log Phasing out the Jenkins job linting operations/puppet.git , replacing it with https://integration.mediawiki.org/ci/job/operations-puppet-validate/ [14:53:32] Logged the message, Master [14:57:43] New patchset: Hashar; "validate jenkins job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [15:01:16] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37009 [15:02:53] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [15:12:19] baahhh [15:12:28] stupid stuck scrollback [15:17:18] !log pulled ms-be5:sde1 and ms-be12:sdi from rings, they were borking obj replication across the cluster [15:17:26] Logged the message, Master [15:19:56] sq48 seems to be still down (just looking at nagios-wm scrollback) [15:20:03] i've not a clue if that's important [15:24:14] I'm not ignoring you, I'm poking around [15:29:46] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:29:55] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:34:16] !log powercycled sq48, hmm it was up for 214 days... fishy squid! [15:34:24] Logged the message, Master [15:39:04] RECOVERY - Backend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 460 bytes in 0.024 seconds [15:40:07] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 605 bytes in 0.007 seconds [15:41:04] well almost [15:42:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:43:56] really not sure if that was sufficient for the replication issue, I see steadily increasing after a sharp decrease in eta and not sure why [15:46:42] may be too soon to tell, I'll look at in a while [15:46:51] out for a couple hours, back later [15:48:06] I suppose taking out the devices forces some data to be moved which will screw up the estimates for a bit too [15:48:10] anyways back in a while [15:49:00] bye [15:50:01] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [15:50:28] RECOVERY - MySQL disk space on db78 is OK: DISK OK [16:13:31] New patchset: Jgreen; "fixed db name in dumper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37021 [16:14:37] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37021 [16:30:58] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:31:16] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:45:01] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 10.0815765079 (gt 8.0) [16:47:34] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.188751788618 [16:47:43] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [16:48:37] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [16:53:43] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:40] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:01:27] New patchset: Ottomata; "Setting up blog.wikimedia.org to send varnish logs to main udp2log stream." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37037 [18:02:57] RobH review :)? [18:03:01] https://gerrit.wikimedia.org/r/37037 [18:08:33] ottomata: hey [18:08:45] your hadoop crons are no longer throwing errors [18:08:51] but they are spaming root [18:09:08] does ottomata not get his own spam? ;) [18:09:11] would you be willing to pipe output to whoever this data will be useful to? [18:09:16] yup [18:09:31] well, right now the whole ops team gets it... [18:09:34] but we don't want it [18:09:41] are you sure ? :D [18:10:49] java: a DSL for turning xml files into stacktraces ;) [18:10:54] yes, I'm pretty sure. [18:10:55] notpeter: ape rgos got several labs people to kill their cronspam iirc [18:13:04] paravoid: wait there is a ceph cluster now? [18:13:14] a small one :-) [18:13:39] we've been playing with it a bit, finally [18:13:50] hardware and Dell really didn't make us any favors [18:13:54] * AaronSchulz wishes he knew about that [18:13:57] how many servers? [18:14:19] we've built one in esams with one "backend" and one radosgw [18:14:23] we have more machines there [18:14:34] and we have the 720xds in eqiad that are supposed to be arriving today I think [18:14:44] that we can also format and use them as a ceph lab [18:17:13] preilly: If nothing else came up I'd ignore it, but the autoloader thing, unit tests being added later, and other people complaining is enough reason to just revert [18:17:55] AaronSchulz: okay [18:17:58] it can always just be added back, but it's hard to argue on the lists that we should keep it if given the fact that it was missing from the autoloader [18:19:21] hmm, I'll probably make another list post [18:19:32] AaronSchulz: I'm pretty upset about this right now so I'd like to not talk about it [18:19:33] cmjohnson1: hey, when you take down hume to replace disk , please let me know [18:19:40] preilly: sure [18:19:53] I reimaged w/o raid yesterday, so I'm going to need to reimage again when it's got its disks back [18:20:05] (I added link on ticket for disk) [18:20:08] New review: RobH; "Seems fine to me, but my experience with pushing things into memcached like that is limited. I appr..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/37037 [18:20:17] oh, but you're not in tampa anymore.... [18:20:25] heh [18:20:28] sbernardin: ^ [18:20:32] !seen volunteer [18:20:35] Steve is the Tampa dude now. [18:20:52] but yea, put it in the RT ticket for it to hunt you down bnefore doing something and he will [18:21:01] ^demon|away: do we have the Forge Author access control for Gerrit set? [18:21:07] or after, as need be. [18:21:24] <^demon|away> preilly: Forge Author is granted on all repos, yes. [18:21:32] <^demon|away> Not forge committer, though. [18:21:32] ^demon|away: okay cool thanks [18:21:54] <^demon|away> (You couldn't amend someone else's change without forge author :)) [18:22:12] <^demon|away> Well, not as easily at least. [18:22:18] ^demon|away: yes you could [18:22:29] ^demon|away: but, I'll agree NOT easily [18:24:02] yeah, put comment in [18:24:03] thanks! [18:29:29] binasher, does stuff like https://gerrit.wikimedia.org/r/#/c/36894/2/sql/externally-backed.sql require your approval, too? [18:30:47] MaxSem: it should, though a couple other people should be able to review schema stuff too [18:31:01] i +1'd it [18:31:28] thanks! [18:32:24] <^demon|away> binasher: That's not how it works. Once you've signed up, you've taken over sole responsibility ;-) [18:33:57] * binasher needs to get responsibility sharding merged [18:34:43] <^demon|away> binasher: Put the responsibilities in Git, then they'll be distributed. [18:35:44] speaking of which [18:35:57] binasher: where are the definitions/code for gdash? [18:38:32] Jeff_Green: did you see that db1025 is having a problem with snapshotting? [18:38:59] notpeter: no--where did you see that? [18:39:03] cronspam [18:39:21] 2 hours ago [18:39:28] blarney. [18:39:39] <^demon|away> mutante: Could you please? https://gerrit.wikimedia.org/r/#/c/36882/ [18:40:08] ^demon|away: ok [18:40:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36882 [18:40:57] notpeter: Could not retrieve catalog; skipping run ..hrmm [18:41:35] ^demon|away: fyi.. gallium should be fine again .. recent issue with librsvg packages and android sdk [18:41:55] even though i did not really build the android app yet [18:42:08] i just heard you can do it on the cmdline on gallium [18:42:32] <^demon|away> librsvg & android sdk don't affect me, but ok :) [18:42:42] Nemo_bis: they're in puppet/files/graphite/gdash-dashboards but i never actually made a puppet rules to install them.. i really need to, especially if others are going to contribute [18:43:36] ^demon|away: you saw doc.wikimedia.org already? [18:43:37] notpeter: found it. now the question is wtf to do about it? [18:43:48] <^demon|away> mutante: hashar pointed me to it. [18:43:55] ^demon|away: ohh..oops..it broke? :p [18:44:04] it looked different yesterday [18:44:24] or maybe it didnt.and i just did not try and open the index [18:44:24] * ^demon|away shrugs [18:44:26] <^demon|away> I dunno [18:44:27] mutante: andrewbogott is working on doc.wikimedia.orgI think [18:44:37] PROBLEM - swift-object-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:48] is today really an insert-leap-second day? [18:44:50] yep, nevermind,i think i just opened it with /puppet/ before [18:45:14] hashar: https://gerrit.wikimedia.org/r/#/c/36945/2/manifests/misc/contint.pp [18:45:23] mutante / notpeter what do you think of just adding screen to the base install ? [18:45:29] binasher: thanks, do you also happen to know about ganglia config for custom metrics like http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length ? [18:45:34] LeslieCarr: love it [18:45:40] screen++ [18:45:50] Jeff_Green: a fine question. [18:45:51] LeslieCarr: yes, is it not? i thought its just not there because we have no base yet..puppet did not run [18:45:55] oh [18:46:00] that's possible [18:46:01] :) [18:46:04] notpeter: Buffer I/O error on device dm-4, logical block 0 [18:46:08] that looks unsuperduper to me [18:46:16] oh [18:46:17] fu [18:46:30] LeslieCarr: notpeter ran screen from bastion host [18:46:32] mutante is right - it is in the base already [18:46:43] <^demon|away> While we're talking about base--we've talked about putting git(-core) there. Almost every host needs it these days. [18:46:48] man we're good :) [18:46:58] Jeff_Green: yeah, this be looking like a disk problem to me [18:47:04] yawp [18:47:10] Nemo_bis: see the oddly named puppet class nagios::ganglia::monitor::enwiki in puppet/manifests/nagios.pp [18:47:37] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:04] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:04] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:07] binasher: thank you very much [18:48:22] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:29] New patchset: Reedy; "Everything else over to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37043 [18:48:31] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:40] PROBLEM - swift-container-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-account-server on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:49] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:58] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:25] RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:49:34] RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:49:34] RECOVERY - swift-container-updater on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:49:52] RECOVERY - swift-object-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:50:01] RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:50:02] mutante, Jeff_Green, hashar: Previously doc.wikimedia.org just said "It works!" Now it points to a directory which may at some point contain a table of contents :) [18:50:10] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:50:20] RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:50:20] RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:50:20] RECOVERY - swift-account-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:50:20] RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:50:21] andrewbogott: you are the boss :-] [18:50:27] But I don't have specific plans outside of tuning up the two directories that are already present. [18:50:28] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:50:43] .htaccess DENY FROM ALL [18:50:44] done. [18:50:46]