[00:00:08] can I request a varnish cache flush for mobile. We have some cached html which is stopping some important javascript loading causing quite a serious bug - 42749 [00:01:10] awjr asked half an hour or so [00:01:20] I think his ops bribe offer wasn't high enough [00:01:26] haha [00:01:29] I've just asked Ryan Lane [00:01:30] two whiskeys then? [00:01:37] Ryan_Lane: i'll get you a whiskey [00:01:56] Ryan_Lane: I offer 2 whiskeys [00:02:04] offer accepted from all of you [00:02:05] it's done [00:02:06] lol [00:02:09] hahaha [00:02:24] that's 5 whiskeys [00:02:26] confirmed fixed [00:02:31] thanks [00:02:47] Have you purged the bits varnish caches tooo? [00:02:53] Ryan_Lane: you flushed non-mobile cache as well? [00:02:57] oh, that's needed too? [00:03:03] I'm not sure that's a great idea [00:03:11] possibly, per bug 42452 [00:03:30] From what m ark has said before, it only takes a few minutes for them to be repopulated (due to the amount of data)... [00:03:42] So as long as all of them weren't done at once [00:03:51] ok [00:03:53] gimme a sec [00:03:54] New patchset: Dzahn; "add redirect for wikivoyage.net and save a few code lines by using a regex for .com, .de and .net" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/37345 [00:04:50] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/37345 [00:06:31] I'm flushing it now [00:06:33] eqiad first [00:08:27] thanks, let me know when they are all flushed and I'll ping the people who are complaining :) [00:11:06] kaldari: done [00:11:15] yay [00:11:58] Haha, nice [00:12:11] There's no sign of any changes on the ganglia graphs [00:12:41] Reedy: those caches refill so quickly [00:12:45] only slightly [00:12:49] on the app servers [00:12:50] Yup [00:12:52] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Bits%2520application%2520servers%2520pmtpa&tab=m&vn= [00:13:05] ah, more noticeable there [00:13:14] eqiad isn't [00:15:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.071 seconds [00:20:31] New patchset: Pyoungmeister; "coredb monitoring: remove uncalled for conditional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37347 [00:20:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37347 [00:21:07] New patchset: Dzahn; "wikivoyage.[com|de|net] redirects - break out into seprate rules again, can just summarize the first part, second part would not work like this" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/37348 [00:21:45] New patchset: Dzahn; "wikivoyage.[com|de|net] redirects - break out into seprate rules again, can just summarize the first part, second part would not work like this" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/37348 [00:23:31] New review: Dzahn; "testing 12 urls on 1 servers, totalling 12 requests" [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37348 [00:23:33] New patchset: Pyoungmeister; "add name for mysql user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37349 [00:23:38] New review: Dzahn; "testing 12 urls on 1 servers, totalling 12 requests" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37348 [00:23:38] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/37348 [00:23:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37349 [00:27:03] dzahn is doing a graceful restart of all apaches [00:27:24] !log dzahn gracefulled all apaches [00:27:33] Logged the message, Master [00:31:38] New patchset: Pyoungmeister; "swapping roles on db61 for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37351 [00:31:46] kaldari: did that fix the problem, btw? [00:32:06] Ryan_Lane: no idea, but we'll see if anyone else complains of the problem [00:32:09] !log DNS update - adding wikivoyage.net (link to .org) [00:32:12] heh [00:32:12] ok [00:32:17] Logged the message, Master [00:32:21] Ryan_Lane: I can't reproduce it myself [00:33:09] hooray for whiskey! [00:33:12] It seemed to have been affecting a large number of people on multiple wikis though [00:33:17] * Ryan_Lane nods [00:33:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37351 [00:33:45] I like to support the Tennessee economy when I can [00:34:11] :D [00:38:15] PROBLEM - swift-object-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:24] PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:42] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:42] PROBLEM - swift-object-updater on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:42] PROBLEM - swift-account-reaper on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:51] PROBLEM - swift-container-updater on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:27] PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:27] PROBLEM - swift-account-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:27] PROBLEM - swift-container-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:30] PROBLEM - swift-object-server on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:31] PROBLEM - SSH on ms-be7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:48] PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:57] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:27] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [00:42:36] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [00:42:36] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [00:42:36] RECOVERY - swift-account-server on ms-be7 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [00:43:03] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [00:43:12] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:43:31] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [00:43:31] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [00:43:31] RECOVERY - swift-object-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [00:43:39] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:43:39] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [00:43:39] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:43:57] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [00:45:22] oh yay [00:45:30] our first broken disk with the new systems [00:46:00] New patchset: Kaldari; "Re-enabling wgAllowCopyUploads for Commons for experimental Flickr uploading, see bug 20512." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37353 [00:46:14] iirc the devs/other operators say that swift is particularly good about detecting failing disks early [00:46:30] or error counts in some log are or something [00:47:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:56] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37353 [00:54:27] !log reprepro changes: add precise-wikimedia deb universe amd64 mariadb-server-5.5 5.5.28-mariadb-wmf201212041~precise [00:54:35] Logged the message, Master [00:56:13] !log kaldari synchronized wmf-config/InitialiseSettings.php 'turning on experimental Flickr uploading for admins on Commons' [00:56:22] Logged the message, Master [01:03:08] !log reedy synchronized php-1.21wmf5/extensions/ParserFunctions/Expr.php [01:03:16] Logged the message, Master [01:04:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.103 seconds [01:15:47] binasher: where are you planning to use maria? [01:36:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:10] New patchset: Asher; "mariadb testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37365 [01:51:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.586 seconds [02:00:27] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [02:00:27] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:00:28] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [02:24:07] !log LocalisationUpdate completed (1.21wmf5) at Fri Dec 7 02:24:07 UTC 2012 [02:24:17] Logged the message, Master [02:25:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:38:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.067 seconds [03:16:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:16:30] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:33:27] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:33:27] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:33:38] New patchset: Catrope; "Another pmpta typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37371 [03:40:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37371 [03:48:34] New patchset: Catrope; "Try using check_http_on_port instead of check_lvs_http_on_port for Parsoid monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37375 [04:13:24] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37365 [04:16:54] New patchset: Ori.livneh; "Yet another pmpta -> pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37378 [04:21:22] New patchset: Asher; "our jenkins puppet test is broken :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37379 [04:25:30] PROBLEM - LVS HTTP IPv4 on parsoid.svc.pmtpa.wmnet is CRITICAL: (null) [04:35:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37379 [04:52:12] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [04:54:25] New patchset: Asher; "var check fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37382 [04:54:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37382 [04:59:00] !log kaldari synchronized php-1.21wmf5/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'fixing live geo bug in UploadWizard' [04:59:10] Logged the message, Master [05:04:26] New patchset: Asher; "reversing the logic for whether my.cnf should contain facebook-patch only options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37385 [05:05:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37385 [05:15:09] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [05:25:08] !log aaron synchronized php-1.21wmf5/includes/job/JobQueueDB.php 'deployed 78c63d4cafb4937e289856f66ac0e524fda79acb' [05:25:17] Logged the message, Master [05:35:15] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [06:09:18] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [06:10:21] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [06:14:15] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [06:39:26] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.082 second response time [07:45:49] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [07:53:46] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [08:25:41] PROBLEM - LVS HTTP IPv4 on parsoid.svc.pmtpa.wmnet is CRITICAL: (null) [08:28:14] maybe that shouldn't page righ tnow [08:30:04] given that the nagios checks are supposedly broken [08:30:48] Warning: Unknown: Input variables exceeded 1000. To increase the limit change max_input_vars in php.ini. [08:31:06] someone attempted to haXX0rize us?:P [08:31:27] :-D [08:31:36] no one would do that, we're the good guys [08:31:54] (that's sarcasm, I just couldn't find the sarcasm emoticon on my keyboard) [08:31:57] but that annoys bad gys! [08:36:36] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37375 [08:38:59] let's see if roan's change will make those go away [08:43:36] f course puppet takes forever to run on spence >_< [09:10:09] oh joy [09:10:24] "cannot override local resource" again, after all the waiting [09:12:40] hello [09:12:47] New patchset: Hashar; "rake validate now fail properly on .pp validation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37394 [09:14:32] apergos: may you merge in https://gerrit.wikimedia.org/r/37394 ? That fix the Jenkins job in charge of validating operations/puppet manifests :) [09:14:40] yeah just a sec [09:14:45] sure :-) [09:17:55] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37394 [09:20:13] waiting for the run to complete [09:20:34] thanks a lot! [09:21:04] err: /Stage[main]/Misc::Docsite/File[/srv/org/wikimediaq/doc/index.html]/ensure: change from absent to file failed: Could not set 'file on ensure: No such file or directory - /srv/org/wikimediaq/doc/index.html.puppettmp_4914 at /var/lib/git/operations/puppet/manifests/misc/docs.pp:17 [09:21:05] hm [09:21:39] ah that is andrew boggot change [09:23:03] was tha rakefie something that would have been pulled in by git clone? [09:23:14] *that [09:23:37] apergos: I am not sure I understand your question [09:23:38] RECOVERY - LVS HTTP IPv4 on parsoid.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.016 seconds [09:23:53] the misc::docsite has an error in it, it require a file which is never provided apparently [09:23:53] ha, that's roan's change working [09:24:50] well I'm asking about your change to rakefile, which I expected to show up as changed somehow in th pupet run [09:24:56] on gallium right? [09:25:05] ahh [09:25:23] Jenkins get the latest version of the production branch then merge in the submitted change [09:25:24] but the only thing I saw that could have been it was [09:25:26] notice: /Stage[main]/Misc::Docs::Puppet/Git::Clone[puppetsource]/Exec[git_pull_puppetsource]/returns: executed successfully [09:25:35] so that's why I was asking if the git clone covered that [09:25:39] so as soon as you merged the rake file change, Jenkins knows about it [09:25:59] Git::Clone should get the latest version, I think that is the default [09:26:03] ok great [09:26:09] then you should be eset to test that now [09:26:10] but Jenkins does not use the file from /etc/puppet/something [09:26:51] ok [09:27:03] New patchset: Hashar; "puppet manifest failure (do not submit)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37396 [09:27:08] I should do an office hour to explain jenkins to everyone :-) [09:27:22] +2! [09:27:52] looks like your failure was flagged as failure, w00t [09:27:57] \O/ [09:28:08] I still have to make the console log nicer [09:28:15] that is a bit hard to find out what is actually failing [09:29:17] Change abandoned: Hashar; "yeah that fails as expected!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37396 [09:31:46] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/35298 [09:32:26] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37118 [09:32:37] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37165 [09:32:43] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37153 [09:32:50] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37118 [09:33:27] that retriggered the lint checks [09:33:32] and made Jenkins to V+1 [09:40:41] New patchset: MaxSem; "Enable GeoData on all wikipedias and wikivoyages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37398 [09:41:06] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37398 [09:48:32] !log maxsem synchronized wmf-config/InitialiseSettings.php 'https://gerrit.wikimedia.org/r/37398' [09:48:39] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [09:48:39] Offending key for IP in /etc/ssh/ssh_known_hosts:835 [09:48:39] Matching host key in /etc/ssh/ssh_known_hosts:603 [09:48:43] Logged the message, Master [09:52:28] ehm, why 'wikipedia' => somevalue doesn't worrk in InitialiseSettings? [09:52:35] New patchset: MaxSem; "Revert "Enable GeoData on all wikipedias and wikivoyages"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37399 [09:53:09] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37399 [09:54:02] !log maxsem synchronized wmf-config/InitialiseSettings.php 'https://gerrit.wikimedia.org/r/#/c/37399/' [09:54:08] o_0 [09:54:11] Logged the message, Master [10:27:54] RECOVERY - Parsoid on kuo is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.005 seconds [10:28:12] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.054 seconds [10:28:21] RECOVERY - Parsoid on lardner is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.009 seconds [10:28:30] RECOVERY - Parsoid on tola is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.003 seconds [10:28:30] RECOVERY - Parsoid on wtp1 is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.009 seconds [10:28:48] RECOVERY - Parsoid on mexia is OK: HTTP OK HTTP/1.1 200 OK - 1221 bytes in 0.003 seconds [10:29:25] heh [10:53:49] amusing names [10:54:02] why don't they use parsoid001 ? ;) [11:05:00] New patchset: Mark Bergsma; "Handle many more error conditions" [operations/software] (master) - https://gerrit.wikimedia.org/r/37231 [11:07:06] New patchset: Mark Bergsma; "Randomize the order of containers" [operations/software] (master) - https://gerrit.wikimedia.org/r/37405 [11:07:06] New patchset: Mark Bergsma; "Use connection pooling for every Swift operation" [operations/software] (master) - https://gerrit.wikimedia.org/r/37406 [11:07:07] New patchset: Mark Bergsma; "Don't unnecessarily rerequest dst containers" [operations/software] (master) - https://gerrit.wikimedia.org/r/37407 [11:07:07] New patchset: Mark Bergsma; "Get rid of the useless HEAD request on every object creation" [operations/software] (master) - https://gerrit.wikimedia.org/r/37408 [11:51:23] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [12:01:26] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [12:01:26] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [12:01:27] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [12:33:23] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [12:34:26] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:35:12] New patchset: Hashar; "wikibugs perl dependencies are needed for Jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37419 [12:52:10] New patchset: Jgreen; "add backupmover account to fundraising db dump boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37422 [12:52:48] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37422 [12:54:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.335 seconds [13:10:00] do mwscript and company work on spence too? [13:12:38] <^demon> apergos: Ping. [13:13:10] New review: Nemo bis; "By the thanks also to some CR via IRC by Ariel I sent an e-mail on this about a week ago but got no ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/33713 [13:14:08] !log demon synchronized php-1.21wmf5/extensions/Wikibase 'Syncing wikibase to 24d8471' [13:14:18] Logged the message, Master [13:17:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:17:41] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:18:43] New patchset: Demon; "Adding wikibase to debug log groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37426 [13:20:20] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37426 [13:28:06] !log demon synchronized wmf-config/CommonSettings.php 'Deploying wikibase debug log groups' [13:28:14] Logged the message, Master [13:32:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:38] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:34:38] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:44:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.069 seconds [13:50:02] Change abandoned: Andrew Bogott; "Most links turn out to work OK without this, and I don't like this particular approach." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37118 [14:04:26] !log demon synchronized php-1.21wmf5/extensions/Wikibase/client/includes/store/sql/WikiPageEntityLookup.php [14:04:35] Logged the message, Master [14:11:25] !log demon synchronized php-1.21wmf5/extensions/Wikibase/client/includes/store/sql/WikiPageEntityLookup.php 'More debugging' [14:11:33] Logged the message, Master [14:14:28] !log demon synchronized php-1.21wmf5/extensions/Wikibase/client/includes/store/sql/WikiPageEntityLookup.php 'More debugging, Iaf3cfff9' [14:14:36] Logged the message, Master [14:16:40] hashar, feel like reviewing a few lines of php? https://gerrit.wikimedia.org/r/#/c/37250/ [14:17:45] luuuve php [14:18:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:45] andrewbogott: reviewing [14:23:50] thanks [14:29:36] andrewbogott: done https://gerrit.wikimedia.org/r/#/c/37250/ [14:29:49] andrewbogott: implode( explode() ) is a nice trick :) [14:30:33] Easier than figuring out regexp in yet another language [14:31:26] <^demon> I tried to fix doc.wikimedia.org ssl yesterday. [14:32:28] <^demon> Hmm, the apache config got added, but still seems to be serving old cert & docroot :\ [14:32:32] <^demon> Wonder if apache got kicked. [14:33:45] <^demon> andrewbogott: Can you try restarting apache on gallium? [14:34:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [14:34:36] ^demon: Yep, done. [14:35:07] <^demon> Yay, fixed. [14:35:09] <^demon> Thanks! [14:35:12] <^demon> https://doc.wikimedia.org/ [14:38:44] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [14:41:21] Jarry1250: /wii mutante [14:41:30] ehm [14:47:09] andrewbogott: regarding your Openstack manager change, I will let Ryan +2 it [14:47:20] * andrewbogott nods [14:47:25] andrewbogott: I don't want to mess with "his" extension :) [14:47:45] Don't you? [14:47:47] I do ;) [14:48:25] hehe [14:49:00] hashar, re-review? [14:50:32] hm [14:50:42] I need to enable lint checks on that extension :) [14:50:51] That's why I break the LDAP extension now and again ;D [14:51:34] andrewbogott: nice =) [14:52:49] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:59:54] andrewbogott: I got it [15:00:09] andrewbogott: it is a bit hacky to append the URL to the class name for display purpose [15:00:52] You mean, in the multiselect label? [15:02:24] $instanceInfo["${puppetgroupname}-puppetclasses"] had 'options' set to a list of plain text puppet classes [15:03:04] now it will be something like: ganglia::client ntp::client .. [15:04:45] Is that dict used anywhere apart from setting up the web form, though? [15:04:55] probably not [15:06:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:38] ^demon|brb: pong, sorry for the long delay [15:07:51] you caught me just after I had gone to do some errands [15:08:02] <^demon|brb> No worries, ended up figuring it out. [15:08:09] ok great [15:08:22] I am less here thn I would like [15:08:30] and getting pretty tired of my digestion recently [15:12:07] hashar: I find it weird that the widget uses the dict[key] as the field label. But, given that, the rest of the craziness follows :) [15:13:09] Best option would be to add a help url field to the widget. Dunno if anyone but me would use that though. [15:15:56] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [15:16:00] paravoid: do mwscript and friends work on spence? [15:16:15] I don't know [15:16:21] what are you looking for? [15:17:38] paravoid: I'd like to submit a check like the one for the enwiki jobqueue [15:17:57] but that uses a direct DB query which requires passwords and whatnot [15:18:32] I don't think there is a local mw installation there [15:18:58] is the mysql password different for all clusters? [15:19:19] huh, I lie [15:19:19] there is one [15:19:34] oh [15:19:40] in which case multiversion and all the rest will work fine [15:19:44] wonderful [15:19:48] Nemo_bis: There's a script for doing that already [15:19:53] And I added a total count to it too... [15:19:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.383 seconds [15:20:03] Reedy: that only checks those above 10k I think? [15:20:07] hmm where? [15:20:09] No [15:20:27] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/WikimediaMaintenance.git;a=blob;f=getJobQueueLengths.php;h=25179a01093705ae4b101cb38367dcd4d4afc609;hb=HEAD [15:20:30] It does every one [15:21:06] ah, yes, maintenance script :) [15:21:07] where does it run? [15:21:17] or rather, where is output stashed? [15:21:51] Doing all of those sql connections seperately takes an age (I know, I tried it) [15:21:59] aww [15:22:50] !log reedy synchronized php-1.21wmf5/extensions/WikimediaMaintenance/ [15:22:58] Logged the message, Master [15:23:34] Total 5812836 [15:23:34] real 0m27.778s [15:25:02] fast enough [15:25:25] The other is multiple times longer [15:25:33] and outputting to console is probably the slowest part [15:27:46] PROBLEM - Puppet freshness on db1043 is CRITICAL: Puppet has not run in the last 10 hours [15:29:05] Total 5814623 [15:29:05] real 0m10.913s [15:29:15] a third of the time if you don't output all the zeros [15:29:44] seems easy enough to skip those [15:30:24] I just did it ;) [15:30:30] what command is that? [15:30:30] Reedy, why https://gerrit.wikimedia.org/r/#/c/37398/ didn't work? [15:30:30] https://gerrit.wikimedia.org/r/37437 [15:30:33] By default [15:30:58] ah [15:31:05] and what about getting only the total? [15:31:14] http://p.defau.lt/?79QoWQsguFdgb4r0rMheVw [15:31:27] MaxSem: didn't work for what? Wikipedia? [15:31:33] yes [15:31:51] worked for wikivoyages though [15:32:02] lol [15:32:20] Try 'wiki' [15:32:21] I think.. [15:32:31] I know there is 'wikipedia' => in other configs... [15:34:25] if it starts to resist, I'll have to punish it by deploying everywhere:P [15:36:20] Nemo_bis: https://gerrit.wikimedia.org/r/37438 [15:36:46] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [15:37:20] mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php --totalonly [15:37:56] Reedy: only the number or also "Total"? [15:38:02] Also Total [15:38:28] (\d+) would suffice to get the number from the line [15:38:46] apergos: how's swift? :) [15:38:53] sucky [15:39:32] the etas on these object repication runs are around 170 hours now [15:39:34] !log reedy synchronized php-1.21wmf5/extensions/WikimediaMaintenance [15:39:42] Logged the message, Master [15:39:53] for what exactly? [15:39:57] down from 210 hours yesterday morning but still [15:40:18] these are what shuffle the data around [15:42:12] Nemo_bis: Also, yes, spence should have mwscript etc setup [15:42:22] at worst, you might need to give it /path/to/mwscript [15:42:31] or php /path/to/MWScript.php [15:42:45] ok thanks [15:42:49] Reedy: spence appears to have a copy of the standard install [15:42:52] with wmf-config and all the rest [15:43:04] so stuff should run there just l ike anywhere else [15:43:05] yeah, it's in mediawiki-installation [15:43:38] reedy@spence:~$ mwscript [15:43:38] mwscript: command not found [15:44:07] It doesn't get the appserver package.. So no /usr/local/bin/mwscript [15:44:20] yeah but [15:44:49] Lol [15:44:49] reedy@spence:~$ php /home/wikipedia/common/multiversion/MWScript.php extensions/WikimediaMaintenance/getJobQueueLengths.php --totalonly [15:44:49] PHP Fatal error: Class 'Memcached' not found in /home/wikipedia/common/php-1.21wmf5/includes/objectcache/MemcachedPeclBagOStuff.php on line 57 [15:44:49] Fatal error: Class 'Memcached' not found in /home/wikipedia/common/php-1.21wmf5/includes/objectcache/MemcachedPeclBagOStuff.php on line 57 [15:44:56] /apache/common-local/multiversion [15:45:00] I guess someone will need to fix that first [15:45:26] so spence needs php5-memcached [15:45:39] try running the /apache/common/ version [15:45:42] any better? [15:46:30] ah I see [15:46:31] meh [15:46:32] Worse [15:46:32] PHP Warning: require(/usr/local/apache/common-local/php-1.21wmf5/../wmf-config/CommonSettings.php): failed to open stream: Permission denied in /usr/local/apache/common-local/php-1.21wmf5/maintenance/doMaintenance.php on line 88 [15:46:32] PHP Fatal error: require(): Failed opening required '/usr/local/apache/common-local/php-1.21wmf5/../wmf-config/CommonSettings.php' (include_path='.:/usr/local/apache/common-local/php-1.21wmf5:/usr/local/apache/common-local/php-1.21wmf5/includes:/usr/local/apache/common-local/php-1.21wmf5/languages:/usr/local/apache/common-local/php-1.21wmf5/maintenance') in /usr/local/apache/common-local/php-1.21wmf5/maintenance/doMaintenance.php [15:46:32] on line 88 [15:46:57] wll wtf not? [15:47:08] probably needs a sudo in it.. [15:47:23] Yeah [15:47:24] sudo -u mwdeploy php /usr/local/apache/common-local/multiversion/MWScript.php extensions/WikimediaMaintenance/getJobQueueLengths.php --totalonly [15:47:29] Gives the memcached errors again [15:47:53] drwx------ 5 mwdeploy mwdeploy 4096 2012-12-07 13:27 /usr/local/apache/common-local/php-1.21wmf5/../wmf-config/ [15:47:54] joy [15:48:03] yeah that's not installed [15:48:29] so if such a job gets puppetized, better add the php5-memcached to spence at the same time [15:50:18] yup [15:50:28] grrr, laptop decided with 15 minutes of battery life left to turn itself off [15:50:48] New patchset: Nemo bis; "Add ganglia graph for global jobqueue length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [15:50:56] reedy@fenari:/home/wikipedia/common$ aptitude why php5-memcached [15:50:56] i wikimedia-task-appserver Depends php5-memcached [15:51:05] ^ Should we just make sure it's got appserver too? [15:51:40] I meh [15:51:44] I'd rather not I guess [15:52:14] also I could be wrong but I thought we were slowly trying to get rid of that package (appserver) [15:52:57] We also need to make sure it's got the php5-redis package too.. [15:53:14] ok [15:53:26] spence is a very overworked host so [15:53:36] if this job is a resource intensive thing we might want to rethink it [15:54:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:50] real 0m10.913s [15:55:50] user 0m0.240s [15:55:50] sys 0m0.080s [15:55:54] Reedy: so could you comment in the patch above that it has to wait for mwscript to... [whatever] ? [15:56:01] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+pmtpa&h=spence.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 see that [15:56:03] mwscript is there [15:56:10] it needs to be have the right php packages [15:56:25] hmm that doesn't look like too much work then [15:56:27] good enough [15:57:23] !log Jenkins now auto submit changes in mediawiki/core.git whenever tests are successful {{gerrit|37420}} [15:57:32] Logged the message, Master [15:57:50] New review: Reedy; "Also needs to be ensured Spence has the correct php packages to run mediawiki. Needs php5-memcached ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/37441 [15:58:05] well I added you two as reviewers because you alòready looked into it somehow, thanks! [15:58:12] and there it is [15:58:15] heh [15:58:32] notpeter robh: Dell is sending new 10G NIC for the mc1002/1009 servers...will be here Monday [15:58:45] awesome [15:58:53] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [16:00:14] fast reviewer [16:01:14] heh [16:01:18] after that we'll need a smart way to feed each line of the getJobQueueLengths.php output to a separate graph for each wiki maybe? [16:01:22] apergos: did you see bad disk on ms-be7...faidon sent ticket last night [16:03:01] be7? [16:03:05] no, I saw one on 10 [16:03:27] aaand en.wiki jobqueue just crossed half a million [16:03:35] let's just swap those with disks out of the new systems while we RMA them. [16:03:57] cmjohnson1: cool! sounds good [16:04:02] Nemo_bis: THAT'S OVER 9000!? [16:04:26] Reedy: dunno, hard maths [16:05:09] sda1 I see [16:05:11] awesome [16:05:39] paravoid: okay..once sbernardin gets onsite I will ask him to swap a disk [16:05:55] can we not swap it right now? [16:06:01] cmjohnson1: I don't mind that much, but since we have all those systems sitting it may be worth it [16:06:17] apergos: I guess it's hard for Chris to swap it over from Virginia :P [16:06:33] I was asking the opposite question [16:06:39] slightly difficult :-P [16:06:46] I want to not swap the disk right now [16:06:53] oh [16:06:55] how come? [16:07:03] well lt me put it a different way [16:07:12] I don't want to do anything about the rings [16:07:29] I want to let them sit, really we *need* to let them sit, til we get a complete run of object reiplcation [16:07:38] exactly. [16:07:43] you don't need to change the rings [16:07:47] just swap the disks [16:07:48] if we can swap the disk without restarting those [16:07:55] then awesome [16:08:08] but if for some reason we need to reboot then we are starting the run over on that host [16:08:10] we can either adjust the rings to remove the broken disks, or swap the disks and let replication finish with the new empty disks. [16:09:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.276 seconds [16:09:44] apergos: have you filed an RT for ms-be10's disk? [16:09:50] (too lazy to check :) [16:09:50] no, not yet [16:11:02] we're looking at 8 days itil these things complete [16:11:16] maybe less if we get lucky (often the eta is a little off and we see acceleration near the end) [16:11:34] if we wind up needing to reboot then we are back at 250 hours or whatever it was [16:11:45] why? [16:11:58] because the run will start from the beginning, after the reboot [16:12:06] well hmm [16:12:06] a) we can hot-swap, no need for a reboot, b) why would replication start from the beginning? [16:12:12] after a reboot [16:12:16] of course it won't [16:12:26] it walks thourhg all partitions on the disk [16:12:33] polls the remote hosts for each object [16:12:43] it rsyncs [16:12:52] that's it [16:12:56] and if the remote hosts are in sync, it says, great [16:12:59] and moves on to the next object [16:13:09] if nt, it rsyncs [16:13:35] I've already seen it start over at 250hrs a couple times this week [16:13:40] so I really want to avoid that [16:14:02] Robh: idrac for server constable should be up now [16:14:11] how would it start from the beginnin when it already has copied half the data? [16:14:15] otoh we are talking about ms7 and ms10 [16:14:35] no, I'm talking about ms-be7 [16:14:38] these run through faster, let me look at those (I have been watching [16:14:42] and hopefully you were talking about ms-be10 [16:14:43] yes, ms-be7 and ms-be10 [16:14:47] sorry for the abbreviation [16:14:55] that was confusing for a moment :) [16:15:08] I have been watching some other hosts so let me check those [16:15:42] sbernardin: Monday a Dell technician is going to call you...they are coming to swap the mainboard on srv266 [16:16:10] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [16:16:17] ms-bs7 looks like it runs much quicker than the others, this is a good thing [16:17:47] same for ms-be10 [16:18:46] cmjohnson: ok [16:18:56] sbernardin: pinging now, thanks [16:20:18] ok well let's try one of them and see how it goes (disk swap, not new rings) [16:20:27] it will be nice to get srv266 going (sbernardin). You will have to escort him in the cage and watch him carefully. [16:20:48] cmjohnson: will do [16:20:50] apergos: let's do ms-be7 since it is on the floor sbernardin is already on [16:21:17] cmjohnson: what's wronh with ms-be7? [16:21:33] cmjohnson: *wrong [16:21:36] that's the one with a ticket, worksforme [16:21:42] sbernardin: open up a box and pull a disk out ...we are going to swap a bad disk...don't pull anything out yet [16:21:51] cmjohnson1, will he have to kill the Dell guy after he's done replacing? [16:22:06] lol [16:22:10] New patchset: Demon; "Only prune down to the last day, table grows fast" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37448 [16:22:10] !log authdns-update for durant mgmt [16:22:16] maxsem: collarteral damage...have to protect our trade secrets [16:22:19] Logged the message, RobH [16:22:35] what secrets are those? [16:22:58] <^demon> RobH: I've got a follow-up to yesterday's cron. Just adding a param: https://gerrit.wikimedia.org/r/#/c/37448/ [16:23:08] <^demon> (To keep wikidatawiki from getting too big :p) [16:23:15] apergos: need to know ....need to know! :-P [16:23:16] cool, im about to submit a operations change as well [16:23:19] will merge both [16:24:11] hmph.. uppity dc techs :-P got their own cabal or something [16:24:18] cmjohnson: one of the ssd's? [16:24:20] New patchset: Demon; "Only prune down to the last day, table grows fast" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37448 [16:24:35] no, it's sda, that should be a regular disk [16:24:36] New patchset: RobH; "adding in mac for constable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37449 [16:24:38] negs..it is going to be a 3.5 [16:24:45] ok [16:25:29] New review: RobH; "Make it so." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37448 [16:25:29] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37448 [16:26:00] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37449 [16:27:07] ^demon: its all merged, running puppetd --test on hume now [16:27:26] sbernardin: did the replacement disk for hume show up yet? [16:27:59] * cmjohnson1 see's robh's puppet on hume and remembers then disk  [16:28:07] cmjohnson: yes they did [16:28:17] ^demon: all updated [16:28:28] <^demon> Thanks. [16:28:46] quite welcome, yay code review and not racking server [16:28:48] so odd. [16:28:56] im a bit giddy still. [16:29:03] * RobH has stupid ass grin [16:29:21] er [16:29:24] I don't want you doing code review [16:29:27] so sbernardin you will need to coordinate a disk swap with notpeter [16:29:51] mark: im not primarily [16:29:57] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [16:30:02] but chad had one yesterday that needed root and was easy [16:30:08] cmjohnson: ok...what's the process for that? [16:30:13] so i did since i was in there doing my own gerrit reviews for installs [16:31:01] so sbernardin when told so you will need to pull the disk and replace w/one of the new ones you received and than insert back into the server [16:31:11] it should hopefully rebuild on it's own [16:31:21] cmjohnson: ok [16:31:58] sberbardin: don't forget to scan the packing slip and add to the ticket rt3916 [16:32:09] after you attach the scan ...close the ticket [16:36:33] !log maxsem synchronized php-1.21wmf5/extensions/GeoData 'https://gerrit.wikimedia.org/r/#/c/37446/' [16:36:41] Logged the message, Master [16:37:00] Warning: the RSA host key for 'hume' differs from the key for the IP address '2620:0:860:2:21d:9ff:fe33:f235' [16:37:00] Offending key for IP in /etc/ssh/ssh_known_hosts:835 [16:37:00] Matching host key in /etc/ssh/ssh_known_hosts:603 [16:37:21] RobH: ping [16:37:30] preilly: heya, im installing constable for you now. [16:37:32] RobH: is constable available for use now? [16:37:40] it will be in a few minutes, its installing base system. [16:37:41] RobH: Okay great thanks! [16:37:59] im going to finish install, get it running puppet, and will ping ya and resolve ticket [16:38:04] (the install ticket) [16:38:11] RobH: Okay wonderful [16:38:19] the procurement ticket will remain open until we get a spec on the parsoid cluster and varnish boxes [16:38:56] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [16:42:43] RobH: Okay that makes total sense to me [16:43:11] New patchset: Demon; "Adding stopgap measure for testing wikidata client on test2wiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [16:44:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:36] puppet is being sloooooow [16:48:27] cmjohnson: do I contact notpeter about hume..or do I wait for him? [16:51:43] sbernardin: use cmjohnson1 in irc...so it pings me [16:52:14] cmjohnson1: do I contact notpeter about hume..or do I wait for him? [16:52:22] either way [16:53:15] notpeter: please contact sbernardin w/a good time to replace the bad disk in Hume.....(sbernardin) just wait for peter [16:53:35] hey guys, anyone know if anything has suddenly changed with the jobs queue cause its shot up in the last 2-3 hours [16:53:37] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=spence.wikimedia.org&v=506&m=enwiki_JobQueue_length [16:53:49] argh blergh meh [16:53:54] what? [16:54:09] cmjohnson1: cool...no problem [16:54:28] my hope is that someone made a few edits to some heavily used templates [16:54:37] cmjohnson1: when do you want to replace the bad disk in ms-be7? [16:54:40] but afaik no, no one has been tinkering with the innards [16:55:10] Reedy: ^^^ job queue [16:55:27] let's coordinate w/apergos [16:55:29] Seddon: Someone edited a big template. Most likely [16:55:46] enwiki like doing shit like that [16:57:09] Reedy, ok cool. Ill let you know if it goes on for a significant period of time [16:57:37] φρ στιλλ > 4 μιλλιον Ι βετ [16:57:39] grrrr [16:57:44] fr still > 4 million I bet [16:57:56] why can't it just detect what layout I want :-P [16:58:32] apergos: I know what you need. You need two keyboards. [16:59:04] Put stickers with the Greek characters on one of the two keyboards, and have the OS switch layouts depending on which keyboard you type on [16:59:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.063 seconds [16:59:53] look at your screen when typing? :-) [16:59:55] even more inconvenient [17:00:05] dude, I look but I type fast enough for somet tings [17:00:14] that I've already hit return by the time I register that 'woops' [17:00:17] voice recogination software [17:00:18] hehe [17:00:25] I use caps lock for the layout switching [17:00:30] that helps too [17:00:31] preilly: constable puppet run done, its all yers [17:00:31] I use the win key [17:00:35] not good for anything else [17:01:09] so sbernardin and cmjohnson1 when do you want to do ms-be7? [17:01:17] I don't want to be here toooo too late [17:01:19] today. [17:01:45] sbernardin is ready now [17:01:51] οκ [17:01:54] * cmjohnson1 assumes [17:01:55] err! [17:02:30] only thing to make sure is that the new disk has no partition or filesystem or crap on it [17:02:32] but everything is right there so it should only take a few minutes [17:02:37] right [17:02:49] we are pulling it from a new machine still in box [17:02:51] ok [17:03:44] I will have it replaced as normal but for the sake of getting them up and running we can swap it right now. once the new disk comes in I will have sbernardin put in the donor server [17:04:58] yes, swap now and rma later [17:05:02] yep [17:13:57] apergos: am I ok to swap the drives now? [17:14:23] yes indeed, hot swap and let's see how it goes [17:14:33] apergos: ok [17:15:12] apergos: ok...it's been swapped out [17:15:41] ok [17:16:37] sbernardin: hey, I'm on pst, and going to head into the office soon. I'll ping you when I'm back online for hume drive switcharoo [17:18:53] notpeter...no problem [17:20:03] * apergos waits for the new disk to go in [17:30:08] http://tech.mit.edu/V132/N59/pressure/sleepinghours/index.htm [17:30:22] the only thing i have in common with an mit student is we dont sleep enough. [17:30:31] cool data =] [17:30:46] File not found: /v1/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-mediawiki-local-public/c/c0/VisualEditor-ModuleStack.png [17:30:49] https://upload.wikimedia.org/wikipedia/mediawiki/c/c0/VisualEditor-ModuleStack.png [17:30:56] v1/AUH, looks like something is up [17:32:01] oh apparently that is what all 404s look like [17:32:05] sbernardin: did the other disk get swapped in yet? [17:32:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:13] Krinkle: yes I'm afraid [17:32:14] why is all that garbage exposed [17:32:20] it's on my todo [17:32:23] I thought some backend thing crashed or something [17:32:26] I was actually working on that yesterday [17:32:34] not a huge priority for obvious reasons [17:32:55] (this isn't a security-related leak, it's just ugly) [17:33:10] !log reedy synchronized php-1.21wmf5/extensions/CentralAuth [17:33:18] Logged the message, Master [17:33:39] apergos: yes it did [17:34:05] <^demon> Anyone mind poking https://gerrit.wikimedia.org/r/#/c/37429/ for me? It should be ready to go now. [17:34:35] ok no sign of it being recognized [17:34:54] guess I'll have to poke around [17:37:30] apergos: ok...let me know if I need to do anything else [17:37:39] ok, thanks [17:46:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.636 seconds [17:47:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [17:55:03] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [18:05:49] !log Ran killall -u extdist on fenari per Chad's request [18:05:59] Logged the message, Mr. Obvious [18:07:31] rats [18:07:36] [1431846.186436] sd 0:0:0:0: rejecting I/O to offline device [18:08:29] apergos: lots of "table full" errors for pc server [18:08:50] dberror.log is massive [18:09:07] mostly due to printing the serialized parser output to the log ;) [18:09:11] out of space right? [18:09:18] yeah [18:09:19] what server is it? [18:09:22] * AaronSchulz checks ganglia [18:11:11] New patchset: Catrope; "Set ensure=>absent to kill the Nagios check for parsoid.svc.pmpta.wmnet (typo)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37460 [18:15:27] apergos: good thing I changed the code to catch the exceptions [18:15:32] otherwise this would have been downtime [18:15:36] yep [18:15:56] New patchset: Ottomata; "Setting varnich instance name for blog varnish logging to ""." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37461 [18:16:01] I guess robla or someone can just remember to ambush asher when he comes [18:16:21] I don't see any of pc1 2 or 3 full [18:16:30] you don't know which one it's complaining about? [18:16:33] * robla looks for what he needs to ambush Asher about [18:16:57] Fri Dec 7 18:16:49 UTC 2012 srv270 commonswiki SqlBagOStuff::set 10.0.0.221 1114 The table 'pc054' is full (10.0.0.221) [18:17:14] oh dberrorlog now looks like crap. [18:17:16] nice [18:17:38] I take it there was a change to the logging. who did that? [18:17:46] New review: Ottomata; "See also:" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37461 [18:17:46] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37461 [18:17:53] apergos: tail -f dberror.log | cut -b -120 [18:17:56] pc1 [18:17:57] ;) [18:18:05] according to the ip anyways [18:18:54] since this is about the log, is this necessarily an Asher thing? [18:19:58] it is? [18:20:13] it's about the table claiming to be full [18:20:19] that's usually 'out of space' except that [18:20:23] I see it at 92% [18:20:33] the dberr log is elsewhere [18:21:12] 1.5T ibdata1 [18:21:35] this is the big ticket item, f we need to do something about it (which I assume so) then this is an asher thing, sadly [18:21:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:05] what's pc054 used for? [18:23:22] there is no pc054 (server), it must be a table name [18:23:44] I don't know the parser cache table naming scheme [18:23:51] * robla is trying to figure out how urgent this is [18:23:56] gotcha [18:23:57] apergos: there are 256 of them [18:24:06] so only one reports full [18:24:09] maybe it really is full [18:24:15] oh, pc == parser cache...got it [18:24:16] as opposed to being a disk space issue [18:24:26] no, lots of tables are giving this [18:24:29] ah [18:24:30] ok then [18:24:33] *sigh* [18:24:36] probably all ;) [18:24:37] there's a cron job that is supposed to be cleaning these out [18:24:42] is that actually happening? [18:24:43] there is?? [18:25:01] woosters: might have a high priority db thing [18:25:02] does that script still bring down the server [18:25:06] ? [18:25:14] ya [18:25:22] anyone know the name of that script? [18:25:26] * robla looks [18:26:06] New patchset: Pyoungmeister; "icinga: add some requires" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37462 [18:26:10] purgeParserCache.php [18:26:31] heh I would never have guessed (is it in puppet? some shell script someplace? etc etc) [18:26:41] see RT 2108 [18:26:49] also, see my mail from October 18 on the subject [18:26:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37462 [18:27:04] to ops list, titled: "Parser cache database problem earlier this week" [18:28:18] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20pmtpa&h=pc1.pmtpa.wmnet&v=1126257&m=mysql_innodb_free_space&r=hour&z=default&jr=&js=&st=1350611180&vl=Mbytes&ti=mysql_innodb_free_space&z=large [18:28:57] does your mail say anything useful about the cron script? (orhterwise I will wait on it for now) [18:29:00] looks like out of innodb space [18:29:01] robla: [18:29:32] apergos: nah, the ticket is the better place to look right now [18:29:40] ok, I read that [18:29:54] well I could try running this script manually and see what it does [18:30:15] anyone have any ideas if the suggested age is good or not? [18:30:23] if no one does I'm just going to run it and watch [18:30:29] even though as the ticket says [18:30:44] You need to call it with --age=, where is an integer which relates to the resulting cache size in a complex manner which has not been fully determined [18:31:21] * apergos grits teeth and runs it [18:31:26] yeah, go for it [18:31:33] * robla may drop off of irc for a sec [18:31:45] you were 5 secs too slow. [18:31:50] watching the progress bar, it is slow [18:32:04] should have run it in screen I guess [18:32:38] bleep bleep it [18:32:46] so I interrupted it and [18:32:52] restat in screen but naturally [18:33:02] it refuses to run now [18:33:05] Cannot purge this kind of parser cache. [18:33:22] apergos: its friday night for you, whatever you touch is going to break man [18:33:26] you know better! [18:33:31] thanks [18:33:33] ;] [18:33:37] well no one else is volunteering [18:33:38] glad I could help [18:33:53] * RobH is in peanut gallery [18:34:11] * AaronSchulz looks at rob's desk [18:34:26] I'll look at the script as soon as I have a current copy [18:34:30] see if I can figure ot anything [18:34:44] AaronSchulz: http://en.wikipedia.org/wiki/Peanut_gallery [18:35:01] im not sure if i mean im rowdy [18:35:02] AaronSchulz: I'm in R31 now [18:35:08] or about as mature as a small child. [18:35:10] no robh [18:35:11] or maybe both [18:35:18] ahhh, other rob [18:35:21] DAMN YOU ROBLA [18:35:25] $success = $pc->deleteObjectsExpiringBefore( $date, array( $this, 'showProgress' ) ); [18:35:29] RobH: no, you [18:35:31] if ( !$success ) { [18:35:34] New patchset: Lcarr; "adding fake contacts group for icinga testing without paging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37467 [18:35:36] I think you have an empty desk here or something :) [18:35:38] New patchset: Dzahn; "icinga sprint: reverse order of config files and variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37468 [18:35:43] great. so it didn't succeed.. knowing why would be nice [18:35:43] ohhhh, yes [18:35:50] AaronSchulz: i move next month [18:36:03] apergos: check dberrors.log and grep for the host [18:36:04] then there will be three (or four now?) rob's in the same room [18:36:11] its gonna be confusing =P [18:36:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37468 [18:36:20] I am sure that something got locked or whatever [18:36:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37467 [18:37:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.429 seconds [18:39:32] New patchset: Lcarr; "changing nagios files to fake ones for testing icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37469 [18:39:56] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37469 [18:40:21] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Fri Dec 7 18:39:54 UTC 2012 [18:41:07] there is an innodb recover in progress over there [18:41:11] it is rejecting all connections [18:41:43] * AaronSchulz really does not like using innodb as a key/value cache [18:42:06] RobH: thanks for all your help today [18:42:27] could be that the space issue crashed it, could be the script, really don't know [18:42:33] once it's recovered I'll try again [18:43:43] ok well that didn't go well [18:43:53] once it's recovered I will touch nothing [18:43:58] heh [18:44:05] has anyone disabled pc1? [18:44:16] not I [18:45:27] we will soon see if it stays up if I do nothing [18:45:37] are users seeing errors? [18:46:08] AaronSchulz: ? (I am guessing not) [18:46:25] notpeter: the errors are hidden [18:46:30] AaronSchulz: ok [18:46:37] connection problems might cause slowness though if they happen [18:46:41] ok well, when I do nothing the same thing happens [18:46:43] New patchset: Pyoungmeister; "disabling pc1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37471 [18:46:44] so far it's just fast errors on REPLACE [18:47:01] so no script, and nothing else to be done by me, someone with actual expertise needs to touch it [18:47:15] me no talk php good. can someone look at that? [18:47:16] in the meantime it's in a crash/recovery loop [18:47:23] notpeter: that will break any cron script [18:47:32] though I guess that's ok for now [18:47:37] if there is a recovery problem [18:47:43] I mean, that box isn't doing anything right now [18:47:51] nope, not anything useful [18:48:02] ok, I'm going to merge [18:48:11] wait [18:48:14] waiting [18:48:18] New patchset: Dzahn; "icinga: fix class relationship after renamed config class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37472 [18:48:20] notpeter: I don't know if having no 'server' param will work [18:48:27] that was asher's suggestion [18:48:29] notpeter: just comment out the whole "1" array key/value [18:48:45] AaronSchulz: ok [18:48:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37472 [18:49:22] New review: Aaron Schulz; "Should comment out the whole mysql cache." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/37471 [18:50:04] New patchset: Pyoungmeister; "disabling pc1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37471 [18:50:36] AaronSchulz: ^^ [18:51:30] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37471 [18:51:54] AaronSchulz: you got sync? [18:52:21] !log aaron synchronized wmf-config/CommonSettings.php 'disabled use of pc1' [18:52:30] Logged the message, Master [18:52:49] woo wooo! [18:53:43] thanks for that [18:53:59] asher asked me to keep it down so it can be upgraded :) [18:55:16] New patchset: Lcarr; "Chanigng class name to icinga::monitor:;configuration::variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37474 [18:56:16] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37474 [18:56:40] * apergos vaguely remembers that there might be a config setting to limit the size of the tables or some such [18:56:49] independent of the underlying filesystem [18:56:56] which would be why we are 'out of space' [18:58:04] New patchset: Lcarr; "fixing icinga class name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37476 [18:58:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37476 [18:58:28] yeah [18:59:06] innodb_data_file_path=ibdata1:1500G [18:59:08] there it is [18:59:23] and iirc touching that is not the way to fix this, it has bad consequences [18:59:30] and so once again asher :-( [19:00:12] yeah, I'm cool to wait [19:00:18] site is fine [19:01:18] ok, now I go back to trying to figure out how to get ms-be7 to see its disk [19:02:38] which I might give up on soon, 9 pm already [19:03:02] New patchset: Lcarr; "fixing class names again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37477 [19:03:19] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37477 [19:05:30] when I look at the megacli output for the new drive and one that was already in there, they look identical [19:05:51] except for the drive serial number, and the slot of course [19:10:33] err: /Stage[main]//Monitor_group[mysql_eqiad]/Nagios_hostgroup[mysql_eqiad]: Could not evaluate: Puppet::Util::FileType::FileTypeF [19:10:33] d not write /etc/nagios/puppet_hostgroups.cfg: No such file or directory - /etc/nagios/puppet_hostgroups.cfg [19:10:40] thats the big error we're getting for every group [19:10:40] RECOVERY - HTTP on neon is OK: HTTP OK HTTP/1.1 200 OK - 452 bytes in 0.055 seconds [19:12:06] as well as /etc/nagios/puppet_servicegroups.cfg: [19:12:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:36] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [19:20:51] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 110.27 ms [19:22:57] RECOVERY - Puppet freshness on neon is OK: puppet ran at Fri Dec 7 19:22:34 UTC 2012 [19:23:12] cmjohnson1: I don't know how to get ms-be7 to see that disk (from linux), it's just not happening [19:23:26] I updated the ticket with the various messages [19:26:14] apergos: did you try to add using MegaCli? [19:27:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.868 seconds [19:27:38] to add it to what? [19:27:42] try replacing the missing drive MegaCli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN [19:28:28] but there is no array [19:29:10] err: /Stage[main]/Mysql::Monitor::Percona::Files/File[/etc/nagios/nrpe.d/nrpe_percona.cfg]/ensure: change from absent to file fail [19:29:12] d not set 'file on ensure: No such file or directory - /etc/nagios/nrpe.d/nrpe_percona.cfg.puppettmp_5176 at /var/lib/git/operatio [19:29:13] t/manifests/mysql.pp:286 [19:29:34] !log upgrading pc1 to precise and newer mysql-facebook build [19:29:43] Logged the message, Master [19:29:56] binasher: thank you for taking that over [19:30:04] I hit a brick wall pretty fast [19:30:10] binasher: nuke from orbit! [19:30:21] New patchset: Dzahn; "add icinga::nsca package class to icinga to get rid of requirements out of nagios::*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37480 [19:30:27] PROBLEM - HTTP on neon is CRITICAL: Connection refused [19:30:29] cmjohnson1: do you have any tricks for non-raid disks? [19:31:11] apergos: not w/out going into raid bios [19:31:19] * cmjohnson1 looking at options [19:31:20] and that's the problem [19:33:53] New patchset: Dzahn; "add icinga::nsca package class to icinga to get rid of requirements out of nagios::*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37480 [19:35:03] New patchset: Lcarr; "siwtching nagios_config_dir to use /etc/icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37482 [19:36:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37480 [19:36:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37482 [19:41:51] RECOVERY - HTTP on neon is OK: HTTP OK - HTTP/1.1 302 Found - 0.054 second response time [19:45:00] PROBLEM - MySQL disk space on pc1 is CRITICAL: Connection refused by host [19:51:23] New patchset: Lcarr; "making screen not suck so badly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37486 [19:52:26] New patchset: Lcarr; "making screen not suck so badly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37486 [19:52:33] ops folks ^^ if i can get some opinions [19:55:09] New patchset: Lcarr; "making screen not suck so badly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37486 [19:55:48] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37486 [19:57:54] RECOVERY - MySQL disk space on pc1 is OK: DISK OK [20:00:30] hm I guess that was an opinion [20:00:36] PROBLEM - SSH on ms-be3001 is CRITICAL: Connection refused [20:00:37] !log rebooting pc1 [20:00:45] Logged the message, Master [20:01:12] 9000 is good [20:01:19] what are the strings? [20:01:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:48] LeslieCarr: [20:02:22] oh ? [20:02:23] hehe [20:02:35] the string along the bottom ? [20:02:51] PROBLEM - Host pc1 is DOWN: PING CRITICAL - Packet loss = 100% [20:03:11] left side = host, middle = list of term windows, right = date [20:03:23] it's only for running screen as root [20:03:37] now I need a host where it's live so I can check it out [20:03:47] LeslieCarr: have you used byobu ? [20:04:06] except that all the puppet runs are way behind [20:04:27] byobu ? nope [20:04:31] New patchset: Lcarr; "readding pager_testing group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37490 [20:04:39] RECOVERY - Host pc1 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:04:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37490 [20:06:09] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:03] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.035 seconds [20:15:45] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 109.62 ms [20:16:29] Is there a way for me to ask Gerrit "what repos do you host?" Or, failing that, is there a canonical list of the projects that we host in git? [20:16:38] I'm looking for the adminbot source, and google is failing me [20:16:54] Oh! ^demon, you're just in time to answer the question that I just asked... [20:17:04] Is there a way for me to ask Gerrit "what repos do you host?" Or, failing that, is there a canonical list of the projects that we host in git? [20:17:29] <^demon> From the UI, there's https://gerrit.wikimedia.org/r/#/admin/groups/ [20:17:47] <^demon> If you want something programatic, you can `ssh -p 29418 gerrit.wikimedia.org gerrit ls-projects` [20:18:03] ^demon: That link is perfect. Is there a way to navigate to that page or should I just bookmark it? [20:18:09] !log olivneh synchronized php-1.21wmf5/extensions/E3Experiments [20:18:17] Logged the message, Master [20:18:19] <^demon> andrewbogott: "Admin -> Projects" [20:18:30] <^demon> Once we upgrade, it'll get it's own "Projects" top level menu [20:19:07] <^demon> Whoops, and I copy+pasted the wrong url. You want it ending in /admin/projects/ [20:19:08] <^demon> Sorry [20:19:39] PROBLEM - SSH on ms-be3002 is CRITICAL: Connection refused [20:19:41] Yep, this is all perfectly obvious now that I know about it :) Thanks [20:28:05] RECOVERY - SSH on ms-be3001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:28:14] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 110.28 ms [20:33:54] Ryan_Lane: Are you going to be in the office today? [20:34:16] I am right now [20:35:23] lol [20:35:50] * RoanKattouw blames the monitor that's perfectly positioned to hide Ryan from me [20:37:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37429 [20:38:53] I was looking for one of our SAP people at work today - they where sat behind a flip chart heh [20:40:05] PROBLEM - NTP on ms-be3002 is CRITICAL: NTP CRITICAL: No response from NTP server [20:48:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:29] New patchset: Ori.livneh; "Set wgDebugLogGroups for EventLogging" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37494 [21:01:07] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37494 [21:02:49] !log olivneh synchronized wmf-config/CommonSettings.php 'Updating wfDebugLogGroups dest for EventLogging' [21:02:59] Logged the message, Master [21:03:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.906 seconds [21:12:54] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35135 [21:20:03] New patchset: Ori.livneh; "log to $wmfUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37499 [21:20:41] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37499 [21:22:29] !log olivneh synchronized wmf-config/CommonSettings.php 'EventLogging log group -> ' [21:22:38] Logged the message, Master [21:31:54] New patchset: Lcarr; "fixing fake contact groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37502 [21:34:21] New patchset: Demon; "Disable cron for wikibase client polling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37504 [21:34:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37502 [21:36:40] Change merged: Lcarr; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [21:36:45] Change merged: Lcarr; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37221 [21:36:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37504 [21:37:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:37:27] !log lcarr synchronized wmf-config/throttle.php [21:37:35] Logged the message, Master [21:38:01] I love it when our network engineers hack PHP files :-) [21:38:48] hehe [21:39:31] I am wondering if we could get puppet on JunOS [21:39:40] and write some pp manifests for the routers and switch [21:40:56] hehe [21:40:59] not really yet [21:41:02] there's some work towards it [21:41:06] that will be the future :) [21:41:48] \O/ [21:42:41] Meanwhile I got a tiny change pending to install liberal-mime-perl on gallium. The aim is to lint check the wikibugs perl script https://gerrit.wikimedia.org/r/#/c/37419/ [21:42:52] * hashar pings Ryan_Lane ^^^ :D [21:43:28] hashar: yessir [21:43:29] ? [21:43:40] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/37419/ [21:43:40] are we ready to enable self-registration now? :) [21:43:54] Ryan_Lane: yeah most probably [21:44:01] still want to make sure everything works fine [21:44:06] but so far I have seen no issue [21:44:39] I think I got all jobs migrated to Zuul [21:44:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37419 [21:45:04] hashar, I'm afraid I just added you to review to half a dozen of changes [21:45:44] hashar: how to state in unit test that certain userrights are expected? [21:45:46] Platonides: I think I merged some already and or removed myself from some others [21:46:07] not unless you did that in the last 60 seconds:) [21:46:16] oh [21:46:22] hashar: cool, so should I aim to enable it next week? [21:46:35] Ryan_Lane: yes sir. [21:46:36] they were check-vars.php related [21:46:40] \o/ [21:46:46] Ryan_Lane: still have to migrate some jobs by analytics team apparently. [21:46:51] ah [21:46:56] Ryan_Lane: but that is a matter of an hour or so [21:46:59] cool [21:47:10] thanks for getting this working! [21:47:19] I managed to setup regression jobs too https://integration.mediawiki.org/ci/view/Regressions/ (experimental still) [21:49:12] what's a "regression job" ? [21:49:32] my idea is to run the full test suite after a change has been merged [21:49:45] so if anything fail we know that we introduced a regression in master [21:50:00] maybe I will make it to notify people by emails [21:51:09] shouldn't jenkins have already run it just before the merge? [21:51:20] yup [21:51:27] it event test it on latest master [21:51:31] so that is a duplicate [21:51:54] but if it ever fail that means we have a trouble in the test suite or that Zuul does not properly test the changes [21:52:51] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [21:56:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [22:00:11] http://neon.wikimedia.org/icinga/ [22:01:48] LeslieCarr, it gives a Whoops! ... [22:02:16] yeah [22:02:25] pasted in for the peeps sitting here - we're trying to fix it up [22:02:45] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [22:02:45] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:02:45] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [22:02:55] DOH [22:02:55] LeslieCarr: [1354908488] Error: Unable to rename file '/var/cache/icinga/icinga.tmp5Ki1Lu' to '/var/lib/icinga/status.dat': Permission denied [22:02:58] [1354908488] Error: Unable to update status data file '/var/lib/icinga/status.dat': Permission denied [22:03:02] I did not know we had icinga [22:03:44] we're working on it [22:04:04] New patchset: Catrope; "Add parsoid to git-deploy config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37554 [22:04:32] hashar: Icinga sprint [22:04:34] Ryan_Lane: Jenkins says: perl -Mstrict -Mdiagnostics -cw wikibugs || wikibugs syntax OK [22:04:36] Ryan_Lane: thanks! [22:07:23] !log DNS update - add icinga.wm as cname for neon [22:07:34] Logged the message, Master [22:07:44] dns works for me :) [22:14:28] so we need /var/cache/icinga/* and /var/lib/icinga/* [22:14:36] to be writable [22:14:49] cool, i can grab that [22:15:01] specifically /var/lib/icinga/retention.dat [22:15:32] off for the weekend [22:15:40] have a good sprint ! [22:15:44] enjoy hashar, cu [22:16:17] New patchset: Ryan Lane; "Turn deployment minion regex into a hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37560 [22:16:44] hashar: yw [22:17:34] New patchset: Ryan Lane; "Turn deployment minion regex into a hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37560 [22:17:40] New patchset: Lcarr; "fixing some directory permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37561 [22:17:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37560 [22:18:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37561 [22:18:16] Ryan_Lane: can i merge your change ? [22:18:17] LeslieCarr: one more i think [22:18:21] hashar, what does perl -cw do? [22:18:22] ok [22:18:30] /var/lib/icinga/spool [22:18:44] /var/lib/icinga/spool/checkresults [22:19:28] seems -c check syntax only (runs BEGIN and CHECK blocks) [22:19:30] -w enable many useful warnings [22:19:37] Platonides: -c compile, -w is like "use warnings;" [22:19:43] Platonides: so that is more or less like php -l [22:19:54] so the wikibugs there is the filename to be linted? [22:20:02] Platonides: thought that also resolves inclusion of other modules [22:20:12] yeah "wikibugs" is the filename [22:21:15] it has no extension ? [22:22:17] * Platonides adds the command for linting perl to his new pre-commit hook [22:22:36] New patchset: Ryan Lane; "Add deployment pillars to all minions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37563 [22:25:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37563 [22:25:39] New patchset: Reedy; "Adding large.dblist and medium.dblist. Udate small.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37565 [22:26:17] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37565 [22:26:56] Platonides: ori-l has some pre commit hook too [22:27:10] I guess we should share them in some repository for anyone to enjoy [22:27:34] anyway, bed time. See you on monday! [22:27:49] New patchset: Jforrester; "(bug 42735) Add gerrit-wm IRC bot to #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37566 [22:29:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:16] New patchset: Reedy; "Add medium and large dblist symlinks to noc conf" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37568 [22:30:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37568 [22:31:17] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37554 [22:34:42] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [22:42:34] New patchset: Jforrester; "(bug 42735) Add wikibugs IRC bot to #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [22:45:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.159 seconds [22:45:28] New patchset: Ryan Lane; "Ensure parsoid is listed in repo regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37571 [22:45:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37571 [22:48:17] New review: Demon; "Actually now that I look, I don't think this is actually used anywhere (so it's not actually puppeti..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/37570 [22:48:47] ^demon|away: Oh well. :-( [22:49:17] <^demon|away> Yeah, the class doesn't seem to be included anywhere. [23:00:53] New patchset: Pyoungmeister; "remerging mysql and coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37576 [23:03:21] New patchset: Ryan Lane; "Add support for calling modules in checkout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37577 [23:04:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37577 [23:07:18] New patchset: Ryan Lane; "Add missing variable to class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37578 [23:10:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37578 [23:11:01] New patchset: Lcarr; "fixing config dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37579 [23:11:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37579 [23:11:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37576 [23:12:41] New patchset: Pyoungmeister; "Revert "swapping roles on db61 for testing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37580 [23:15:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37580 [23:18:39] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:18:39] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:19:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:22] New patchset: Ryan Lane; "Always pass the repo into called functions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37583 [23:28:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37583 [23:29:24] New patchset: Catrope; "Add a Parsoid module for git-deploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37584 [23:30:10] Ryan_Lane: https://gerrit.wikimedia.org/r/37584 [23:34:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.486 seconds [23:35:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [23:35:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:40:01] New patchset: Ryan Lane; "Add minion regex into deployment pillars" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37585 [23:40:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37585 [23:42:20] New patchset: Catrope; "Add a Parsoid module for git-deploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37584 [23:43:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37584 [23:46:51] !log olivneh synchronized php-1.21wmf5/extensions/EventLogging 'Fix-up of mtime issue' [23:47:01] Logged the message, Master [23:47:11] New patchset: Ryan Lane; "Add parsoid module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37591 [23:47:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37591 [23:52:42] New patchset: Ryan Lane; "Switch " with ' in deployment pillar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37593 [23:53:11] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37593