[02:12:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44844 [02:27:00] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 21 02:26:59 UTC 2013 [02:27:12] Logged the message, Master [02:50:21] !log LocalisationUpdate completed (1.21wmf8) at Mon Jan 21 02:50:20 UTC 2013 [02:50:32] Logged the message, Master [05:40:27] New patchset: Tim Starling; "Remove unused class "applicationserver_old"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44950 [05:43:36] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44950 [05:46:57] New patchset: Tim Starling; "Don't use deprecated class apaches::packages for blogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44951 [05:47:52] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44951 [05:51:08] !log on marmontel: removed MW-specific packages php5-wmerrors, php-luasandbox, php-wikidiff2 [05:51:18] Logged the message, Master [06:26:14] New patchset: Tim Starling; "Better way to check for array membership" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [06:31:27] New patchset: Tim Starling; "Test Ic5aab665 by temporarily "decommissioning" hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44954 [06:31:42] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [06:31:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44954 [06:36:48] New patchset: Tim Starling; "Revert "Test Ic5aab665 by temporarily "decommissioning" hume"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44955 [06:37:14] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44955 [06:40:38] New review: Tim Starling; "hume.wikimedia.org.yaml size reduced by 23%." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [08:14:06] !log jenkins: updating all Jenkins jobs based on d31c92e of integration/jenkins-job-builder-config.git [08:14:17] Logged the message, Master [08:23:25] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [08:23:57] New review: Hashar; "PS3: removes role::cache::configuration::beta and integrate the $beta prefixed variables directly in..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [08:25:04] poor labsconsole is dead / slow :D [08:25:37] I guess memcached died again on virt0 , I can't check on nagios though [08:53:58] hashar, according to the latest news, its due to borken RAM module [08:55:03] MaxSem: oh good to know [08:55:26] apparently, it sill hasn't been pulled out [08:55:45] oh [08:55:54] MaxSem: mind commenting on https://bugzilla.wikimedia.org/show_bug.cgi?id=42127 please ? [08:56:05] that is the bug for memcache dieing on labsconsole [08:56:53] hashar, I heard this from Ryan so someone with first-hand knowledge is preferred [08:57:09] yeah I will ping andrew this afternoon [08:57:10] I still don't know what did his experiments end with [08:57:11] to find out more [09:11:55] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:13:12] New review: Hashar; "Makes role::cache::mobile to include role::cache::configuration so that the $beta variables are actu..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [09:13:17] ahh [09:13:20] found the root cause [09:13:22] \O/ [09:30:29] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:40:25] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:47:48] New review: Hashar; "Fixed mount options on labs." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [09:52:23] !log jenkins: jobs refresh completed. [09:52:33] Logged the message, Master [09:57:27] !log relaying Ryan: he restarted ldap on virt0 (was hung after server restart). nscld was properly falling back to virt1000 but ldap was stuck there too. DNS got restarted. [09:57:37] Logged the message, Master [10:19:37] New review: Hashar; "The check_https_lvs macro does not exist in Nagios configuration :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44750 [10:28:47] New patchset: ArielGlenn; "define check_http_lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:29:36] New patchset: ArielGlenn; "define check_https_lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:31:22] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:50:09] RECOVERY - Puppet freshness on db1028 is OK: puppet ran at Mon Jan 21 10:50:03 UTC 2013 [10:50:09] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Mon Jan 21 10:50:03 UTC 2013 [10:50:09] RECOVERY - Puppet freshness on mw46 is OK: puppet ran at Mon Jan 21 10:50:08 UTC 2013 [10:50:10] RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Mon Jan 21 10:50:08 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mw48 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mw1075 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on cp1020 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mc1004 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:21] RECOVERY - Puppet freshness on sq72 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:21] RECOVERY - Puppet freshness on sq63 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:22] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on tarin is OK: puppet ran at Mon Jan 21 10:50:23 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on mw1063 is OK: puppet ran at Mon Jan 21 10:50:23 UTC 2013 [10:50:29] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.056 second response time [10:50:29] RECOVERY - Puppet freshness on cp1019 is OK: puppet ran at Mon Jan 21 10:50:28 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Mon Jan 21 10:50:28 UTC 2013 [10:50:39] RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:40] RECOVERY - Puppet freshness on mw45 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:40] RECOVERY - Puppet freshness on mw1139 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:49] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [10:50:50] RECOVERY - Puppet freshness on amssq38 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1123 is OK: puppet ran at Mon Jan 21 10:50:48 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on sq43 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Mon Jan 21 10:50:58 UTC 2013 [10:51:10] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [10:52:09] hahaha [10:56:57] New patchset: Hashar; "Resource references should now be capitalized" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44964 [10:59:34] apergos: do you know why puppet does not run on those machines ? :/ [11:01:57] no clue [11:09:05] RECOVERY - Backend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.014 seconds [11:09:32] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1393 bytes in 0.006 seconds [11:09:41] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [11:09:50] yeahyeah [11:10:30] !log nagios was dead over the weekend (config broken), fixed in puppet and on spence, now back in action [11:10:40] Logged the message, Master [11:11:32] congrats [11:12:16] thanks to you [11:34:27] hey [11:34:31] what's going on? [11:34:33] got pages [11:34:50] paravoid: nagios got broken [11:34:53] nagios revived after being dead over the weekend [11:35:31] you should be able to ignore it, and sorry for the noise [11:35:33] nice [12:28:15] New patchset: Hashar; "explicit 0664 mode for /etc/logrotate.d/glusterlogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44972 [12:49:15] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [12:52:47] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [12:56:55] boo Running VCC-compiler failed, exit 1 [12:56:56] :_D [12:56:59] but I am almost there! [13:14:07] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:16:21] holyhell [13:16:26] No -T arg in shared memory [13:16:27] again [13:30:23] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:35:44] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:44:05] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:45:17] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:55:02] mark: hi around? :-] I am looking how to get logs for a varnish instance I am setting up in labs. [13:55:08] /var/log/varnish is empty :-] [13:57:02] ahh [13:57:07] the default varnishncsa does not run [13:58:04] doh [13:58:20] french national radio talking about Aaron SW.. [14:04:43] New review: Hashar; "Patchset 13 let puppet run properly on the instance and also have the backend/frontend varnish servi..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [14:17:24] New patchset: Hashar; "(bug 44118) contint: install pyflakes on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44974 [14:17:39] !log gallium: manually installed pyflakes {{gerrit|44974}} [14:17:50] Logged the message, Master [14:18:04] New review: Hashar; "already installed pyflakes on gallium. Feel free to merge this whenever you want." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44974 [14:51:32] New review: Silke Meyer; "You get the point." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44690 [15:17:09] wtf [15:21:27] apergos: what did you do with check_lvs_http? [15:21:30] what was broken? [15:21:38] I added the def [15:21:43] it didn't exist previosly [15:21:56] and so nagios would not start, broken configuration [15:22:50] ok [15:26:04] the ms-fe.eqiad check is completely broken [15:26:10] I wonder how it worked so fa [15:26:11] *far [15:26:16] I wonder if it did [15:28:32] it surely didn't page [15:29:37] that's true enough [15:32:38] !log depooling ms-fe1 for testing [15:32:47] Logged the message, Master [15:45:40] !log reedy synchronized php-1.21wmf7/includes/EditPage.php [15:45:44] Can someone poke mw1072, it's asking me for a password when sync-file [15:45:46] mw1072: Permission denied (publickey,password). [15:45:50] Logged the message, Master [15:48:47] !log reedy synchronized php-1.21wmf8/includes/EditPage.php [15:49:01] Logged the message, Master [15:50:52] New patchset: Hashar; "explicit 0664 mode for /etc/logrotate.d/glusterlogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44972 [15:52:32] the changes required to the (varnish) puppet manifests for labs support are making me cry [15:53:58] * hashar hands mark a facial tissue [15:54:09] at least you did not hit your forehead with a huuuge facepalm [15:54:19] that's what made me cry actually [15:54:21] ;-) [15:54:45] I have gone with the same hack in use for the bits varnish :/ [15:54:51] i know [15:54:57] all the special casing is not helping at all :( [15:55:01] yeah [15:55:23] ideally we would have all the conf per realm in a different set of file [15:55:28] kind of a configuration database [15:55:34] and just fetch from it whatever value we need [15:55:39] ideally labs wouldn't differ from production much ;) [15:55:47] hehe [15:55:51] different ips but not much else [15:56:01] and different disks in this case :/ [15:56:05] New patchset: Silke Meyer; "Variables for the client config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [15:56:06] yeah I wonder about that [15:56:10] we might be able to fix that possibly [15:56:26] also, you can usually mount partitions multiple times [15:56:30] so no need to unmount I think [15:58:17] Change abandoned: Hashar; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [16:00:53] Change abandoned: Hashar; "been made by someone else in the 'newdeploy' branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43964 [16:01:57] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [16:02:11] New review: Hashar; "fix typo in commit summary" [operations/debs/python-voluptuous] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44408 [16:02:19] New patchset: Ottomata; "Sending blog.wikimedia.org traffic logs to analytics1001 udp2log instance." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44983 [16:02:24] moving out to get my daughter back home [16:02:28] might connect later tonight [16:08:25] hiya paravoid, you got a sec to review this one? [16:08:25] https://gerrit.wikimedia.org/r/#/c/44983/ [16:08:35] it is a simple change, but i think I should not self review it [16:09:04] not because its dangerous, it just feels like someone else should at least say, "hm. ok" [16:09:33] New review: Faidon; "hm. ok" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44983 [16:11:50] haha [16:11:51] thanks [16:12:17] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44983 [16:12:35] thanks paravoid! [16:24:29] New review: Reedy; "Needs rebasing and probably somewhat re-doing" [operations/debs/wikimedia-task-appserver] (master) C: -1; - https://gerrit.wikimedia.org/r/43356 [16:26:09] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [16:26:31] New patchset: Faidon; "swift: add /monitoring/ to rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44984 [16:27:18] New review: Faidon; "Staged on ms-fe1" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/44984 [16:27:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44984 [16:29:11] !log Rerouted AS43821->AS14907 traffic [16:29:23] Logged the message, Master [16:30:36] !log repooling ms-fe1 [16:30:48] Logged the message, Master [16:36:08] !log depooling, restarting and repooling ms-fe2/3/4 one by one [16:36:19] Logged the message, Master [16:52:33] New review: Mwang; "yes" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44715 [17:04:13] New patchset: Faidon; "Use /monitoring/backend to monitor Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44985 [17:04:18] mark: wanna review that? [17:04:34] k [17:05:22] it's really trivial, but since it touches LVS, varnish and nagios [17:05:32] it's a good idea to have another set of eyes :) [17:07:31] New review: Mark Bergsma; "Looks fine." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44985 [17:07:40] thanks. [17:07:43] yes, but pybal doesn't restart on config changes [17:07:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44985 [17:07:56] so run puppet on an inactive lvs server (lvs1006), restart that one [17:11:50] New patchset: Reedy; "Update symlinks to PoolCounter, db and mc files to include eqiad/pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44986 [17:12:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44986 [17:12:25] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [17:13:47] that's for Swift mostly [17:13:50] so pmtpa [17:16:38] hm [17:18:12] same there [17:18:13] lvs4 [17:18:16] vs lvs3 [17:18:22] lvs3 is inactive I think [17:23:26] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:27:18] !log Rerouted AS14907->AS43821 traffic [17:27:35] Logged the message, Master [17:27:56] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:35:11] !log db1038 swapping bad disk (slot 2) with new disk [17:35:22] Logged the message, Master [17:35:57] !log restarting pybal on lvs1002 [17:36:00] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:36:03] hrm [17:36:05] just got a page [17:36:06] Logged the message, Master [17:36:15] maybe I shouldn't have done that [17:37:50] looking [17:38:23] the page was for just upload-lb.esams ipv6 [17:38:28] no [17:38:32] also ipv4 [17:38:47] that arrived just now [17:39:16] I didn't restart anything esams-related (or pmtpa-related) [17:39:23] this could be the varnish change [17:39:23] i did routing changes just now [17:39:30] trying to get rid of the packet loss [17:39:44] but only upload complained [17:40:04] hrm, no varnish on the esams upload path though [17:40:12] so can't be it [17:40:13] indeed [17:40:55] I had a puppetd -vt running on spence [17:41:17] hrm [17:41:18] it's back [17:41:24] i think this is related to the ongoing packet loss [17:41:29] i'm going to try to find a better path after dinner [17:41:35] i'll keep an eye on my mobile phone [17:41:40] dinner's getting cold [17:41:42] okay [17:42:02] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:43:03] restarting nagios-wm [17:43:19] packet loss.. ok, something i can't help with :) off to breakfast [17:43:34] PROBLEM - Frontend Squid HTTP on knsq19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:42] I think it's knsq19 [17:43:43] haha [17:43:43] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:57] third time in a week [17:44:00] I'm going to reboot the box [17:44:10] RECOVERY - Frontend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.330 seconds [17:44:11] RECOVERY - Frontend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 0.344 seconds [17:44:11] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:19] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.284 seconds [17:44:54] RECOVERY - Backend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.466 seconds [17:44:54] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.513 seconds [17:45:02] PROBLEM - Backend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:05] I don't think it's the packet loss [17:45:10] what's wrong with it? [17:45:17] I think it's just knsq19 misbehaving [17:45:20] http://ganglia.wikimedia.org/latest/?c=Upload%20squids%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:45:20] RECOVERY - Frontend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 5.361 seconds [17:45:29] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:40] see how it spikes in cpu while the rest spike on i/o? [17:45:45] if you see e.g. network or packet graphs [17:45:52] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:45:52] it also spikes on network while the rest take a dive [17:45:58] I think it's just stops caching [17:46:05] RECOVERY - Frontend Squid HTTP on knsq19 is OK: HTTP OK HTTP/1.0 200 OK - 787 bytes in 9.261 seconds [17:46:10] !log restarting knsq19 backend squid [17:46:14] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 769 bytes in 6.015 seconds [17:46:20] Logged the message, Master [17:46:33] I SMS'ed Reedy to tell you that yesterday (my) night :) [17:46:42] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:50] I did mention it in the channel ;) [17:47:08] RECOVERY - Frontend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 2.042 seconds [17:47:08] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [17:47:08] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [17:47:08] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 5.187 seconds [17:47:08] RECOVERY - Frontend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 6.396 seconds [17:47:17] PROBLEM - Frontend Squid HTTP on amssq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:18] PROBLEM - Frontend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:44] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:02] it's rebuilding coss [17:48:17] that would explain the amount of traffic increase to pmtpa [17:48:29] RECOVERY - Backend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.993 seconds [17:48:39] those membufs messages [17:48:47] we may need to increase that param [17:48:54] but i want the packet loss gone first [17:48:56] yeah [17:48:56] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 2.016 seconds [17:49:02] since all kinds of symptoms can arise with packet loss [17:49:03] but this doesn't explain why knsq19 is different than the rest [17:49:05] no [17:49:07] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:49:12] depool it if it helps [17:49:15] and it's not just today, last time it was like that too [17:49:22] can't be coincidence [17:49:24] right [17:49:34] we have new varnish servers waiting [17:49:44] I think i'll start on them by the end of the week if all goes well ;) [17:49:51] although they have H310s :-( [17:49:51] you won't wait for H710s? [17:49:58] dunno [17:50:11] coss rebuilt [17:50:17] RECOVERY - Backend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.223 seconds [17:50:17] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 7.623 seconds [17:50:33] hm [17:50:37] lots of [17:50:37] 2013/01/21 17:49:32| storeSwapMetaUnpack: bad type (-16)! [17:50:37] 2013/01/21 17:49:34| storeSwapMetaUnpack: insane length (4128785)! [17:50:39] 2013/01/21 17:49:36| storeSwapMetaUnpack: insane length (319172897)! [17:50:43] 2013/01/21 17:49:40| storeSwapMetaUnpack: insane length (321979937)! [17:50:44] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 190 seconds [17:50:46] while rebuilding coss [17:50:49] maybe corrupted cache? [17:51:11] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 207 seconds [17:52:14] PROBLEM - Backend Squid HTTP on amssq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:15] PROBLEM - Backend Squid HTTP on amssq60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:15] PROBLEM - Backend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:18] sigh [17:52:19] traffic to the pmtpa squids is still elevated [17:52:24] 2013/01/21 17:52:09| squidaio_queue_request: WARNING - Disk I/O overloading [17:52:32] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 13 seconds [17:52:33] RECOVERY - Frontend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 3.580 seconds [17:52:33] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:52:59] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [17:53:01] I'll stop knsq19 [17:53:22] !log stopping knsq19 backend squid [17:53:32] Logged the message, Master [17:53:53] RECOVERY - Backend Squid HTTP on amssq48 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.869 seconds [17:53:54] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 6.963 seconds [17:54:02] RECOVERY - Backend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 9.225 seconds [17:54:20] RECOVERY - Frontend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.515 seconds [17:54:29] RECOVERY - Frontend Squid HTTP on amssq48 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 9.779 seconds [17:54:47] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 774 bytes in 4.632 seconds [17:55:50] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:54] still not ok [17:56:17] PROBLEM - Frontend Squid HTTP on amssq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:22] ok [17:56:28] gonna create a new path from eu to us [17:57:44] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 3.471 seconds [17:57:44] PROBLEM - Backend Squid HTTP on amssq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:05] PROBLEM - Backend Squid HTTP on knsq19 is CRITICAL: Connection refused [17:58:06] RECOVERY - Frontend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 8.753 seconds [17:58:06] PROBLEM - Frontend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:54] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [18:02:09] who broke gerrit? Lo [18:02:11] :o [18:02:41] hmmmm, is something very bad broken? [18:02:47] or is it just me? [18:03:26] aude: I suspect it's transit to the US [18:03:28] can't ping any wmf servers [18:03:29] yeah [18:03:41] mark: [18:03:49] who broke wikipedia [18:03:52] oh noes [18:04:03] * aude proxying via the us [18:04:11] wikipedia is fine from the eu :p [18:04:20] drdee is in canada, says it is fine there [18:04:24] wikipedia is fine from here as well [18:04:26] indeed Reedy [18:04:26] trying again [18:04:31] ok, works [18:04:48] mark was playing with transit [18:04:51] mark: was tracking down some packet loss issues [18:05:00] !log Rerouted AS43821->AS14907 traffic [18:05:00] hmmm [18:05:10] Logged the message, Master [18:05:12] meanwhile facebook, etc. worked so i know it was not me [18:05:31] ok, good via proxy [18:05:32] not fine for me [18:05:35] New review: Asher; "Reedy is working on making the test generally cover per realm/site db.php here - https://gerrit.wiki..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/44739 [18:05:36] I'm also US [18:08:36] traceoute to Prodego ends at ae0.cr1-eqiad.wikimedia.org [18:08:55] nope, still having trouble via us proxy [18:09:05] * aude forgot to click ok to save my settings [18:09:16] twitter, etc. is fine [18:09:34] also down in the UK according to deskana [18:09:34] aude: is wikipedia down? [18:10:10] Against manganese, it's timing out at xe-4-1-0.was10.ip4.tinet.net [18:10:37] mark: want me to page leslie? [18:11:02]