[02:12:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44844 [02:27:00] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 21 02:26:59 UTC 2013 [02:27:12] Logged the message, Master [02:50:21] !log LocalisationUpdate completed (1.21wmf8) at Mon Jan 21 02:50:20 UTC 2013 [02:50:32] Logged the message, Master [05:40:27] New patchset: Tim Starling; "Remove unused class "applicationserver_old"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44950 [05:43:36] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44950 [05:46:57] New patchset: Tim Starling; "Don't use deprecated class apaches::packages for blogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44951 [05:47:52] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44951 [05:51:08] !log on marmontel: removed MW-specific packages php5-wmerrors, php-luasandbox, php-wikidiff2 [05:51:18] Logged the message, Master [06:26:14] New patchset: Tim Starling; "Better way to check for array membership" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [06:31:27] New patchset: Tim Starling; "Test Ic5aab665 by temporarily "decommissioning" hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44954 [06:31:42] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [06:31:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44954 [06:36:48] New patchset: Tim Starling; "Revert "Test Ic5aab665 by temporarily "decommissioning" hume"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44955 [06:37:14] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44955 [06:40:38] New review: Tim Starling; "hume.wikimedia.org.yaml size reduced by 23%." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44953 [08:14:06] !log jenkins: updating all Jenkins jobs based on d31c92e of integration/jenkins-job-builder-config.git [08:14:17] Logged the message, Master [08:23:25] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [08:23:57] New review: Hashar; "PS3: removes role::cache::configuration::beta and integrate the $beta prefixed variables directly in..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [08:25:04] poor labsconsole is dead / slow :D [08:25:37] I guess memcached died again on virt0 , I can't check on nagios though [08:53:58] hashar, according to the latest news, its due to borken RAM module [08:55:03] MaxSem: oh good to know [08:55:26] apparently, it sill hasn't been pulled out [08:55:45] oh [08:55:54] MaxSem: mind commenting on https://bugzilla.wikimedia.org/show_bug.cgi?id=42127 please ? [08:56:05] that is the bug for memcache dieing on labsconsole [08:56:53] hashar, I heard this from Ryan so someone with first-hand knowledge is preferred [08:57:09] yeah I will ping andrew this afternoon [08:57:10] I still don't know what did his experiments end with [08:57:11] to find out more [09:11:55] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:13:12] New review: Hashar; "Makes role::cache::mobile to include role::cache::configuration so that the $beta variables are actu..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [09:13:17] ahh [09:13:20] found the root cause [09:13:22] \O/ [09:30:29] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:40:25] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [09:47:48] New review: Hashar; "Fixed mount options on labs." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [09:52:23] !log jenkins: jobs refresh completed. [09:52:33] Logged the message, Master [09:57:27] !log relaying Ryan: he restarted ldap on virt0 (was hung after server restart). nscld was properly falling back to virt1000 but ldap was stuck there too. DNS got restarted. [09:57:37] Logged the message, Master [10:19:37] New review: Hashar; "The check_https_lvs macro does not exist in Nagios configuration :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44750 [10:28:47] New patchset: ArielGlenn; "define check_http_lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:29:36] New patchset: ArielGlenn; "define check_https_lvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:31:22] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44962 [10:50:09] RECOVERY - Puppet freshness on db1028 is OK: puppet ran at Mon Jan 21 10:50:03 UTC 2013 [10:50:09] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Mon Jan 21 10:50:03 UTC 2013 [10:50:09] RECOVERY - Puppet freshness on mw46 is OK: puppet ran at Mon Jan 21 10:50:08 UTC 2013 [10:50:10] RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Mon Jan 21 10:50:08 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mw48 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mw1075 is OK: puppet ran at Mon Jan 21 10:50:13 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on cp1020 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on analytics1027 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:20] RECOVERY - Puppet freshness on mc1004 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:21] RECOVERY - Puppet freshness on sq72 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:21] RECOVERY - Puppet freshness on sq63 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:22] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Mon Jan 21 10:50:18 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on tarin is OK: puppet ran at Mon Jan 21 10:50:23 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on mw1063 is OK: puppet ran at Mon Jan 21 10:50:23 UTC 2013 [10:50:29] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK: HTTP/1.0 200 OK - 1283 bytes in 0.056 second response time [10:50:29] RECOVERY - Puppet freshness on cp1019 is OK: puppet ran at Mon Jan 21 10:50:28 UTC 2013 [10:50:29] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Mon Jan 21 10:50:28 UTC 2013 [10:50:39] RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:40] RECOVERY - Puppet freshness on mw45 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:40] RECOVERY - Puppet freshness on mw1139 is OK: puppet ran at Mon Jan 21 10:50:33 UTC 2013 [10:50:49] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [10:50:50] RECOVERY - Puppet freshness on amssq38 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1050 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1057 is OK: puppet ran at Mon Jan 21 10:50:43 UTC 2013 [10:50:50] RECOVERY - Puppet freshness on mw1123 is OK: puppet ran at Mon Jan 21 10:50:48 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on sq43 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Mon Jan 21 10:50:53 UTC 2013 [10:51:00] RECOVERY - Puppet freshness on mw58 is OK: puppet ran at Mon Jan 21 10:50:58 UTC 2013 [10:51:10] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [10:51:10] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [10:52:09] hahaha [10:56:57] New patchset: Hashar; "Resource references should now be capitalized" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44964 [10:59:34] apergos: do you know why puppet does not run on those machines ? :/ [11:01:57] no clue [11:09:05] RECOVERY - Backend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1258 bytes in 0.014 seconds [11:09:32] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1393 bytes in 0.006 seconds [11:09:41] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [11:09:42] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [11:09:50] yeahyeah [11:10:30] !log nagios was dead over the weekend (config broken), fixed in puppet and on spence, now back in action [11:10:40] Logged the message, Master [11:11:32] congrats [11:12:16] thanks to you [11:34:27] hey [11:34:31] what's going on? [11:34:33] got pages [11:34:50] paravoid: nagios got broken [11:34:53] nagios revived after being dead over the weekend [11:35:31] you should be able to ignore it, and sorry for the noise [11:35:33] nice [12:28:15] New patchset: Hashar; "explicit 0664 mode for /etc/logrotate.d/glusterlogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44972 [12:49:15] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [12:52:47] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [12:56:55] boo Running VCC-compiler failed, exit 1 [12:56:56] :_D [12:56:59] but I am almost there! [13:14:07] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:16:21] holyhell [13:16:26] No -T arg in shared memory [13:16:27] again [13:30:23] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:35:44] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:44:05] New patchset: Hashar; "adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:45:17] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:55:02] mark: hi around? :-] I am looking how to get logs for a varnish instance I am setting up in labs. [13:55:08] /var/log/varnish is empty :-] [13:57:02] ahh [13:57:07] the default varnishncsa does not run [13:58:04] doh [13:58:20] french national radio talking about Aaron SW.. [14:04:43] New review: Hashar; "Patchset 13 let puppet run properly on the instance and also have the backend/frontend varnish servi..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44709 [14:17:24] New patchset: Hashar; "(bug 44118) contint: install pyflakes on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44974 [14:17:39] !log gallium: manually installed pyflakes {{gerrit|44974}} [14:17:50] Logged the message, Master [14:18:04] New review: Hashar; "already installed pyflakes on gallium. Feel free to merge this whenever you want." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44974 [14:51:32] New review: Silke Meyer; "You get the point." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44690 [15:17:09] wtf [15:21:27] apergos: what did you do with check_lvs_http? [15:21:30] what was broken? [15:21:38] I added the def [15:21:43] it didn't exist previosly [15:21:56] and so nagios would not start, broken configuration [15:22:50] ok [15:26:04] the ms-fe.eqiad check is completely broken [15:26:10] I wonder how it worked so fa [15:26:11] *far [15:26:16] I wonder if it did [15:28:32] it surely didn't page [15:29:37] that's true enough [15:32:38] !log depooling ms-fe1 for testing [15:32:47] Logged the message, Master [15:45:40] !log reedy synchronized php-1.21wmf7/includes/EditPage.php [15:45:44] Can someone poke mw1072, it's asking me for a password when sync-file [15:45:46] mw1072: Permission denied (publickey,password). [15:45:50] Logged the message, Master [15:48:47] !log reedy synchronized php-1.21wmf8/includes/EditPage.php [15:49:01] Logged the message, Master [15:50:52] New patchset: Hashar; "explicit 0664 mode for /etc/logrotate.d/glusterlogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44972 [15:52:32] the changes required to the (varnish) puppet manifests for labs support are making me cry [15:53:58] * hashar hands mark a facial tissue [15:54:09] at least you did not hit your forehead with a huuuge facepalm [15:54:19] that's what made me cry actually [15:54:21] ;-) [15:54:45] I have gone with the same hack in use for the bits varnish :/ [15:54:51] i know [15:54:57] all the special casing is not helping at all :( [15:55:01] yeah [15:55:23] ideally we would have all the conf per realm in a different set of file [15:55:28] kind of a configuration database [15:55:34] and just fetch from it whatever value we need [15:55:39] ideally labs wouldn't differ from production much ;) [15:55:47] hehe [15:55:51] different ips but not much else [15:56:01] and different disks in this case :/ [15:56:05] New patchset: Silke Meyer; "Variables for the client config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44690 [15:56:06] yeah I wonder about that [15:56:10] we might be able to fix that possibly [15:56:26] also, you can usually mount partitions multiple times [15:56:30] so no need to unmount I think [15:58:17] Change abandoned: Hashar; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [16:00:53] Change abandoned: Hashar; "been made by someone else in the 'newdeploy' branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43964 [16:01:57] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [16:02:11] New review: Hashar; "fix typo in commit summary" [operations/debs/python-voluptuous] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44408 [16:02:19] New patchset: Ottomata; "Sending blog.wikimedia.org traffic logs to analytics1001 udp2log instance." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44983 [16:02:24] moving out to get my daughter back home [16:02:28] might connect later tonight [16:08:25] hiya paravoid, you got a sec to review this one? [16:08:25] https://gerrit.wikimedia.org/r/#/c/44983/ [16:08:35] it is a simple change, but i think I should not self review it [16:09:04] not because its dangerous, it just feels like someone else should at least say, "hm. ok" [16:09:33] New review: Faidon; "hm. ok" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44983 [16:11:50] haha [16:11:51] thanks [16:12:17] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44983 [16:12:35] thanks paravoid! [16:24:29] New review: Reedy; "Needs rebasing and probably somewhat re-doing" [operations/debs/wikimedia-task-appserver] (master) C: -1; - https://gerrit.wikimedia.org/r/43356 [16:26:09] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [16:26:31] New patchset: Faidon; "swift: add /monitoring/ to rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44984 [16:27:18] New review: Faidon; "Staged on ms-fe1" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/44984 [16:27:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44984 [16:29:11] !log Rerouted AS43821->AS14907 traffic [16:29:23] Logged the message, Master [16:30:36] !log repooling ms-fe1 [16:30:48] Logged the message, Master [16:36:08] !log depooling, restarting and repooling ms-fe2/3/4 one by one [16:36:19] Logged the message, Master [16:52:33] New review: Mwang; "yes" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44715 [17:04:13] New patchset: Faidon; "Use /monitoring/backend to monitor Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44985 [17:04:18] mark: wanna review that? [17:04:34] k [17:05:22] it's really trivial, but since it touches LVS, varnish and nagios [17:05:32] it's a good idea to have another set of eyes :) [17:07:31] New review: Mark Bergsma; "Looks fine." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44985 [17:07:40] thanks. [17:07:43] yes, but pybal doesn't restart on config changes [17:07:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44985 [17:07:56] so run puppet on an inactive lvs server (lvs1006), restart that one [17:11:50] New patchset: Reedy; "Update symlinks to PoolCounter, db and mc files to include eqiad/pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44986 [17:12:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44986 [17:12:25] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [17:13:47] that's for Swift mostly [17:13:50] so pmtpa [17:16:38] hm [17:18:12] same there [17:18:13] lvs4 [17:18:16] vs lvs3 [17:18:22] lvs3 is inactive I think [17:23:26] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:27:18] !log Rerouted AS14907->AS43821 traffic [17:27:35] Logged the message, Master [17:27:56] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:35:11] !log db1038 swapping bad disk (slot 2) with new disk [17:35:22] Logged the message, Master [17:35:57] !log restarting pybal on lvs1002 [17:36:00] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:36:03] hrm [17:36:05] just got a page [17:36:06] Logged the message, Master [17:36:15] maybe I shouldn't have done that [17:37:50] looking [17:38:23] the page was for just upload-lb.esams ipv6 [17:38:28] no [17:38:32] also ipv4 [17:38:47] that arrived just now [17:39:16] I didn't restart anything esams-related (or pmtpa-related) [17:39:23] this could be the varnish change [17:39:23] i did routing changes just now [17:39:30] trying to get rid of the packet loss [17:39:44] but only upload complained [17:40:04] hrm, no varnish on the esams upload path though [17:40:12] so can't be it [17:40:13] indeed [17:40:55] I had a puppetd -vt running on spence [17:41:17] hrm [17:41:18] it's back [17:41:24] i think this is related to the ongoing packet loss [17:41:29] i'm going to try to find a better path after dinner [17:41:35] i'll keep an eye on my mobile phone [17:41:40] dinner's getting cold [17:41:42] okay [17:42:02] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:43:03] restarting nagios-wm [17:43:19] packet loss.. ok, something i can't help with :) off to breakfast [17:43:34] PROBLEM - Frontend Squid HTTP on knsq19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:42] I think it's knsq19 [17:43:43] haha [17:43:43] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:57] third time in a week [17:44:00] I'm going to reboot the box [17:44:10] RECOVERY - Frontend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.330 seconds [17:44:11] RECOVERY - Frontend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 0.344 seconds [17:44:11] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:19] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.284 seconds [17:44:54] RECOVERY - Backend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.466 seconds [17:44:54] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.513 seconds [17:45:02] PROBLEM - Backend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:05] I don't think it's the packet loss [17:45:10] what's wrong with it? [17:45:17] I think it's just knsq19 misbehaving [17:45:20] http://ganglia.wikimedia.org/latest/?c=Upload%20squids%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:45:20] RECOVERY - Frontend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 5.361 seconds [17:45:29] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:40] see how it spikes in cpu while the rest spike on i/o? [17:45:45] if you see e.g. network or packet graphs [17:45:52] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:45:52] it also spikes on network while the rest take a dive [17:45:58] I think it's just stops caching [17:46:05] RECOVERY - Frontend Squid HTTP on knsq19 is OK: HTTP OK HTTP/1.0 200 OK - 787 bytes in 9.261 seconds [17:46:10] !log restarting knsq19 backend squid [17:46:14] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 769 bytes in 6.015 seconds [17:46:20] Logged the message, Master [17:46:33] I SMS'ed Reedy to tell you that yesterday (my) night :) [17:46:42] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:50] I did mention it in the channel ;) [17:47:08] RECOVERY - Frontend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 2.042 seconds [17:47:08] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [17:47:08] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [17:47:08] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 5.187 seconds [17:47:08] RECOVERY - Frontend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 6.396 seconds [17:47:17] PROBLEM - Frontend Squid HTTP on amssq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:18] PROBLEM - Frontend Squid HTTP on amssq47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:44] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:02] it's rebuilding coss [17:48:17] that would explain the amount of traffic increase to pmtpa [17:48:29] RECOVERY - Backend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 1.993 seconds [17:48:39] those membufs messages [17:48:47] we may need to increase that param [17:48:54] but i want the packet loss gone first [17:48:56] yeah [17:48:56] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 2.016 seconds [17:49:02] since all kinds of symptoms can arise with packet loss [17:49:03] but this doesn't explain why knsq19 is different than the rest [17:49:05] no [17:49:07] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [17:49:12] depool it if it helps [17:49:15] and it's not just today, last time it was like that too [17:49:22] can't be coincidence [17:49:24] right [17:49:34] we have new varnish servers waiting [17:49:44] I think i'll start on them by the end of the week if all goes well ;) [17:49:51] although they have H310s :-( [17:49:51] you won't wait for H710s? [17:49:58] dunno [17:50:11] coss rebuilt [17:50:17] RECOVERY - Backend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.223 seconds [17:50:17] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 7.623 seconds [17:50:33] hm [17:50:37] lots of [17:50:37] 2013/01/21 17:49:32| storeSwapMetaUnpack: bad type (-16)! [17:50:37] 2013/01/21 17:49:34| storeSwapMetaUnpack: insane length (4128785)! [17:50:39] 2013/01/21 17:49:36| storeSwapMetaUnpack: insane length (319172897)! [17:50:43] 2013/01/21 17:49:40| storeSwapMetaUnpack: insane length (321979937)! [17:50:44] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 190 seconds [17:50:46] while rebuilding coss [17:50:49] maybe corrupted cache? [17:51:11] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 207 seconds [17:52:14] PROBLEM - Backend Squid HTTP on amssq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:15] PROBLEM - Backend Squid HTTP on amssq60 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:15] PROBLEM - Backend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:18] sigh [17:52:19] traffic to the pmtpa squids is still elevated [17:52:24] 2013/01/21 17:52:09| squidaio_queue_request: WARNING - Disk I/O overloading [17:52:32] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 13 seconds [17:52:33] RECOVERY - Frontend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 3.580 seconds [17:52:33] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [17:52:59] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [17:53:01] I'll stop knsq19 [17:53:22] !log stopping knsq19 backend squid [17:53:32] Logged the message, Master [17:53:53] RECOVERY - Backend Squid HTTP on amssq48 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.869 seconds [17:53:54] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 6.963 seconds [17:54:02] RECOVERY - Backend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 9.225 seconds [17:54:20] RECOVERY - Frontend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.515 seconds [17:54:29] RECOVERY - Frontend Squid HTTP on amssq48 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 9.779 seconds [17:54:47] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 774 bytes in 4.632 seconds [17:55:50] PROBLEM - Backend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:54] still not ok [17:56:17] PROBLEM - Frontend Squid HTTP on amssq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:22] ok [17:56:28] gonna create a new path from eu to us [17:57:44] RECOVERY - Backend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 3.471 seconds [17:57:44] PROBLEM - Backend Squid HTTP on amssq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:05] PROBLEM - Backend Squid HTTP on knsq19 is CRITICAL: Connection refused [17:58:06] RECOVERY - Frontend Squid HTTP on amssq47 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 8.753 seconds [17:58:06] PROBLEM - Frontend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:54] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [18:02:09] who broke gerrit? Lo [18:02:11] :o [18:02:41] hmmmm, is something very bad broken? [18:02:47] or is it just me? [18:03:26] aude: I suspect it's transit to the US [18:03:28] can't ping any wmf servers [18:03:29] yeah [18:03:41] mark: [18:03:49] who broke wikipedia [18:03:52] oh noes [18:04:03] * aude proxying via the us [18:04:11] wikipedia is fine from the eu :p [18:04:20] drdee is in canada, says it is fine there [18:04:24] wikipedia is fine from here as well [18:04:26] indeed Reedy [18:04:26] trying again [18:04:31] ok, works [18:04:48] mark was playing with transit [18:04:51] mark: was tracking down some packet loss issues [18:05:00] !log Rerouted AS43821->AS14907 traffic [18:05:00] hmmm [18:05:10] Logged the message, Master [18:05:12] meanwhile facebook, etc. worked so i know it was not me [18:05:31] ok, good via proxy [18:05:32] not fine for me [18:05:35] New review: Asher; "Reedy is working on making the test generally cover per realm/site db.php here - https://gerrit.wiki..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/44739 [18:05:36] I'm also US [18:08:36] traceoute to Prodego ends at ae0.cr1-eqiad.wikimedia.org [18:08:55] nope, still having trouble via us proxy [18:09:05] * aude forgot to click ok to save my settings [18:09:16] twitter, etc. is fine [18:09:34] also down in the UK according to deskana [18:09:34] aude: is wikipedia down? [18:10:10] Against manganese, it's timing out at xe-4-1-0.was10.ip4.tinet.net [18:10:37] mark: want me to page leslie? [18:11:02] no [18:11:38] 10 208.185.20.118.T01811-04.above.net (208.185.20.118) 49.842 ms 50.285 ms 50.064 ms [18:11:41] it gets stuck there [18:11:44] kaldari: us only [18:11:59] or everywhere served from teh us [18:12:16] aude: Deskana says it is down, but other UK users see it as up [18:12:22] Prodego: right [18:12:31] via ESAMS it's good (amsterdam) [18:12:45] wikipedia fell off BGP [18:12:47] although might not be able to edit [18:13:56] wikimedia didn't fall off bgp at all [18:14:09] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:20] heh, that sounded exciting [18:15:24] Prodego: things better for you now? [18:15:31] how does it look now? [18:15:36] yep back now [18:15:56] yay I can reach bastion hosts w00t :-D [18:16:32] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:17:01] thanks guys [18:18:36] yep, it works now [18:19:59] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [18:21:30] PROBLEM - Frontend Squid HTTP on amssq50 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:01] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [18:23:08] RECOVERY - Frontend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 5.665 seconds [18:26:14] New patchset: Reedy; "Remove var_dump" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44991 [18:26:29] did someone push a bad squid.conf or something [18:26:50] I didn't [18:26:54] didn't get the chance [18:26:56] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44991 [18:27:03] or are all these disks just broken [18:27:20] i guess that is the case [18:27:40] New patchset: Reedy; "Update dbconfigtest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [18:28:04] !log Rebooting knsq24 [18:28:14] Logged the message, Master [18:28:35] 24? [18:28:41] check nagios [18:28:57] not an upload squid though [18:29:35] no there's a bunch [18:29:53] upload esams is still broken [18:30:02] PROBLEM - Backend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:46] text esams doesn't look affected [18:30:56] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:42] RECOVERY - Backend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 2.998 seconds [18:31:42] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:02] !log power cycled amssq56 [18:32:12] Logged the message, Master [18:32:14] hmm [18:32:22] looking at the overview [18:32:29] all the ams* ones look borked [18:32:33] but the kn* ones look ok [18:32:47] well s/ok/better/ [18:33:05] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=Upload+squids+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:33:38] PROBLEM - Backend Squid HTTP on amssq49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:54] Change abandoned: Reedy; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44987 [18:34:05] PROBLEM - Frontend Squid HTTP on amssq49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:06] PROBLEM - Frontend Squid HTTP on amssq53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:12] I don't know our network topology well dammit [18:34:22] New review: Reedy; "Mine was really hacky" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44739 [18:34:22] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44739 [18:34:24] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [18:34:46] they're all under memory pressure [18:35:00] RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 111.04 ms [18:35:36] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 110.75 ms [18:36:12] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 777 bytes in 6.228 seconds [18:36:12] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:06] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.885 seconds [18:37:22] don't push any squid configs now [18:37:28] doing manual changes to get backend squids up [18:37:32] !log Started knsq18 minus one disk [18:37:41] RECOVERY - Backend Squid HTTP on knsq18 is OK: HTTP OK HTTP/1.0 200 OK - 632 bytes in 1.223 seconds [18:37:42] Logged the message, Master [18:38:08] PROBLEM - LVS HTTP IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:17] PROBLEM - Frontend Squid HTTP on knsq21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:18] PROBLEM - Frontend Squid HTTP on knsq22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:32] !log Started knsq16 minus one disk [18:38:41] Logged the message, Master [18:38:53] PROBLEM - Backend Squid HTTP on amssq56 is CRITICAL: Connection refused [18:39:20] PROBLEM - Frontend Squid HTTP on amssq56 is CRITICAL: Connection refused [18:39:21] RECOVERY - Backend Squid HTTP on knsq16 is OK: HTTP OK HTTP/1.0 200 OK - 632 bytes in 0.234 seconds [18:39:29] RECOVERY - Frontend Squid HTTP on amssq53 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 9.232 seconds [18:39:41] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:39:48] RECOVERY - LVS HTTP IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 2.064 seconds [18:39:56] RECOVERY - Frontend Squid HTTP on knsq21 is OK: HTTP OK HTTP/1.0 200 OK - 789 bytes in 0.346 seconds [18:40:04] !log Started amssq56 squid instances [18:40:06] RECOVERY - Frontend Squid HTTP on knsq22 is OK: HTTP OK HTTP/1.0 200 OK - 789 bytes in 9.390 seconds [18:40:15] Logged the message, Master [18:40:41] RECOVERY - Backend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.371 seconds [18:41:08] RECOVERY - Frontend Squid HTTP on amssq56 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 3.605 seconds [18:41:09] RECOVERY - Frontend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 6.546 seconds [18:41:38] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:42:03] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:42:38] PROBLEM - Backend Squid HTTP on amssq49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:05] PROBLEM - Frontend Squid HTTP on amssq55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:04] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:44:17] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.196 seconds [18:44:26] PROBLEM - Backend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:27] PROBLEM - Backend Squid HTTP on amssq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:36] so quiet [18:44:53] RECOVERY - Frontend Squid HTTP on amssq55 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 9.573 seconds [18:44:54] PROBLEM - Frontend Squid HTTP on amssq50 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:55] and we can't even watch the servers whining on ganglia :) [18:45:03] * Damianz pats Nemo [18:45:57] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:19] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:47:54] RECOVERY - Backend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 5.812 seconds [18:48:29] PROBLEM - Frontend Squid HTTP on amssq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:41] RECOVERY - Backend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.339 seconds [18:50:09] RECOVERY - Frontend Squid HTTP on amssq48 is OK: HTTP OK HTTP/1.0 200 OK - 792 bytes in 0.636 seconds [18:50:09] RECOVERY - Frontend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 790 bytes in 4.553 seconds [18:50:15] !log Restarted knsq16 backend minus two disks [18:50:26] Logged the message, Master [18:51:47] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [18:53:57] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:55:18] !log starting knsq19 backend squid [18:55:28] Logged the message, Master [18:55:49] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:55:50] RECOVERY - Backend Squid HTTP on knsq19 is OK: HTTP OK HTTP/1.0 200 OK - 633 bytes in 1.274 seconds [18:56:53] PROBLEM - Backend Squid HTTP on amssq49 is CRITICAL: Connection refused [18:58:55] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [18:59:56] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:02:32] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:03:56] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.223 seconds [19:06:31] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:11:26] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:51] !log Restarted oversized frontend on amssq50 [19:12:01] Logged the message, Master [19:13:24] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:34] !log Restarting amssq* upload frontends in a slow loop [19:14:44] Logged the message, Master [19:15:38] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [19:16:23] :-P [19:17:13] PROBLEM - Backend Squid HTTP on amssq62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:52] RECOVERY - Backend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 3.653 seconds [19:19:20] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [19:19:35] PROBLEM - NTP on ms-be1012 is CRITICAL: NTP CRITICAL: No response from NTP server [19:21:58] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [19:24:20] I don't know if this is something that anyone cares about, but if you navigate to https://wikipedia.com you will get a cert error (since you actually are loading .org) [19:34:44] !log deploying squid config for upload's /monitoring/ [19:34:55] Logged the message, Master [19:39:41] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [19:40:19] New review: Andrew Bogott; "So... we want these variables set in" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44690 [19:40:48] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:41:53] !log Restarting knsq* upload frontends manually [19:41:58] erm [19:42:02] I'm pushing a config [19:42:03] Logged the message, Master [19:42:06] i know [19:42:12] New patchset: Reedy; "Rewrite testDoNotRemoveLinesInHostsbyname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:42:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44994 [19:44:02] squids are now monitoring swift instead of NFS, woo! [19:44:22] yay [19:44:41] now I have to fix / /index.html /favicon.ico /robots.txt [19:45:08] serve from varnish ;-) [19:45:28] well, we still have squids [19:45:31] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [19:45:39] not for long [19:45:40] it's a bit annoying that we have to do changes on both [19:45:42] New review: Andrew Bogott; "I'm happy to take your word for this" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44972 [19:45:43] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44972 [19:45:49] upload squids are gonna die really soon now [19:50:43] and there goes pybaltestfile.txt [19:51:11] :-( [19:52:51] you don't like how the filename doesn't have pybal anymore? [19:52:55] I can fix that [19:53:00] pybalmonitoringpybal/pybalbackendpybal [19:53:08] haha no i'm kidding [19:53:11] I remember putting that file in [19:53:24] I'm also kidding obviously :) [19:53:51] because i'm /soooo/ pushing pybal to the world ;-p [19:54:04] haha [19:54:12] you really should though [19:54:18] my offer still stands [19:54:26] I'll happily upload it to Debian [19:54:31] some day [19:55:15] there [19:55:19] all frontends restarted [19:55:24] pybal looks a *lot* happier now [19:55:26] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:56:02] or maybe whatever triggered it is gone again [19:56:26] triggered what? [19:56:34] the whole outage [19:56:42] the memory usage [19:56:50] it was the 4th time this happened or so [19:57:05] did you restart all frontends? [19:57:13] no [19:57:19] then yeah [20:00:50] so the only remaining worry now is that packet loss [20:01:17] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [20:02:32] New review: Andrew Bogott; "I can't tell the difference between patchset 2 and patchset 1. Am I missing a subtle change, or did..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43886 [20:08:24] andrewbogott: I guess that change 43886 is a rebase [20:08:36] andrewbogott: you can tell by looking at the Parent(s) field [20:08:51] Yeah, but right before he sent it Mike told me on IRC he was submitting changes... [20:09:32] The changed parent tells us that it was rebased but not that it's /just/ a rebase, right? [20:16:27] yeah [20:16:36] we don't have a script yet to show up it is a trivial rebase [20:16:43] Tim wrote a script that find out the common ancestor [20:16:48] and does a 3 ways diff [20:16:53] can't find it though [20:17:32] maybe git diff [20:17:34] (with 3 dots) [20:19:36] na not that one [20:19:38] bah :( [20:21:22] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 204 seconds [20:22:08] !log Rerouted AS43821->AS14907 traffic [20:22:18] Logged the message, Master [20:22:55] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 230 seconds [20:26:53] interesting [20:27:06] I see packet loss to the upload LVS service IP from everywhere [20:27:11] but not to amslvs2 [20:27:18] same box [20:29:39] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [20:30:16] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [20:34:09] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:04] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:01] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [20:43:54] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [20:53:48] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:58] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:57:06] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [20:57:43] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:51] RECOVERY - SSH on ms-be1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:06:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44084 [21:09:34] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [21:10:36] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on srv247 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: Puppet has not run in the last 10 hours [21:10:37] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:10:38] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:10:39] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [21:10:39] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [21:10:39] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:17:31] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:18:15] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [21:22:19] PROBLEM - swift-container-auditor on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:27] PROBLEM - swift-account-reaper on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:28] PROBLEM - swift-container-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:28] PROBLEM - swift-object-replicator on ms-be1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:42] lol [21:22:54] PROBLEM - swift-container-server on ms-be1009 is CRITICAL: Connection refused by host [21:22:54] PROBLEM - swift-object-server on ms-be1009 is CRITICAL: Connection refused by host [21:22:55] PROBLEM - swift-account-replicator on ms-be1009 is CRITICAL: Connection refused by host [21:23:13] PROBLEM - swift-container-updater on ms-be1009 is CRITICAL: Connection refused by host [21:23:13] PROBLEM - swift-account-server on ms-be1009 is CRITICAL: Connection refused by host [21:23:13] PROBLEM - swift-object-updater on ms-be1009 is CRITICAL: Connection refused by host [21:23:39] PROBLEM - swift-object-auditor on ms-be1009 is CRITICAL: Connection refused by host [21:23:40] PROBLEM - swift-account-auditor on ms-be1009 is CRITICAL: Connection refused by host [21:25:54] RECOVERY - SSH on ms-be1010 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:26:03] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [21:26:30] RECOVERY - Puppet freshness on ms-be1010 is OK: puppet ran at Mon Jan 21 21:26:15 UTC 2013 [21:27:42] RECOVERY - Puppet freshness on ms-be1011 is OK: puppet ran at Mon Jan 21 21:27:34 UTC 2013 [21:28:01] RECOVERY - Puppet freshness on ms-be1012 is OK: puppet ran at Mon Jan 21 21:27:44 UTC 2013 [21:32:08] New patchset: Ryan Lane; "Adding info for virt9-11" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45067 [21:35:09] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:10] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [21:41:00] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [21:41:01] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [21:45:21] PROBLEM - SSH on ms-be1011 is CRITICAL: Connection refused [21:46:15] PROBLEM - SSH on ms-be1012 is CRITICAL: Connection refused [22:03:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45067 [22:10:53] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:10:53] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [22:13:25] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:13:34] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [22:14:27] RECOVERY - SSH on ms-be1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:14:36] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [22:22:08] New patchset: Ryan Lane; "Disable thin_storeconfigs on virt0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45072 [22:22:16] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45072 [22:23:45] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [22:44:59] !g 44839 | anyone who's bored [22:44:59] anyone who's bored: https://gerrit.wikimedia.org/r/#q,44839,n,z [22:45:45] New patchset: Reedy; "Add wikivoyage to captcha whitelist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44839 [22:46:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44839 [22:46:59] \o/ [22:47:00] New patchset: Ryan Lane; "Fix syntax error in dhcp file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45073 [22:47:10] anomie: Wikidata is missing too [22:47:16] I'll deploy both [22:48:14] New patchset: Reedy; "Add wikidata to captcha whitelist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45074 [22:48:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45074 [22:49:25] !log reedy synchronized wmf-config/CommonSettings.php [22:49:37] Logged the message, Master [22:50:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45073 [22:52:10] !log adding virt9-11 entries in dns [22:52:19] Logged the message, Master [23:02:37] New patchset: Tim Starling; "Log request duration on stafford" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45076 [23:03:44] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45076 [23:05:12] Ryan_Lane: ok to deploy these nginx proxy changes? [23:05:21] they stick [23:05:26] I think the repo needs a rebase [23:05:28] or reset [23:06:55] why is the "mikepatch" branch checked out? [23:06:58] mikepatch1 [23:07:29] ugh [23:07:42] someone likely did the incorrect thing [23:08:23] bash history shows some frustration [23:08:38] lots of resets and repeated commands [23:09:21] I'll fix it? [23:09:25] please do [23:11:09] !log on sockpuppet: fixed puppet checkout, switching branch from mikepatch1 to production, and then did fetch&&rebase for good measure [23:11:20] Logged the message, Master [23:11:23] whaa [23:11:43] mike doesn't have root in production [23:11:51] andrewbogott_afk: was it you that merged his changes? [23:16:48] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: Connection refused [23:18:08] uh oh http://devopsreactions.tumblr.com/post/37823969926/a-small-infrastructure-change-4pm-friday [23:20:51] paravoid: is that nagios alert a problem? [23:20:57] no [23:21:21] I'm trying to fix it since hours ago but getting distracted [23:21:25] will do before I go to bed [23:21:31] not sure why it started paging today though