[00:04:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.425 seconds [00:52:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:40:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 306 seconds [01:45:20] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.307 seconds [02:14:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:14:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:25:12] !log LocalisationUpdate completed (1.21wmf1) at Mon Oct 15 02:25:12 UTC 2012 [02:25:28] Logged the message, Master [02:27:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:49:14] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [03:31:52] Brooke: you appear to be spreading sex a little too enthusiastically [03:34:16] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:35:46] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:47:45] ikr [04:10:34] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:10] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:19] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:25] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:34] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:34] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:19] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:14:26] ok, who broke the site ? [04:14:46] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:52] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:52] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:08] looks like thumbnails again .. [04:17:10] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:18] wtf [04:17:37] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:18:35] hrm, those machines are unhappy [04:18:46] memory leak [04:18:48] probably [04:19:34] apergos - there? [04:20:19] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:20:28] the swift machines are totally swapping [04:20:57] fe [04:21:02] yeah [04:21:24] try rebooting one of them? [04:21:45] i think restarting the swift processes would be a lot safer than rebooting one [04:22:53] all the swift-proxy-server processes are in swap [04:25:07] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [04:25:52] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.027 seconds [04:25:53] restart them - just swift proxy [04:26:05] looks like u did that [04:26:13] !log restarted swift proxy on ms-fe4 - it was in swap doom [04:26:23] yeah, tim is looking though, don't want to interrupt his debugging process [04:26:26] Logged the message, Mistress of the network gear. [04:26:39] can the site break when i haven't had a lot of wine please ? ;) [04:29:46] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.350 second response time [04:30:31] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.020 seconds [04:32:10] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.020 second response time [04:35:01] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 2.138 seconds [04:37:07] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:12] all good, i'm outtie [05:02:19] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:08:19] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [05:19:07] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:20:37] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [05:25:28] really [05:35:54] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:37:24] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [05:44:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:48:30] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:49:09] !log ms-be12 reporting errors on disk /dev/sdi, I've rt'ed it an unmounted it in the meantime [05:49:22] Logged the message, Master [05:50:00] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.016 seconds [05:54:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.707 seconds [06:02:36] I need to go stand in line to see about my permit again, I've put itoff for awhile due to various things, and there is another series of strikes coming so I need to get it done before that [06:02:47] see folks probably around mid-day [06:06:48] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds [06:06:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 215 seconds [06:10:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:11:36] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.009 seconds [06:11:36] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [06:11:36] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [06:13:38] slowness on swift is causing lock wait timeouts on various wikis [06:14:37] apparently a lock is acquired with LocalFile::lock() and then held while swift communication is done [06:15:31] but swift just hangs for a long long time before aborting the connection [06:20:00] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:11] pybal is flapping them [06:21:39] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.020 seconds [06:24:14] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:23] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [06:28:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:31:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:38] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:39] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:08] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [06:37:17] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:47] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [06:38:47] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [06:43:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.416 seconds [06:56:27] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:46] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:57] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [06:58:06] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.016 seconds [07:01:15] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:24] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [07:18:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:36] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [07:31:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.276 seconds [07:31:38] hello [07:44:46] New patchset: Hashar; "remove 'configchange' script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25939 [07:45:50] New review: Hashar; "rebased." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25939 [07:45:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25939 [07:51:04] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:51:04] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:51:28] New patchset: Tim Starling; "Increase the number of object server workers to 100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27976 [07:52:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27976 [07:52:35] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27976 [08:06:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:26] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [08:13:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [08:13:52] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:28] Change abandoned: Hashar; "Squashed in Ife7e6e7d "zuul configuration for Wikimedia" which uses an "instance" definition and thu..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [08:15:04] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:15:20] New review: Hashar; "I have merged in Ic7187035which was describing the lab role." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [08:19:34] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [08:20:36] New review: Hashar; "Load zuul.pp in site.pp since we do not have an autoloader :-]" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [08:20:36] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [08:22:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.071 seconds [08:23:25] what's going on? [08:23:26] sigh [08:23:31] ah hi [08:24:40] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:42:58] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [08:54:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:01:51] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:02:55] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [09:03:02] grrùmnnm [09:04:29] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:05:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:09:15] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:10:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [09:27:24] PROBLEM - swift-account-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:27:33] PROBLEM - swift-account-reaper on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:27:33] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:27:42] PROBLEM - swift-container-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:28:09] PROBLEM - swift-account-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:28:19] PROBLEM - swift-container-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:28:19] PROBLEM - swift-account-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:31:50] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:32:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [09:33:30] stuuupiiid puppet [09:33:31] grmbmblbl [09:33:55] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:34:56] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:35:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:40:27] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.752 second response time [09:40:29] New review: Hashar; "Here is the summary of changes I have made between patchset 2 and patchset 9:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [09:43:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:51] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 62515 bytes in 0.615 seconds [09:46:34] Change abandoned: Hashar; "cant really use that while we import everything in site.pp. Will have to look at using rspec at anot..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16139 [09:46:54] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [09:47:21] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [09:47:24] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16952 [09:47:30] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16953 [09:47:35] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16954 [09:47:39] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [09:47:48] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [09:48:02] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [09:48:29] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [09:49:00] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [09:49:27] New patchset: Hashar; "mw udp2log filter did not honor $log_directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24432 [09:50:30] New patchset: Hashar; "put library after sourcefile in varnish cc command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24797 [09:51:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24620 [09:51:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24432 [09:51:34] New patchset: Hashar; "git::clone now support a specific sha1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [09:52:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24797 [09:52:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27175 [09:57:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.252 seconds [09:58:09] RECOVERY - swift-account-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:58:27] RECOVERY - swift-container-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:58:36] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:58:36] RECOVERY - swift-account-reaper on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:59:12] RECOVERY - swift-account-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:59:12] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:59:39] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [10:31:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:11] mark: ping me when you're back [10:47:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.082 seconds [10:53:41] i am back [10:53:58] paravoid: pong [11:10:11] New review: Dereckson; "This change should be deployed according the i18n team planning." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/27962 [11:13:21] hey, sorry just saw that [11:13:27] saw the ipv6 issue via cogent? [11:13:39] saw that there was an email, haven't read it yet [11:13:43] i'm currently debugging swift issues :P [11:13:47] heh [11:13:48] or squid issues [11:13:55] still cannot_forwards [11:14:02] in the meantime, any reason not to switch bits back to esams? [11:14:04] i want sq85 to cache that video better [11:14:09] what, you haven't done that already? :P [11:14:11] that would workaround the issue just fine [11:14:26] that was supposed to happen on friday [11:16:08] !log switching European bits back to esams [11:16:18] (I restarted cp3001/2 varnish first just in case) [11:16:19] Logged the message, Master [11:16:45] so sq85 is requesting that video several times a second... [11:16:50] so, no, didn't do it on Friday and didn't feel like playing with it during the weekend in case it needs flaps back and forth [11:16:56] that would probably page everyone [11:19:04] i'm modifying sq85's squid config locally [11:19:10] puppet will replace it [11:19:23] disable puppet? :) [11:20:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:21] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:20:52] GET /wikipedia/commons/3/3b/Gertie_the_Dinosaur.ogv HTTP/1.0..User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4..Host: upload [11:20:53] .wikimedia.org..Accept: */*..Accept-Language: en-US,en;q=0.8..Range: bytes=10201750-..Referer: http://en.wikipedia.org/wiki/Winsor_McCay..Pragma: no-cache..Via: 1.0 sq85.wikimedia.org:312 [11:20:53] 8 (squid/2.7.STABLE9)..X-Forwarded-For: 213.165.40.2, 208.80.152.91..Connection: keep-alive.... [11:21:11] god damn range requests [11:23:39] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.152 seconds response time. www.wikipedia.org returns 208.80.154.225 [11:24:09] !log restarting pdns @ ns2 [11:24:21] Logged the message, Master [11:24:27] what's the problem with the range rquest? [11:34:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.441 seconds [11:45:15] PROBLEM - Backend Squid HTTP on sq85 is CRITICAL: Connection refused [11:45:51] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:51] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:21] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [11:48:15] RECOVERY - Backend Squid HTTP on sq85 is OK: HTTP OK HTTP/1.0 200 OK - 468 bytes in 0.005 seconds [11:48:51] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [11:49:32] !log Deployed range_offset_limit 64 MB change (from 0) to all upload backend squids [11:49:43] Logged the message, Master [11:53:03] PROBLEM - Puppet freshness on sq42 is CRITICAL: Puppet has not run in the last 10 hours [11:53:03] PROBLEM - Puppet freshness on srv220 is CRITICAL: Puppet has not run in the last 10 hours [11:54:06] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [12:01:39] New patchset: J; "make labs TMH setup more like production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [12:04:59] New patchset: J; "remove unused role role::jobrunner::videoscaler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28008 [12:06:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28008 [12:09:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:15:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:23:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [12:37:13] Importing MrFluffyCam20120930.ogv...PHP Warning: wfMkdirParents: failed to mkdir "/mnt/upload7/wikipedia/commons/a/a6" mode 0777 in /home/wikipedia/common/php-1.21wmf1/includes/GlobalFunctions.php on line 2546 [12:37:13] Warning: wfMkdirParents: failed to mkdir "/mnt/upload7/wikipedia/commons/a/a6" mode 0777 in /home/wikipedia/common/php-1.21wmf1/includes/GlobalFunctions.php on line 2546 failed. (Could not create directory "mwstore://local-multiwrite/local-public/a/a6".) [12:37:22] while trying to use importImages on fenari [12:37:51] Oh, because fenari doesn't have /upload7 [12:38:17] on fenari it's /mnt/originals [12:38:34] yeah [12:38:38] should probably make that consistent [12:38:56] heh [12:39:17] I just ended up doing is_dir( '/mnt/originals' ) ? '/mnt/originals/ext-dist' : '/mnt/upload7/ext-dist'; [12:41:03] do we use /mnt/originals anywhere on fenari? [12:42:13] no [12:42:20] i just mounted it there when I first created the volume [12:42:27] to work on it [12:44:07] okay, I'll remount it to upload7 then [12:44:27] might want to do that with puppet consistent with the apaches [12:44:32] the current mount was manual [12:45:10] grumble [12:45:19] can we please pretty please please get rid of ext-dist [12:45:23] in that form at least [12:49:44] there's a bash in a screen that has its cwd in /mnt/originals, I'm killing it [12:50:24] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [12:55:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:26] the ipv6 tele2 issue is peculiar [12:58:30] I guess I'll let our network mistress to handle it [13:08:30] what did you find? [13:09:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.556 seconds [13:12:18] the pmtpa ipv6 network from cogent goes to tele2 and somewhere on tele2 it responds network unreachable [13:12:35] so cogent->pmtpa is broken over ipv6, hence the increased impact [13:12:45] are we peering with tele2 on eqiad? [13:12:59] can it be internal breakage between eqiad->pmtpa? [13:13:00] we have transit from tele2 in eqiad [13:13:41] so this only affects the pmtpa prefix? that would be really weird [13:14:53] no, eqiad too it seems [13:14:57] I use Cogent's looking glass [13:15:16] that makes more sense [13:16:10] from which router? [13:16:44] london for example [13:17:00] but more [13:17:08] ah [13:17:11] mostly european routers [13:17:14] yes [13:17:28] so the ams router of tele2 doesn't have our US prefixes [13:17:30] hmm [13:26:14] we are advertising that prefix to them in eqiad [13:28:41] !log reedy synchronized wmf-config/CommonSettings.php 'Remove extdist conditional for mount path' [13:28:53] Logged the message, Master [13:29:56] !log reedy synchronized wmf-config/throttle.php 'Updated from gitrepo' [13:30:08] Logged the message, Master [13:30:18] robh: can you look into the partition on storage3 today. still not installing OS. [13:30:38] yep, will take a look at it shortly, had the grub fail right? [13:31:23] correct [13:31:30] cool, will take a gander in a bit [13:31:54] !log Pointing php to php-1.21wmf1 [13:32:05] Logged the message, Master [13:34:22] !log Killing php-1.20wmf11 l10n cache from apaches [13:34:33] Logged the message, Master [13:35:06] !log Killing php-1.20wmf11 l10n cache from apaches [13:35:18] Logged the message, Master [13:43:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.166 seconds [14:14:44] Importing Video Recording of President William Jefferson Clinton Delivering Remarks to the Community - NARA - 6037378.ogv...PHP Warning: SwiftFileBackend::doStoreInternal: Invalid response (503): Unexpected HTTP return code: 503 in /home/wikipedia/common/php-1.21wmf1/includes/filebackend/SwiftFileBackend.php on line 1364 [14:14:44] Warning: SwiftFileBackend::doStoreInternal: Invalid response (503): Unexpected HTTP return code: 503 in /home/wikipedia/common/php-1.21wmf1/includes/filebackend/SwiftFileBackend.php on line 1364 failed. (An unknown error occurred in storage backend "local-swift".) [14:14:54] Trying to upload a 897M file [14:15:08] I uploaded a 920.65 MB one 15 minutes ago.. [14:15:24] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:32:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.37411144 (gt 8.0) [14:45:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.420 seconds [14:51:58] !log reedy synchronized php-1.21wmf2/ 'Initial sync' [14:52:09] Logged the message, Master [14:55:35] !log reedy synchronized wmf-config/ [14:55:45] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:55:45] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:55:46] Logged the message, Master [14:56:30] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:56:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.21wmf2 [14:57:03] Logged the message, Master [15:03:05] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [15:09:05] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [15:12:03] !log reedy Started syncing Wikimedia installation... : Build localisation cache for 1.21wmf2 [15:12:15] Logged the message, Master [15:17:11] !log reedy synchronized php-1.21wmf1/extensions/MobileFrontend [15:17:23] Logged the message, Master [15:19:16] !log reedy synchronized php-1.21wmf2/extensions/MobileFrontend [15:19:28] Logged the message, Master [15:20:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:33] Could someone run this on fenari for me please? chown mwdeploy /home/wikipedia/common/wmf-config/ExtensionMessages-1.2* [15:28:00] just the dirs? [15:28:21] or did you want -R? [15:28:23] Reedy: [15:28:40] they're php files [15:28:44] ah [15:28:47] well they're done [15:28:49] in any case [15:28:52] thanks [15:28:55] yw [15:28:58] that'll stop scap complaining [15:29:11] great [15:33:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.233 seconds [15:35:47] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:05] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:14] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [15:41:02] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.67320725806 (gt 8.0) [15:43:31] !log stopping puppet on oxygen, looking for cause of recent packet loss [15:43:43] Logged the message, Master [16:03:39] New patchset: Demon; "Add new suffix for wikidata wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28040 [16:04:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:20] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:29] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:29] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:38] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:38] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:56] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:14] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:23] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:23] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:32] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:32] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:37] !log reedy Finished syncing Wikimedia installation... : Build localisation cache for 1.21wmf2 [16:06:41] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:48] Logged the message, Master [16:09:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:27] !log reedy synchronized wmf-config/InitialiseSettings.php 'Re-enable education program on enwiki' [16:16:39] Logged the message, Master [16:19:51] 3 [client 10.64.0.141] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:19:51] 3 [client 10.64.0.129] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:19:51] 2 [client 10.64.0.138] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:21:10] argh [16:21:12] check_command check_http_lvs!bits.wikimedia.org!/skins-1.5/common/images/poweredby_mediawiki_88x31.png [16:21:51] has that been removed or something? [16:22:17] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [16:22:39] I changed /usr/local/apache/common/php earlier... And just re-ran the dsh command [16:22:44] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [16:22:53] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [16:23:11] that looks to have fixed the spam in the apache logs [16:23:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [16:26:27] !log reedy synchronized php-1.21wmf1/extensions/MobileFrontend/ [16:26:40] Logged the message, Master [16:27:08] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:27:08] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:27:53] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:28:54] !log reedy synchronized php-1.21wmf2/extensions/MobileFrontend/ [16:29:05] Logged the message, Master [16:30:08] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:35:55] New patchset: Nemo bis; "Remove obsolete $wgRateLimitsExcludedGroups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28044 [16:44:14] so what was the alerts about bits? [16:49:43] I think it was my fault [16:49:59] I changed a symlink and it was creating spam. Running the same command again fixed it :/ [16:55:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.188 seconds [17:10:29] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:10:47] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [17:10:56] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:12:37] cmjohnson1: you onsite today? [17:14:59]