[00:04:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:18:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.425 seconds [00:52:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:40:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 306 seconds [01:45:20] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.307 seconds [02:14:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:14:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:25:12] !log LocalisationUpdate completed (1.21wmf1) at Mon Oct 15 02:25:12 UTC 2012 [02:25:28] Logged the message, Master [02:27:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:49:14] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [03:31:52] Brooke: you appear to be spreading sex a little too enthusiastically [03:34:16] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:35:46] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:47:45] ikr [04:10:34] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:10] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:19] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:25] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:34] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:34] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:19] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:14:26] ok, who broke the site ? [04:14:46] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:52] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:52] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:08] looks like thumbnails again .. [04:17:10] PROBLEM - Swift HTTP on ms-fe3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:17:18] wtf [04:17:37] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:18:35] hrm, those machines are unhappy [04:18:46] memory leak [04:18:48] probably [04:19:34] apergos - there? [04:20:19] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:20:28] the swift machines are totally swapping [04:20:57] fe [04:21:02] yeah [04:21:24] try rebooting one of them? [04:21:45] i think restarting the swift processes would be a lot safer than rebooting one [04:22:53] all the swift-proxy-server processes are in swap [04:25:07] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [04:25:52] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.027 seconds [04:25:53] restart them - just swift proxy [04:26:05] looks like u did that [04:26:13] !log restarted swift proxy on ms-fe4 - it was in swap doom [04:26:23] yeah, tim is looking though, don't want to interrupt his debugging process [04:26:26] Logged the message, Mistress of the network gear. [04:26:39] can the site break when i haven't had a lot of wine please ? ;) [04:29:46] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.350 second response time [04:30:31] RECOVERY - Swift HTTP on ms-fe3 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.020 seconds [04:32:10] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.020 second response time [04:35:01] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 2.138 seconds [04:37:07] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:12] all good, i'm outtie [05:02:19] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:08:19] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [05:19:07] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:20:37] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [05:25:28] really [05:35:54] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:37:24] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [05:44:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:48:30] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:49:09] !log ms-be12 reporting errors on disk /dev/sdi, I've rt'ed it an unmounted it in the meantime [05:49:22] Logged the message, Master [05:50:00] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.016 seconds [05:54:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.707 seconds [06:02:36] I need to go stand in line to see about my permit again, I've put itoff for awhile due to various things, and there is another series of strikes coming so I need to get it done before that [06:02:47] see folks probably around mid-day [06:06:48] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds [06:06:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 215 seconds [06:10:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:11:36] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.009 seconds [06:11:36] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [06:11:36] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [06:13:38] slowness on swift is causing lock wait timeouts on various wikis [06:14:37] apparently a lock is acquired with LocalFile::lock() and then held while swift communication is done [06:15:31] but swift just hangs for a long long time before aborting the connection [06:20:00] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:11] pybal is flapping them [06:21:39] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.020 seconds [06:24:14] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:23] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [06:28:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:31:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:38] PROBLEM - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:39] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:08] RECOVERY - LVS HTTP IPv4 on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.012 seconds [06:37:17] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:47] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [06:38:47] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [06:43:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.416 seconds [06:56:27] PROBLEM - Swift HTTP on ms-fe4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:46] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:57] RECOVERY - Swift HTTP on ms-fe4 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [06:58:06] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.016 seconds [07:01:15] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:24] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [07:18:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:36] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.007 seconds [07:31:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.276 seconds [07:31:38] hello [07:44:46] New patchset: Hashar; "remove 'configchange' script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25939 [07:45:50] New review: Hashar; "rebased." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25939 [07:45:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25939 [07:51:04] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:51:04] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:51:28] New patchset: Tim Starling; "Increase the number of object server workers to 100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27976 [07:52:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27976 [07:52:35] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27976 [08:06:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:26] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [08:13:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [08:13:52] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:28] Change abandoned: Hashar; "Squashed in Ife7e6e7d "zuul configuration for Wikimedia" which uses an "instance" definition and thu..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [08:15:04] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:15:20] New review: Hashar; "I have merged in Ic7187035which was describing the lab role." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [08:19:34] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [08:20:36] New review: Hashar; "Load zuul.pp in site.pp since we do not have an autoloader :-]" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [08:20:36] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [08:22:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.071 seconds [08:23:25] what's going on? [08:23:26] sigh [08:23:31] ah hi [08:24:40] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [08:42:58] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [08:54:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:01:51] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:02:55] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [09:03:02] grrùmnnm [09:04:29] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:05:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:09:15] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:10:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [09:27:24] PROBLEM - swift-account-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:27:33] PROBLEM - swift-account-reaper on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:27:33] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:27:42] PROBLEM - swift-container-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:28:09] PROBLEM - swift-account-replicator on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:28:19] PROBLEM - swift-container-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:28:19] PROBLEM - swift-account-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:31:50] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:32:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27611 [09:33:30] stuuupiiid puppet [09:33:31] grmbmblbl [09:33:55] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:34:56] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:35:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:40:27] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.752 second response time [09:40:29] New review: Hashar; "Here is the summary of changes I have made between patchset 2 and patchset 9:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27611 [09:43:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:51] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 62515 bytes in 0.615 seconds [09:46:34] Change abandoned: Hashar; "cant really use that while we import everything in site.pp. Will have to look at using rspec at anot..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16139 [09:46:54] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [09:47:21] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [09:47:24] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16952 [09:47:30] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16953 [09:47:35] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16954 [09:47:39] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [09:47:48] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [09:48:02] Change abandoned: Hashar; "packages have been installed manually" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24479 [09:48:29] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [09:49:00] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [09:49:27] New patchset: Hashar; "mw udp2log filter did not honor $log_directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24432 [09:50:30] New patchset: Hashar; "put library after sourcefile in varnish cc command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24797 [09:51:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24620 [09:51:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24432 [09:51:34] New patchset: Hashar; "git::clone now support a specific sha1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27175 [09:52:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24797 [09:52:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27175 [09:57:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.252 seconds [09:58:09] RECOVERY - swift-account-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:58:27] RECOVERY - swift-container-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:58:36] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:58:36] RECOVERY - swift-account-reaper on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:59:12] RECOVERY - swift-account-replicator on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:59:12] RECOVERY - swift-account-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:59:39] RECOVERY - swift-container-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [10:31:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:11] mark: ping me when you're back [10:47:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.082 seconds [10:53:41] i am back [10:53:58] paravoid: pong [11:10:11] New review: Dereckson; "This change should be deployed according the i18n team planning." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/27962 [11:13:21] hey, sorry just saw that [11:13:27] saw the ipv6 issue via cogent? [11:13:39] saw that there was an email, haven't read it yet [11:13:43] i'm currently debugging swift issues :P [11:13:47] heh [11:13:48] or squid issues [11:13:55] still cannot_forwards [11:14:02] in the meantime, any reason not to switch bits back to esams? [11:14:04] i want sq85 to cache that video better [11:14:09] what, you haven't done that already? :P [11:14:11] that would workaround the issue just fine [11:14:26] that was supposed to happen on friday [11:16:08] !log switching European bits back to esams [11:16:18] (I restarted cp3001/2 varnish first just in case) [11:16:19] Logged the message, Master [11:16:45] so sq85 is requesting that video several times a second... [11:16:50] so, no, didn't do it on Friday and didn't feel like playing with it during the weekend in case it needs flaps back and forth [11:16:56] that would probably page everyone [11:19:04] i'm modifying sq85's squid config locally [11:19:10] puppet will replace it [11:19:23] disable puppet? :) [11:20:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:20:21] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:20:52] GET /wikipedia/commons/3/3b/Gertie_the_Dinosaur.ogv HTTP/1.0..User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4..Host: upload [11:20:53] .wikimedia.org..Accept: */*..Accept-Language: en-US,en;q=0.8..Range: bytes=10201750-..Referer: http://en.wikipedia.org/wiki/Winsor_McCay..Pragma: no-cache..Via: 1.0 sq85.wikimedia.org:312 [11:20:53] 8 (squid/2.7.STABLE9)..X-Forwarded-For: 213.165.40.2, 208.80.152.91..Connection: keep-alive.... [11:21:11] god damn range requests [11:23:39] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.152 seconds response time. www.wikipedia.org returns 208.80.154.225 [11:24:09] !log restarting pdns @ ns2 [11:24:21] Logged the message, Master [11:24:27] what's the problem with the range rquest? [11:34:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.441 seconds [11:45:15] PROBLEM - Backend Squid HTTP on sq85 is CRITICAL: Connection refused [11:45:51] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:51] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:21] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [11:48:15] RECOVERY - Backend Squid HTTP on sq85 is OK: HTTP OK HTTP/1.0 200 OK - 468 bytes in 0.005 seconds [11:48:51] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [11:49:32] !log Deployed range_offset_limit 64 MB change (from 0) to all upload backend squids [11:49:43] Logged the message, Master [11:53:03] PROBLEM - Puppet freshness on sq42 is CRITICAL: Puppet has not run in the last 10 hours [11:53:03] PROBLEM - Puppet freshness on srv220 is CRITICAL: Puppet has not run in the last 10 hours [11:54:06] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [12:01:39] New patchset: J; "make labs TMH setup more like production" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28007 [12:04:59] New patchset: J; "remove unused role role::jobrunner::videoscaler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28008 [12:06:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28008 [12:09:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:15:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:23:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [12:37:13] Importing MrFluffyCam20120930.ogv...PHP Warning: wfMkdirParents: failed to mkdir "/mnt/upload7/wikipedia/commons/a/a6" mode 0777 in /home/wikipedia/common/php-1.21wmf1/includes/GlobalFunctions.php on line 2546 [12:37:13] Warning: wfMkdirParents: failed to mkdir "/mnt/upload7/wikipedia/commons/a/a6" mode 0777 in /home/wikipedia/common/php-1.21wmf1/includes/GlobalFunctions.php on line 2546 failed. (Could not create directory "mwstore://local-multiwrite/local-public/a/a6".) [12:37:22] while trying to use importImages on fenari [12:37:51] Oh, because fenari doesn't have /upload7 [12:38:17] on fenari it's /mnt/originals [12:38:34] yeah [12:38:38] should probably make that consistent [12:38:56] heh [12:39:17] I just ended up doing is_dir( '/mnt/originals' ) ? '/mnt/originals/ext-dist' : '/mnt/upload7/ext-dist'; [12:41:03] do we use /mnt/originals anywhere on fenari? [12:42:13] no [12:42:20] i just mounted it there when I first created the volume [12:42:27] to work on it [12:44:07] okay, I'll remount it to upload7 then [12:44:27] might want to do that with puppet consistent with the apaches [12:44:32] the current mount was manual [12:45:10] grumble [12:45:19] can we please pretty please please get rid of ext-dist [12:45:23] in that form at least [12:49:44] there's a bash in a screen that has its cwd in /mnt/originals, I'm killing it [12:50:24] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [12:55:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:26] the ipv6 tele2 issue is peculiar [12:58:30] I guess I'll let our network mistress to handle it [13:08:30] what did you find? [13:09:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.556 seconds [13:12:18] the pmtpa ipv6 network from cogent goes to tele2 and somewhere on tele2 it responds network unreachable [13:12:35] so cogent->pmtpa is broken over ipv6, hence the increased impact [13:12:45] are we peering with tele2 on eqiad? [13:12:59] can it be internal breakage between eqiad->pmtpa? [13:13:00] we have transit from tele2 in eqiad [13:13:41] so this only affects the pmtpa prefix? that would be really weird [13:14:53] no, eqiad too it seems [13:14:57] I use Cogent's looking glass [13:15:16] that makes more sense [13:16:10] from which router? [13:16:44] london for example [13:17:00] but more [13:17:08] ah [13:17:11] mostly european routers [13:17:14] yes [13:17:28] so the ams router of tele2 doesn't have our US prefixes [13:17:30] hmm [13:26:14] we are advertising that prefix to them in eqiad [13:28:41] !log reedy synchronized wmf-config/CommonSettings.php 'Remove extdist conditional for mount path' [13:28:53] Logged the message, Master [13:29:56] !log reedy synchronized wmf-config/throttle.php 'Updated from gitrepo' [13:30:08] Logged the message, Master [13:30:18] robh: can you look into the partition on storage3 today. still not installing OS. [13:30:38] yep, will take a look at it shortly, had the grub fail right? [13:31:23] correct [13:31:30] cool, will take a gander in a bit [13:31:54] !log Pointing php to php-1.21wmf1 [13:32:05] Logged the message, Master [13:34:22] !log Killing php-1.20wmf11 l10n cache from apaches [13:34:33] Logged the message, Master [13:35:06] !log Killing php-1.20wmf11 l10n cache from apaches [13:35:18] Logged the message, Master [13:43:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.166 seconds [14:14:44] Importing Video Recording of President William Jefferson Clinton Delivering Remarks to the Community - NARA - 6037378.ogv...PHP Warning: SwiftFileBackend::doStoreInternal: Invalid response (503): Unexpected HTTP return code: 503 in /home/wikipedia/common/php-1.21wmf1/includes/filebackend/SwiftFileBackend.php on line 1364 [14:14:44] Warning: SwiftFileBackend::doStoreInternal: Invalid response (503): Unexpected HTTP return code: 503 in /home/wikipedia/common/php-1.21wmf1/includes/filebackend/SwiftFileBackend.php on line 1364 failed. (An unknown error occurred in storage backend "local-swift".) [14:14:54] Trying to upload a 897M file [14:15:08] I uploaded a 920.65 MB one 15 minutes ago.. [14:15:24] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:32:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:06] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.37411144 (gt 8.0) [14:45:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.420 seconds [14:51:58] !log reedy synchronized php-1.21wmf2/ 'Initial sync' [14:52:09] Logged the message, Master [14:55:35] !log reedy synchronized wmf-config/ [14:55:45] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:55:45] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:55:46] Logged the message, Master [14:56:30] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [14:56:52] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.21wmf2 [14:57:03] Logged the message, Master [15:03:05] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [15:09:05] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [15:12:03] !log reedy Started syncing Wikimedia installation... : Build localisation cache for 1.21wmf2 [15:12:15] Logged the message, Master [15:17:11] !log reedy synchronized php-1.21wmf1/extensions/MobileFrontend [15:17:23] Logged the message, Master [15:19:16] !log reedy synchronized php-1.21wmf2/extensions/MobileFrontend [15:19:28] Logged the message, Master [15:20:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:33] Could someone run this on fenari for me please? chown mwdeploy /home/wikipedia/common/wmf-config/ExtensionMessages-1.2* [15:28:00] just the dirs? [15:28:21] or did you want -R? [15:28:23] Reedy: [15:28:40] they're php files [15:28:44] ah [15:28:47] well they're done [15:28:49] in any case [15:28:52] thanks [15:28:55] yw [15:28:58] that'll stop scap complaining [15:29:11] great [15:33:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.233 seconds [15:35:47] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:05] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [15:36:14] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [15:41:02] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.67320725806 (gt 8.0) [15:43:31] !log stopping puppet on oxygen, looking for cause of recent packet loss [15:43:43] Logged the message, Master [16:03:39] New patchset: Demon; "Add new suffix for wikidata wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28040 [16:04:44] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:20] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:29] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:29] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:38] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:38] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:47] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:05:56] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:05] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:14] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:23] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:23] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:32] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:32] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:37] !log reedy Finished syncing Wikimedia installation... : Build localisation cache for 1.21wmf2 [16:06:41] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:06:48] Logged the message, Master [16:09:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:27] !log reedy synchronized wmf-config/InitialiseSettings.php 'Re-enable education program on enwiki' [16:16:39] Logged the message, Master [16:19:51] 3 [client 10.64.0.141] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:19:51] 3 [client 10.64.0.129] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:19:51] 2 [client 10.64.0.138] Symbolic link not allowed or link target not accessible: /usr/local/apache/common/docroot/wikibooks.org/w/extensions [16:21:10] argh [16:21:12] check_command check_http_lvs!bits.wikimedia.org!/skins-1.5/common/images/poweredby_mediawiki_88x31.png [16:21:51] has that been removed or something? [16:22:17] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [16:22:39] I changed /usr/local/apache/common/php earlier... And just re-ran the dsh command [16:22:44] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [16:22:53] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [16:23:11] that looks to have fixed the spam in the apache logs [16:23:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [16:26:27] !log reedy synchronized php-1.21wmf1/extensions/MobileFrontend/ [16:26:40] Logged the message, Master [16:27:08] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:27:08] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:27:53] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:28:54] !log reedy synchronized php-1.21wmf2/extensions/MobileFrontend/ [16:29:05] Logged the message, Master [16:30:08] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:35:55] New patchset: Nemo bis; "Remove obsolete $wgRateLimitsExcludedGroups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28044 [16:44:14] so what was the alerts about bits? [16:49:43] I think it was my fault [16:49:59] I changed a symlink and it was creating spam. Running the same command again fixed it :/ [16:55:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.188 seconds [17:10:29] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:10:47] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [17:10:56] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:12:37] cmjohnson1: you onsite today? [17:14:59] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:15:35] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:15:35] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:22:47] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [17:25:51] Reedy: question about common [17:26:25] it's big! [17:26:27] notpeter: yes I am here [17:26:43] there's an rsync cron job on all the search hosts to grab common/php/languages/messages [17:26:58] it started throwing errors this morning. did those files move? [17:27:12] I seemingly broke the symlink earlier on [17:27:18] and fixed it again, with the same command [17:27:36] cmjohnson1: pc2 was being lame and wasn't imaging properly on friday, but now is... sooooo.... we all good :) [17:28:01] awesome! the mac was wrong eh? [17:28:27] cmjohnson1: nah, it wasn't able to get the base image via tftp, or would time out doing so... iunno, was weird. but all good now! [17:28:45] that is weird [17:28:57] Reedy: hrm, ok. still getting cron fail emails as of 13 minutes ago [17:29:26] notpeter: is /usr/local/apache/common/php valid? [17:30:27] !log reedy synchronized php-1.21wmf1/includes/resourceloader/ResourceLoaderFileModule.php [17:30:39] Logged the message, Master [17:31:30] Reedy: so, this cron doesn't pull the whole dir or put it in the normal place... [17:31:42] What's it do? :/ [17:31:53] rsync -a --delete --exclude=**/.svn/lock --no-perms 10.0.5.8::common/php/languages/messages /a/search/conf/ [17:31:57] It's weird, I do this symlink change every 2 weeks or so [17:32:07] and it spits out [17:32:07] rsync: change_dir "/php/languages" (in common) failed: No such file or directory (2) [17:32:36] that exclude is useless ;) [17:32:40] !log reedy synchronized wmf-config/CommonSettings.php 'wgDebugLogGroups for resourceloader' [17:32:51] Logged the message, Master [17:32:53] oh, true! [17:33:04] hmm [17:33:40] Is it using nfs-home correct? [17:33:55] Are the nas boxes replicating back to nfs1? [17:35:20] so, the cron syncs over two other files, all.dblist and initilaizesettings.php [17:35:54] Reedy: ooooo, no [17:36:02] this is probably grabbing some outdated junk.... [17:36:17] heh [17:37:02] Actually [17:37:10] scap-2 is still pulling from the same host apparently.. [17:37:36] yeah, this is right, I habeeb [17:37:40] it's nfs-home [17:37:44] wihch is now on the netapps [17:38:29] OH [17:38:30] I wonder [17:38:36] Ohio? [17:38:41] It's not a relative symlink [17:38:47] it's explicit to /usr/local/... [17:40:21] it looks to be relative [17:40:33] I just fixed it again with dsh... [17:40:33] lrwxrwxrwx 1 mwdeploy mwdeploy 12 2012-10-15 17:39 php -> php-1.21wmf1 [17:40:38] ah, ok :) [17:40:45] Get it to try again? [17:41:01] works now. thanks :) [17:41:12] cronspam: reduce-o! [17:41:13] usu [17:41:15] *yay [17:42:21] New review: Aude; "looks good :)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/28040 [17:44:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:56] AaronSchulz: does adding a new log group require any action on flourine? [17:52:11] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:52:11] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:54:27] Reedy: you should create the log file for it with the same perms as the others [17:54:53] I was thinking that was the case from when they were on fenari.. [17:55:34] Looks like I'll need to get ops to do that.. [17:58:10] Can someone please create on flourine: /a/mw-log/udp2log/resourceloader.log chmod 644 and owned by udp2log:udp2log [17:59:05] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:59:14] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [17:59:14] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [17:59:23] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [18:01:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [18:03:38] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:03:38] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:04:32] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:04:41] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:05:23] Reedy: maybe roan can [18:05:41] * AaronSchulz still finds it amusing how roan has teh roots :) [18:07:41] !log reedy synchronized php-1.21wmf1/extensions/MobileFrontend [18:07:52] Logged the message, Master [18:11:30] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki and mediawikiwiki to 1.21wmf2 [18:11:42] Logged the message, Master [18:16:32] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [18:19:36] !log reedy synchronized php-1.21wmf2/extensions/CodeReview [18:19:47] Logged the message, Master [18:32:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:25] Reedy: What'cha logging? [18:38:27] (rl reatled) [18:38:42] the same as the exception should show [18:38:52] ok [18:39:08] As running phpunit against the cluster ain't gonna happen [18:42:20] <^demon|lunch> Why isn't jenkins catching it? [18:42:30] I suspect it's from some extension [18:42:54] which needs a now core test running on a mw install, with said extension setup [18:43:00] <^demon|lunch> I thought we did MobileFrontend tests already [18:43:16] It's a core test... [18:43:23] <^demon|lunch> Oh, core test but with extension installed, got it. [18:43:24] Do we run core tests where we test extensions? [18:43:28] yeahj [18:43:32] <^demon|lunch> No, we don't. [18:43:32] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Puppet has not run in the last 10 hours [18:43:35] <^demon|lunch> hashar: ^ [18:43:56] and versions on beta labs are behind right now :( [18:46:13] Yay for logging [18:47:53] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [18:47:53] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [18:48:02] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [18:48:20] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [18:48:38] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [18:49:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [18:52:32] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:52:32] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:52:41] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:52:59] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:53:17] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:55:16] !log reedy synchronized php-1.21wmf1/includes/resourceloader/ResourceLoaderFileModule.php [18:55:28] Logged the message, Master [18:58:18] New patchset: Asher; "fix rmem tuning for udplog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28073 [18:58:19] !log reedy synchronized wmf-config/ [18:58:31] Logged the message, Master [18:59:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28073 [18:59:38] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28073 [19:02:35] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [19:04:35] !log restarted udp2log on oxygen with correct --recv-queue [19:04:47] Logged the message, Master [19:07:05] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0 [19:09:02] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [19:20:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.225 seconds [19:35:47] Reedy: Any useful results yet? [19:36:02] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:20] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:20] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:20] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:38] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:38] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:47] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:47] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:47] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:47] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:47] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:48] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:48] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:58] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [19:37:05] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [19:37:14] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [19:37:14] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [19:37:32] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [19:41:07] Krinkle: no, the file is still completely empty [19:57:55] !log powering down srv266 troubleshooting per rt2896 [19:58:07] Logged the message, Master [20:00:27] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [20:00:54] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:39] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [20:04:57] PROBLEM - mysqld processes on es10 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:05:42] !log reedy synchronized php-1.21wmf2/extensions/Collection/ 'Revert back to 9f37005e78a068821c45d9d313394c6728f55b4a' [20:05:53] Logged the message, Master [20:07:30] RECOVERY - Host srv266 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:07:54] !log reedy synchronized php-1.21wmf1/extensions/Collection/ 'Revert back to 9f37005e78a068821c45d9d313394c6728f55b4a' [20:08:05] Logged the message, Master [20:08:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] !log asher synchronized wmf-config/db.php 'pulling es10, third crash in three weeks' [20:10:48] Logged the message, Master [20:12:09] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [20:14:48] cmjohnson1: just opened https://rt.wikimedia.org/Ticket/Display.html?id=3724 about the db es10 crashing regularly. it's currently out of service since it just crashed again, can you take a look today? pretty high priority [20:18:32] binasher...yes [20:20:30] !log awjrichards synchronized php-1.21wmf1/extensions/MobileFrontend/ 'touch files' [20:20:42] Logged the message, Master [20:21:19] !log Attempting to purge out another 500k pages with CentralNotice skin problem (this time not crashing the cluster by filtering out media files and rate limiting to no more than 50hps). Estimated run time is 3 hours. [20:21:31] Logged the message, Master [20:23:33] RECOVERY - Memcached on srv193 is OK: TCP OK - 0.004 second response time on port 11000 [20:25:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [20:25:57] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.337 second response time [20:35:40] !log awjrichards synchronized php-1.21wmf1/extensions/MobileFrontend/ 'touch files' [20:35:51] Logged the message, Master [20:37:15] !log rate limiting for CentralNotice purge is working too well... ie: script is not doing anything -- debugging [20:37:23] !log reedy synchronized php-1.21wmf1/includes/resourceloader/ResourceLoaderFileModule.php [20:37:27] Logged the message, Master [20:37:38] Logged the message, Master [20:43:13] !log reedy synchronized php-1.21wmf1/extensions/ArticleFeedback [20:43:24] Logged the message, Master [20:43:37] can someone please flush the mobile varnish cache? [20:43:48] binasher, LeslieCarr, mutante ^ [20:46:22] It's oh-so quiet.... [20:49:03] !log flushing mobile varnish cache [20:49:06] !log reedy synchronized php-1.21wmf2/extensions/ArticleFeedback [20:49:07] awjr: done [20:49:10] thank you mutante [20:49:15] Logged the message, Master [20:49:17] yw [20:49:27] Logged the message, Master [20:52:29] !log shutting down es10 for per rt3724 [20:52:41] Logged the message, Master [20:52:54] awjr: still need regular cache flushes with resourceloader? :( [20:52:59] !log reedy synchronized php-1.21wmf1/resources/ [20:53:11] Logged the message, Master [20:53:14] binasher: RL still isn't fully mobile-friendly; we're working on it [20:53:28] !log reedy synchronized php-1.21wmf2/resources/ [20:53:40] Logged the message, Master [20:55:25] binasher: es10 has a DIMM error ...probably explains your problem [20:55:39] ahh [20:56:06] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [20:58:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:34] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [21:11:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.400 seconds [21:19:58] https://bugzilla.wikimedia.org/show_bug.cgi?id=39322 bits.wikimedia.org don't load over 6to4 connection via Hurricane Electric [21:23:09] New patchset: Asher; "memcached conf for new mc servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28151 [21:24:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28151 [21:24:40] New patchset: Asher; "memcached conf for new mc servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28151 [21:25:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28151 [21:26:12] New patchset: Dereckson; "(bug 40616) Temporarily reenable wgUseNPPatrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28152 [21:27:34] New patchset: Hashar; "(bug 40616) Temporarily reenable wgUseNPPatrol on fi.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28152 [21:28:20] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28152 [21:29:10] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40616) Temporarily reenable wgUseNPPatrol on fi.wikipedia' [21:29:14] New review: Hashar; "Deployed live!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28152 [21:29:22] Logged the message, Master [21:35:24] New patchset: Asher; "memcached conf for new mc servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28151 [21:36:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28151 [21:39:16] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28151 [21:40:47] New review: Dereckson; "Note: there is a typo in the changelog." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28152 [21:43:01] New patchset: Asher; "remove dueling @monitor_groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28153 [21:44:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28153 [21:44:28] !log resuming purge of CentralNotice cache (now working) with rate limit of 50hps. estimated completion in 3hrs. [21:44:39] Logged the message, Master [21:47:24] New patchset: Asher; "add mecached ganglia group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28154 [21:48:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28154 [21:49:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 249 seconds [21:51:21] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [21:54:01] !log catrope synchronized php-1.21wmf2/includes/resourceloader/ResourceLoaderFileModule.php '1f761e3bf349df8d48592f71ed4afff63471c6b4' [21:54:13] Logged the message, Master [21:54:21] PROBLEM - Puppet freshness on srv220 is CRITICAL: Puppet has not run in the last 10 hours [21:54:21] PROBLEM - Puppet freshness on sq42 is CRITICAL: Puppet has not run in the last 10 hours [21:55:09] !log catrope synchronized php-1.21wmf2/extensions/VisualEditor/ 'f25aaf5bc4449ea8c1ac749a96487485b5f0c6b3' [21:55:21] Logged the message, Master [21:55:24] PROBLEM - Puppet freshness on srv297 is CRITICAL: Puppet has not run in the last 10 hours [21:59:25] RoanKattouw, have you done deploying? my window starts soon [21:59:37] Yes, sorry, go ahead [22:02:08] paravoid: did you git blame those file paging lines in swift? [22:06:08] * jeremyb wonders what the h in hps means [22:06:31] AaronSchulz: seen bug 41028 ? [22:07:06] * AaronSchulz always clicks those and gets stuff like https://bugzilla.mozilla.org/show_bug.cgi?id=41028 [22:07:07] wah wah [22:07:26] haha [22:07:31] !b 41028 [22:07:31] https://bugzilla.wikimedia.org/41028 [22:09:52] * AaronSchulz loves the --si parameter to disk/file related programs [22:11:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28154 [22:14:24] AaronSchulz: danke [22:14:42] the size of that file sounds close to the magic number [22:14:47] RoanKattouw would remember this too [22:15:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:38] could be something different though [22:15:44] AaronSchulz: Oh the 3 billion bytes limit? [22:15:52] Or trillion? [22:16:09] remember how copy operations would truncate around a certain amount [22:16:11] 3G using powers of 10, IIRC [22:16:13] Oh, right, that [22:16:15] you were debugging it for hours [22:16:15] But that was NFS [22:16:19] right [22:16:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:16:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:16:26] Good call on sending him to hume though [22:16:31] Cause that's what fixed it before [22:16:35] that's why I'm not sure it will help, but tis worth a try [22:16:44] also it should be on hume anyway, bad reedy! :) [22:21:35] New review: Aude; "This looks like the standard setup and I think is good for the main.conf file." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/27546 [22:23:29] New patchset: Pyoungmeister; "adding rmoen to mortals group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28161 [22:24:11] Uploads? [22:24:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28161 [22:29:16] New patchset: Pyoungmeister; "adding rmoen to mortals group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28161 [22:30:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28161 [22:31:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.513 seconds [22:35:15] New patchset: Asher; "fix incorrect class scop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28164 [22:36:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28164 [22:36:25] scop scoop scope [22:37:18] oh scap! [22:37:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28161 [22:38:45] * AaronSchulz will have to start greeting asher each day with "Hey, so how is that ceph cluster coming along?" [22:39:44] troll [22:39:58] LeslieCarr: I thought you guys wanted to serve us better? [22:40:20] peter does, the rest of us are surly as ever [22:40:28] heh [22:41:46] hey, I'm surly [22:41:59] I'm surly interested in halping people get their jobs done ;) [22:42:23] New patchset: Asher; "misplaced ";". bad ";"!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28167 [22:43:22] notpeter: aaron needs you to be surly to dell to get eqiad r720's here faster. very surly. [22:43:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28167 [22:43:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28167 [22:44:35] let me go get my helping threatening stick. [22:45:04] !op_on_duty is notpeter [22:45:04] Key was added [22:45:26] ooh [22:45:32] !whofromopsiscurrentlyonduty [22:45:36] mutante: it's broken :P [22:45:39] lulz [22:45:40] mutante: Please don't abuse the pling-op stalkword. :-) [22:46:54] !op_on_duty [22:46:54] notpeter [22:47:01] sup? [22:47:07] Reedy: need infobot:) eh, i mean "flooterbuck" [22:47:11] binasher: Funny. [22:47:14] hmm. notpeter, do you know what's up with eqiad mc hosts? wm-bot says you might [22:47:42] binasher: yes, but that has nothing to do with what wm-bot knows [22:48:05] !wheremyserversat [22:48:20] !reboot analytics1007 [22:48:26] 1001,1003-1005,1007,1008, 1011-1014 are up [22:48:30] hah [22:48:34] 15 and 16 are waiting on doobers [22:48:44] 1002, 1006 have hardware issues [22:48:47] there are tickets for those [22:48:53] and 1009 and 1010 have some kinda network issues [22:49:09] 1009 and 1010 need cables that both intel and juniper agree are cables [22:49:14] blerg [22:49:14] which is surprisingly difficult [22:49:21] So not string? [22:49:29] Have you tried attaching plastic cups? [22:50:41] cups! That's the key [22:50:58] yeah, printing works with that [22:50:59] i believe we were using full cans of soda [22:51:21] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [22:53:47] cp1040, yep, frozen and down, i wonder it just reports Puppet freshness [22:54:11] how does one change the topic of an irc channel? [22:54:22] /topic #wikimedia-operations foo [22:54:50] but needs ops first... [22:54:54] how do we get ops? [22:55:01] you dont need [22:55:03] |ops [22:55:06] it was mode -t [22:55:10] last time at least [22:56:03] heh, yeah, which of the bots has opping feature [22:56:44] /msg chanserv op #wikimedia-operations [22:57:26] not authorized... [22:57:41] blah [22:57:45] silly irc [22:58:10] Mode lock : -t [22:58:29] you dont need ops [22:58:48] i saw the mode lock with /msg chanserver info #channel [22:58:52] chanserve [22:59:24] chanserv, damn [23:04:02] !log power cycling cp1040 [23:04:14] Logged the message, Master [23:05:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:42] New patchset: Pyoungmeister; "adding s page to mortals group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28173 [23:07:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28173 [23:18:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.755 seconds [23:26:35] New patchset: Pyoungmeister; "adding s page to mortals group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28173 [23:27:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28173 [23:28:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28173 [23:52:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds