[00:04:22] okay, never mind, found it. [00:07:15] gn8 folks [01:19:39] !log synchronized payments cluster to r106909 [01:19:49] Logged the message, Master [01:27:57] zzz [01:34:37] editing is very slow for me... hm [01:34:50] and rotatebot needed 50 seconds to do a simple page edit [01:35:11] (from Germany at Commons) [01:37:46] same timing for me editing via API.. [01:39:47] 48 seconds for 1219225 bytes to upload on cluster [01:40:05] Saibo: thanks for saying, I thought it was my internetconnection [01:41:05] Reedy: I am not talking about uploads [01:41:07] page edits [01:41:18] it is just for viewing a page [01:41:21] Doesn't matter [01:41:21] no [01:41:24] for editing [01:41:24] It's slow to upload :p [01:41:41] via quickdelete - which uses JS and API (I guess) [01:42:00] https://commons.wikimedia.org/wiki/Special:Contributions/Saibo look [01:42:21] the file page edit and the talk page edit are usually only a second or so separated [01:42:26] I get a message that the servers are overloaded [01:42:43] ms1/2 are showing very high load... ms5 is also highish [01:42:43] -> Sorry, the servers are overloaded at the moment. Too many users are trying to view this page. Please wait a while before you try to access this page again. [01:42:43] Hmm [01:42:56] (nl-wiki) [01:43:00] commons was slow too [01:43:07] Api app servers incoming traffic dropped by around 50% 20 minutes ago [01:43:11] When did you notice it? [01:43:32] a minute ago [01:43:37] at nl-wiki [01:43:42] at commons the past hours [01:44:10] as you can see in my contribs: 2011-12-21T01:34:57 2011-12-21T01:35:47 [01:44:19] for two edits which belong together ;) [01:44:32] nothing earlier - but didn't edit then via api [01:44:50] *trying a normal edit* [01:45:23] sloooow too [01:45:41] -*waiting* [01:45:50] seems to be the same [01:45:55] roughly [01:46:03] at least 25 seconds [01:46:48] I've pinged ops [01:46:54] thanks [01:47:42] the API problems notice is simply made permanent, huh? ;) [01:48:18] I wasn't sure of the state of that, so I left it :p [01:48:26] !replag [01:48:26] :D [01:48:29] @replag [01:48:31] Reedy: No replag currently. See also "replag all". [01:48:38] @replag all [01:48:39] Ryan_Lane: [s1] db32: 0s, db36: 0s, db12: 0s, db26: 0s, db38: 0s; [s2] db30: 0s, db13: 0s, db24: 0s; [s3] db34: 0s, db39: 0s, db25: 0s, db11: 0s [01:48:39] @replag all [01:48:40] Ryan_Lane: [s4] db22: 0s, db31: 0s, db33: 0s; [s5] db45: 0s, db35: 0s, db44: 0s; [s6] db47: 0s, db46: 0s, db50: 2s; [s7] db16: 0s, db37: 0s, db18: 0s [01:48:42] Reedy: [s1] db32: 0s, db36: 0s, db12: 0s, db26: 0s, db38: 0s; [s2] db30: 0s, db13: 0s, db24: 0s; [s3] db34: 0s, db39: 0s, db25: 0s, db11: 0s [01:48:42] maplebed, ^ [01:48:43] Reedy: [s4] db22: 0s, db31: 0s, db33: 0s; [s5] db45: 0s, db35: 0s, db44: 0s; [s6] db47: 0s, db46: 0s, db50: 0s; [s7] db16: 0s, db37: 0s, db18: 0s [01:48:50] thanks Reedy [01:49:38] did someone turn on debug mode? [01:50:01] see: /home/w/syslog/apache.log [01:50:46] no deploys today? [01:51:05] There's been allsorts [01:51:22] ah. ok [01:51:29] something is looking for /php-1.18/resources/mediawiki/mediawiki.debug.css and /php-1.18/resources/mediawiki/mediawiki.debug.js [01:51:34] over and over and over and over [01:51:41] Ah [01:51:45] obviously that's problematic [01:51:48] it started to get really slow betwee 2011-12-21T01:21:29 and 2011-12-21T01:23:30 [01:52:00] ganglia shows roughly 30 mins ago [01:52:05] (just was looking in rotatebot's contribs) [01:52:19] thanks :) [01:52:38] 2011-12-21T01:17:41 was slow, too.. sorry [01:52:49] meeh.. [01:52:53] Reedy: any idea why mediawiki would be looking for that? [01:52:54] Those files don't exist [01:52:54] much more earlier.. [01:53:01] I know they don't [01:53:09] !log synchronized payments cluster to r106917 [01:53:14] why is anything trying to load debug css and js? [01:53:18] Logged the message, Master [01:53:37] let's create some temp files [01:53:55] go for it [01:54:18] 2011-12-21T00:42:12 .. before it was much faster (3 seconds instead of 30+) [01:54:35] that is my last call :D [01:54:36] !log reedy synchronized php-1.18/resources/mediawiki 'creating empty mediawiki.debug.css/js' [01:54:45] Logged the message, Master [01:54:54] They're in trunk/phase3 [01:55:17] why are they trying to be loaded anyway? [01:55:32] No idea [01:55:37] also, note, this could be a red herring [01:55:57] yeah, making empty files should stop the error log spam at least [01:56:10] yeah [01:56:13] No code has been pushed for 6 hours or so [01:56:57] tstarling cleared profiling data [01:57:17] Error log looks more normal now at leat [01:57:25] lemme see if apaches are being depooled [01:58:38] I really doubt that [01:58:47] what about db43? [01:59:50] there are a number of srv and mw boxes depooled [02:00:02] fetches taking longer than 3 seconds [02:00:09] anyway. on to the database [02:00:25] could an apache restart help? [02:00:31] someone attempted to depool it by the looks [02:01:20] hrm [02:02:01] it's depooled, don't worry [02:02:06] just profiling is screwed up [02:04:09] Ryan_Lane, it's cause core has the mediawiki.debug defined [02:04:19] but no files [02:04:24] * Reedy looks who is to blame [02:04:54] root cleared profiling data [02:05:07] Neil in r106062 [02:05:08] that's better [02:06:08] !log LocalisationUpdate completed (1.18) at Wed Dec 21 02:06:08 UTC 2011 [02:06:17] Logged the message, Master [02:06:23] LocalisationCache::recache is running a lot [02:06:29] oh, LU? [02:06:48] it was fixed recently, wasn't it? [02:06:50] Reedy: ? [02:07:07] neilk_, you'd added mediawiki.debug to Resouces.php, but not copied in the files [02:07:18] It was spamming the log files a lot [02:07:23] LocalisationCache::recache profiling at 24% real time, 3.6 seconds each [02:07:51] Special:FundraiserLandingPage also has something very wrong with it [02:11:12] it is fast again [02:11:14] Reedy: ok I see. johnduhart added that to trunk, I accidentally pulled that in with my changes. Sorry. [02:12:00] neilk_, I don't think you did any damage, other than the spamming in the error logs... :) [02:12:16] he! he did - stealing our time! ;) [02:12:23] hi neilk_, btw :) [02:12:28] hi Saibo. [02:12:36] TimStarling, any more information than that? [02:12:44] tstarling cleared profiling data [02:12:56] .. shit happens :) [02:13:08] it's still broken [02:13:26] not for me and not for rotatebot [02:13:38] network traffic is going back to normal on the apaches... [02:14:16] api network is still lower [02:15:19] I wish torrus wasn't totally screwed. [02:15:53] ru.wikipedia.org seems especially slow and there are a ton of cache misses for just http://ru.wikipedia.org/ in the squid sampled 1000 log that seem to stand out over other stuff [02:16:49] just looking at the file timestamps in /tmp/mw-cache-1.18 tells you that all sorts of languages are being rebuilt [02:16:53] network traffic seems to have doubled on the squids. [02:16:57] * Ryan_Lane nods [02:17:29] that's probably not related to any network traffic spike [02:18:35] top 5 sorted cache misses in the current squid sampled log: [02:18:38] 5518 http://ru.wikipedia.org/ [02:18:39] 607 http://en.m.wikipedia.org/favicon.ico [02:18:39] 562 http://en.wikipedia.org/favicon.ico [02:18:39] 555 https://bits.wikimedia.org/donate.wikimedia.org/load.php [02:18:40] 510 http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=atom [02:18:54] *1000.. that's a lot beyond any other languages [02:19:01] binasher: you moved udpprofile to professor right? [02:19:07] right [02:19:21] can you update /usr/local/bin/clear-profile at some point? it's probably in puppet [02:19:30] aw, yup.. sorry bout that [02:19:55] not urgent [02:31:29] tstarling cleared profiling data [02:41:35] !log tstarling synchronized php-1.18/includes/LocalisationCache.php 'r106922' [02:41:44] Logged the message, Master [02:43:19] !log tstarling synchronized wmf-config/InitialiseSettings.php 'LC recache log' [02:43:27] Logged the message, Master [02:55:16] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 1 process with args varnishncsa [02:55:38] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 21 02:55:20 UTC 2011 [03:30:00] !log tstarling synchronized php-1.18/includes/LocalisationCache.php 'r106927' [03:30:20] Logged the message, Master [03:51:29] !log doing a manual run of l10nupdate to check recache timings [03:51:37] Logged the message, Master [03:56:58] !log LocalisationUpdate completed (1.18) at Wed Dec 21 03:56:58 UTC 2011 [03:57:07] Logged the message, Master [04:19:45] images dat are purged aren't loading well [04:19:49] now [04:20:04] then only file name is shown [04:41:43] New patchset: tstarling; "Attempting to fix l10nupdate on the image scalers. Everything in the mediawiki-installation dsh node group should be able to get LU updates. Hume is also broken and should probably be in applicationserver::home-no-service, but I'll leave that for another " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:41:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1653 [04:42:44] New review: tstarling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1653 [04:42:45] Change merged: tstarling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:48:16] New review: tstarling; "Tested." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:48:49] Romaine: still? [04:49:01] Romaine: and what is "that are purged"? [04:53:06] * jeremyb beeps Romaineā€¦ [05:47:53] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:31] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:19:31] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [08:21:51] PROBLEM - Disk space on hume is CRITICAL: DISK CRITICAL - free space: / 341 MB (5% inode=79%): /a/static/uncompressed 23167 MB (2% inode=99%): [09:06:35] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1606 [09:13:14] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10754 MB (3% inode=99%): [09:30:46] !log a few more binlogs deleted on db9... [09:30:56] Logged the message, Master [10:12:19] !log dataset1 kernel panics in lgo during copy :-( :-( [10:12:28] Logged the message, Master [10:15:48] Log s/lgo/log/ as in syslog. saving a copy of the bad log in fenari:/home/ariel/dataset1-syslog-dec-20-2012 [10:22:23] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:26:27] apergos: you forgot the ! [10:26:39] thanks [10:27:01] !log s/lgo/log/ as in syslog. saving a copy of the bad log in fenari:/home/ariel/dataset1-syslog-dec-20-2012 [10:27:10] Logged the message, Master [11:18:04] !log rebooting ds1 as it's got the one cpu tied up with a hung scp process and continual spewing to syslog... [11:18:12] Logged the message, Master [11:18:22] if it actually reboots, that is... [11:22:02] bahhh [11:44:51] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:45:01] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1654 [11:46:01] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:46:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1654 [11:47:23] New patchset: Dzahn; "make the process check on mobile traffic loggers a bit more relaxed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1655 [11:48:02] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1655 [11:48:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:49:33] New review: Dzahn; "looks good. should fix gallium. checking" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1654 [11:49:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:51:07] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Wed Dec 21 11:50:55 UTC 2011 [12:45:45] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [13:09:16] "Status: Site slowness" - still? Doesn't look like by watching Rotatebot's edits. [13:10:38] site slowness? :) [13:11:01] how can that happen? [13:32:51] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 3 processes with args varnishncsa [13:32:51] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 1 process with args varnishncsa [13:38:00] PROBLEM - NFS on dataset1 is CRITICAL: Connection refused [13:40:53] New patchset: Hashar; "enable testswarm on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [13:41:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1646 [13:53:21] PROBLEM - SSH on dataset1 is CRITICAL: Connection refused [13:58:27] !log powering on and off ds1 the hard way via the pdu. [13:58:35] Logged the message, Master [14:39:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:49:42] New patchset: Dzahn; "small fixes to make the "nightly builds"-page validate as XHTML 1.0 Strict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1657 [14:49:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1657 [14:50:51] New review: Dzahn; "this makes it look nice on validator.w3.org" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1657 [14:56:59] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1657 [15:02:11] New review: Dzahn; "re-enabling after manual package removal and fix for user account being created" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1646 [15:02:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [15:17:14] !log installing 2.6.38 from natty backports on ds1 for further testing [15:17:22] Logged the message, Master [15:21:25] !log reboot dataset1 with new kernel [15:21:34] Logged the message, Master [15:21:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1647 [15:25:22] New patchset: Catrope; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [15:25:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [15:26:50] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1647 [15:26:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [15:30:05] !log and starting another huge copy from ds2 to ds1, let's see what happens... [15:30:15] Logged the message, Master [15:39:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1623 [15:39:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1623 [15:42:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1649 [15:44:25] New patchset: Hashar; "gallium: avoid duplicate sudo_user definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1658 [15:44:31] RECOVERY - NFS on dataset1 is OK: TCP OK - 0.000 second response time on port 2049 [15:44:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1658 [15:46:41] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1658 [15:46:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1658 [15:49:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1606 [15:49:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:56:31] RECOVERY - SSH on dataset1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:09:46] New patchset: Hashar; "testswarm: make sure we have a system user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1659 [16:09:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1659 [16:11:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1659 [16:11:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1659 [16:16:45] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 21 16:16:37 UTC 2011 [16:21:07] !log so that was fast. barf from scp, nice call trace etc, shot the process on ds2, will email the vendor [16:21:16] Logged the message, Master [16:38:45] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [16:46:01] New patchset: Hashar; "testswarm: fix /etc/testswarm permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1660 [16:46:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1660 [16:47:16] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1660 [16:47:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1660 [16:52:39] New patchset: Hashar; "testswarm-checkouts.conf is not needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1661 [17:01:10] New patchset: Hashar; "add index.html to DirectoryIndex for integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1662 [17:01:57] New review: Hashar; "This is a cherry pick of bf0f391d from test to production" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1662 [17:02:00] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1661 [17:02:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1661 [17:02:23] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1662 [17:02:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1662 [17:12:29] New patchset: Hashar; "testswarm: index.html as default for HTTPS too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:12:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1663 [17:15:14] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1663 [17:15:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:17:16] New review: Hashar; "It fixed the issue! Thanks for the fast merge 8-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:25:17] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64 - rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:25:30] New patchset: Dzahn; "remove special.cfg from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:25:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1649 [17:25:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1648 [17:25:46] New review: Dzahn; "was already reviewed" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1649 [17:27:18] New patchset: Mark Bergsma; "Create separate Ganglia cluster(s) for Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1664 [17:27:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1664 [17:27:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1664 [17:27:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1664 [17:31:06] New patchset: Mark Bergsma; "Setup ganglia aggregators for swift clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1665 [17:31:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1665 [17:31:25] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64 - rebased - remove special.cfg here so it doesnt break" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:31:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1649 [17:32:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1648 [17:32:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:32:41] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1649 [17:32:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:36:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1665 [17:36:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1665 [17:51:12] !log catrope synchronized php-1.18/extensions/ArticleFeedbackv5/modules/jquery.articleFeedbackv5/jquery.articleFeedbackv5.js 'r106959' [17:51:20] Logged the message, Master [17:52:03] !log srv224 has a full disk [17:52:07] apergos: ---^^ [17:52:12] Logged the message, Mr. Obvious [17:54:20] (busy) [18:01:46] New patchset: Mark Bergsma; "Allow gmond access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1666 [18:01:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1666 [18:02:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1666 [18:02:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1666 [18:11:13] New patchset: Bhartshorne; "allowing swift hosts to be gmond listeners" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1667 [18:11:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1667 [18:16:55] Change abandoned: Bhartshorne; "mark already did this." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1667 [18:22:50] New patchset: Bhartshorne; "allowing swift hosts to hear their peers' multicast traffic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:22:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1668 [18:24:37] New patchset: Bhartshorne; "allowing swift hosts to hear their peers' multicast traffic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:24:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1668 [18:25:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1668 [18:25:09] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:29:10] RECOVERY - Disk space on hume is OK: DISK OK [18:37:10] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:01] New patchset: Hashar; "bug 33301, bad SSL cert at integration.mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [18:42:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1669 [18:49:31] upload.wikimedia.org|91.198.174.234 doesn't give be this thumb,.... http://upload.wikimedia.org/wikipedia/commons/thumb/8/83/Postmeilensaeule_Wolkenstein2.jpg/39px-Postmeilensaeule_Wolkenstein2.jpg [18:50:04] hangs at connection establishment.. [18:50:20] oh.. now! 19:48:40-- 19:50:14 [18:50:23] great :D [18:50:28] only took 90 seconds [18:50:47] *trying with 40 px* [18:51:14] only 10 seconds [18:53:29] PROBLEM - Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [19:25:48] New patchset: Asher; "fix the nagios check for non port 80 varnish instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:26:50] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1670 [19:26:53] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:29:21] RECOVERY - Varnish HTTP mobile-backend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.064 seconds [19:31:39] ah. our nagios mysql replication checks are totally broken. [19:35:15] i can't really be upset about db10 breaking while i was out.. Misc_Db_Slave never went crit when replication break many days ago, only when i ran "stop slave" - it only seems to go crit if both the io and sql slave threads are down, but usually one is still running and pulling logs when replication breaks. and Misc_Db_Lag is in OK state with "CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : s".. grrr [19:35:55] looks like the "Misc_Db" checks are old crap that are only used on db9/10 and storage3. must kill. [19:36:27] !log catrope synchronized wmf-config/missing.php 'Update missing.php from trunk, see bug 30206' [19:36:37] Logged the message, Master [19:40:22] RoanKattouw: gj, thx! [19:41:10] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.063 seconds [20:13:11] PROBLEM - NTP on dataset1 is CRITICAL: NTP CRITICAL: Offset unknown [20:18:02] New patchset: Mark Bergsma; "Template swift storage server configurations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:18:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1671 [20:18:55] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1671 [20:18:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:21:28] New patchset: Mark Bergsma; "Experimentally raise worker counts on account/container/object servers to processorcount" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:21:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1672 [20:21:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1672 [20:21:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:26:46] New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [20:26:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1673 [20:27:21] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:38] New patchset: Mark Bergsma; "Restart swift processes on config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [20:32:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1674 [20:37:42] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:40:51] New review: Demon; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1669 [20:48:02] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:50:39] New review: Krinkle; "OK" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:58:52] New patchset: Hashar; "bug 33301, bad SSL cert at integration.mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [21:00:46] New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [21:00:52] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10742 MB (3% inode=99%): [21:05:32] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10724 MB (3% inode=99%): [21:07:57] !Log three more bin logs tossed from ds9 [21:08:08] let's see if it's case sensitive [21:08:42] !log three more bin logs tossed from ds9 [21:08:51] Logged the message, Master [21:11:34] it is ;) [21:12:14] thanks, I figured that out :-P [21:47:09] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1674 [21:47:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1674 [21:47:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [21:51:07] New patchset: Mark Bergsma; "Fix paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [21:51:25] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1675 [21:51:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [22:16:10] apergos: db9* ? [22:16:32] yeah db9 [22:16:49] anyone reading the backlogs will know [22:17:07] well i thought there may be a dataset9 i didn't know about [22:17:24] anyway, i can fix it if you like [22:42:34] New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [22:43:01] New patchset: Mark Bergsma; "Set proxy worker count to 2x # CPU cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:43:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1677 [22:43:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:59:28] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [23:07:54] New patchset: Bhartshorne; "apply swift TCP tuning settings to all high-http-performance hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:09:44] New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:14:23] Change abandoned: Bhartshorne; "need to do this separately for swift to avoid getting the time_wait stuff on the squids." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:16:07] New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:19:29] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1676 [23:19:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:27:36] New patchset: Asher; "fix template name for varnishncsa.init" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:27:48] New patchset: Hashar; "bug 32645, add testswarm to integration homepage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1680 [23:28:47] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1679 [23:28:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:35:02] New patchset: Bhartshorne; "adding recommended tcp settings to swift hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681 [23:36:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1681 [23:36:25] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681