[00:03:14] paravoid: you're not around, are you? [00:05:36] New patchset: Lcarr; "addingi n minute field to cronjob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44746 [00:06:14] mutante: can you check out my snmptrapd init script and see if you can figure out wtf it doesn't want to run ? [00:06:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44746 [00:07:08] Aaron|laptop: could you help troubleshoot http://pastebin.mozilla.org/2069811 [00:07:37] * Aaron|laptop stops reading about cgroups for a second [00:08:24] !log blog is fixed, in that its the software equivalent of spinning plates [00:08:34] Logged the message, RobH [00:09:11] Aaron|laptop: exception = new MWException( "Could not acquire '$statusKey' lock." ); gets logged if [00:09:12] $success = $this->mMemc->add( $statusKey, 'loading', MSG_LOAD_TIMEOUT ); [00:09:21] if ! $success [00:10:48] so I tailed the exception log and didn't see much of that (lots of upload stuff though) [00:11:10] Aaron|laptop: that's just because there isn't traffic hitting the eqiad apaches [00:11:39] memcached-serious logging is working fine from there, i.e. 2013-01-18 23:43:51 mw1074 frwikisource: Memcached error for key "commonswiki:file:d2e0d294d8e6b18d051cd0f9a4f10564" on server "10.64.0.185:11211": ITEM TOO BIG [00:11:40] oh, you are just testing there? [00:12:23] i sent a bit of traffic from 1 tampa squid there… some people are hitting the tampa vip directly even though its not pointed to in dns [00:12:43] and have been doing direct testing myself [00:13:08] 2013-01-18 23:38:51 srv245 ruwiki: Memcached error for key "ruwiki:infoaction:Половое_сношение_и_иные_действия_сексуального_характера_с_лицом,_не_достигшим_шестнадцатилетнего_возраста,_в_уголовном_праве_Р [00:13:08] оссии:51757680" on server ":": A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE [00:13:09] sigh [00:13:39] 2013-01-18 23:37:15 mw26 tawikinews: Memcached error for key "tawikinews:shorturls:title:பால்வழியும்_அந்திரொமேடா_பேரடையும்_4_பில்லியன்_ஆண்டுகளில்_இணையும்,_வானியலாளர்கள்_கணிப்பு" on server ":": A BAD KEY WAS PROVIDED/ [00:13:40] CHARACTERS OUT OF RANGE [00:13:58] I filed a but about the shorturls thing weeks ago [00:14:01] *a bug [00:14:01] i think wmf8 has some memcached fails.. [00:14:02] ok [00:14:11] LeslieCarr: hmm... access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) [00:14:15] thats unrelated to the eqiad messagecache thing of course [00:14:17] infoaction needs a bug report too I guess [00:14:34] * Aaron|laptop wonders why those bad uses have been cropping up lately [00:15:15] hrm, where do you see that ? [00:15:19] and where is it calling htat ? [00:15:24] (to mutante, that is) [00:15:30] LeslieCarr: strace /etc/init.d/snmptrapd restart [00:16:29] binasher: funny, I changed that code lately, and what you pasted is the older code [00:16:42] Aaron|laptop: there are >1k "Exception from line 352 of /usr/local/apache/common-local/php-1.21wmf7/includes/cache/MessageCache.php: Could not acquire 'ruwiki:messages:ru:status' lock." style messages from eqiad apaches, and i really didn't send that many requests there [00:17:18] Aaron|laptop: is the change in master? [00:17:23] asher:core asher$ git pull [00:17:24] Already up-to-date. [00:17:49] it's in master [00:17:56] I don't see the code you pasted [00:18:11] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 183 seconds [00:18:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 194 seconds [00:19:01] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 236 seconds [00:20:12] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:17] Aaron|laptop: what are you looking at? [00:20:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:52] what's up with: [00:20:53] includes/cache/MessageCache.php: $memCached = wfGetCache( CACHE_NONE ); [00:21:01] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [00:21:04] $this->mMemc = $memCached; [00:22:00] I probably just gets an emptybagostuff [00:22:08] so the function calls work but do nothing [00:22:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:23] that's only if there is no cache set [00:23:01] PROBLEM - Puppet freshness on cp1015 is CRITICAL: Puppet has not run in the last 10 hours [00:23:02] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [00:23:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [00:24:00] 2013-01-18 17:14:41 mw36 kkwikiquote: [6f8031b4] /w/index.php?title=%D0%90%D1%80%D0D%D00%D09%D1%8B:GlobalBlockList&dir=prev&offset=20120920214815&limit=100 Exception from line 345 of /usr/local/apache/common-local/php-1.21wmf7/includes/cache/MessageCache.php: Could not save cache for 'kk'. [00:24:31] yeah, so some message loading failed and the key was set for a while to stop more load attempts from spamming for a while [00:24:47] that's different [00:24:54] specifically, the caches could not be saved at some point [00:25:18] just look at the eqiad ones, - all in the "lock acquire" stage [00:25:22] we are talking about the "failed to acquire" error right? [00:25:32] setting the key will cause that problem until it expires [00:26:49] but I'm not seeing a cache fail exception for meta though [00:27:06] !log DNS update - add wikivoyager.org/.de [00:27:14] *cache save fail [00:27:16] Logged the message, Master [00:27:36] oh, it's an add [00:27:44] Aaron|laptop: what does CACHE_ANYTHING map to? [00:28:50] memcached pecl [00:29:01] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [00:29:03] wgMainCacheType, which = that.. ok [00:29:57] Aaron|laptop: ok, should i not be concerned with those messages at all? [00:30:44] how much traffic was happening? [00:31:55] looking at that code, any time more than request is trying to regenerate that cache, you will see that error [00:32:11] RECOVERY - Backend Squid HTTP on sq72 is OK: HTTP OK: HTTP/1.0 200 OK - 1250 bytes in 0.107 second response time [00:32:21] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK: HTTP/1.0 200 OK - 1285 bytes in 0.081 second response time [00:32:24] RECOVERY - Backend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1257 bytes in 0.004 seconds [00:32:51] RECOVERY - Frontend Squid HTTP on sq72 is OK: HTTP OK HTTP/1.0 200 OK - 1393 bytes in 0.008 seconds [00:33:55] Aaron|laptop: around 40 reqs/sec [00:34:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.349 second response time [00:35:08] i turned that back on [00:36:11] for the time being, the following hosts entry to a tampa squid will let you hit eqiad apaches, hitting eqiad dbs/mc [00:36:12] 208.80.152.82 en.wikipedia.org commons.wikimedia.org [00:36:46] I'm trying to see why there really need to be two locks there [00:38:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [00:39:05] New patchset: Lcarr; "fixing /etc/default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44749 [00:39:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44749 [00:40:36] Tim added the extra lock in 5f268028 [00:41:14] the only difference is that the outer lock relents after enough attempts to acquire and lets the process though [00:41:15] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [00:42:11] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: Connection refused [00:43:11] RECOVERY - Backend Squid HTTP on sq72 is OK: HTTP OK: HTTP/1.0 200 OK - 1250 bytes in 0.108 second response time [00:44:01] RECOVERY - Puppet freshness on analytics1025 is OK: puppet ran at Sat Jan 19 00:43:53 UTC 2013 [00:44:02] RECOVERY - Puppet freshness on constable is OK: puppet ran at Sat Jan 19 00:43:58 UTC 2013 [00:44:31] RECOVERY - Puppet freshness on blondel is OK: puppet ran at Sat Jan 19 00:44:03 UTC 2013 [00:44:32] RECOVERY - Puppet freshness on srv202 is OK: puppet ran at Sat Jan 19 00:44:03 UTC 2013 [00:44:32] RECOVERY - Puppet freshness on db64 is OK: puppet ran at Sat Jan 19 00:44:03 UTC 2013 [00:52:47] LeslieCarr: Time Frame Services Checked [00:52:48] <= 1 minute: 3616 (98.6%) [00:52:52] woohoo! [00:53:37] https://neon.wikimedia.org/cgi-bin/icinga/extinfo.cgi?&type=4 [00:53:50] :) [01:03:14] PROBLEM - Puppet freshness on srv211 is CRITICAL: Puppet has not run in the last 10 hours [01:05:57] binasher: hmm, I guess I wouldn't worry too much about those [01:06:03] ok [01:06:14] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [01:06:55] I a process got passed lock() due to timeout then it will probably fail on that check...I guess this it to prevent a bunch of processes from piling up doing the same db queries [01:07:14] AaronSchulz: 40 req/sec of european traffic is hitting eqiad apaches again, and it seems to have cleared up [01:07:48] that code does look reasonable [01:08:08] *If a [01:11:52] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Sat Jan 19 01:11:40 UTC 2013 [01:17:02] New patchset: Pyoungmeister; "adding moar lvs checks for eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44750 [01:20:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44750 [01:29:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [01:42:35] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 181 seconds [01:42:47] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 187 seconds [01:42:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 191 seconds [01:44:14] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 222 seconds [01:52:38] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:53:43] !log i was sending >40 reqs/sec from europe and asia to eqiad apaches for the last hour for real world non-english testing [01:53:55] Logged the message, Master [01:54:35] !log testing completed, no longer sending live traffic to eqiad [01:54:37] PROBLEM - Frontend Squid HTTP on sq72 is CRITICAL: Connection refused [01:54:47] Logged the message, Master [01:55:16] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: Connection refused [01:56:14] PROBLEM - Frontend Squid HTTP on sq72 is CRITICAL: Connection refused [01:56:59] PROBLEM - Backend Squid HTTP on sq72 is CRITICAL: Connection refused [02:01:38] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [02:03:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:04:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.477 second response time [02:06:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.539 seconds [02:09:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:18:08] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 26 seconds [02:18:26] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:18:44] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:18:59] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:24:18] !log LocalisationUpdate completed (1.21wmf7) at Sat Jan 19 02:24:18 UTC 2013 [02:24:29] Logged the message, Master [02:39:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:38] PROBLEM - Puppet freshness on db10 is CRITICAL: Puppet has not run in the last 10 hours [02:42:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:07] !log LocalisationUpdate completed (1.21wmf8) at Sat Jan 19 02:46:06 UTC 2013 [02:46:19] Logged the message, Master [02:51:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.944 seconds [02:51:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.593 second response time [02:58:32] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [03:00:44] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [03:20:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:38] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:22:38] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:37:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.998 seconds [03:38:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.121 second response time [04:13:08] [23:12] Ignore List: [04:13:08] [23:12] 1 nagios-wm #wikimedia-operations: ALL [04:13:13] I guess I need to expand the list... [13:01:29] is anyone around who can help me find the db config for foundationwiki? [13:06:11] Jeff_Green, did you look at http://noc.wikimedia.org/ ? [13:06:40] I didn't--but I'm looking on the server directly. [13:08:08] ah HA. i think i just found what I was looking for... [15:28:22] New review: Nikerabbit; "Inconsistent indentation with 2 or 4 spaces and some trailing whitespace." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44712 [15:29:57] New review: Nikerabbit; "Did you mean to amend the previous commit?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/44715 [16:52:42] aswikisource seems to be missing the pr_index table for ProofreadPage. [19:22:08] !log pgehres synchronized wmf-config/reporting-setup.php 'Switching S:FundraiserStatistics read slave back to db1025' [19:22:20] Logged the message, Master [20:50:10] New review: Hashar; "ops, feel free to merge whenever you see this change ;-)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/44164 [20:53:53] anomie: was a bug filed for that or not? [20:54:15] Nemo_bis- No idea. Someone mentioned it on #wikimedia-tech [22:24:54] !log Created ProofredPage table on aswikisource [22:25:06] Logged the message, Master