[00:00:45] Wed Dec 26 23:54:35 UTC 2012 mw15 commonswiki BacklinkCache::getLinks 10.0.6.75 2008 MySQL client ran out of memory (10.0.6.75) SELECT /*! STRAIGHT_JOIN */ page_namespace,page_title,page_id FROM `templatelinks`,`page` WHERE tl_namespace = '10' AND tl_title = 'Date' AND (page_id=tl_from) ORDER BY tl_from [00:01:29] that query returns 12384915 rows! BacklinkCache::getLinks needs batching.. [00:01:54] Aaron has been working on that already [00:02:31] cool. is there already a bugzilla ticket? [00:04:54] I am thinking of https://gerrit.wikimedia.org/r/#/c/32488/ [00:05:26] https://bugzilla.wikimedia.org/show_bug.cgi?id=37731 is the bug I suppose [00:06:46] maybe it needs to be reopened [00:08:01] hmm that adds a limit to getNumLinks but not the query in getLinks.. i'll reopen [00:08:44] New patchset: Tim Starling; "beta: rm global $wmfRealm before including IS-labs.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39744 [00:09:19] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39744 [00:11:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [00:13:27] New patchset: Andrew Bogott; "A couple of minor fixes for single-node mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764 [00:14:38] !log reedy synchronized php-1.21wmf6/cache/l10n [00:14:47] Logged the message, Master [00:16:12] New patchset: Andrew Bogott; "A couple of minor fixes for single-node mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764 [00:16:12] New patchset: Ori.livneh; "Change 40763 to MobileFrontend only loads EventLogging if it is defined, so this is not necessary." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40765 [00:16:35] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764 [00:21:38] New review: Tim Starling; "getRealmSpecifcFilename -> getRealmSpecificFilename" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32167 [00:21:49] !log powercycled cp1044, couldn't ssh in [00:21:57] Logged the message, Master [00:23:25] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [00:24:19] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:24:28] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [00:24:46] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [00:25:13] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.054 seconds [00:25:18] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40765 [00:28:58] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [00:36:45] New patchset: Reedy; "Ignore "SHA-1 metadata" in fatalmonitor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40766 [00:37:26] New review: Tim Starling; "It should be $wmgRealm, for consistency with other globals. wmg = Wikimedia global, wmf = Wikimedia ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39056 [00:38:27] New patchset: Asher; "redis replication config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40767 [00:42:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:44:24] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40767 [00:46:40] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [00:52:27] New review: Tim Starling; "Each getRealmSpecificFilename() call takes about 29us on my laptop, so that implies that this change..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32167 [00:52:31] New patchset: Asher; "lowering cache4xx to 1m (from 5m) for upload varnish instances. transient swift errors result in thumb.php 404's that should be valid, so shorter seems better here" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40768 [00:56:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40768 [01:00:46] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [01:00:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.097 seconds [01:04:12] New patchset: Asher; "redis slaveof cmd requires a port" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40769 [01:04:55] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40769 [01:09:19] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [01:15:10] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:16:49] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [01:17:00] New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554 [01:17:36] binasher: ^ that specifies the value in bytes; sorry. [01:18:01] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [01:18:34] ori-l: see https://rt.wikimedia.org/Ticket/Display.html?id=4094#txn-91690 :( [01:19:05] god damn it [01:19:12] do you think it's worth trying with an even more modest limit? [01:20:05] yeah [01:20:06] even a setting of 1020 bytes would be an x4 increase over the current limit [01:26:22] ori-l: i'm testing shm_reclen=1024 on one of the mobile servers, i'll let you know tomorrow if varnishncsa crashes [01:26:53] didn't have to wait too long with the 12k setting [01:26:56] binasher: excellent, thanks! also - [01:27:16] it would be nice to track down the problem, but probably not easy [01:27:29] do you think this could be caused by a lingering copy of varnishncsa that is working with the previous sizes? [01:27:56] nope [01:28:21] they all crashed on the test host, then i tried running one to stdout with no other running, to see if it was udp related [01:28:44] that crashed as well and is where i got the backtrace from [01:29:29] that sucks :/ [01:29:39] well, thanks for responding so quickly and thoroughly [01:30:18] hopefully 1k works [01:32:26] New patchset: Andrew Bogott; "Variable substitution between single-quotes doesn't work so well." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40770 [01:32:29] so far it's ok. munmap_chunk(): invalid pointer: 0x0000000000d9d360 - i wonder if there's a fixed size in the code that shm_reclen can't be larger than (smaller than the 16k mentioned on the list) [01:32:59] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40770 [01:34:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:13] binasher: not sure. there are reports of people successfully setting shm_reclen to 64k, but not at web-scale™ [01:44:51] binasher: did you restart varnishd itself? shm_workspace is flagged 'delayed' in the docs, indicating "This parameter can be changed on the fly, but will not take effect immediately." [01:44:55] (https://www.varnish-cache.org/docs/3.0/reference/varnishd.html#run-time-parameters) [01:45:23] ori-l: yes, gave it as a command line option at start time. [01:48:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.428 seconds [01:50:09] bah :| out of ideas, then. i'll read the changelog and bug tracker carefully tonight to see if anything potentially related was reported. [01:59:23] anybody know what the l/p for ganglia is now? (pm or e-mail) [02:09:07] !log on db64: killed FlaggedRevsStats::getEditReviewTimes queries, were running for 40 days. Client host was hume but no process was found with the relevant ephemeral port. [02:09:17] Logged the message, Master [02:15:57] ori-l: /home/wikipedia/docs/ganglia or something there of [02:16:32] /home/wikipedia/doc/ganglia.htaccess [02:21:50] !log LocalisationUpdate completed (1.21wmf6) at Thu Dec 27 02:21:50 UTC 2012 [02:22:00] Logged the message, Master [02:23:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.319 seconds [02:37:46] Reedy: thanks [02:44:25] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:53:25] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [02:53:25] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:53:25] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [02:53:25] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [03:53:25] Reedy: so they gave permission denied but then what? still needs fixing? [03:53:31] (l10nupdate) [04:15:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:15:47] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:34:46] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [05:34:46] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [05:34:46] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [05:34:46] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [05:34:47] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [05:34:47] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:51:21] New patchset: Tim Starling; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [05:54:44] New patchset: Tim Starling; "Make getRealmSpecificFilename() faster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775 [06:36:57] Change abandoned: Tim Starling; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31661 [06:38:36] New review: Tim Starling; "I don't understand why different wikis need different thumbnail sizes. It's not like they use differ..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/31580 [06:41:59] New review: Tim Starling; "You mean bug 42748." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/39775 [06:46:10] New patchset: Tim Starling; "Revert "Kill mobileRedirect.php, not used since forever"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39775 [06:46:38] New review: Tim Starling; "PS2: rebased and fixed commit message." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/39775 [06:46:38] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39775 [06:47:57] !log tstarling synchronized live-1.5/mobileRedirect.php [06:48:06] Logged the message, Master [07:00:27] New review: Brian Wolff; ">ii. This won't affect current users' preferences, but only >anonymous" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/31580 [07:28:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:52] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [07:44:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.779 seconds [08:17:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [08:36:52] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [08:39:52] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:48:52] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [09:05:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:07:48] New patchset: ArielGlenn; "make deployment dirs if they don't exist" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40780 [09:10:16] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/33566 [09:10:58] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/36712 [09:10:59] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/35378 [09:11:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40780 [09:18:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.824 seconds [09:24:52] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:50:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [10:11:41] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [10:38:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:44] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 203 seconds [10:40:20] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 219 seconds [10:50:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [10:55:11] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [10:55:56] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [10:56:41] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:56:41] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [11:23:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:36:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.956 seconds [12:09:44] New patchset: J; "add cgroup to limit memory of sub processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40784 [12:11:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:38] oh? [12:13:58] cool! [12:22:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.051 seconds [12:46:09] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [12:55:00] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [12:55:00] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [12:55:00] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [12:55:00] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:56:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.232 seconds [13:38:29] New patchset: Mark Bergsma; "Exit if the queue gets too full because workers are stuck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40789 [13:39:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40789 [13:43:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:35] New patchset: Mark Bergsma; "Move the socket receive / purge enqueuing out of the eval block" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40790 [13:46:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40790 [13:47:07] New review: Alex Monk; "hello?" [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34113 [13:54:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.656 seconds [14:17:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:17:12] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:24:52] New patchset: ArielGlenn; "deploy script to update file paths in config files if desired" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40795 [14:29:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:45] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [14:42:15] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [14:44:57] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:45] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:19] New patchset: Anomie; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [14:48:52] New review: Anomie; "PS12: Fix spelling error getRealmSpecifcFilename → getRealmSpecificFilename" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32167 [14:49:27] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms [14:50:12] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 26.96 ms [14:51:42] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:42] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:42] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:51:51] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:27] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:36] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:18] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [14:55:46] afk for awhile, back later this evening [14:56:03] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 27.27 ms [14:56:39] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [14:57:06] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 27.02 ms [14:57:33] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [14:57:42] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:00] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:18] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [15:02:21] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [15:02:48] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [15:05:30] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:07:10] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds [15:07:18] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [15:13:45] PROBLEM - MySQL Slave Running on db64 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [15:14:03] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 534 seconds [15:15:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [15:33:17] !log shutting down mw57 to troubleshoot DIMM/Mem issue [15:33:27] Logged the message, Master [15:36:15] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [15:36:15] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [15:36:15] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [15:36:15] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [15:36:16] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [15:36:16] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:36:42] PROBLEM - Host mw57 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:30] RECOVERY - Host mw57 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [15:49:09] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [15:58:00] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [16:01:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.644 seconds [16:47:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [17:16:12] apergos or any other op available? yuvipanda is experiencing problems reaching wikipedia [17:16:39] he did a traceroute: http://pastebin.com/fCSNAmJh [17:17:31] ew that's an ugly path [17:18:12] LeslieCarr: around? [17:18:16] hey [17:18:18] Jeff_Green: aye [17:18:22] * yuvipanda waves to awjr  [17:18:26] * awjr waves back [17:18:34] so, I can successfully ping hop 3 from eqiad [17:18:45] can I get your IP yuvipanda? [17:18:46] private is fine [17:18:51] sure [17:19:02] 180.151.43.50 [17:21:05] i did another traceroute, same thing [17:22:19] looking [17:24:44] from a first glance it looks like a problem on your isp's side [17:25:26] oh? [17:26:03] hmm, i could try with my phone's 3g but that doesn't let any icmp go through. [17:26:18] traceroute stops at AS 9498 [17:26:33] never reaches AS 10029 which is your ISP [17:26:53] or perhaps stops at 10029's border [17:27:05] 10029 is SPECTRANET [17:27:17] that is the ISP I'm on. [17:27:28] so the ISP is blocking this somehow? [17:27:31] I figured as much :) [17:28:24] paravoid: thanks! I'll poke them in the eye [17:28:38] i'll check hrough my phone [17:28:42] I wouldn't say blocked [17:28:49] I'd say some kind of problem probably [17:29:16] everything else goes through fine [17:29:21] also, there were some unconfirmed reports before about gmail being blocked in India [17:29:29] oh? [17:29:32] works fine for me... [17:29:35] right [17:29:43] one second, hopping on to a different network [17:29:44] so, it might be some kind of infrastructure issue? [17:29:56] possibly [17:30:05] spectranet isin't really a popular isp [17:32:42] yuvipanda, might be related: https://twitter.com/dweekly/status/284183693596188672 [17:32:48] yeah that was the one [17:32:56] they seem to be doing some traffic engineering too [17:33:00] meh, too far out from the city, so no 3G either [17:33:11] so, different paths for different smallish subnets of theirs [17:33:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:16] i can ping twitter [17:33:38] yuvipanda, are you hiding in the woods? [17:34:58] yuvipanda: try reporting the problem to your isp. make sure to give them your IP and the IP from our side you're not able to reach [17:34:58] MaxSem: no, in bangalore :) [17:35:06] yeah, i'll do that. [17:35:12] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [17:35:21] you said that you're far from the city [17:35:34] yeah [17:35:35] noc@wikimedia.org is the primary contact for network issues from our side, although I don't see evidence to suggest it's something close to us, so far [17:35:48] MaxSem: it is a big city [17:41:10] yuvipanda: do you also have a problem reaching wikimedia-lb.pmtpa.wikimedia.org btw? [17:42:19] we don't (yet?) have a way to selectively switch isps to different datacenters, but it might be good to know [17:43:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.317 seconds [17:44:39] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds [17:44:48] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 215 seconds [17:51:21] paravoid: sorry stepped out to grab a drink [17:51:26] paravoid: i can't reach that either, no [17:52:02] oh hm, right, we're using the same path to reach you, so that figures [17:53:16] yeah, same traceroute [17:58:03] Jeff_Green: now around [17:58:09] what's up ? [17:58:15] yuvipanda: having problems ? [17:58:20] hey--i was going to get you involved in that yeah [17:58:53] but that was a while ago, not sure what is the urgency at this point [18:00:27] New review: MaxSem; "This change would also fix the inability of Windows folks to contribute to this repo as the old cert..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/32924 [18:00:28] LeslieCarr: yeah, looks like my ISP [18:00:57] LeslieCarr: Will just tunnel through for a while, and see if there's anything I can do to report it to the ISP [18:01:01] okay [18:01:10] please do report it to your isp so they can fix their routing [18:01:36] yup, will do :) [18:18:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:57] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:21:24] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:21:42] RobH: you don't happen to be in the ashburn area at the moment, do you ? [18:34:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [18:36:31] lesliecarr: what's up? (not there though)...do we need smart hands? [18:36:46] i'm not sure i trust smart hands to do the pulling the sfp module in a switch [18:36:53] it's just to get it rma'd [18:38:12] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [18:41:12] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:45:12] LeslieCarr: Sorry, was getting groceries [18:45:15] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [18:45:24] im back in DC area [18:45:43] until monday [18:45:46] then i leave here forever. [18:45:48] \o/ [18:46:10] :) [18:46:15] if you have the opportunity - https://rt.wikimedia.org/Ticket/Display.html?id=4199 [18:46:57] this needs to happen asap? [18:47:05] cuz chris is back onsite next tuesday. [18:47:14] well, wednesday (tuesday we all have off) [18:47:45] I'm trying to avoid driving out there since I am still packing up my place and stuff in Arlington ;] [18:48:25] but otherwise if it needs to go sooner than later i can head over either late today or late tomorrow (have UPS picking up boxes of my stuff today and tomorrow) [18:48:35] i can leave to head down after UPS shows up each day. [18:49:51] btw, RobH: https://gerrit.wikimedia.org/r/#/c/39739/ :) [18:50:12] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [18:50:53] maxsem: sorry i meant to merge that yesterday [18:50:53] heh, i'll gladly merge that =] [18:51:18] cmjohnson1: if you had, then you would have taken responsibility of letting me know I had yttrium back ;] [18:51:33] my precious (servers) [18:52:29] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39739 [18:52:42] ha...totally heard gollum there..."my precious" [18:52:46] indeed [18:52:57] so yea, my saying 'if you merge a decom, you have to tell me' is new as of this moment [18:53:13] but it makes sense, if we reclaim servers, either flag the gerrit review for me to review, or just drop me a note [18:53:26] pretty sure everyone already does that (the note) but doesnt hurt to say it again [18:53:26] k [18:54:47] MaxSem: thanks, change is merged and live, yttrium will just get pulled and reclaimed by me sometime soon. [18:55:21] whee [18:57:01] cmjohnson1: hey, what's up with search1001? [18:57:15] just curious/need a reminder if you already told me :) [18:58:24] it needs a reinstall...i didn't want to do it if I was going to move the servers to a different rack [18:59:56] i can fix now if you like or wait [19:06:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.452 seconds [19:19:27] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Thu Dec 27 19:18:52 UTC 2012 [19:19:36] can someone create an account for me on http://wikitech.wikimedia.org? I heard I might want RobH for this maybe? [19:22:42] chrismcmahon: i got it [19:22:52] thanks LeslieCarr [19:23:02] RECOVERY - Lucene disk space on search1001 is OK: DISK OK [19:25:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:26:02] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [19:38:56] RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.0162473917 secs [19:53:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:20] New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554 [20:05:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.496 seconds [20:12:50] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [20:40:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.435 seconds [20:56:59] meeester notpeter, you around? [20:57:21] i'm puppetizing a site for ryan faulkner on stat1001, and it needs a research db slave password in the configs [20:57:35] shoudl I add a class to private/manifests/passwords.pp for this? [20:58:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [20:58:02] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [20:58:04] actually, binasher, maybe you'd know the proper thing to do here, since you might know more about these db passwords [21:01:51] yoooo binasher_, did you see my recent question (saw that you just entered the room) [21:01:52] ? [21:03:13] ottomata: that's done with other db passwords (as used by nagios, ganglia, etc.) so that would make sense [21:03:30] ok cool, so i'm going to add one for the research user [21:03:35] passwords::mysql::research [21:03:36] or something [21:03:42] sound good? [21:03:53] also, to make sure I know the process of making this change: [21:03:55] edit on sockpuppet [21:03:57] svn commit [21:04:04] ….then what? [21:07:47] ottomata: git not svn, and see the REMINDER.. file in /root/private/ [21:08:03] ah its in git now, cool, yeah read that [21:19:27] New patchset: Ottomata; "Puppetizing E3's metrics API on stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866 [21:20:33] ops: The thank-you banner is now up; app api load could increase. Handy ganglia link: http://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 (I'm watching) [21:21:02] I'm probably being excessively paranoid [21:22:58] New patchset: Ottomata; "Puppetizing E3's metrics API on stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866 [21:26:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:35:34] RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [21:37:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.539 seconds [21:42:05] New review: Ottomata; "I'm not sure if " [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/40866 [21:42:06] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866 [21:45:45] New patchset: Ottomata; "Fixing docroot parameter for webserver::apache::site in metrics-api site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40977 [21:46:40] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40977 [21:50:53] New patchset: Ottomata; "Removing invalid WSGIRestrictStdout Off from metrics api VirtualHost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40979 [21:51:13] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40979 [21:52:17] New patchset: Ori.livneh; "Remove old config var; enable Ext:EventLogging" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40980 [21:56:25] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40980 [22:00:19] New patchset: Ottomata; "Need to include passwords::mysql::research class, Also fixing commas in settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982 [22:01:00] New patchset: Ottomata; "Need to include passwords::mysql::research class, Also fixing commas in settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982 [22:01:19] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982 [22:02:01] !log olivneh synchronized wmf-config/InitialiseSettings.php [22:02:12] Logged the message, Master [22:04:53] New patchset: MaxSem; "Solr monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40983 [22:06:36] New patchset: Tim Starling; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [22:07:42] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [22:08:07] New patchset: Tim Starling; "Make getRealmSpecificFilename() faster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775 [22:08:44] New patchset: Ottomata; "Don't need to symlink to E3Analysis/src anymore" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40984 [22:08:55] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40984 [22:08:58] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775 [22:12:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:13:05] !log tstarling Started syncing Wikimedia installation... : [22:13:13] Logged the message, Master [22:27:27] !log tstarling Finished syncing Wikimedia installation... : [22:27:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.884 seconds [22:27:35] Logged the message, Master [22:28:38] !log tstarling Started syncing Wikimedia installation... : [22:28:46] Logged the message, Master [22:37:32] New patchset: Tim Starling; "Fix fatal error due to missing MWRealm.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40988 [22:37:35] New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [22:38:17] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40988 [22:39:29] !log tstarling Started syncing Wikimedia installation... : [22:39:37] Logged the message, Master [22:42:20] New review: MZMcBride; "manifests/misc/maintenance.pp now has inconsistent indentation (spaces were used in this changeset i..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/40989 [22:42:46] !log tstarling Finished syncing Wikimedia installation... : [22:42:54] Logged the message, Master [22:46:39] New patchset: Tim Starling; "Fix incorrect dir from I7ef35304" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40991 [22:46:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [22:47:00] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40991 [22:47:39] New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [22:47:50] !log tstarling Started syncing Wikimedia installation... : [22:47:58] Logged the message, Master [22:51:55] !log tstarling Finished syncing Wikimedia installation... : [22:52:03] Logged the message, Master [22:53:44] !log tstarling Started syncing Wikimedia installation... : [22:53:52] Logged the message, Master [22:55:56] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [22:55:56] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [22:55:56] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [22:55:56] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [22:57:44] !log tstarling Finished syncing Wikimedia installation... : [22:57:52] Logged the message, Master [23:02:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:07:53] New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [23:13:15] New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [23:16:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [23:21:43] New patchset: Pyoungmeister; "adding data sources for eqiad ganglia groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41001 [23:24:08] New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [23:24:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989 [23:31:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41001 [23:42:08] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 21.05242 (gt 8.0) [23:49:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:58:20] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.65921103704