[00:00:45] <binasher>	 Wed Dec 26 23:54:35 UTC 2012    mw15    commonswiki     BacklinkCache::getLinks 10.0.6.75       2008    MySQL client ran out of memory (10.0.6.75)      SELECT  /*! STRAIGHT_JOIN */ page_namespace,page_title,page_id  FROM `templatelinks`,`page`  WHERE tl_namespace = '10' AND tl_title = 'Date' AND (page_id=tl_from)  ORDER BY tl_from
[00:01:29] <binasher>	 that query returns 12384915 rows! BacklinkCache::getLinks needs batching..
[00:01:54] <TimStarling>	 Aaron has been working on that already
[00:02:31] <binasher>	 cool. is there already a bugzilla ticket?
[00:04:54] <TimStarling>	 I am thinking of https://gerrit.wikimedia.org/r/#/c/32488/
[00:05:26] <TimStarling>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=37731 is the bug I suppose
[00:06:46] <TimStarling>	 maybe it needs to be reopened
[00:08:01] <binasher>	 hmm that adds a limit to getNumLinks but not the query in getLinks.. i'll reopen
[00:08:44] <gerrit-wm>	 New patchset: Tim Starling; "beta: rm global $wmfRealm before including IS-labs.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39744
[00:09:19] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39744
[00:11:09] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds
[00:13:27] <gerrit-wm>	 New patchset: Andrew Bogott; "A couple of minor fixes for single-node mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764
[00:14:38] <logmsgbot>	 !log reedy synchronized php-1.21wmf6/cache/l10n
[00:14:47] <morebots>	 Logged the message, Master
[00:16:12] <gerrit-wm>	 New patchset: Andrew Bogott; "A couple of minor fixes for single-node mediawiki:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764
[00:16:12] <gerrit-wm>	 New patchset: Ori.livneh; "Change 40763 to MobileFrontend only loads EventLogging if it is defined, so this is not necessary." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40765
[00:16:35] <gerrit-wm>	 Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40764
[00:21:38] <gerrit-wm>	 New review: Tim Starling; "getRealmSpecifcFilename -> getRealmSpecificFilename" [operations/mediawiki-config] (master); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/32167
[00:21:49] <binasher>	 !log powercycled cp1044, couldn't ssh in
[00:21:57] <morebots>	 Logged the message, Master
[00:23:25] <nagios-wm>	 PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100%
[00:24:19] <nagios-wm>	 RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[00:24:28] <nagios-wm>	 RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms
[00:24:46] <nagios-wm>	 RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker
[00:25:13] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.054 seconds
[00:25:18] <gerrit-wm>	 Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40765
[00:28:58] <nagios-wm>	 RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds
[00:36:45] <gerrit-wm>	 New patchset: Reedy; "Ignore "SHA-1 metadata" in fatalmonitor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40766
[00:37:26] <gerrit-wm>	 New review: Tim Starling; "It should be $wmgRealm, for consistency with other globals. wmg = Wikimedia global, wmf = Wikimedia ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39056
[00:38:27] <gerrit-wm>	 New patchset: Asher; "redis replication config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40767
[00:42:55] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:44:24] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40767
[00:46:40] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa
[00:52:27] <gerrit-wm>	 New review: Tim Starling; "Each getRealmSpecificFilename() call takes about 29us on my laptop, so that implies that this change..." [operations/mediawiki-config] (master); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/32167
[00:52:31] <gerrit-wm>	 New patchset: Asher; "lowering cache4xx to 1m (from 5m) for upload varnish instances. transient swift errors result in thumb.php 404's that should be valid, so shorter seems better here" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40768
[00:56:19] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40768
[01:00:46] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa
[01:00:55] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.097 seconds
[01:04:12] <gerrit-wm>	 New patchset: Asher; "redis slaveof cmd requires a port" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40769
[01:04:55] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40769
[01:09:19] <nagios-wm>	 PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa
[01:15:10] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds
[01:16:49] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds
[01:17:00] <gerrit-wm>	 New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554
[01:17:36] <ori-l>	 binasher: ^ that specifies the value in bytes; sorry.
[01:18:01] <nagios-wm>	 RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa
[01:18:34] <binasher>	 ori-l: see https://rt.wikimedia.org/Ticket/Display.html?id=4094#txn-91690 :(
[01:19:05] <ori-l>	 god damn it
[01:19:12] <ori-l>	 do you think it's worth trying with an even more modest limit?
[01:20:05] <binasher>	 yeah
[01:20:06] <ori-l>	 even a setting of 1020 bytes would be an x4 increase over the current limit
[01:26:22] <binasher>	 ori-l: i'm testing shm_reclen=1024 on one of the mobile servers, i'll let you know tomorrow if varnishncsa crashes
[01:26:53] <binasher>	 didn't have to wait too long with the 12k setting
[01:26:56] <ori-l>	 binasher: excellent, thanks! also -
[01:27:16] <binasher>	 it would be nice to track down the problem, but probably not easy
[01:27:29] <ori-l>	 do you think this could be caused by a lingering copy of varnishncsa that is working with the previous sizes?
[01:27:56] <binasher>	 nope
[01:28:21] <binasher>	 they all crashed on the test host, then i tried running one to stdout with no other running, to see if it was udp related
[01:28:44] <binasher>	 that crashed as well and is where i got the backtrace from
[01:29:29] <ori-l>	 that sucks :/
[01:29:39] <ori-l>	 well, thanks for responding so quickly and thoroughly
[01:30:18] <ori-l>	 hopefully 1k works
[01:32:26] <gerrit-wm>	 New patchset: Andrew Bogott; "Variable substitution between single-quotes doesn't work so well." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40770
[01:32:29] <binasher>	 so far it's ok.  munmap_chunk(): invalid pointer: 0x0000000000d9d360 - i wonder if there's a fixed size in the code that shm_reclen can't be larger than (smaller than the 16k mentioned on the list)
[01:32:59] <gerrit-wm>	 Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40770
[01:34:13] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:39:13] <ori-l>	 binasher: not sure. there are reports of people successfully setting shm_reclen to 64k, but not at web-scale™
[01:44:51] <ori-l>	 binasher: did you restart varnishd itself? shm_workspace is flagged 'delayed' in the docs, indicating "This parameter can be changed on the fly, but will not take effect immediately."
[01:44:55] <ori-l>	 (https://www.varnish-cache.org/docs/3.0/reference/varnishd.html#run-time-parameters)
[01:45:23] <binasher>	 ori-l: yes, gave it as a command line option at start time.
[01:48:28] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.428 seconds
[01:50:09] <ori-l>	 bah :| out of ideas, then. i'll read the changelog and bug tracker carefully tonight to see if anything potentially related was reported.
[01:59:23] <ori-l>	 anybody know what the l/p for ganglia is now? (pm or e-mail)
[02:09:07] <TimStarling>	 !log on db64: killed FlaggedRevsStats::getEditReviewTimes queries, were running for 40 days. Client host was hume but no process was found with the relevant ephemeral port.
[02:09:17] <morebots>	 Logged the message, Master
[02:15:57] <Reedy>	 ori-l: /home/wikipedia/docs/ganglia or something there of
[02:16:32] <Reedy>	 /home/wikipedia/doc/ganglia.htaccess
[02:21:50] <logmsgbot>	 !log LocalisationUpdate completed (1.21wmf6) at Thu Dec 27 02:21:50 UTC 2012
[02:22:00] <morebots>	 Logged the message, Master
[02:23:25] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:32:16] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.319 seconds
[02:37:46] <ori-l>	 Reedy: thanks
[02:44:25] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[02:53:25] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[02:53:25] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[02:53:25] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[02:53:25] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[03:53:25] <jeremyb>	 Reedy: so they gave permission denied but then what? still needs fixing?
[03:53:31] <jeremyb>	 (l10nupdate)
[04:15:47] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[04:15:47] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:46] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:46] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:46] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:46] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:47] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[05:34:47] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[05:51:21] <gerrit-wm>	 New patchset: Tim Starling; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167
[05:54:44] <gerrit-wm>	 New patchset: Tim Starling; "Make getRealmSpecificFilename() faster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775
[06:36:57] <gerrit-wm>	 Change abandoned: Tim Starling; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31661
[06:38:36] <gerrit-wm>	 New review: Tim Starling; "I don't understand why different wikis need different thumbnail sizes. It's not like they use differ..." [operations/mediawiki-config] (master); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/31580
[06:41:59] <gerrit-wm>	 New review: Tim Starling; "You mean bug 42748." [operations/mediawiki-config] (master); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/39775
[06:46:10] <gerrit-wm>	 New patchset: Tim Starling; "Revert "Kill mobileRedirect.php, not used since forever"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39775
[06:46:38] <gerrit-wm>	 New review: Tim Starling; "PS2: rebased and fixed commit message." [operations/mediawiki-config] (master); V: 2 C: 2;  - https://gerrit.wikimedia.org/r/39775
[06:46:38] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39775
[06:47:57] <logmsgbot>	 !log tstarling synchronized live-1.5/mobileRedirect.php
[06:48:06] <morebots>	 Logged the message, Master
[07:00:27] <gerrit-wm>	 New review: Brian Wolff; ">ii. This won't affect current users' preferences, but only >anonymous" [operations/mediawiki-config] (master) C: 0;  - https://gerrit.wikimedia.org/r/31580
[07:28:28] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:33:52] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[07:44:31] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.779 seconds
[08:17:40] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:32:04] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds
[08:36:52] <nagios-wm>	 PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours
[08:39:52] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[08:48:52] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[09:05:22] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:07:48] <gerrit-wm>	 New patchset: ArielGlenn; "make deployment dirs if they don't exist" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40780
[09:10:16] <gerrit-wm>	 Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/33566
[09:10:58] <gerrit-wm>	 Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/36712
[09:10:59] <gerrit-wm>	 Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/35378
[09:11:43] <gerrit-wm>	 Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40780
[09:18:07] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.824 seconds
[09:24:52] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[09:50:59] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:05:24] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds
[10:11:41] <nagios-wm>	 PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours
[10:38:23] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:39:44] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 203 seconds
[10:40:20] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 219 seconds
[10:50:50] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds
[10:55:11] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds
[10:55:56] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds
[10:56:41] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[10:56:41] <nagios-wm>	 PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours
[11:23:35] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:36:11] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.956 seconds
[12:09:44] <gerrit-wm>	 New patchset: J; "add cgroup to limit memory of sub processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40784
[12:11:26] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:12:38] <paravoid>	 oh?
[12:13:58] <paravoid>	 cool!
[12:22:05] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.051 seconds
[12:46:09] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[12:55:00] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[12:55:00] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[12:55:00] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[12:55:00] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[12:56:39] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:09:06] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.232 seconds
[13:38:29] <gerrit-wm>	 New patchset: Mark Bergsma; "Exit if the queue gets too full because workers are stuck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40789
[13:39:03] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40789
[13:43:54] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:45:35] <gerrit-wm>	 New patchset: Mark Bergsma; "Move the socket receive / purge enqueuing out of the eval block" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40790
[13:46:16] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40790
[13:47:07] <gerrit-wm>	 New review: Alex Monk; "hello?" [operations/apache-config] (master) C: 0;  - https://gerrit.wikimedia.org/r/34113
[13:54:33] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.656 seconds
[14:17:12] <nagios-wm>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[14:17:12] <nagios-wm>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[14:24:52] <gerrit-wm>	 New patchset: ArielGlenn; "deploy script to update file paths in config files if desired" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/40795
[14:29:39] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:40:45] <nagios-wm>	 PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100%
[14:41:48] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds
[14:42:15] <nagios-wm>	 RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms
[14:44:57] <nagios-wm>	 PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:45] <nagios-wm>	 PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:19] <gerrit-wm>	 New patchset: Anomie; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167
[14:48:52] <gerrit-wm>	 New review: Anomie; "PS12: Fix spelling error getRealmSpecifcFilename → getRealmSpecificFilename" [operations/mediawiki-config] (master); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/32167
[14:49:27] <nagios-wm>	 RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms
[14:50:12] <nagios-wm>	 RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 26.96 ms
[14:51:42] <nagios-wm>	 PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:42] <nagios-wm>	 PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:42] <nagios-wm>	 PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100%
[14:51:51] <nagios-wm>	 PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100%
[14:52:27] <nagios-wm>	 PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100%
[14:52:36] <nagios-wm>	 PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100%
[14:55:18] <nagios-wm>	 RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms
[14:55:46] <apergos>	 afk for awhile, back later this evening
[14:56:03] <nagios-wm>	 RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 27.27 ms
[14:56:39] <nagios-wm>	 RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms
[14:57:06] <nagios-wm>	 RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 27.02 ms
[14:57:33] <nagios-wm>	 RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms
[14:57:42] <nagios-wm>	 PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100%
[14:58:00] <nagios-wm>	 PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100%
[14:58:18] <nagios-wm>	 RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms
[15:02:21] <nagios-wm>	 RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms
[15:02:48] <nagios-wm>	 RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms
[15:05:30] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:07:10] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds
[15:07:18] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours
[15:13:45] <nagios-wm>	 PROBLEM - MySQL Slave Running on db64 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the
[15:14:03] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 534 seconds
[15:15:33] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:29:39] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds
[15:33:17] <cmjohnson1>	 !log shutting down mw57 to troubleshoot DIMM/Mem issue
[15:33:27] <morebots>	 Logged the message, Master
[15:36:15] <nagios-wm>	 PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours
[15:36:15] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours
[15:36:15] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours
[15:36:15] <nagios-wm>	 PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours
[15:36:16] <nagios-wm>	 PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours
[15:36:16] <nagios-wm>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[15:36:42] <nagios-wm>	 PROBLEM - Host mw57 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:30] <nagios-wm>	 RECOVERY - Host mw57 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[15:49:09] <nagios-wm>	 PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused
[15:58:00] <nagios-wm>	 RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time
[16:01:09] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:13:36] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.644 seconds
[16:47:21] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:59:39] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds
[17:16:12] <awjr>	 apergos or any other op available? yuvipanda is experiencing problems reaching wikipedia
[17:16:39] <awjr>	 he did a traceroute: http://pastebin.com/fCSNAmJh
[17:17:31] <Jeff_Green>	 ew that's an ugly path
[17:18:12] <Jeff_Green>	 LeslieCarr: around?
[17:18:16] <paravoid>	 hey
[17:18:18] <awjr>	 Jeff_Green: aye
[17:18:22] * yuvipanda  waves to awjr 
[17:18:26] * awjr  waves back
[17:18:34] <paravoid>	 so, I can successfully ping hop 3 from eqiad
[17:18:45] <paravoid>	 can I get your IP yuvipanda?
[17:18:46] <paravoid>	 private is fine
[17:18:51] <yuvipanda>	 sure
[17:19:02] <yuvipanda>	 180.151.43.50
[17:21:05] <yuvipanda>	 i did another traceroute, same thing
[17:22:19] <paravoid>	 looking
[17:24:44] <paravoid>	 from a first glance it looks like a problem on your isp's side
[17:25:26] <yuvipanda>	 oh?
[17:26:03] <yuvipanda>	 hmm, i could try with my phone's 3g but that doesn't let any icmp go through.
[17:26:18] <paravoid>	 traceroute stops at AS 9498
[17:26:33] <paravoid>	 never reaches AS 10029 which is your ISP
[17:26:53] <paravoid>	 or perhaps stops at 10029's border
[17:27:05] <paravoid>	 10029 is SPECTRANET
[17:27:17] <yuvipanda>	 that is the ISP I'm on.
[17:27:28] <yuvipanda>	 so the ISP is blocking this somehow?
[17:27:31] <paravoid>	 I figured as much :)
[17:28:24] <yuvipanda>	 paravoid: thanks! I'll poke them in the eye
[17:28:38] <yuvipanda>	 i'll check hrough my phone
[17:28:42] <paravoid>	 I wouldn't say blocked
[17:28:49] <paravoid>	 I'd say some kind of problem probably
[17:29:16] <yuvipanda>	 everything else goes through fine
[17:29:21] <paravoid>	 also, there were some unconfirmed reports before about gmail being blocked in India
[17:29:29] <yuvipanda>	 oh?
[17:29:32] <yuvipanda>	 works fine for me...
[17:29:35] <paravoid>	 right
[17:29:43] <yuvipanda>	 one second, hopping on to  a different network
[17:29:44] <paravoid>	 so, it might be some kind of infrastructure issue?
[17:29:56] <yuvipanda>	 possibly
[17:30:05] <yuvipanda>	 spectranet isin't really a popular isp
[17:32:42] <MaxSem>	 yuvipanda, might be related: https://twitter.com/dweekly/status/284183693596188672
[17:32:48] <paravoid>	 yeah that was the one
[17:32:56] <paravoid>	 they seem to be doing some traffic engineering too
[17:33:00] <yuvipanda>	 meh, too far out from the city, so no 3G either
[17:33:11] <paravoid>	 so, different paths for different smallish subnets of theirs
[17:33:15] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:33:16] <yuvipanda>	 i can ping twitter
[17:33:38] <MaxSem>	 yuvipanda, are you hiding in the woods?
[17:34:58] <paravoid>	 yuvipanda: try reporting the problem to your isp. make sure to give them your IP and the IP from our side you're not able to reach
[17:34:58] <yuvipanda>	 MaxSem: no, in bangalore :)
[17:35:06] <yuvipanda>	 yeah, i'll do that.
[17:35:12] <nagios-wm>	 PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours
[17:35:21] <MaxSem>	 you said that you're  far from the city
[17:35:34] <yuvipanda>	 yeah
[17:35:35] <paravoid>	 noc@wikimedia.org is the primary contact for network issues from our side, although I don't see evidence to suggest it's something close to us, so far
[17:35:48] <yuvipanda>	 MaxSem: it is a big city
[17:41:10] <paravoid>	 yuvipanda: do you also have a problem reaching wikimedia-lb.pmtpa.wikimedia.org btw?
[17:42:19] <paravoid>	 we don't (yet?) have a way to selectively switch isps to different datacenters, but it might be good to know
[17:43:55] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.317 seconds
[17:44:39] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds
[17:44:48] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 215 seconds
[17:51:21] <yuvipanda>	 paravoid: sorry stepped out to grab a drink
[17:51:26] <yuvipanda>	 paravoid: i can't reach that either, no
[17:52:02] <paravoid>	 oh hm, right, we're using the same path to reach you, so that figures
[17:53:16] <yuvipanda>	 yeah, same traceroute
[17:58:03] <LeslieCarr>	 Jeff_Green: now around
[17:58:09] <LeslieCarr>	 what's up ?
[17:58:15] <LeslieCarr>	 yuvipanda: having problems ?
[17:58:20] <Jeff_Green>	 hey--i was going to get you involved in that yeah
[17:58:53] <Jeff_Green>	 but that was a while ago, not sure what is the urgency at this point
[18:00:27] <gerrit-wm>	 New review: MaxSem; "This change would also fix the inability of Windows folks to contribute to this repo as the old cert..." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/32924
[18:00:28] <yuvipanda>	 LeslieCarr: yeah, looks like my ISP
[18:00:57] <yuvipanda>	 LeslieCarr: Will just tunnel through for a while, and see if there's anything I can do to report it to the ISP
[18:01:01] <LeslieCarr>	 okay
[18:01:10] <LeslieCarr>	 please do report it to your isp so they can fix their routing
[18:01:36] <yuvipanda>	 yup, will do :)
[18:18:51] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:20:57] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds
[18:21:24] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds
[18:21:42] <LeslieCarr>	 RobH: you don't happen to be in the ashburn area at the moment, do you ?
[18:34:54] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds
[18:36:31] <cmjohnson1>	 lesliecarr: what's up? (not there though)...do we need smart hands?
[18:36:46] <LeslieCarr>	 i'm not sure i trust smart hands to do the pulling the sfp module in a switch
[18:36:53] <LeslieCarr>	 it's just to get it rma'd
[18:38:12] <nagios-wm>	 PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours
[18:41:12] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours
[18:45:12] <RobH>	 LeslieCarr: Sorry, was getting groceries
[18:45:15] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours
[18:45:24] <RobH>	 im back in DC area
[18:45:43] <RobH>	 until monday
[18:45:46] <RobH>	 then i leave here forever.
[18:45:48] <RobH>	 \o/
[18:46:10] <LeslieCarr>	 :)
[18:46:15] <LeslieCarr>	 if you have the opportunity - https://rt.wikimedia.org/Ticket/Display.html?id=4199
[18:46:57] <RobH>	 this needs to happen asap?
[18:47:05] <RobH>	 cuz chris is back onsite next tuesday.
[18:47:14] <RobH>	 well, wednesday (tuesday we all have off)
[18:47:45] <RobH>	 I'm trying to avoid driving out there since I am still packing up my place and stuff in Arlington ;]
[18:48:25] <RobH>	 but otherwise if it needs to go sooner than later i can head over either late today or late tomorrow (have UPS picking up boxes of my stuff today and tomorrow)
[18:48:35] <RobH>	 i can leave to head down after UPS shows up each day.
[18:49:51] <MaxSem>	 btw, RobH: https://gerrit.wikimedia.org/r/#/c/39739/ :)
[18:50:12] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[18:50:53] <cmjohnson1>	 maxsem: sorry i meant to merge that yesterday
[18:50:53] <RobH>	 heh, i'll gladly merge that =]
[18:51:18] <RobH>	 cmjohnson1: if you had, then you would have taken responsibility of letting me know I had yttrium back ;]
[18:51:33] <RobH>	 my precious (servers)
[18:52:29] <gerrit-wm>	 Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39739
[18:52:42] <cmjohnson1>	 ha...totally heard gollum there..."my precious"
[18:52:46] <RobH>	 indeed
[18:52:57] <RobH>	 so yea, my saying 'if you merge a decom, you have to tell me' is new as of this moment
[18:53:13] <RobH>	 but it makes sense, if we reclaim servers, either flag the gerrit review for me to review, or just drop me a note
[18:53:26] <RobH>	 pretty sure everyone already does that (the note) but doesnt hurt to say it again
[18:53:26] <cmjohnson1>	 k
[18:54:47] <RobH>	 MaxSem: thanks, change is merged and live, yttrium will just get pulled and reclaimed by me sometime soon.
[18:55:21] <MaxSem>	 whee
[18:57:01] <notpeter>	 cmjohnson1: hey, what's up with search1001?
[18:57:15] <notpeter>	 just curious/need a reminder if you already told me :)
[18:58:24] <cmjohnson1>	 it needs a reinstall...i didn't want to do it if I was going to move the servers to a different rack
[18:59:56] <cmjohnson1>	 i can fix now if you like or wait
[19:06:42] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:17:39] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.452 seconds
[19:19:27] <nagios-wm>	 RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Thu Dec 27 19:18:52 UTC 2012
[19:19:36] <chrismcmahon>	 can someone create an account for me on http://wikitech.wikimedia.org?  I heard I might want RobH for this maybe?
[19:22:42] <LeslieCarr>	 chrismcmahon: i got it
[19:22:52] <chrismcmahon>	 thanks LeslieCarr
[19:23:02] <nagios-wm>	 RECOVERY - Lucene disk space on search1001 is OK: DISK OK
[19:25:53] <nagios-wm>	 PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours
[19:26:02] <nagios-wm>	 RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123
[19:38:56] <nagios-wm>	 RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.0162473917 secs
[19:53:20] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:56:20] <gerrit-wm>	 New patchset: Ori.livneh; "(RT 4094) Increase varnish SHM defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40554
[20:05:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.496 seconds
[20:12:50] <nagios-wm>	 PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours
[20:40:08] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:52:08] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.435 seconds
[20:56:59] <ottomata>	 meeester notpeter, you around?
[20:57:21] <ottomata>	 i'm puppetizing a site for ryan faulkner on stat1001, and it needs a research db slave password in the configs
[20:57:35] <ottomata>	 shoudl I add a class to private/manifests/passwords.pp for this?
[20:58:02] <nagios-wm>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[20:58:02] <nagios-wm>	 PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours
[20:58:04] <ottomata>	 actually, binasher, maybe you'd know the proper thing to do here, since you might know more about these db passwords
[21:01:51] <ottomata>	 yoooo binasher_, did you see my recent question (saw that you just entered the room)
[21:01:52] <ottomata>	 ?
[21:03:13] <binasher_>	 ottomata: that's done with other db passwords (as used by nagios, ganglia, etc.) so that would make sense
[21:03:30] <ottomata>	 ok cool, so i'm going to add one for the research user
[21:03:35] <ottomata>	 passwords::mysql::research
[21:03:36] <ottomata>	 or something
[21:03:42] <ottomata>	 sound good?
[21:03:53] <ottomata>	 also, to make sure I know the process of making this change:
[21:03:55] <ottomata>	 edit on sockpuppet
[21:03:57] <ottomata>	 svn commit
[21:04:04] <ottomata>	 ….then what?
[21:07:47] <binasher>	 ottomata: git not svn, and see the REMINDER.. file in /root/private/
[21:08:03] <ottomata>	 ah its in git now, cool, yeah read that
[21:19:27] <gerrit-wm>	 New patchset: Ottomata; "Puppetizing E3's metrics API on stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866
[21:20:33] <ori-l>	 ops: The thank-you banner is now up; app api load could increase. Handy ganglia link: http://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 (I'm watching)
[21:21:02] <ori-l>	 I'm probably being excessively paranoid
[21:22:58] <gerrit-wm>	 New patchset: Ottomata; "Puppetizing E3's metrics API on stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866
[21:26:27] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:35:34] <nagios-wm>	 RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms
[21:37:04] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.539 seconds
[21:42:05] <gerrit-wm>	 New review: Ottomata; "I'm not sure if " [operations/puppet] (production); V: 2 C: 2;  - https://gerrit.wikimedia.org/r/40866
[21:42:06] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40866
[21:45:45] <gerrit-wm>	 New patchset: Ottomata; "Fixing docroot parameter for webserver::apache::site in metrics-api site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40977
[21:46:40] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40977
[21:50:53] <gerrit-wm>	 New patchset: Ottomata; "Removing invalid         WSGIRestrictStdout Off from metrics api VirtualHost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40979
[21:51:13] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40979
[21:52:17] <gerrit-wm>	 New patchset: Ori.livneh; "Remove old config var; enable Ext:EventLogging" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40980
[21:56:25] <gerrit-wm>	 Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40980
[22:00:19] <gerrit-wm>	 New patchset: Ottomata; "Need to include passwords::mysql::research class, Also fixing commas in settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982
[22:01:00] <gerrit-wm>	 New patchset: Ottomata; "Need to include passwords::mysql::research class, Also fixing commas in settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982
[22:01:19] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40982
[22:02:01] <logmsgbot>	 !log olivneh synchronized wmf-config/InitialiseSettings.php
[22:02:12] <morebots>	 Logged the message, Master
[22:04:53] <gerrit-wm>	 New patchset: MaxSem; "Solr monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40983
[22:06:36] <gerrit-wm>	 New patchset: Tim Starling; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167
[22:07:42] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167
[22:08:07] <gerrit-wm>	 New patchset: Tim Starling; "Make getRealmSpecificFilename() faster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775
[22:08:44] <gerrit-wm>	 New patchset: Ottomata; "Don't need to symlink to E3Analysis/src anymore" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40984
[22:08:55] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40984
[22:08:58] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40775
[22:12:08] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:13:05] <logmsgbot>	 !log tstarling Started syncing Wikimedia installation... :
[22:13:13] <morebots>	 Logged the message, Master
[22:27:27] <logmsgbot>	 !log tstarling Finished syncing Wikimedia installation... :
[22:27:35] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.884 seconds
[22:27:35] <morebots>	 Logged the message, Master
[22:28:38] <logmsgbot>	 !log tstarling Started syncing Wikimedia installation... :
[22:28:46] <morebots>	 Logged the message, Master
[22:37:32] <gerrit-wm>	 New patchset: Tim Starling; "Fix fatal error due to missing MWRealm.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40988
[22:37:35] <gerrit-wm>	 New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[22:38:17] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40988
[22:39:29] <logmsgbot>	 !log tstarling Started syncing Wikimedia installation... :
[22:39:37] <morebots>	 Logged the message, Master
[22:42:20] <gerrit-wm>	 New review: MZMcBride; "manifests/misc/maintenance.pp now has inconsistent indentation (spaces were used in this changeset i..." [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/40989
[22:42:46] <logmsgbot>	 !log tstarling Finished syncing Wikimedia installation... :
[22:42:54] <morebots>	 Logged the message, Master
[22:46:39] <gerrit-wm>	 New patchset: Tim Starling; "Fix incorrect dir from I7ef35304" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40991
[22:46:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours
[22:47:00] <gerrit-wm>	 Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/40991
[22:47:39] <gerrit-wm>	 New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[22:47:50] <logmsgbot>	 !log tstarling Started syncing Wikimedia installation... :
[22:47:58] <morebots>	 Logged the message, Master
[22:51:55] <logmsgbot>	 !log tstarling Finished syncing Wikimedia installation... :
[22:52:03] <morebots>	 Logged the message, Master
[22:53:44] <logmsgbot>	 !log tstarling Started syncing Wikimedia installation... :
[22:53:52] <morebots>	 Logged the message, Master
[22:55:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours
[22:55:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours
[22:55:56] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours
[22:55:56] <nagios-wm>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[22:57:44] <logmsgbot>	 !log tstarling Finished syncing Wikimedia installation... :
[22:57:52] <morebots>	 Logged the message, Master
[23:02:14] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:07:53] <gerrit-wm>	 New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[23:13:15] <gerrit-wm>	 New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[23:16:11] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds
[23:21:43] <gerrit-wm>	 New patchset: Pyoungmeister; "adding data sources for eqiad ganglia groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41001
[23:24:08] <gerrit-wm>	 New patchset: Pyoungmeister; "enabling a cron for echo team" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[23:24:41] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40989
[23:31:12] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41001
[23:42:08] <nagios-wm>	 PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 21.05242 (gt 8.0)
[23:49:02] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:58:20] <nagios-wm>	 RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.65921103704