[00:00:15] <andrewbogott>	 mutante: cool.  We still need hashar to sign off though
[00:00:20] <grrrit-wm>	 (03CR) 10Dzahn: "ftfy" [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle)
[00:00:25] <mutante>	 yep
[00:00:29] <mutante>	 ok
[00:03:19] <grrrit-wm>	 (03PS1) 10RobH: fixing rbf2002 entry [dns] - 10https://gerrit.wikimedia.org/r/189137 
[00:03:46] <grrrit-wm>	 (03CR) 10RobH: [C: 032] fixing rbf2002 entry [dns] - 10https://gerrit.wikimedia.org/r/189137 (owner: 10RobH)
[00:04:44] <icinga-wm>	 PROBLEM - manage_nfs_volumes_running on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:05:09] <^d>	 andrewbogott: ^^
[00:05:27] <andrewbogott>	 ^d: yeah, I stopped it
[00:05:42] <^d>	 Ah k, just making sure it didn't screw itself again
[00:06:22] <andrewbogott>	 at the moment, it can’t get any more broken :(
[00:07:54] <icinga-wm>	 RECOVERY - manage_nfs_volumes_running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:10:53] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022006 (10RobH)
[00:10:54] <wikibugs>	 3operations, ops-codfw: reclaim rbf2002/WMF5833 back to spare, allocate WMF5845 as rbf2002 - https://phabricator.wikimedia.org/T88380#1022004 (10RobH) 5Resolved>3Open Papaul,  When I attempt to ssh into 10.193.2.118, i get no response.  Did you test mgmt ssh into rbf2002 when you finished setup?  (This shoul...
[00:13:23] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022012 (10RobH) The DNS issues for rbf2001 and rbf2002 mgmt have been fixed.  However, rbf2002.mgmt is on 10.193.2.118, and its not responsive to ping or ssh (both via fqdn or direct ip)  I've re...
[00:14:33] <icinga-wm>	 RECOVERY - Disk space on analytics1027 is OK: DISK OK  
[00:14:54] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022017 (10RobH) a:5RobH>3Dzahn Well, we know that the install worked in Ubuntu before (since I had installed ubuntu on rbf2001).  I'm not sure what issue would arise for its production DNS, a...
[00:15:25] <icinga-wm>	 RECOVERY - Disk space on stat1002 is OK: DISK OK  
[00:18:28] <grrrit-wm>	 (03PS1) 10Dzahn: phab: direct_comments_allowed for Domains tickets [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) 
[00:18:51] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] phab: direct_comments_allowed for Domains tickets [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) (owner: 10Dzahn)
[00:18:54] <qchris>	 !log Manually bumping heap for the Hadoop namenodes and revived them after both of them running out of heap and not coming back.
[00:19:01] <morebots>	 Logged the message, Master
[00:20:44] <icinga-wm>	 PROBLEM - manage_nfs_volumes_running on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:22:36] <wikibugs>	 3operations, Incident-20150205-SiteOutage, MediaWiki-Core-Team, Wikimedia-Logstash: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash - https://phabricator.wikimedia.org/T88870#1022054 (10bd808) 3NEW a:3bd808
[00:22:41] <wikibugs>	 3operations, hardware-requests, Scrum-of-Scrums, RESTBase: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1022061 (10GWicke) Just to document the latest status:  - HP forgot 10G ethernet, shipping modules for arrival tomorrow - Racking early next week - Setup: Debian Jessie, small (~20G...
[00:23:45] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds  
[00:25:22] <grrrit-wm>	 (03CR) 10Dzahn: "i dont see the syntax error right now ..do you?" [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) (owner: 10Dzahn)
[00:26:03] <icinga-wm>	 RECOVERY - manage_nfs_volumes_running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:26:54] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds  
[00:27:27] <grrrit-wm>	 (03CR) 10Dzahn: "the example is: direct_comments_allowed => { testproj => 'cisco.com,gmail.com'}," [puppet] - 10https://gerrit.wikimedia.org/r/189140 (https://phabricator.wikimedia.org/T88842) (owner: 10Dzahn)
[00:28:56] <wikibugs>	 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1022068 (10Dzahn) there's a patch now trying to enable the direct_comments, please see above. i'm currently not sure why jenkins dislikes it
[00:29:13] <icinga-wm>	 PROBLEM - manage_nfs_volumes_running on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:32:42] <wikibugs>	 3operations, Release-Engineering, Wikimedia-Extension-setup: ca.wikimedia wiki -  sidebar in French won't work... - https://phabricator.wikimedia.org/T88843#1022081 (10Dzahn)
[00:34:54] <wikibugs>	 3operations, Wikimedia-Git-or-Gerrit: stop gerrit from mailing every single change in operations to the ops mailing list - https://phabricator.wikimedia.org/T88388#1022083 (10Dzahn)
[00:35:16] <wikibugs>	 3operations, Incident-20150205-SiteOutage, MediaWiki-Core-Team, Wikimedia-Logstash: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash - https://phabricator.wikimedia.org/T88870#1022085 (10chasemp) Sounds good!
[00:36:24] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds  
[00:38:20] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022094 (10Dzahn) >>! In T86897#1022012, @RobH wrote: > The DNS issues for rbf2001 and rbf2002 mgmt have been fixed.  I don't see the change on iron yet.I''ll check later again.  ``` dzahn@iron:~$...
[00:38:57] <robh>	 mutante: ....
[00:39:05] <robh>	 but i tested it a seocnd ago and it showed only one
[00:39:08] <robh>	 what the fuccccccccck
[00:39:16] <robh>	 well, not a second ago, but when i pasted that
[00:39:40] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022095 (10Dzahn) >>! In T86897#1022017, @RobH wrote: > Well, we know that the install worked in Ubuntu before (since I had installed ubuntu on rbf2001).  I'm not sure what issue would arise for i...
[00:39:43] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 47 below the confidence bounds  
[00:39:44] <robh>	 either way, im done dealing with dns on a friday afternoon, its gonna wait till monday when im ready to deal with it again.
[00:40:53] <icinga-wm>	 RECOVERY - manage_nfs_volumes_running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes  
[00:41:22] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022097 (10Dzahn) a:5Dzahn>3None
[00:41:53] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds  
[00:42:34] <grrrit-wm>	 (03PS1) 10QChris: Force 2GB of heap for name nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/189143 (https://phabricator.wikimedia.org/T88871) 
[00:44:54] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 48 below the confidence bounds  
[00:47:04] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 47 below the confidence bounds  
[00:49:15] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 46 below the confidence bounds  
[00:49:29] <qchris>	 andrewbogott: We're having issues with the Hadoop name nodes dying due to running out of heap space and not coming back, which renders the Hadoop cluster unusable.
[00:49:43] <qchris>	 Manually bumping heap allows to restart the node.
[00:49:50] <qchris>	 Could you have a look at
[00:49:54] <qchris>	 https://gerrit.wikimedia.org/r/#/c/189143/
[00:50:06] <qchris>	 Which is a crude way to puppetize this adhoc fix.
[00:50:20] <qchris>	 (I'll leave a proper fix to ottomata when he comes back)
[00:50:37] <andrewbogott>	 qchris: no theory about what’s gobbling heap space?
[00:51:07] <qchris>	 Vague theory is:
[00:51:07] <andrewbogott>	 Will lifting the heap size just increase the time between deaths, or does it actually stabilize somewhere under 2g?
[00:51:24] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 45 below the confidence bounds  
[00:51:25] <qchris>	 Each file takes up space on the name node (hence heap)
[00:51:31] <qchris>	 More files, more need for heap.
[00:51:53] <andrewbogott>	 ok
[00:51:57] <qchris>	 We have been seeing this issue since 2015-02-04.
[00:52:03] <qchris>	 Getting worse and worse.
[00:52:06] <andrewbogott>	 I can merge that patch for now, but seems like it will recur :(
[00:52:22] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Force 2GB of heap for name nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/189143 (https://phabricator.wikimedia.org/T88871) (owner: 10QChris)
[00:53:07] <grrrit-wm>	 (03PS1) 10QChris: Bump cdh module to increase heap on name nodes [puppet] - 10https://gerrit.wikimedia.org/r/189146 (https://phabricator.wikimedia.org/T88871) 
[00:53:16] <qchris>	 andrewbogott: Thanks.
[00:53:36] <qchris>	 Could you please also review the module bump ^
[00:53:40] <wikibugs>	 3operations, hardware-requests, ops-codfw: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1022139 (10Dzahn) yea, no. i'm getting he " Malformed IP address  " thing again...
[00:53:57] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Bump cdh module to increase heap on name nodes [puppet] - 10https://gerrit.wikimedia.org/r/189146 (https://phabricator.wikimedia.org/T88871) (owner: 10QChris)
[00:54:07] <qchris>	 Yes, the issue will return. But the default of 1GB was sufficient for ~9 months.
[00:54:20] <qchris>	 And basically, I just want the cluster to survive until ottomata is back.
[00:54:28] <qchris>	 So researchers are not blocked.
[00:54:33] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 45 below the confidence bounds  
[00:54:44] <qchris>	 I am sure, he'll want more parametrization :-)
[00:54:49] <qchris>	 Thanks andrewbogott!
[00:56:35] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 48 below the confidence bounds  
[00:59:50] <grrrit-wm>	 (03PS1) 10Legoktm: Add "composer test" command to lint files and run tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189148 (https://phabricator.wikimedia.org/T85947) 
[01:00:04] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 45 below the confidence bounds  
[01:02:13] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 48 below the confidence bounds  
[01:07:24] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 47 below the confidence bounds  
[01:09:34] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 45 below the confidence bounds  
[01:11:43] <icinga-wm>	 PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 0 data above and 45 below the confidence bounds  
[01:33:54] <icinga-wm>	 RECOVERY - Kafka Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected  
[01:48:17] <bblack>	 !log leaving cp1064 (jessie upload eqiad) pooled front+back.  it's experimental but looks stable.  if upload-related 503 spikes and I'm not around, feel free to depool it.
[01:48:28] <morebots>	 Logged the message, Master
[01:50:04] <icinga-wm>	 PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 885.983580392  
[02:00:44] <grrrit-wm>	 (03CR) 10Vogone: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) (owner: 10Florianschmidtwelzow)
[02:11:15] <qchris>	 !log Ran kafka leader re-election as analytics1021 dropped out of it's partition leader role.
[02:11:23] <morebots>	 Logged the message, Master
[02:11:27] <grrrit-wm>	 (03CR) 10Dzahn: "hashar a ecrit: "whenever someone adds in LDAP a key that he uses on production, the Jenkins job will fail and thus prevent new puppet cha" [puppet] - 10https://gerrit.wikimedia.org/r/175442 (owner: 10ArielGlenn)
[02:14:24] <icinga-wm>	 RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3675.59574114  
[02:19:05] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s)
[02:19:12] <morebots>	 Logged the message, Master
[02:20:13] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf15) at 2015-02-07 02:19:09+00:00
[02:20:16] <morebots>	 Logged the message, Master
[02:33:05] <logmsgbot>	 !log l10nupdate Synchronized php-1.25wmf16/cache/l10n: (no message) (duration: 00m 01s)
[02:33:14] <morebots>	 Logged the message, Master
[02:34:13] <logmsgbot>	 !log LocalisationUpdate completed (1.25wmf16) at 2015-02-07 02:33:09+00:00
[02:34:15] <morebots>	 Logged the message, Master
[02:49:14] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out  
[02:50:14] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67263 bytes in 0.119 second response time  
[02:56:38] <grrrit-wm>	 (03PS1) 10Ori.livneh: vbench improvements [puppet] - 10https://gerrit.wikimedia.org/r/189166 
[02:58:06] <grrrit-wm>	 (03PS2) 10Ori.livneh: vbench improvements [puppet] - 10https://gerrit.wikimedia.org/r/189166 
[02:58:18] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] vbench improvements [puppet] - 10https://gerrit.wikimedia.org/r/189166 (owner: 10Ori.livneh)
[02:59:44] <icinga-wm>	 PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago  
[03:00:44] <icinga-wm>	 RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[03:01:43] <icinga-wm>	 PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend  
[03:13:07] <gwicke>	 !log restarting parsoid cluster
[03:13:13] <morebots>	 Logged the message, Master
[03:20:35] <icinga-wm>	 PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[03:22:24] <icinga-wm>	 PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 1 failures  
[03:25:44] <icinga-wm>	 RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.046 second response time  
[03:39:03] <icinga-wm>	 PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: /srv 14255 MB (3% inode=99%):  
[03:40:14] <icinga-wm>	 RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures  
[03:44:53] <wikibugs>	 3: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88883#1022386 (10emailbot)
[03:59:03] <icinga-wm>	 RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself  
[04:47:14] <icinga-wm>	 RECOVERY - Disk space on vanadium is OK: DISK OK  
[04:48:33] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Feb  7 04:47:30 UTC 2015 (duration 47m 29s)
[04:48:39] <morebots>	 Logged the message, Master
[05:10:04] <subbu>	 !log deployed parsoid hotfiix 8ca7ef40 (cherry-pick of 447a0565)
[05:10:14] <morebots>	 Logged the message, Master
[06:28:24] <icinga-wm>	 PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:34] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:28:35] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:04] <icinga-wm>	 PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:05] <icinga-wm>	 PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet has 2 failures  
[06:29:44] <icinga-wm>	 PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:29:44] <icinga-wm>	 PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures  
[06:45:13] <icinga-wm>	 RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures  
[06:45:24] <icinga-wm>	 RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures  
[06:45:53] <icinga-wm>	 RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures  
[06:45:54] <icinga-wm>	 RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures  
[06:46:24] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures  
[06:46:24] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures  
[06:46:24] <icinga-wm>	 RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures  
[08:07:53] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0]  
[08:16:24] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds  
[09:11:24] <icinga-wm>	 PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100%  
[09:23:55] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[09:33:26] <bblack>	 !log rebooting cp1070 (dead network, dead console)
[09:33:30] <morebots>	 Logged the message, Master
[09:35:24] <icinga-wm>	 RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 10.44 ms  
[09:38:04] <icinga-wm>	 PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100%  
[09:38:30] <wikibugs>	 3operations, ops-eqiad: cp1070 hardware failure - https://phabricator.wikimedia.org/T88889#1022473 (10BBlack) 3NEW
[09:40:57] <bblack>	 !log depooled cp1070 in pybal
[09:40:58] <_joe_>	 bblack: hey, I can look after it
[09:41:00] <morebots>	 Logged the message, Master
[09:41:08] <_joe_>	 is it an hw failure?
[09:41:13] <bblack>	 yeah looks like it
[09:41:20] <bblack>	 see paste in ticket
[09:41:34] <paravoid>	 hey
[09:41:40] <_joe_>	 oh, eh. Sorry, just woke up basically
[09:42:04] <paravoid>	 what are the chances :)
[09:42:17] <bblack>	 yeah who knows
[09:42:28] <bblack>	 could be related to the thermal stuff too for all I know
[09:42:29] <paravoid>	 that it's one of the debian boxes that died I mean
[09:43:19] <bblack>	 the chances are 17% :)
[09:44:32] <grrrit-wm>	 (03PS1) 10BBlack: depool cp1070 in cache.pp: T88889 [puppet] - 10https://gerrit.wikimedia.org/r/189183 
[09:44:49] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] depool cp1070 in cache.pp: T88889 [puppet] - 10https://gerrit.wikimedia.org/r/189183 (owner: 10BBlack)
[09:45:17] <bblack>	 also: https://gdash.wikimedia.org/dashboards/reqerror/
[09:45:39] <bblack>	 that huge 1h 5xx plateau ends when icinga reported cp1070 dying.  probably not a coincidence.
[09:46:10] <paravoid>	 ouch
[09:48:13] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[09:54:48] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: reprepro: switch Cassandra to 2.1, add to jessie [puppet] - 10https://gerrit.wikimedia.org/r/189186 
[09:57:29] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] reprepro: switch Cassandra to 2.1, add to jessie [puppet] - 10https://gerrit.wikimedia.org/r/189186 (owner: 10Faidon Liambotis)
[10:00:54] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]  
[10:01:04] <wikibugs>	 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1022495 (10faidon) I fixed this in a better way: adjusting our reprepro config to point to 2.1 (and also install to jessie). This means that a simple "reprepro update" should just get us the latest and greatest...
[10:03:14] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected  
[10:06:22] <Nemo_bis>	 Wikia does custom kernel? https://github.com/Wikia/kernel-configs
[10:10:17] <_joe_>	 Nemo_bis: the big new would be they still use 3 years old configs for even older kernels
[10:11:21] <Nemo_bis>	 Yeah, but they might just have discontinued publishing. :)
[10:11:55] <Nemo_bis>	 As they have private issue tracker etc.
[10:32:43] <grrrit-wm>	 (03PS1) 10BBlack: depool cp1064 upload backend [puppet] - 10https://gerrit.wikimedia.org/r/189190 
[10:33:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] depool cp1064 upload backend [puppet] - 10https://gerrit.wikimedia.org/r/189190 (owner: 10BBlack)
[11:45:55] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[11:48:04] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[12:45:18] <grrrit-wm>	 (03PS1) 10John F. Lewis: apache: remove wikiversity.org from portal alias [puppet] - 10https://gerrit.wikimedia.org/r/189195 (https://phabricator.wikimedia.org/T88776) 
[12:49:17] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] "per Brandon, lowercase file names would look some-what nicer in a repo of lowercase file names." [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) (owner: 10Dzahn)
[13:08:24] <grrrit-wm>	 (03PS1) 10John F. Lewis: add network variables for dumps rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/189196 
[13:09:37] <grrrit-wm>	 (03CR) 10John F. Lewis: "Uploaded I8f4d653857e8414df4ef4fa5a2b7be5684963b78 which adds the values to network and makes them recognised by srange." [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn)
[13:43:46] <wikibugs>	 3operations: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88892#1022557 (10emailbot)
[14:21:34] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[15:06:54] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected  
[15:17:33] <icinga-wm>	 PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:30:57] <apergos>	 hm I did not get a page for that
[15:31:04] <apergos>	 let's see what's up
[15:35:33] <icinga-wm>	 RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 4.149 second response time  
[15:35:55] <apergos>	 !log started nginx on daaset1001, it was not running for some reason
[15:36:04] <morebots>	 Logged the message, Master
[15:44:14] <icinga-wm>	 PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:55:44] <apergos>	 I see requests coming in and being answered
[15:57:58] <grrrit-wm>	 (03PS1) 10Glaisher: Enable VisualEditor on project namespace at cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189211 (https://phabricator.wikimedia.org/T88896) 
[16:11:35] <icinga-wm>	 RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5114 bytes in 4.632 second response time  
[16:13:24] <wikibugs>	 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - https://phabricator.wikimedia.org/T88730#1022728 (10Joe) Twemproxy docs state: "Enabling auto_eject_hosts: ensures that a dead server can be ejected out of the hash ring after server_failure_limit: consecutive...
[17:19:59] <wikibugs>	 3operations, ops-ulsfo: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88901#1022777 (10emailbot)
[18:03:34] <grrrit-wm>	 (03PS3) 10Gergő Tisza: Add -labs settings for Score [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181358 (https://phabricator.wikimedia.org/T85049) 
[18:33:38] <wikibugs>	 3operations, ops-codfw: reclaim rbf2002/WMF5833 back to spare, allocate WMF5845 as rbf2002 - https://phabricator.wikimedia.org/T88380#1022902 (10Papaul) I came in today Saturday just to check the settings on rbf2002 and I realize that the OS was already installed and i was also able to login (ssh root@10.193.2.1...
[21:38:36] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: icinga: remove mark from SMS [puppet] - 10https://gerrit.wikimedia.org/r/189287 
[21:38:59] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] icinga: remove mark from SMS [puppet] - 10https://gerrit.wikimedia.org/r/189287 (owner: 10Faidon Liambotis)