[00:02:38] <wikibugs>	 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700248 (10ssastry)  See attached screenshots.  {F2657213}  {F2657215}  {F2657217}
[00:38:43] <wikibugs>	 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700266 (10GWicke) If you look at the graphs, similarly-sized documents were processed with reasonable latency and no outage before, for example one fairly large batch around Sept 13...
[00:51:55] <icinga-wm>	 PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail
[01:18:44] <icinga-wm>	 RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[02:17:03] <wikibugs>	 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700299 (10ssastry) Reg. GC pressure because of increased tokens in flight due to batch, that is a possibility I hadn't thought about .. so, worth investigating. But, yes, gerrit 243...
[02:20:41] <wikibugs>	 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700301 (10ssastry) And, the reason I am surprised by the graph I pasted above is because of what it seems to be saying:  it doesn't matter how many API calls you are making. It does...
[02:27:10] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 06s)
[02:27:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:31:56] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-04 02:31:56+00:00
[02:32:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:07:36] <krrrit-wm>	 (03CR) 10Zhuyifei1999: [C: 031] Enable WikidataPageBanner on Catalan wiki and zh wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson)
[03:17:06] <krrrit-wm>	 (03CR) 10Liuxinyu970226: [C: 031] Enable WikidataPageBanner on Catalan wiki and zh wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson)
[03:18:14] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:21:15] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING
[03:35:15] <icinga-wm>	 PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:01:34] <icinga-wm>	 RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:13:32] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct  4 05:13:32 UTC 2015 (duration 13m 31s)
[05:13:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:11:45] <icinga-wm>	 PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:17:34] <icinga-wm>	 PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:29:54] <icinga-wm>	 PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:14] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:33] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:53] <icinga-wm>	 PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:53] <icinga-wm>	 PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:54] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:24] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:13] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:32:14] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:25] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:45] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:03] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:38:14] <icinga-wm>	 RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:45:34] <icinga-wm>	 RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:55:35] <icinga-wm>	 RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:56:04] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[06:56:23] <icinga-wm>	 RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:56:35] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[06:57:33] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[06:57:44] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:11:25] <icinga-wm>	 PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:37:45] <icinga-wm>	 RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[07:42:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 647
[07:47:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Seconds_Behind_Master: 253
[07:48:29] <_joe_>	 uhm why is the SMS we receive so uniformative?
[08:05:34] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100
[08:17:04] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100
[09:25:24] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail
[09:54:24] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:16:19] <wikibugs>	 6operations, 10Analytics, 6Services, 7Icinga, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1700529 (10mobrovac) >>! In T114556#1699808, @ArielGlenn wrote: > I wonder if the list of folks ought to be this: T114383 and T113416,   pretty unclear to me tho...
[10:27:04] <SPF|Cloud>	 http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 I know analytics machines are regularly put under stress tests (why?), but is this normal? 
[11:29:08] <krrrit-wm>	 (03PS1) 10Merlijn van Deen: toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 
[11:29:47] <krrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 (owner: 10Merlijn van Deen)
[11:29:50] <valhallasw`cloud>	 raah.
[11:31:27] <krrrit-wm>	 (03PS2) 10Merlijn van Deen: toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 
[11:33:56] <krrrit-wm>	 (03PS1) 10Merlijn van Deen: toollabs: install hugin-tools [puppet] - 10https://gerrit.wikimedia.org/r/243500 (https://phabricator.wikimedia.org/T108210) 
[12:57:54] <icinga-wm>	 PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:07:44] <icinga-wm>	 PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail
[13:14:03] <icinga-wm>	 PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail
[13:24:25] <icinga-wm>	 RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[13:35:55] <icinga-wm>	 RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[13:42:14] <icinga-wm>	 RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:22:30] <krrrit-wm>	 (03CR) 10Luke081515: [C: 031] Smallest change needed to unbreak nagios config [puppet] - 10https://gerrit.wikimedia.org/r/243408 (https://phabricator.wikimedia.org/T114556) (owner: 10Jcrespo)
[14:27:19] <wikibugs>	 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700773 (10cscott) Well, all of those other things correlate with output size.  I wouldn't get to misled by that.  A page can contain a single template inclusion, which then expands...
[14:30:33] <icinga-wm>	 PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: puppet fail
[15:00:34] <icinga-wm>	 RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:48:04] <icinga-wm>	 PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:48:11] <Krenair>	 Nemo_bis, hey, do you know where the small.dblist/medium.dblist/large.dblist generation script is?
[15:51:24] <Nemo_bis>	 Krenair: originally WikimediaMaintenance https://gerrit.wikimedia.org/r/#/c/33694
[15:51:43] <Krenair>	 I found it, thanks
[15:54:29] <krrrit-wm>	 (03PS1) 10Alex Monk: Update DB size lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243517 
[16:14:35] <icinga-wm>	 RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[17:29:17] <krrrit-wm>	 (03PS2) 10Alex Monk: Update DB size lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243517 (https://phabricator.wikimedia.org/T114613) 
[18:11:34] <icinga-wm>	 PROBLEM - puppet last run on mw2011 is CRITICAL: CRITICAL: Puppet has 1 failures
[18:39:55] <icinga-wm>	 RECOVERY - puppet last run on mw2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:33:24] <icinga-wm_>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0]
[20:33:54] <icinga-wm_>	 PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:35:43] <icinga-wm_>	 PROBLEM - RAID on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:35:43] <icinga-wm_>	 PROBLEM - puppet last run on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:37:13] <icinga-wm_>	 RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING
[20:37:14] <icinga-wm_>	 RECOVERY - RAID on analytics1035 is OK: OK: optimal, 13 logical, 14 physical
[20:37:14] <icinga-wm_>	 RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures
[20:40:04] <icinga-wm_>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[22:16:47] <krrrit-wm>	 (03CR) 10Aaron Schulz: "Hmm, lots of the external clusters are read only right? I guess that could change just for heartbeats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz)
[23:36:39] <logmsgbot>	 !log ori@tin Synchronized php-1.27.0-wmf.1/extensions/ContentTranslation/extension.json: 8c80ec1273: Updated mediawiki/core Project: mediawiki/extensions/ContentTranslation (duration: 00m 17s)
[23:36:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master