[00:02:38] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700248 (10ssastry) See attached screenshots. {F2657213} {F2657215} {F2657217} [00:38:43] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700266 (10GWicke) If you look at the graphs, similarly-sized documents were processed with reasonable latency and no outage before, for example one fairly large batch around Sept 13... [00:51:55] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [01:18:44] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:17:03] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700299 (10ssastry) Reg. GC pressure because of increased tokens in flight due to batch, that is a possibility I hadn't thought about .. so, worth investigating. But, yes, gerrit 243... [02:20:41] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700301 (10ssastry) And, the reason I am surprised by the graph I pasted above is because of what it seems to be saying: it doesn't matter how many API calls you are making. It does... [02:27:10] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 06s) [02:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:56] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-04 02:31:56+00:00 [02:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:36] (03CR) 10Zhuyifei1999: [C: 031] Enable WikidataPageBanner on Catalan wiki and zh wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson) [03:17:06] (03CR) 10Liuxinyu970226: [C: 031] Enable WikidataPageBanner on Catalan wiki and zh wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson) [03:18:14] PROBLEM - YARN NodeManager Node-State on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:15] RECOVERY - YARN NodeManager Node-State on analytics1034 is OK: OK: YARN NodeManager analytics1034.eqiad.wmnet:8041 Node-State: RUNNING [03:35:15] PROBLEM - puppet last run on elastic1014 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:34] RECOVERY - puppet last run on elastic1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:32] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 4 05:13:32 UTC 2015 (duration 13m 31s) [05:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:11:45] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [06:17:34] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:54] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:33] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:53] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:32:14] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:25] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:03] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:38:14] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:35] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:04] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:25] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [07:37:45] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:42:34] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 647 [07:47:34] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Seconds_Behind_Master: 253 [07:48:29] <_joe_> uhm why is the SMS we receive so uniformative? [08:05:34] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:17:04] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [09:25:24] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [09:54:24] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:16:19] 6operations, 10Analytics, 6Services, 7Icinga, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1700529 (10mobrovac) >>! In T114556#1699808, @ArielGlenn wrote: > I wonder if the list of folks ought to be this: T114383 and T113416, pretty unclear to me tho... [10:27:04] http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 I know analytics machines are regularly put under stress tests (why?), but is this normal? [11:29:08] (03PS1) 10Merlijn van Deen: toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 [11:29:47] (03CR) 10jenkins-bot: [V: 04-1] toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 (owner: 10Merlijn van Deen) [11:29:50] raah. [11:31:27] (03PS2) 10Merlijn van Deen: toollabs-genpp: add simple tool to check package availability [puppet] - 10https://gerrit.wikimedia.org/r/243498 [11:33:56] (03PS1) 10Merlijn van Deen: toollabs: install hugin-tools [puppet] - 10https://gerrit.wikimedia.org/r/243500 (https://phabricator.wikimedia.org/T108210) [12:57:54] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Puppet has 1 failures [13:07:44] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail [13:14:03] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail [13:24:25] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:35:55] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:42:14] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:22:30] (03CR) 10Luke081515: [C: 031] Smallest change needed to unbreak nagios config [puppet] - 10https://gerrit.wikimedia.org/r/243408 (https://phabricator.wikimedia.org/T114556) (owner: 10Jcrespo) [14:27:19] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1700773 (10cscott) Well, all of those other things correlate with output size. I wouldn't get to misled by that. A page can contain a single template inclusion, which then expands... [14:30:33] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: puppet fail [15:00:34] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:04] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Puppet has 1 failures [15:48:11] Nemo_bis, hey, do you know where the small.dblist/medium.dblist/large.dblist generation script is? [15:51:24] Krenair: originally WikimediaMaintenance https://gerrit.wikimedia.org/r/#/c/33694 [15:51:43] I found it, thanks [15:54:29] (03PS1) 10Alex Monk: Update DB size lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243517 [16:14:35] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:29:17] (03PS2) 10Alex Monk: Update DB size lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243517 (https://phabricator.wikimedia.org/T114613) [18:11:34] PROBLEM - puppet last run on mw2011 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:55] RECOVERY - puppet last run on mw2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:33:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0] [20:33:54] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:43] PROBLEM - RAID on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:43] PROBLEM - puppet last run on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:13] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [20:37:14] RECOVERY - RAID on analytics1035 is OK: OK: optimal, 13 logical, 14 physical [20:37:14] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [20:40:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:16:47] (03CR) 10Aaron Schulz: "Hmm, lots of the external clusters are read only right? I guess that could change just for heartbeats." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [23:36:39] !log ori@tin Synchronized php-1.27.0-wmf.1/extensions/ContentTranslation/extension.json: 8c80ec1273: Updated mediawiki/core Project: mediawiki/extensions/ContentTranslation (duration: 00m 17s) [23:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master