[00:00:09] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:01:19] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:01:43] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: swap cluster/site returned by get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/318470 (owner: 10Filippo Giunchedi) [00:03:01] fatalmonitor looks good [00:03:27] the usual points must have either 4 or 2 values per line and not find/open font issues [00:04:28] (03CR) 10Dzahn: "well, i see "Another downside to this algorithm in comparison to the parallel collector is that it uses more CPU in order to provide the a" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [00:04:59] I've filled https://phabricator.wikimedia.org/T149389 with a new MobileFormatter issue [00:05:49] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:07:51] (03CR) 10Dzahn: "root@cobalt:~# jstat -gc 71966" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [00:11:39] (03PS1) 10Filippo Giunchedi: prometheus: use lists as arguments for selectors for get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/318471 [00:13:28] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: use lists as arguments for selectors for get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/318471 (owner: 10Filippo Giunchedi) [00:13:55] (03CR) 10Dzahn: "root@cobalt:~# java -XX:+PrintFlagsFinal -version | grep HeapSize" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [00:15:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [00:16:47] (03CR) 10Dzahn: "since we have heapLimit = 28g in Gerrit config and the blogger says "assuming that your heap is less than 4Gb in size. However, if it’s" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [00:20:50] !log Testing logging to SAL via stashbot [00:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:37] mutante , kaldari , care to add ([[w:Wikipedia:Tim Starling Day|Tim Starling Day]]) (and the other holidays) into Module:Deployment schedule so that it will show such "holidays" on the date headline as appropriate on Deployments ? I'd find that cool, but unfortunatly no lua skills, and the table headlines are autoset depending on what is calculated via the when function ... [00:22:04] !log help [00:22:04] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [00:22:29] stashbot: hello [00:22:30] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [00:22:34] :) [00:22:36] arseny92: My Lua skills are also subpar [00:22:51] arseny92: sorry, ditto, i have no clue about Module:Deployment [00:23:36] arseny92: would that mean jouncebot knows if it's a holiday? [00:24:08] (03PS1) 10Filippo Giunchedi: prometheus: move get_clusters to varnish_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/318472 [00:24:10] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2750748 (10Paladox) Ive been talking to @dzahn about this. What were thinking is of testing the different gc availa... [00:24:56] mutante it's holloween on monday too here :) [00:25:14] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: move get_clusters to varnish_config.erb [puppet] - 10https://gerrit.wikimedia.org/r/318472 (owner: 10Filippo Giunchedi) [00:27:49] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:30:16] mutante , yes, if that (https://phabricator.wikimedia.org/diffusion/GJOU/repository/master/) is also updated ;) [00:31:06] (03PS1) 10Filippo Giunchedi: prometheus: fix ganglia varnish cluster name [puppet] - 10https://gerrit.wikimedia.org/r/318474 [00:31:13] paladox , there is a nickserv ghost command [00:32:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix ganglia varnish cluster name [puppet] - 10https://gerrit.wikimedia.org/r/318474 (owner: 10Filippo Giunchedi) [00:32:38] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2750773 (10Dzahn) I read some of http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-w... [00:32:54] arseny92 what command is that? [00:33:15] paladox_: release [00:33:28] arseny92: it does sound like nice to have, yea [00:33:50] Oh [00:34:02] re: gerrit and garbage collector .. hmm https://phabricator.wikimedia.org/T148478#2750773 [00:34:03] Did anyone see the password [00:34:05] LOL [00:34:06] oh [00:34:10] paladox_: hunter2 [00:34:13] No [00:34:20] lol [00:34:58] mutante i am now paladox [00:35:18] paladox: ok, i just commented on the ticket [00:35:24] Thanks [00:35:28] Yep i saw :) [00:35:28] to summarize our talk [00:35:41] Yep, you explained it very good [00:35:43] :) [00:39:19] bd808, where's the ACL for stashbot? [00:40:28] (03PS1) 10Filippo Giunchedi: prometheus: change file_sd_config syntax after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318476 [00:40:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:42:06] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: change file_sd_config syntax after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/318476 (owner: 10Filippo Giunchedi) [00:44:57] Krenair: it's in the bot's config. I'll add you as maintainer [00:45:35] I can sudo as stuff so that's not an issue [00:45:38] mutante , https://wikitech.wikimedia.org/wiki/Module:Deployment_schedule , sounds if one adds correctly somewhere near if ( cutcday ~= os.date( '%x', utc ) ) then something like if when = %Y1031  then that gets added to the date table header [00:45:42] I'd prefer it be public for everyone to see if possible [00:46:36] *nod* right now its mixed in with secrets, but I've been thinking about ways to fix that [00:47:05] I guess I could have the bot publish the ACL list to wikitech itself [00:47:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:47:59] btw, is there a reason I have to 'sudo sudo -u -i' in tool labs instead of just one sudo? [00:49:10] hmmm... not that I know of. I usually just `sudo become $TOOL` [00:49:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [00:49:47] that is itself doing a double sudo I guess [00:49:50] I suppose that works [00:49:56] yes [00:51:35] (03CR) 10Paladox: [C: 031] add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [00:51:47] (03CR) 10Paladox: add mapped IPv6 address for contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [00:51:51] ^ who has lua skills? I don't want to break the calendar messing with the module :p [00:55:18] the idea is to have text showing in the table date header if the calculated date turns out to be one of the preset special dates [00:57:55] i.e. append [[w:Wikipedia:Tim Starling Day|Tim Starling Day]] after the "Monday, October 31" header [00:59:13] TimStarling ^^ [01:02:33] (03CR) 10Dereckson: Create patroller usergroup for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317824 (https://phabricator.wikimedia.org/T149019) (owner: 10Cenarium) [01:04:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:06:36] my lua skills are pretty rusty [01:07:54] sounds like you need to just use the "Preview page with this template" feature [01:12:36] (03CR) 10Dzahn: "> I am not quite sure how ferm/iptables firewall out IPv6 requests from the public internet. Maybe it properly drop them" [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [01:14:14] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:17:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [01:31:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:39:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:49:09] PROBLEM - Freshness of OCSP Stapling files on cp3048 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [01:52:41] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#2750912 (10Dzahn) [01:52:58] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#2750927 (10Dzahn) [01:53:33] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#2750912 (10Dzahn) [01:53:35] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn) [01:53:44] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#2750912 (10Dzahn) p:05Triage>03Normal [01:54:47] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2688855 (10Dzahn) It was still in Icinga because puppet was disabled on einsteinium temp. It's gone now. done. resolving. T149395 is for datacenter follow-up work. [01:55:00] 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2750937 (10Dzahn) [01:55:02] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#2750936 (10Dzahn) 05Open>03Resolved [02:20:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [02:27:19] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 09m 07s) [02:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 28 02:32:31 UTC 2016 (duration 5m 12s) [02:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [02:43:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [02:45:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:50:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [02:58:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:00:50] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:04:51] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [03:13:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [03:23:30] (03PS6) 10BBlack: nginx (1.11.4-1+wmf9) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [03:23:32] (03PS2) 10BBlack: no OpenSSL buffer release [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [03:24:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:27:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 723.86 seconds [03:36:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 226.10 seconds [03:43:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:50:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:52:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [03:54:22] (03PS7) 10BBlack: nginx (1.11.4-1+wmf10) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [03:54:23] (03PS3) 10BBlack: no OpenSSL readahead [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [03:55:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [04:02:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:04:29] hmm [04:04:31] opensearch? [04:05:57] T149400 [04:05:58] T149400: ApiQuerySearch.php: Call to a member function getTotalHits() on a non-object (boolean) - https://phabricator.wikimedia.org/T149400 [04:09:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [04:11:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [04:13:58] T1 [04:13:59] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [04:14:36] mutante: T2001 [04:14:36] T2001: Documentation is out of date, incomplete (tracking) - https://phabricator.wikimedia.org/T2001 [04:14:45] bd808: that's the one:) [04:14:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:15:09] opens ticket to move T2001 to T1 [04:15:14] jk [04:15:31] it was a bit goofy that we didn't [04:15:58] * bd808 "owns" a lot of low number T* from early phab testing [04:16:13] mutante bugzilla ids are incremented by 2k [04:16:15] that's like Q numbers when wikidata just started [04:17:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:19:59] arseny92: yep, i remember. thanks. a fixed offset is better than random for sure, i'd kind of like one for the RT tickets too, but the advanced search with "reference" field works again, so doesnt matter much [04:20:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [04:22:07] 06Operations, 06Discovery-Search (Current work): Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2751078 (10Deskana) [04:22:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [04:22:09] 06Operations, 06Discovery-Search (Current work), 13Patch-For-Review, 07Wikimedia-Incident: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2751076 (10Deskana) 05Open>03Resolved I will assume that @gehel's patch resolves this issue. Please reopen if t... [04:23:38] stashbot: rSVN18621 [04:23:38] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [04:23:46] 06Operations, 06Discovery, 10hardware-requests, 06Discovery-Search (Current work): Estimate hardware requirements for ordering new servers for Elasticsearch - https://phabricator.wikimedia.org/T148559#2751079 (10Deskana) I believe that this task to estimate the hardware is complete on Discovery's end. Howe... [04:25:12] mutante: hmmm.. that would be interesting to support. [04:25:43] :) can i search for commits ? [04:25:50] P110 [04:26:18] reads about Tool:Bash [04:27:01] i remember saving those quips from Bugzilla :) [04:28:50] bd808: does the bot output random quips from https://tools.wmflabs.org/bash/random ? [04:29:19] it doesn't, but PRs are welcome [04:29:35] it might annoy the crap out of people though [04:30:51] probably, yea. i was just reading that stashbot processes messages for Tool:Bash and Tool;SAL [04:30:52] it doesn't do paste lookups anymore because the bot is in #wikidata and it was making them crazy with off content notices [04:31:16] so it's storing data for it but doesnt output them [04:31:20] It will give you the direct link if you !bash some new quip [04:31:49] bd808: gotcha:) thanks [04:31:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:36:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [04:44:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:44:39] RECOVERY - Freshness of OCSP Stapling files on cp3048 is OK: OK [05:22:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [05:24:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [05:38:30] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [05:42:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [05:54:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:05:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:19:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:21:49] (03PS1) 10Brian Wolff: Expand Content-Security-Policy on upload test to fr. [puppet] - 10https://gerrit.wikimedia.org/r/318490 (https://phabricator.wikimedia.org/T117618) [06:23:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:28:50] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:36:11] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [06:47:39] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[moreutils] [06:48:20] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [06:55:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [06:57:49] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:59:59] (03PS2) 10Muehlenhoff: hue_server: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/316363 [07:03:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:07:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:11:57] <_joe_> so I guess in the last two hours nobody checked these errors [07:12:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:12:37] <_joe_> Count [07:12:39] <_joe_> Warning: Invalid operand type was used: expecting an array in /srv/mediawiki/php-1.28.0-wmf.23/extensions/CirrusSearch/includes/Searcher.php on line 588 [07:12:56] <_joe_> that's what causing all those errors [07:14:33] <_joe_> dcausse: ^^ [07:15:10] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:15:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [07:17:28] !log installing PHP security updates on jessie [07:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:18] (03CR) 10Giuseppe Lavagetto: [C: 031] hieradata: add swift user for docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/318148 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [07:39:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [07:43:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:45:01] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751204 (10Marostegui) So I would like to get another pair of eyes here, as if this goes wrong, we might need to rebuild the whole server :-( There are currently 3 new disks there that were not included in... [07:48:07] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751208 (10jcrespo) Looks good, maybe rebuilding one at a time, to avoid IO exhaustion? [07:48:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:48:55] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751209 (10Marostegui) Yeah - as I said, I would only add (and rebuild) once at the time. [07:52:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:52:52] (03CR) 10Muehlenhoff: [C: 032] hue_server: Restrict to production networks [puppet] - 10https://gerrit.wikimedia.org/r/316363 (owner: 10Muehlenhoff) [07:53:20] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:55:57] (03PS2) 10Muehlenhoff: Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) [07:56:30] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:58:04] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751213 (10jcrespo) Sorry, I overlooked that and looked only at the commands. [07:59:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [08:01:30] !log applying schema change (imagelinks) to s3 wikis T139090 [08:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:36] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [08:02:02] I am starting to notice dbstore2001 improvements- alter tables now fly [08:05:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:07:43] 06Operations, 10ops-codfw, 10DBA: db2011 disk media errors - https://phabricator.wikimedia.org/T149099#2751253 (10Marostegui) No worries! Better be safe than sorry :) [08:08:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [08:08:47] jynus: Interesting, because it is also runnig a massive alter on pagelinks :p [08:09:38] on pagelinks? [08:10:02] Yeah, compression :) [08:10:08] I assume not on s3? [08:10:12] I have finished compressing all the tables in enwiki [08:10:13] No, s1 [08:10:19] Only pagelinks and revision are left [08:10:26] marostegui, all the tables? [08:10:28] I had to adjust innodb_online_alter_log_max_size [08:10:59] jynus: yes, but pagelinks and revision which are coming next [08:11:01] what is your projected size for /enwiki ? [08:11:17] Right now it is 876G but I am hoping to make it around 600G [08:11:37] Because of compression+defragmented on those two massive tables [08:12:38] compare when done with db1073 [08:12:49] PROBLEM - Freshness of OCSP Stapling files on cp3036 is CRITICAL: CRITICAL: File /var/cache/ocsp/unified.ocsp is more than 18300 secs old! [08:12:59] Oh nice 568G :) [08:13:03] there all size, including logs etc. is 600GB [08:13:09] Not too far from my predictions [08:13:16] but I only compressed the top tables [08:13:40] maybe it is not worth for small tables [08:14:23] Yeah, the small tables went from maybe 4G to 1G (and part of that is probably blank spaces after the table rebuilt) or so, but I wanted to see how much is the whole dataset compressed [08:14:27] We will see [08:14:32] 4->1 is ok [08:14:34] it adds up [08:14:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [08:14:50] I am mostly thinking on not worth the time compressing them [08:16:33] They were pretty fast, but yes, we can do the math [08:16:39] I have all the times :) [08:19:09] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:25:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [08:40:01] (03PS2) 10Muehlenhoff: oozie: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/316359 [08:40:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [08:41:39] (03CR) 10Elukey: [C: 031] oozie: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/316359 (owner: 10Muehlenhoff) [08:43:07] (03CR) 10Ema: [C: 032 V: 032] "Ottomata agreed with the liquidprompt change. Added a comment to test.sh pointing to T95064." [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [08:43:14] (03PS4) 10Ema: Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) [08:43:17] (03CR) 10Ema: [V: 032] Fix bashisms [puppet] - 10https://gerrit.wikimedia.org/r/316529 (https://phabricator.wikimedia.org/T95064) (owner: 10Ema) [08:48:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [08:55:41] (03CR) 10Elukey: [C: 031] "Summary of a conversation on IRC with Ema:" [puppet] - 10https://gerrit.wikimedia.org/r/318314 (owner: 10Ema) [09:00:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:06:34] !log Deploying schema change s1.enwiki - only codfw - T147166 [09:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:41] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [09:09:34] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751354 (10Joe) [09:10:41] !log rebooting pool counters in codfw for kernel update [09:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:25] (03PS2) 10Ema: varnishapi.py: import latest upstream version [puppet] - 10https://gerrit.wikimedia.org/r/318314 [09:14:34] (03CR) 10Ema: [C: 032 V: 032] varnishapi.py: import latest upstream version [puppet] - 10https://gerrit.wikimedia.org/r/318314 (owner: 10Ema) [09:17:07] 06Operations, 06Services (next), 15User-mobrovac: Investigate better protection modes for electron render service (xvfb setuid) - https://phabricator.wikimedia.org/T143336#2751363 (10MoritzMuehlenhoff) FWIW, I'm currently in the process of packaging firejail 0.9.44 for jessie and trusty, there have been quit... [09:18:28] one day I will have to speed up those tests [09:19:27] dcausse: ok deploying on mw1099 [09:19:38] ok [09:20:31] !log Pulling CirrusSearch patch https://gerrit.wikimedia.org/r/#/c/318505/ on mw1099 for T149254 (fix log spam/fatal/warnings) [09:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:36] T149254: [bug] cirrussearch fatal in 1.28.0-wmf.23 - - https://phabricator.wikimedia.org/T149254 [09:20:48] done [09:20:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:20:52] (03PS4) 10Giuseppe Lavagetto: Initial debianization [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/288196 (https://phabricator.wikimedia.org/T132317) [09:21:31] https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&generator=search&gsrsearch=test&gsrlimit=100&gsroffset=30000 is fixed on mw1099 ! [09:21:41] * hashar digs in logstash for host:mw1099 [09:22:16] dcausse: looks all good to me [09:22:22] hashar: me too [09:23:17] unleashing [09:24:08] !log hashar@tin Synchronized /srv/mediawiki-staging/php-1.28.0-wmf.23/extensions/CirrusSearch: https://gerrit.wikimedia.org/r/#/c/318505/ for T149254 (fix log spam/fatal/warnings) (duration: 00m 56s) [09:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:19] godog: so there should that tool script live for the simple-json-datasource? (where in puppet)? [09:26:29] hashar: thanks! no new errors afaics [09:27:11] 271 Invalid operand type was used: expecting an array in /srv/mediawiki/php-1.28.0-wmf.23/extensions/CirrusSearch/includes/Searcher.php on line 588 [09:27:16] that one is lowering in fatalmonitor [09:27:21] so I guess it is fixed :D [09:27:45] hashar, dcausse: Thanks a bunch! [09:27:48] hashar: I owe you many beers! [09:28:26] well [09:28:30] I am just being pedantic [09:28:36] and pushing buttons ! [09:28:46] :) [09:29:10] looks like it is gone now [09:29:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:29:49] _joe_: alarm is gone :] [09:32:10] (03PS1) 10Muehlenhoff: Temporarily disable poolcounter1001 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318509 [09:34:04] <_joe_> :)) [09:34:09] <_joe_> thanks everyone [09:36:20] Hi! Running a script on the labs instance dwl I get the error '35 SSL connect error. The SSL handshaking failed.' so about one time the hour. What is the reason why? [09:39:57] !log stopping slave on mariadb labsdb1005 for labsdb1004 reimporting [09:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:53] (03CR) 10ArielGlenn: "Do we even use this anywhere? Probably worth asking Jaime, see https://wikitech.wikimedia.org/wiki/Database_snapshots which is quite old." [puppet] - 10https://gerrit.wikimedia.org/r/318450 (owner: 10Dzahn) [09:57:47] (03CR) 10ArielGlenn: "I'd like to do a little less trial and error and a little more analysis first. If we understand what is happening with the younggen/oldge" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [09:59:39] PROBLEM - thumbor@8836 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8836 is inactive [10:00:01] !log migrating nodes from ganeti1004 for kernel reboot [10:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:47] (03PS1) 10Gehel: maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) [10:04:15] (03PS2) 10Gehel: maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) [10:08:09] RECOVERY - thumbor@8836 service on thumbor1001 is OK: OK - thumbor@8836 is active [10:09:04] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2751520 (10ArielGlenn) Let's look at the logs (I'm doing so). We still don't have logs for a slowdown event, but I... [10:13:14] (03PS3) 10Gehel: maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) [10:14:11] (03PS1) 10Ema: varnishlog.py: remove trailing NULL byte only if present [puppet] - 10https://gerrit.wikimedia.org/r/318514 [10:16:41] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We don't have a load balancer for "low-traffic" in esams, this won't work." [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [10:18:25] (03CR) 10Volans: [C: 031] "Looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/318514 (owner: 10Ema) [10:19:06] (03PS1) 10Cenarium: Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 [10:20:28] (03PS2) 10Cenarium: Remove patrol from autoconfirmed and reviewer for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318515 (https://phabricator.wikimedia.org/T149019) [10:20:38] (03CR) 10Ema: [C: 032 V: 032] varnishlog.py: remove trailing NULL byte only if present [puppet] - 10https://gerrit.wikimedia.org/r/318514 (owner: 10Ema) [10:22:15] (03PS1) 10Gehel: fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 [10:22:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Add x-default-query functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 (owner: 10Legoktm) [10:23:31] (03CR) 10Giuseppe Lavagetto: [C: 032] Add .gitreview [software/service-checker] - 10https://gerrit.wikimedia.org/r/316484 (owner: 10Legoktm) [10:25:59] !log upgrading python-varnishapi to v50.18 on all v4 cache hosts [10:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:21] !log migrating nodes from ganeti1003 for kernel reboot [10:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:28] (03CR) 10Gehel: [C: 032 V: 032] fixed default configuration for maps / postgresql [labs/private] - 10https://gerrit.wikimedia.org/r/318516 (owner: 10Gehel) [10:41:49] PROBLEM - MariaDB Slave Lag: s1 on db2042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 671.29 seconds [10:42:17] Ignore that, looks like icinga timeout when I was trying to silence it [10:44:09] RECOVERY - MariaDB Slave Lag: s1 on db2042 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [10:46:36] !log migrating nodes from ganeti1003 for kernel reboot [10:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:52] !log migrating nodes from ganeti1002 for kernel reboot (earlier entry was a typo) [10:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] (03PS1) 10Giuseppe Lavagetto: Release 0.0.2 [software/service-checker] - 10https://gerrit.wikimedia.org/r/318517 [11:07:14] (03PS1) 10Bartosz Dziewoński: Verify license tags for custom license in Commons' UploadWizard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318518 (https://phabricator.wikimedia.org/T140903) [11:08:19] RECOVERY - Freshness of OCSP Stapling files on cp3036 is OK: OK [11:08:34] (03PS1) 10Ema: cache_text varnishtest: set X-Carrier based on XCIP [puppet] - 10https://gerrit.wikimedia.org/r/318519 (https://phabricator.wikimedia.org/T131503) [11:15:18] (03PS1) 10Jcrespo: labsdb-toolsdb: Cleaning up tls certificates [puppet] - 10https://gerrit.wikimedia.org/r/318520 (https://phabricator.wikimedia.org/T123731) [11:30:07] (03PS2) 10Jcrespo: labsdb-toolsdb: Cleaning up tls certificates [puppet] - 10https://gerrit.wikimedia.org/r/318520 (https://phabricator.wikimedia.org/T123731) [11:36:33] (03PS8) 10BBlack: nginx (1.11.4-1+wmf11) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [11:36:35] (03PS4) 10BBlack: no OpenSSL readahead [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [11:46:22] (03PS9) 10BBlack: nginx (1.11.4-1+wmf11) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [11:46:24] (03PS5) 10BBlack: no OpenSSL readahead [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [11:46:26] (03PS1) 10BBlack: update cloudflare dynamic record size patch for 1.11.5+ [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318522 [11:48:27] (03Abandoned) 10Hashar: zuul.eqiad.wmnet is no more of any use [dns] - 10https://gerrit.wikimedia.org/r/293288 (https://phabricator.wikimedia.org/T137265) (owner: 10Hashar) [11:48:51] (03CR) 10Hashar: "Lets keep the zuul.eqiad.wmnet DNS entry and indeed switch it to the new host. Thanks Daniel!" [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [11:49:05] (03CR) 10Gehel: [C: 032] maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [11:49:11] (03PS4) 10Gehel: maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) [11:49:14] (03CR) 10Gehel: [V: 032] maps / postgresql: use replication user for monitoring [puppet] - 10https://gerrit.wikimedia.org/r/318511 (https://phabricator.wikimedia.org/T147194) (owner: 10Gehel) [11:50:31] (03CR) 10Hashar: [C: 031] "Ok that make sense :]" [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [11:56:39] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [12:06:11] (03PS2) 10BBlack: fixup debian perl buildflags patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318328 [12:06:13] (03PS2) 10BBlack: update cloudflare dynamic record size patch for 1.11.5+ [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318522 [12:06:15] (03PS10) 10BBlack: nginx (1.11.4-1+wmf11) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318324 [12:06:17] (03PS6) 10BBlack: no OpenSSL readahead [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318431 [12:06:19] (03PS3) 10BBlack: update stapling proxy and ecdhe curve logging patches [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/318327 [12:23:30] (03PS1) 10Muehlenhoff: Check whether ferm has been correctly started [puppet] - 10https://gerrit.wikimedia.org/r/318527 (https://phabricator.wikimedia.org/T148986) [12:39:03] (03PS3) 10Jcrespo: labsdb-toolsdb: Cleaning up tls certificates [puppet] - 10https://gerrit.wikimedia.org/r/318520 (https://phabricator.wikimedia.org/T123731) [12:40:58] (03CR) 10Jcrespo: [C: 032] labsdb-toolsdb: Cleaning up tls certificates [puppet] - 10https://gerrit.wikimedia.org/r/318520 (https://phabricator.wikimedia.org/T123731) (owner: 10Jcrespo) [12:42:31] !log restarting and upgrading labsdb1004 [12:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:33] PROBLEM - mysqld processes on labsdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:52:09] PROBLEM - DPKG on labsdb1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:52:16] jynus: ^ that is you right? [12:52:21] hm that paged [12:52:25] yes, but I acked that [12:52:29] dontimed it [12:52:31] ok [12:52:41] apparently, I didn't [13:01:17] (03PS3) 10Elukey: oozie: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/316359 (owner: 10Muehlenhoff) [13:01:47] (03PS1) 10Thiemo Mättig (WMDE): Set enableLuaEntityFormatStatements for Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318534 [13:03:33] (03CR) 10Elukey: [C: 032] oozie: Restrict to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/316359 (owner: 10Muehlenhoff) [13:10:39] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset unknown [13:14:37] ^ the ntp alert should sort itself out, it's ganeti-related [13:17:49] RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.005016922951 secs [13:27:40] 06Operations, 07HHVM: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421#2751836 (10jcrespo) [13:28:06] 06Operations, 07HHVM: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421#2751856 (10jcrespo) [13:29:33] 06Operations, 07HHVM: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421#2751836 (10jcrespo) [13:35:54] (03CR) 10Andrew Bogott: [C: 031] Also provide imagemagick wrapper in openstack::nova::manager [puppet] - 10https://gerrit.wikimedia.org/r/316545 (https://phabricator.wikimedia.org/T145811) (owner: 10Muehlenhoff) [13:40:45] (03CR) 10Hoo man: [C: 032] "Beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318534 (owner: 10Thiemo Mättig (WMDE)) [13:40:45] 06Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#2751890 (10mark) p:05Normal>03Lowest Done. But I'll audit when at esams before resolving. [13:41:19] (03Merged) 10jenkins-bot: Set enableLuaEntityFormatStatements for Wikidata Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318534 (owner: 10Thiemo Mättig (WMDE)) [13:42:40] !log hoo@tin Synchronized wmf-config/Wikibase-labs.php: For consistency (duration: 00m 46s) [13:42:41] 06Operations, 10ops-esams, 10Traffic: cp3021 failed disk sdb - https://phabricator.wikimedia.org/T148983#2738539 (10mark) This was me, experimenting with ATA Secure Erase. :) I did log it in SAL. [13:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:28] !log uploaded firejail 0.9.44 to carbon [13:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] (03PS1) 10Jcrespo: labsdb-tools: Puppetize patch skipping replication on heavy hitters [puppet] - 10https://gerrit.wikimedia.org/r/318540 [13:57:33] (03CR) 10Jcrespo: [C: 032] labsdb-tools: Puppetize patch skipping replication on heavy hitters [puppet] - 10https://gerrit.wikimedia.org/r/318540 (owner: 10Jcrespo) [14:09:23] (03PS1) 10Jcrespo: labsdb-tools: Fix one-entry-per-line bug on wild_ignore_table [puppet] - 10https://gerrit.wikimedia.org/r/318545 [14:11:54] (03PS2) 10Jcrespo: labsdb-tools: Fix one-entry-per-line bug on wild_ignore_table [puppet] - 10https://gerrit.wikimedia.org/r/318545 [14:13:09] (03CR) 10Jcrespo: [C: 032] "@manuel Head up for a common mistake ^" [puppet] - 10https://gerrit.wikimedia.org/r/318545 (owner: 10Jcrespo) [14:14:06] (03PS2) 10Ema: VCL: allow to load test versions of netmapper JSON files [puppet] - 10https://gerrit.wikimedia.org/r/318320 [14:14:13] (03CR) 10Ema: [C: 032 V: 032] VCL: allow to load test versions of netmapper JSON files [puppet] - 10https://gerrit.wikimedia.org/r/318320 (owner: 10Ema) [14:20:12] (03PS1) 10Muehlenhoff: Update to 4.4.28 [debs/linux44] - 10https://gerrit.wikimedia.org/r/318549 [14:22:09] (03PS1) 10Volans: conftool: add --host option [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) [14:26:04] (03CR) 10Volans: "I'll look to add a test for the case in which a host has multiple services too." [software/conftool] - 10https://gerrit.wikimedia.org/r/318550 (https://phabricator.wikimedia.org/T149213) (owner: 10Volans) [14:30:01] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2752058 (10Eevans) ```lang=shell-session $ cdsh -- "sudo find /srv/cassandra-{a,b,c} -maxdepth 1 -name \"*.hprof\"" restbase1010.eqiad... [14:30:19] (03CR) 10Hashar: "Fails:" [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [14:31:57] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2752060 (10Eevans) [14:35:01] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2752064 (10Eevans) 05Open>03Resolved >>! In T148516#2742878, @Eevans wrote: > I think this investigation has reached a point where... [14:38:27] !log various reboots of multatuli for systemd tests [14:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:53] (03PS4) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [14:40:39] (03PS5) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [14:47:27] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on puppet master. All good on Precise/Trusty/Jessie." [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [14:53:53] (03CR) 10Hashar: "Bah the noop reported by the puppet compiler is actually a problem! https://puppet-compiler.wmflabs.org/4500/gallium.wikimedia.org/ shows" [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [14:54:41] could someone please land the patch above ? That is some cleanup to remove doxygen/graphviz from gallium (not needed there anymore) [14:56:56] !log upgrading openjdk-8/cassandra restart on restbase staging hosts [14:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] andrewbogott: Dear anthropoid, the time has come. Please deploy Labs/Wikitech/Horizon Maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161028T1500). [15:02:06] !log rebooting labnet1001 [15:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:19] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.004 second response time [15:20:19] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.007 second response time [15:20:19] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.004 second response time [15:20:19] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.004 second response time [15:20:19] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.006 second response time [15:20:28] you were eager to dive in? [15:20:33] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.004 second response time [15:20:33] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.007 second response time [15:20:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:20:49] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.029 second response time [15:21:59] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [15:22:19] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [15:22:39] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.002 second response time [15:22:39] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.002 second response time [15:22:39] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.001 second response time [15:22:39] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.001 second response time [15:22:39] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [15:23:06] !log Restarted nodepool [15:23:09] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:32] 06Operations, 06Labs, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752201 (10chasemp) @madhuvishy thoughts on truncating the disposal >10G files and kicking off an update of rsync over the weekend w/ the tree largest excluded for no... [15:23:50] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: Name or service not known [15:24:39] PROBLEM - Host tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - tools.wmflabs.org [15:25:17] (03PS1) 10Ottomata: Use same partman recipe for all kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) [15:25:49] PROBLEM - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is CRITICAL: DNS CRITICAL - 0.048 seconds response time (No ANSWER SECTION found) [15:25:59] PROBLEM - Host checker.tools.wmflabs.org is DOWN: check_ping: Invalid hostname/address - checker.tools.wmflabs.org [15:26:53] (03CR) 10Ottomata: "Robh, FYI, I switched this to use raid10-gpt-srv-ext4 since that is what the other hosts use. LVM would be fine too, but we want to keep " [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [15:28:19] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [15:28:54] andrewbogott: I have started Nodepool and it looks all fine :] [15:29:01] 06Operations: Race condition in setting net.netfilter.nf_conntrack_tcp_timeout_time_wait - https://phabricator.wikimedia.org/T136094#2752223 (10MoritzMuehlenhoff) This also affects trusty hosts. I'll also make the net.netfilter.nf_conntrack_max value configurable via Hiera. [15:29:03] I am going out :) [15:29:29] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:29] RECOVERY - Check for gridmaster host resolution TCP on labs-ns0.wikimedia.org is OK: DNS OK - 0.096 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [15:30:49] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.014 second response time [15:30:59] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [15:30:59] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 3.05 ms [15:31:09] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.179 second response time [15:31:19] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.349 second response time [15:31:19] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.013 second response time [15:31:20] RECOVERY - Check for gridmaster host resolution UDP on labs-ns0.wikimedia.org is OK: DNS OK - 0.148 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [15:31:44] (03CR) 10RobH: "Can all the kafka hosts switch to use LVM over time, rather than shifting them off of it?" [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [15:32:20] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2752246 (10RobH) I'd suggest that we shift the kafka hosts to using it, not shift away from using it. Most of the servers with multiple disks tend to use LVM. [15:32:24] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 57.944 second response time [15:32:24] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 51.641 second response time [15:32:24] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 52.723 second response time [15:32:34] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 48.664 second response time [15:33:32] robh: we could, why? [15:34:41] ottomata: everything tends to shift to use lvm in our cluster over time [15:34:50] so it seems backwards to shift back [15:34:54] and if you do, you shouldnt manually do it [15:35:00] you should reinstall the host to ensure the recipe works [15:35:09] =] [15:35:19] yuck [15:35:21] basically if something starts filling the disk having lvm lets you just expand the 10% or so left to go [15:35:30] well, if you just change it [15:35:32] right, but the whole disk was allocated [15:35:38] you may find when you reinstall it later that there is an issue [15:35:47] uh, in lvm? the recipe tends to leave 10% of the disk [15:35:49] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.001 second response time [15:35:49] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.001 second response time [15:35:49] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.002 second response time [15:35:57] ok, i didin't check, i guess MOST of the disk was allocated [15:36:03] indeed, most but not all [15:36:06] (03CR) 10Jcrespo: "This needs testing, but you may get the starting idea." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318557 (https://phabricator.wikimedia.org/T149422) (owner: 10Jcrespo) [15:36:21] robh, not sure what the advantage there is though [15:36:21] Using the lvm would allow the kafka machines to fall into more of the best practices used on the rest of the cluster [15:36:25] if we save 10% from being allocated [15:36:29] and it fills up, then we have to allocate it [15:36:36] or, we could just use all the space from the beginning [15:36:36] and it stops the host from crashing [15:36:37] and have 10% more [15:36:49] can expand it and have a few more minutes to fix the core issue [15:37:03] hm, couldn't we do the same by making it alert sooner? [15:37:13] im just telling you what i was told [15:37:18] haha ok [15:37:20] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.013 second response time [15:37:20] and that overall over time the shift has been to include lvm [15:37:25] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.008 second response time [15:37:29] if you chose to remove, please reinstall with the recipe you want [15:37:33] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.003 second response time [15:37:38] and dont assume it'll just work or when we have to reisntall later you may have issues [15:37:39] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.010 second response time [15:37:40] =] [15:37:51] um, well, i think i'd rather keep them consistent for now, and if/when we want to switch these clusters over to do so we can do that [15:38:06] well, i had issues iwth the non lvm recipe on it [15:38:09] i'll keep that in mind though for other things [15:38:11] so you shoudl reinstall. [15:38:11] oh really? [15:38:12] yes [15:38:17] the same one that kafka1001 uses? [15:38:18] then i recalled 'we should use lvm' [15:38:20] yep [15:38:22] ah [15:38:22] hm [15:38:25] gr [15:38:32] so rather than troubleshoot a non lvm recipe i switched [15:38:46] because non lvm ones arent as standard for our use (imo) [15:38:59] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:24] * ottomata not exited about troubleshooting partman right now [15:39:33] All of those toolslabs pages are being handled right? [15:39:44] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.011 second response time [15:39:44] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.003 second response time [15:39:54] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.002 second response time [15:40:03] chasemp: ^ all these are being worked by labs team? [15:40:43] robh: yes, consequence of only a partial silence for expected reboots really [15:40:49] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.011 second response time [15:41:05] although I'm not sure why still problems atm [15:41:16] chasemp: cool i just wanted to make sure we weren't blindly ignoring them expecting that you had it handled =] [15:41:19] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:35] dammit I didn't get paged, debugging ensues [15:41:38] hey, what's up? [15:41:49] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.049 second response time [15:42:40] robh: imo having different configs in a cluster is bad, I would not be happy to have lvm in one node and not in the other two [15:42:49] but I get your point about lvm [15:42:59] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.009 second response time [15:43:03] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.011 second response time [15:43:03] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.013 second response time [15:43:08] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.008 second response time [15:43:08] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 15.540 second response time [15:43:08] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 15.574 second response time [15:43:08] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 15.568 second response time [15:43:14] hey, if otto wants to change it back that is fine by me, but i just wanted to let him know i had issues with the non lvm recipe [15:43:19] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.012 second response time [15:43:19] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [15:43:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [15:43:45] sorry for the delay in reply, my backup personal laptop (that im on) is on a spinny disk hdd and it seems to be dying [15:43:56] everything is working like a latency delay in terminal =P [15:44:01] :) [15:44:11] (im getting a loaner from oit later today when i go into the office) [15:44:13] guys, what's going on? [15:44:16] !log toolschecker seems to have come up wonky, restarting service [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:23] akosiaris_: labs stuff is paging and labs team is aware [15:44:35] akosiaris_: kernel reboots for maint caused some side effect paging and one real issue which is ok now [15:44:45] ok, that what I wanted to know [15:44:56] akosiaris_: so your paging is working eh? ;] [15:45:06] godog: so the login for the carrier/sms provider is in the pwstore [15:45:13] and you can login and run a report to see if your number is listed [15:45:24] if you need any he'll tell me [15:45:55] akosiaris_: thanks man, sorry for the noise it was hard to separate false negative from issues there for a bit [15:45:56] robh, yeah pages are working fine [15:46:24] ok [15:46:29] robh: ah! thanks, I tried using wind canada email -> sms gateway but that doesn't seem to work, if I want to use the regular sms provider I need to whitelist my number I suppose? [15:48:06] you should be able to just change the icinga config to use your #@the aql email sms gateway address [15:48:14] oh, canadian number... [15:48:21] so aql works for my USA cell [15:48:27] it should work with canadian [15:48:54] robh: ack, I'll try that [15:49:12] surprised it didnt work with the cell providers email to sms address though [15:49:39] i'd try using that address in a normal mail client and see if it works. but since you arent exactly there for a long time, using the aql email gateway seems ok. [15:49:51] (if it was a year, it would seem silly to pay for those sms via aql, but whatever!) [15:54:15] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [15:54:15] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.031 second response time [15:54:15] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.043 second response time [15:54:18] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.560 second response time [15:54:18] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.929 second response time [15:54:18] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.197 second response time [15:55:11] and the recovery pages are in at last [15:57:45] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:58:25] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 43.074 second response time [15:58:30] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 44.681 second response time [15:58:30] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 44.687 second response time [15:58:30] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 36.355 second response time [15:59:08] robh: AQL works :D thanks [16:01:37] (03PS1) 10Gehel: maps / postgresql: monitoring uses template1 database [puppet] - 10https://gerrit.wikimedia.org/r/318560 [16:03:05] (03CR) 10Gehel: [C: 032] maps / postgresql: monitoring uses template1 database [puppet] - 10https://gerrit.wikimedia.org/r/318560 (owner: 10Gehel) [16:06:45] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:08:16] (03CR) 10Ottomata: [C: 032 V: 032] "Just talked with Robh about this. I'm going to try to reinstall with this recipe. Fingers crossed, maybe it'll just work this time!" [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [16:08:22] (03PS2) 10Ottomata: Use same partman recipe for all kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) [16:08:24] (03CR) 10Ottomata: [V: 032] Use same partman recipe for all kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/318558 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [16:13:45] PROBLEM - Host kafka1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:45] RECOVERY - Host kafka1003 is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [16:17:55] PROBLEM - configured eth on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:18:07] PROBLEM - MD RAID on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:18:25] PROBLEM - DPKG on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:18:35] PROBLEM - salt-minion processes on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:18:45] PROBLEM - dhclient process on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:18:55] PROBLEM - puppet last run on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:19:05] we are reimaging it --^ [16:19:05] PROBLEM - Disk space on kafka1003 is CRITICAL: Return code of 255 is out of bounds [16:19:09] silencing icinga [16:19:29] oh woops [16:19:31] sorry thanks elukey [16:26:56] (03CR) 10Dzahn: "you dont want to use require_package anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [16:27:09] mutante: yup no more needed [16:27:31] as far as I can tell [16:27:32] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2752361 (10Gehel) [16:27:34] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: Maps - error when doing initial tiles generation: "Error: could not create converter for SQL_ASCII"" - https://phabricator.wikimedia.org/T148031#2752359 (10Gehel) 05Open>03Resolved This is now fixed, initial tiles generation is running on maps-te... [16:29:02] hasharAway: is there a reason to switch from require_package back to package{} ? [16:29:42] I thought it could cause a weird dependency cycle [16:29:46] but yeah [16:29:49] let me amend back [16:30:35] PROBLEM - Host labcontrol1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:03] (03PS6) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [16:31:55] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:39] (Class[Contint::Packages::Doxygen] => Class[Contint::Packages::Doxygen] => Class[Contint::Packages::Doxygen]) [16:33:00] mutante: require_package causes a dependency cycle :( [16:33:16] sigh, but it prevents duplicate definitions :p [16:33:34] recompiles that to see if it shows a diff like that or not.. weird [16:33:36] (03PS7) 10Hashar: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 [16:33:56] no, it does not.. wtf [16:34:41] (03CR) 10Hashar: "PS6 tried to use require_package() but that causes a dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [16:35:17] * hasharAway runs puppet [16:35:29] why is it always "no change" in compiler.. if we cant trust that that would be bad [16:35:46] guess the compiler is broken somehow ? :D [16:35:55] so PS7 is all good on labs [16:36:10] and that is the same as PS5 against whcih I ran the compiler and manually did a diff of the pson catalogs [16:36:20] which would drop graphviz/doxygen from gallium/contint1001 :] [16:36:24] no more needed on those hosts [16:36:36] well, we except it to be no-op on labs but not in prod [16:36:46] PROBLEM - NTP on kafka1003 is CRITICAL: NTP CRITICAL: No response from NTP server [16:37:37] (03PS8) 10Dzahn: contint: move doxygen/graphviz to labs instances [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [16:37:37] probably the issue is that require_packages internally craft a class named packages::doxygen [16:37:44] which ends up being namespaced under contint:: [16:38:01] so require_packages('doxygen') creates a Class['contint::packages::doxygen']] [16:38:25] which ends up being the same as the class defined by modules/contint/manifests/packages/doxygen.pp [16:38:29] ==> dependency cycle [16:38:52] hrmm.. ok.. i am more concerned that the compiler tells us something is no-op when it's not [16:39:00] cls = Puppet::Parser::Resource.new( [16:39:00] 'class', class_name, :scope => compiler.topscope) [16:39:09] pretty sure that compiler.topscope is the 'contint' module [16:39:19] and class_name = 'packages::' + package_name.tr('-+', '_') [16:39:24] it is a bug/corner case :] [16:40:51] sigh [16:41:10] package { 'doxygen': ensure => present }  works all fine though :D [16:41:13] (03CR) 10Dzahn: [C: 032] "watching this on gallium... hrmmm" [puppet] - 10https://gerrit.wikimedia.org/r/317985 (owner: 10Hashar) [16:41:18] until it clashes with something else [16:41:30] yes, until you combine it with another role [16:41:30] 06Operations, 06Labs, 07Tracking: Sync data for tools-project from labstore1001 to labstore1004/5 - https://phabricator.wikimedia.org/T144255#2752383 (10madhuvishy) Started another sync now after truncating the >10G error/access log files from the above comment. New command (no >10G exclusion): ``` rsync --... [16:41:46] the compiler thing bugs me much more though [16:42:09] maybe it cant tell the different for the same reason [16:42:12] no idea realyl [16:42:23] no, we built it with both versions [16:42:26] always no diff [16:42:34] and that was always reliable so far [16:42:39] modules/contint/manifests/packages.pp is almost empty now :] [16:43:15] also thanks for all the cleanup patches for gallium [16:43:23] too much centralization (irccloud , matrix, ...) [16:43:47] i wanna see the run on gallium now.. [16:43:57] !log gallium contint1001: apt-get remove --purge doxygen graphviz [16:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:05] wait, why did you do that [16:44:23] ok [16:44:34] the package{} resources are removed from puppet which does not garbage collect no more existing resources [16:44:38] so I manually purged :D [16:44:48] saved an intermediate puppet patch that just does ensure => absen [16:45:28] * mutante checks the compiler output again [16:45:28] contint1001 all ok :] [16:47:00] hasharAway: ok!, so.. the "change catalog" actually is different from the "production catalog", the doxygen recourse is not in it for example [16:47:37] and ya, nothing happens on gallium ( as expected) [16:47:42] yup [16:47:48] but there should clearly be a diff [16:47:58] between the catalogs being shown [16:48:28] feel free to fill a task and attach both catalogs to it ? [16:49:05] hmmm.. yea [16:49:40] (03PS3) 10Hashar: contint: move php5 install on jessie to nearest user [puppet] - 10https://gerrit.wikimedia.org/r/317987 [16:50:23] maybe the compiler is broken [16:50:33] I am passing above change through it at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/4502/console [16:50:57] 06Operations: dbtree.wikimedia.org is broken - https://phabricator.wikimedia.org/T149430#2752385 (10kaldari) [16:51:14] https://puppet-compiler.wmflabs.org/4502/ hehe [16:53:33] (03CR) 10Hashar: [C: 031] "I have passed it via the compiler https://puppet-compiler.wmflabs.org/4502/ contint1001 which is jessie loose all those packages:" [puppet] - 10https://gerrit.wikimedia.org/r/317987 (owner: 10Hashar) [16:53:36] 06Operations: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2752413 (10Dzahn) [16:54:03] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2752425 (10Ottomata) [16:54:42] 06Operations, 06Analytics-Kanban, 10EventBus, 13Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734746 (10Ottomata) I fixed the partman recipe to not use LVM, and I reinstalled the box with the non LVM recipe succesfully. I'll wait til next week to p... [16:54:45] 06Operations: dbtree.wikimedia.org is broken - https://phabricator.wikimedia.org/T149430#2752431 (10Aklapper) [16:54:47] 06Operations, 10DBA: dbtree broken - https://phabricator.wikimedia.org/T149357#2752434 (10Aklapper) [16:55:09] yep, i made a ticket [16:56:00] 06Operations: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2752435 (10hashar) [16:56:03] [ 2016-10-28T16:50:53 ] INFO: Nodes: 1 NOOP 0 DIFF 0 ERROR 0 FAIL [16:56:04] [ 2016-10-28T16:50:53 ] INFO: Nodes: 1 NOOP 1 DIFF 0 ERROR 0 FAIL [16:56:06] ? [16:56:43] and I have uploaded the .pson files in Phabricator / attached them to the task thx! [16:56:45] <_joe_> catalogs are different or results on machines are different? [16:56:57] <_joe_> and yes, catalogs _are_ different [16:57:05] _joe_: the catalogs are different but the result page says "no change" [16:57:33] should we expect a diff there or not because it does not actually remove a package, just the resource [16:57:36] is the question i guess [16:58:17] <_joe_> it should be different, yes [16:58:26] <_joe_> it's a bug of puppet catalog differ [16:58:31] 06Operations: puppet compiler claims "no change" when catalogs are actually different - https://phabricator.wikimedia.org/T149432#2752413 (10hashar) Partial diff ``` --- prod.gallium.wikimedia.org.pson 2016-10-28 18:33:05.000000000 +0200 +++ change.gallium.wikimedia.org.pson 2016-10-28 18:33:05.000000000 +... [16:58:39] I have pasted a partial diff [16:58:48] showing some resources differences are missing [16:58:55] hasharAway: some of those php5-* packages are installed on gallium [16:58:59] but not all [16:59:08] mutante: yeah should be fine [16:59:16] we will phase out the machine in a few days anyway [16:59:21] but feel free to hold :] [16:59:24] I am off for dinner [16:59:34] well, but "Does not change on gallium.wikimedia.org since it is Precise and does not have those packages installed." [16:59:41] conflicts with that [16:59:51] yeah [17:00:00] that is probably a left over from years ago [17:00:08] when we were running mediawiki tests directly on gallium [17:00:14] that is no more the case nowadays [17:00:26] RECOVERY - configured eth on kafka1003 is OK: OK - interfaces up [17:00:44] iirc the only thing we need on gallium is libapache2-mod-php5 which is provided by contint::website [17:00:46] RECOVERY - Disk space on kafka1003 is OK: DISK OK [17:00:56] RECOVERY - DPKG on kafka1003 is OK: All packages OK [17:00:56] RECOVERY - MD RAID on kafka1003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [17:00:58] be back later! [17:01:16] RECOVERY - dhclient process on kafka1003 is OK: PROCS OK: 0 processes with command name dhclient [17:01:36] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:02:36] ok, yes, the change itself looks good to me [17:04:34] (03CR) 10Dzahn: [C: 032] contint: move php5 install on jessie to nearest user [puppet] - 10https://gerrit.wikimedia.org/r/317987 (owner: 10Hashar) [17:06:58] The following packages have unmet dependencies: [17:06:59] libapache2-mod-php5 : Depends: php5-cli but it is not going to be installed [17:07:56] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:11] better if we remove all of that and check that puppet reinstalls what is actually needed [17:08:37] looking [17:12:50] now it froze while trying to disable php module.. wut.. [17:14:43] ok, maybe just my connection [17:15:11] !log contint1001 - removed php5-* packages (https://puppet-compiler.wmflabs.org/4502/contint1001.wikimedia.org/) [17:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:53] runs puppet to see that it re-installs libapach2-mod-php5 and it does, yep, done [17:18:26] RECOVERY - NTP on kafka1003 is OK: NTP OK: Offset 0.007885038853 secs [17:36:19] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:37:41] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2752480 (10Pchelolo) [17:37:46] go _ale [17:37:50] nope! [17:40:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3074112 keys, up 219 days 9 hours - replication_delay is 610 [17:48:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3067520 keys, up 219 days 10 hours - replication_delay is 0 [17:58:53] (03CR) 10Daniel Kinzler: [C: 031] "yes, we want this enabled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:03:43] PROBLEM - thumbor@8835 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8835 is inactive [18:08:13] RECOVERY - thumbor@8835 service on thumbor1001 is OK: OK - thumbor@8835 is active [18:12:05] (03CR) 10Aude: "@daniel it's an option so that people can try and test it before deploying everywhere?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317840 (https://phabricator.wikimedia.org/T142940) (owner: 10Thiemo Mättig (WMDE)) [18:15:23] RECOVERY - salt-minion processes on kafka1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:15:45] (03PS1) 10Ottomata: Add kafka1003 to main-eqiad Kafka cluster [puppet] - 10https://gerrit.wikimedia.org/r/318570 (https://phabricator.wikimedia.org/T148849) [18:16:34] (03CR) 10Ottomata: [C: 04-1] "Will merge this next week" [puppet] - 10https://gerrit.wikimedia.org/r/318570 (https://phabricator.wikimedia.org/T148849) (owner: 10Ottomata) [18:22:29] (03PS1) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [18:25:14] (03CR) 10Rush: "@jaime, we had to reboot the labservices hosts and DNS didn't come back as it relies on the DB layer which is not auto start on boot. does" [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [18:25:43] (03CR) 10Andrew Bogott: [C: 031] "Why just tools and not labs-wide? Aren't we hoping to use clush everywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:27:13] (03CR) 10Yuvipanda: "Yes, but this is setup only for tools just now. That's easier since I can hit tools from the tools puppetmaster. Will need to implement a " [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:27:38] (03PS2) 10Yuvipanda: tools: Update clush classifier prefix for static nodes [puppet] - 10https://gerrit.wikimedia.org/r/315735 [18:27:43] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Update clush classifier prefix for static nodes [puppet] - 10https://gerrit.wikimedia.org/r/315735 (owner: 10Yuvipanda) [18:34:38] mutante: howdy; busy? [18:36:14] (03CR) 10Andrew Bogott: "I'm pretty sure Jaime already approved this change and I just screwed up by confusing 'ensure' with 'enabled'." [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [18:39:41] heads up - kartotherian has been poluting maps cache with the wrong tiles at some emptier regions. The fix is almost done, will be deployed soonish [18:40:51] (03PS1) 10Rush: k8s: install bridge-utils with docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/318574 [18:41:32] (03PS2) 10Rush: k8s: install bridge-utils with docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/318574 [18:45:51] (03CR) 10Rush: tools: Grant clush user complete sudo rights for everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:47:29] (03CR) 10Yuvipanda: tools: Grant clush user complete sudo rights for everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:47:48] (03CR) 10Rush: tools: Grant clush user complete sudo rights for everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:49:51] (03PS2) 10Yuvipanda: tools: Grant clush user complete sudo rights for everything [puppet] - 10https://gerrit.wikimedia.org/r/315736 [18:50:34] (03CR) 10Rush: tools: Grant clush user complete sudo rights for everything (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:51:20] (03CR) 10Yuvipanda: "Yes, because if they are root they have far more actual ways of sidestepping logging. If they can read the ssh private key then game over." [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [18:51:46] chasemp: responded, and also added the infrastructure include [18:51:59] yep yep Guest44452 you lost your nick again fyi [18:52:18] I get that root can jump over any hoops, what I was getting at is making sure they know htey are jumping over them [18:52:54] if you chmod +x and run a command w/ full path clearly your into some shit [18:53:05] mainly looking for how hardnosed you are trying to be about the wrapper [18:53:12] chasemp: not very, yeah [18:53:17] mostly a 'oh shit I forgot' [18:53:33] similar to molly-guard really [18:54:01] Guest44452: can you run a command out that doesn't require perms now via clush? [18:54:09] that's the current state right? [18:54:27] as unprived clush user I mean [18:54:44] chasemp: yeah [18:55:11] chasemp: gotta run it from tools-puppetmaster-01, need to move that to -02 soon [19:03:13] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: add swift user for docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/318148 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [19:04:22] (03PS2) 10Filippo Giunchedi: hieradata: add swift user for docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/318148 (https://phabricator.wikimedia.org/T149098) [19:06:58] (03PS3) 10Filippo Giunchedi: hieradata: add swift user for docker_registry [puppet] - 10https://gerrit.wikimedia.org/r/318148 (https://phabricator.wikimedia.org/T149098) [19:15:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:16:10] (03PS3) 10Rush: tools: Grant clush user complete sudo rights for everything [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [19:17:33] (03CR) 10Rush: [C: 031] "In essence anyone who can access the priv key portion here to do anything nefarious would already be root on a box that can be used for co" [puppet] - 10https://gerrit.wikimedia.org/r/315736 (owner: 10Yuvipanda) [19:18:12] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2752759 (10fgiunchedi) p:05Normal>03High There's still significant traffic for 0px thumbs resulting in 500s, it'd... [19:19:03] (03PS3) 10Rush: k8s: install bridge-utils with docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/318574 [19:19:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:20:33] (03CR) 10Yuvipanda: [C: 031] "+1, but there's laso a profile/docker/engine now, so welcome to divergence I guess." [puppet] - 10https://gerrit.wikimedia.org/r/318574 (owner: 10Rush) [19:22:08] (03CR) 10Rush: [C: 032] k8s: install bridge-utils with docker::engine [puppet] - 10https://gerrit.wikimedia.org/r/318574 (owner: 10Rush) [19:23:03] PROBLEM - Swift HTTP backend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [19:23:13] PROBLEM - Swift HTTP frontend on ms-fe3001 is CRITICAL: connect to address 10.20.0.15 and port 80: Connection refused [19:24:12] !log deployed kartotherian https://gerrit.wikimedia.org/r/#/c/318575/ - caching is still broken [19:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:18] (03PS1) 10Yuvipanda: python: Add python3-venv package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/318580 (https://phabricator.wikimedia.org/T149441) [19:24:25] bd808: ^ [19:25:01] Guest44452: that fixes it? awesome! [19:25:18] Guest44452 any reason you don't identify to nickserv? [19:25:31] his bridge is acting up [19:25:46] arseny92: any reason he actually has to? [19:25:58] (03CR) 10BryanDavis: [C: 032] python: Add python3-venv package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/318580 (https://phabricator.wikimedia.org/T149441) (owner: 10Yuvipanda) [19:26:22] (03Merged) 10jenkins-bot: python: Add python3-venv package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/318580 (https://phabricator.wikimedia.org/T149441) (owner: 10Yuvipanda) [19:26:56] bd808: do you wanna kick off the build? :D [19:27:16] bd808: isn't too hard :D [19:27:24] Guest44452: I've got to join a meeting in 2m [19:27:32] bd808: ok [19:27:37] bd808: i'll kick it off now [19:27:40] but I should learn how for sure [19:27:53] bd808: it's pretty trivial, we have the build.py script you CR'd [19:34:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:34:37] RECOVERY - Swift HTTP backend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 393 bytes in 0.188 second response time [19:34:47] RECOVERY - Swift HTTP frontend on ms-fe3001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.174 second response time [19:35:37] 06Operations, 10RESTBase, 10Traffic, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2752824 (10GWicke) [19:35:37] swift/esams was me btw [19:52:55] 06Operations, 10RESTBase, 10Traffic, 06Services (doing): Restbase redirects with cors not working on Android 4 native browser - https://phabricator.wikimedia.org/T149295#2752861 (10GWicke) See also: {T149444} [20:04:15] urandom: hello. so .. i assume the log messages will come from "restbase::hosts" like restbase1007 but not from "restbase::seeds" like restbase1007-a ? [20:04:53] can we just use the existing list in hieradata/role/eqiad/restbase/server.yaml as the source? does it have to be all of them, yes, right? [20:04:55] mutante: let me look [20:05:03] all of them, right [20:05:10] and, it should be staging, and aqs as well [20:05:41] hmm, in that case i think first we need a new list in that hiera file [20:06:32] or we can combine the existing ones with another list of aqs and staging hotss [20:06:37] that exist in another place [20:07:54] there is role/common/aqs.yaml which has [20:08:05] aqs_hosts: [20:08:18] and it even says it's already there for firewall rules, so that's good [20:08:29] hrmm, yeah [20:08:36] that's kinda ad hoc [20:08:46] ? [20:09:08] i meant, it's something specific to aqs [20:09:40] and, i'm guessing it was added to work-around this very problem, that there is nothing that represents the list of hosts [20:10:06] a list was created to work-around not having a list? [20:10:18] yeah, that wasn't well formatted [20:10:28] and i'm kind of thinking out loud here [20:10:53] the ::seeds list was meant to be templated into cassandra's config, for it's seed list [20:11:10] which by convention we maintained as "all hosts", but doesn't need to be [20:11:24] oh, i meant aqs_hosts not aqs_seeds [20:11:27] (03PS1) 10Yuvipanda: tools: increse k8s apiserver open files limit [puppet] - 10https://gerrit.wikimedia.org/r/318584 [20:11:32] yeah [20:11:41] chasemp: ^ [20:11:58] (03PS2) 10Yuvipanda: tools: increse k8s apiserver open files limit [puppet] - 10https://gerrit.wikimedia.org/r/318584 [20:12:03] mutante: at some point we introduced multi-instance, and started manually adding instances to ::seeds [20:12:19] so it's even less reliable as a list of hosts [20:12:35] i'm thinking, what we should have is a list of hosts [20:12:41] isnt it all manual as long is it's somewhere in the puppet repo at all? [20:12:54] and if multi-instance is used, then it should suss out a seeds list from that [20:12:58] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: increse k8s apiserver open files limit [puppet] - 10https://gerrit.wikimedia.org/r/318584 (owner: 10Yuvipanda) [20:13:23] mutante: not sure i follow that [20:13:27] bah, fuck you gerrit for not making it clear when the fuck a patch depends on other patches [20:13:29] * yuvipanda rages one tiny bit [20:13:37] urandom: i mean i cant think of a non-manual way to add them to hiera [20:13:44] or to a puppet role [20:13:50] (03PS3) 10Yuvipanda: tools: increse k8s apiserver open files limit [puppet] - 10https://gerrit.wikimedia.org/r/318584 [20:14:02] we just need to maintain a list , right [20:14:32] (03CR) 10Yuvipanda: [V: 032] tools: increse k8s apiserver open files limit [puppet] - 10https://gerrit.wikimedia.org/r/318584 (owner: 10Yuvipanda) [20:14:33] mutante: i'm trying to think of a way of coming up with a hosts list that doesn't require updating more than one place when we stand up a new machine, and that is consistent across cassandra clusters [20:14:35] yuvipanda: it does, "submit including partents" [20:14:57] urandom: yes, that sounds good. yea [20:15:07] mutante: but I don't want to... [20:15:16] mutante: except that'a also greyed out because parents haven't undergone CR yet [20:15:20] the old UI had a decent list [20:15:31] now it's 'related patches, and good luck figuring out how the fuck this list is generated' [20:15:46] anyway [20:15:50] raging isn't going to change anything [20:20:10] mutante: gah, this will be quite invasive [20:20:48] mutante: which also sort of convinces me even more that it needs doing... [20:22:17] urandom: hmm.. yea.. eh.. one source of truth would be good instead of creating just another list.. but i understand it could be a lot more than we anticipated for the logmsgbot change [20:24:01] (03CR) 10Jcrespo: "I do not think this will work until this is merged:" [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [20:25:46] mutante: yeah, i'm convinced it needs to be done, so i guess it's a question of either a) doing it right now, or b) adding the equiv of aqs_hosts for the restbase cluster(s) for now, and leaving a phab to do it later [20:26:06] (03CR) 10Jcrespo: "I would add the functionality there (on the module, service class) with an optional parameter there, may be useful on other non-production" [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [20:26:08] the crusader in me is saying (a) [20:27:59] urandom: i think so, yes. a) sounds better if we can wait for the logging stuff a bit more until that is done [20:42:29] !log Sending Tool Labs survey reminder emails from silver (T147336) [20:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:35] T147336: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336 [20:55:21] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:57] (03PS1) 10BryanDavis: Add service name "people" for rutherfordium [dns] - 10https://gerrit.wikimedia.org/r/318643 [21:05:33] (03CR) 10Hashar: [C: 031] "Would be quite welcome! rutherfordium is cumbersome to remember / autocomplete." [dns] - 10https://gerrit.wikimedia.org/r/318643 (owner: 10BryanDavis) [21:08:43] (03CR) 10Hashar: [C: 031] switch zuul CNAME from gallium to contint1001 [dns] - 10https://gerrit.wikimedia.org/r/318249 (https://phabricator.wikimedia.org/T95757) (owner: 10Dzahn) [21:13:25] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753028 (10fgiunchedi) [21:20:07] (03PS2) 10Dzahn: add mapped IPv6 address for eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/317192 [21:23:11] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:39:11] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:45:18] (03CR) 10Dzahn: [C: 032] "a first step to fixing the networking issues of phab2001" [dns] - 10https://gerrit.wikimedia.org/r/317291 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [21:49:13] (03PS1) 10Odder: Account creation throttle exemption for WMCL editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318645 (https://phabricator.wikimedia.org/T149443) [21:49:47] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2753108 (10Mvolz) What's the status on this one? We're still using the doi converter api on every request in order to fill in DOI, PMID, and PMC. [21:49:58] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2753109 (10Mvolz) 05Open>03stalled [21:50:14] (03PS2) 10Odder: Account creation throttle exemption for WCML editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318645 (https://phabricator.wikimedia.org/T149443) [21:50:46] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2239664 (10Mvolz) If we need to assess community impact we'd need to talk to community liaisons so if this is still an issue we could ask them. [21:56:54] (03PS2) 10Dzahn: add phab2001-vcs.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/317291 (https://phabricator.wikimedia.org/T143363) [22:07:03] (03PS1) 10Filippo Giunchedi: add swift.svc CNAME for codfw/eqiad [dns] - 10https://gerrit.wikimedia.org/r/318646 (https://phabricator.wikimedia.org/T149098) [22:08:51] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:10:14] (03PS3) 10Dzahn: add phab2001-vcs.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/317291 (https://phabricator.wikimedia.org/T143363) [22:11:09] (03CR) 10Filippo Giunchedi: [C: 032] add swift.svc CNAME for codfw/eqiad [dns] - 10https://gerrit.wikimedia.org/r/318646 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [22:11:14] (03PS2) 10Filippo Giunchedi: add swift.svc CNAME for codfw/eqiad [dns] - 10https://gerrit.wikimedia.org/r/318646 (https://phabricator.wikimedia.org/T149098) [22:11:43] mutante: hehe clash ^ [22:12:06] heh, yea, i "sniped" you, sry [22:12:27] (03PS2) 10Dzahn: add git-ssh.codfw.wikimedia.org service IP [dns] - 10https://gerrit.wikimedia.org/r/317296 (https://phabricator.wikimedia.org/T143363) [22:12:32] hehe np, merged [22:16:01] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#2753210 (10fgiunchedi) The `docker:registry` user is setup in eqiad/codfw/esams, though esams LVS setup isn't c... [22:17:46] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2751310 (10bd808) $DAYJOB-1 operated at a much, much smaller scale than Wikimedia, but we got huge benefits from... [22:18:41] (03CR) 10Dzahn: [C: 032] add git-ssh.codfw.wikimedia.org service IP [dns] - 10https://gerrit.wikimedia.org/r/317296 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [22:26:05] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2753243 (10Dzahn) radon:~] $ host git-ssh.wikimedia.org git-ssh.wikimedia.org has address 208.80.154.250 git-ssh.wikimedia.org has IPv6 address 2620:0:861:ed1a::3:16... [22:26:35] 06Operations, 10ops-codfw, 10DBA: install new disks into dbstore2001 - https://phabricator.wikimedia.org/T149457#2753248 (10RobH) [22:28:21] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:29:02] * paladox goes to amazon.co.uk and amazon.com (shopping for apple mac's) [22:30:29] (03PS3) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [22:31:35] (03PS1) 10Papaul: script to change mgmt password [puppet] - 10https://gerrit.wikimedia.org/r/318650 [22:31:58] (03PS4) 10Filippo Giunchedi: Introduce mtail module [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) [22:32:09] (03CR) 10Filippo Giunchedi: "thanks Alex!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316543 (https://phabricator.wikimedia.org/T147923) (owner: 10Filippo Giunchedi) [22:32:35] (03CR) 10jenkins-bot: [V: 04-1] script to change mgmt password [puppet] - 10https://gerrit.wikimedia.org/r/318650 (owner: 10Papaul) [22:43:59] (03PS2) 10Dzahn: phabricator: add vcs::listen_addresses for codfw [puppet] - 10https://gerrit.wikimedia.org/r/317295 (https://phabricator.wikimedia.org/T143363) [22:44:00] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2753327 (10Pchelolo) >>! In T149408#2753211, @bd808 wrote: > I think this functionality can be approximated in K... [22:44:30] (03CR) 10Dzahn: [C: 032] "[radon:~] $ host phab2001-vcs.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/317295 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [22:45:33] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2753331 (10mmodell) @dzahn, the same but more readable: |hostname|IPv?|Address| |git-ssh.wikimedia.org|IPv4|208.80.154.250 |git-ssh.wikimedia.org|IPv6|2620:0:861:ed... [22:45:34] (03PS2) 10Papaul: script to change mgmt password [puppet] - 10https://gerrit.wikimedia.org/r/318650 [22:49:34] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2753335 (10Dzahn) > So 208.80.154 is in eqiad and 153 is codfw? Yes, so each DC has several rows and each row has a network. And yea, 154 is eqiad and 153 is codfw.... [22:56:41] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:57:11] (03CR) 10Dzahn: [C: 032] "the script works, tested :) very nice" [puppet] - 10https://gerrit.wikimedia.org/r/318650 (owner: 10Papaul) [23:02:40] (03PS3) 1020after4: phabricator: add vcs::listen_addresses for codfw [puppet] - 10https://gerrit.wikimedia.org/r/317295 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [23:03:12] (03PS1) 10Dzahn: mgmt: add missing # in changepw script and some spaces [puppet] - 10https://gerrit.wikimedia.org/r/318652 [23:03:52] (03CR) 10Dzahn: [C: 032] mgmt: add missing # in changepw script and some spaces [puppet] - 10https://gerrit.wikimedia.org/r/318652 (owner: 10Dzahn) [23:04:01] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:07:25] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753028 (10bd808) It's actually possible to make logstash read things directly from a kafka topic ( (03PS4) 10Dzahn: phabricator: add vcs::listen_addresses for codfw [puppet] - 10https://gerrit.wikimedia.org/r/317295 (https://phabricator.wikimedia.org/T143363) [23:10:30] (03PS1) 10Dzahn: admin: let datacenter-ops run script to change mgmt passwords [puppet] - 10https://gerrit.wikimedia.org/r/318654 [23:11:51] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:11] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:19:11] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:19:36] !log re-enabled puppet on phab2001 temp, ran puppet. removed 10.64.31.186/21 from eth0, stopped puppet again [23:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:10] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2753404 (10bd808) >>! In T149408#2753327, @Pchelolo wrote: > @bd808 What are the use-cases you have in mind that... [23:22:49] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753405 (10fgiunchedi) @bd808 interesting! Not sure what's the current hhvm.log volume, though we're talking about ~20 logs/s at peak (see https://grafana.wikimedia.org/dashboard/db/varnish-http-er... [23:24:38] 06Operations, 10ChangeProp, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Asynchronous processing in production: one queue to rule them all - https://phabricator.wikimedia.org/T149408#2753406 (10Pchelolo) > I know that there are jobs which in turn spawn other jobs. One reason this is done is to t... [23:28:43] (03PS1) 10Filippo Giunchedi: Offboarding Rob Lanphier [puppet] - 10https://gerrit.wikimedia.org/r/318656 [23:29:42] (03CR) 10Mark Bergsma: [C: 032] Offboarding Rob Lanphier [puppet] - 10https://gerrit.wikimedia.org/r/318656 (owner: 10Filippo Giunchedi) [23:30:01] (03PS2) 10Filippo Giunchedi: Offboarding Rob Lanphier [puppet] - 10https://gerrit.wikimedia.org/r/318656 [23:31:34] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:34:01] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753414 (10bd808) ~20/s is nothing at to worry about I don't think. Our MediaWiki log event volume is ~100/s but we handle spikes to 10x that fairly often. [23:39:33] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures