[00:00:28] Krenair: cool, thanks a lot [00:00:29] (03Merged) 10jenkins-bot: Disable section collapsing on h1s in Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234330 (https://phabricator.wikimedia.org/T110436) (owner: 10Jdlrobson) [00:01:01] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234330/ (duration: 00m 13s) [00:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:11] jdlrobson, please test [00:01:15] on it! [00:01:25] sorry for being slow with the other change [00:02:27] Krenair: mmm not seeing it in action on english wikivoyage. [00:02:29] (03PS3) 10Dzahn: mailman: apply list role on fermium [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) [00:02:39] one sec [00:03:31] (03PS4) 10Dzahn: mailman: apply list role on fermium [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) [00:03:45] You missed a change to mobile.php [00:04:08] * jdlrobson face palms [00:04:58] wgMF = wmgMF line? [00:05:02] jdlrobson, try now [00:05:03] yes [00:05:03] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [00:05:17] Why do random codfw mw hosts keep dying like this? [00:05:25] thanks Krenair !!! yay it works! [00:05:32] I'll upload the commit in a sec [00:05:38] Once sync-file gives up on mw2027 [00:05:53] !log Another codfw host broke: mw2027 [00:05:56] !log krenair@tin Synchronized wmf-config/mobile.php: live hack to make previous commit work (duration: 01m 14s) [00:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:01] (03CR) 10Dzahn: [C: 032] mailman: apply list role on fermium [puppet] - 10https://gerrit.wikimedia.org/r/233873 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [00:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:09] Krenair: it seems to be working as expected now [00:07:13] (03PS1) 10Alex Monk: Follow-up I030135ae: Add wg = wmg line to mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234446 [00:07:17] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1582554 (10Dzahn) Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item mailman::lists_servername in a... [00:07:20] Krenair: i'm not sure that's needed [00:07:28] jdlrobson, why not? [00:07:34] It just fixed the issue didn't it? [00:07:41] oh you deployed it? okay ignore that then [00:07:50] i thought you hadn't deployed it yet :) [00:08:01] Yes, I wrote it on tin, deployed it, now I'm putting it in gerrit. [00:08:11] (03CR) 10Alex Monk: [C: 032] Follow-up I030135ae: Add wg = wmg line to mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234446 (owner: 10Alex Monk) [00:08:12] ok got it. my bad :) [00:08:18] (03Merged) 10jenkins-bot: Follow-up I030135ae: Add wg = wmg line to mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234446 (owner: 10Alex Monk) [00:10:42] PROBLEM - puppet last run on fermium is CRITICAL puppet fail [00:11:24] ^ that's me, already making the fix [00:12:07] (03PS1) 10Dzahn: mailman: use role-based hiera lookup on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234449 (https://phabricator.wikimedia.org/T109925) [00:12:13] (03CR) 10jenkins-bot: [V: 04-1] mailman: use role-based hiera lookup on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234449 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [00:12:20] (03PS2) 10Dzahn: mailman: use role-based hiera lookup on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234449 (https://phabricator.wikimedia.org/T109925) [00:12:52] (03CR) 10Dzahn: [C: 032] mailman: use role-based hiera lookup on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234449 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [00:13:51] (03PS1) 10Alex Monk: Follow-up I32bfdd14: Fix beta to contact restbase on http rather than https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234450 [00:15:29] (03CR) 10Alex Monk: [C: 032] Follow-up I32bfdd14: Fix beta to contact restbase on http rather than https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234450 (owner: 10Alex Monk) [00:15:36] (03Merged) 10jenkins-bot: Follow-up I32bfdd14: Fix beta to contact restbase on http rather than https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234450 (owner: 10Alex Monk) [00:15:57] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1582571 (10Dzahn) Notice: /Stage[main]/Mailman::Listserve/Exec[dpkg-reconfigure mailman]: Dependency Exec[debconf-communicate set mailman/gate_news]... [00:18:13] mutante, so, any idea what's up with mw2027? [00:19:00] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/234450/ (duration: 01m 14s) [00:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:33] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1582576 (10Dzahn) But besides that it looks really good. It created ALL the mailman listinfo pages, , spamassassin is running, the webserver setup i... [00:23:17] Krenair: looks very much like it crashed. i'll attempt to get it back up [00:23:23] but then somebody needs to sync it [00:23:31] sync-common? [00:23:34] sure [00:24:42] !log powercycled mw2027 [00:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:28] Configuring memory. Please wait... [00:26:59] yep, it's back [00:27:03] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 52.13 ms [00:27:13] boy I can't wait until nobody has to manually sync to any machine re-entering the pool. [00:27:15] the "UPING" thing will always bug me :) [00:27:39] Krenair: please attempt sync [00:27:51] "hey man, quit lying down" "ok, I'm uping, give me a second" [00:28:10] it's an "optimization" that does this :p [00:28:17] it wasnt like this all the time [00:28:32] ? [00:32:49] mutante, done [00:32:58] :) [00:33:02] krenair@mw2027:~$ sync-common [00:33:02] 00:31:12 Copying to mw2027.codfw.wmnet from tin.eqiad.wmnet [00:33:02] 00:31:12 Started rsync common [00:33:02] 00:32:21 Finished rsync common (duration: 01m 08s) [00:33:03] krenair@mw2027:~$ [00:33:40] so yea, it was just crashed. the symptons are just garbled output on console and when you powercycle it all is back to normal [00:34:02] that's not pretty [00:34:14] but not that uncommon either [00:35:58] PROBLEM - Exim SMTP on fermium is CRITICAL: Connection refused [00:36:10] icinga-wm: thanks, that's what i want it to be :) [00:36:19] and you confirm that things worked [00:37:27] (03PS1) 10Ori.livneh: Small improvements to ssh-agent-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234454 [00:37:28] PROBLEM - mailman archives on fermium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Forbidden - string The Wikimedia-l Archives not found on https://lists.wikimedia.org:443/pipermail/wikimedia-l/ - 440 bytes in 0.021 second response time [00:37:31] ACKNOWLEDGEMENT - Exim SMTP on fermium is CRITICAL: Connection refused daniel_zahn T109925 et al [00:37:31] ACKNOWLEDGEMENT - mailman archives on fermium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Forbidden - string The Wikimedia-l Archives not found on https://lists.wikimedia.org:443/pipermail/wikimedia-l/ - 440 bytes in 0.021 second response time daniel_zahn T109925 et al [00:37:31] ACKNOWLEDGEMENT - mailman list info on fermium is CRITICAL: HTTP CRITICAL: HTTP/1.1 403 Forbidden - string Discussion list for the Wi... not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 446 bytes in 0.058 second response time daniel_zahn T109925 et al [00:37:31] ACKNOWLEDGEMENT - puppet last run on fermium is CRITICAL Puppet has 2 failures daniel_zahn T109925 et al [00:37:40] (03PS2) 10Ori.livneh: Small improvements to ssh-agent-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234454 [00:37:47] (03CR) 10Ori.livneh: [C: 032 V: 032] Small improvements to ssh-agent-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234454 (owner: 10Ori.livneh) [00:38:09] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 25.00% of data above the critical threshold [500.0] [00:38:35] ori: remember the icinga-wm message change? [00:38:42] like where it actually was [00:39:44] mutante: not off the top of my head [00:41:41] ok [00:41:41] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1582609 (10Dzahn) monitoring issues: mailman I/O stats: ERROR: Device incorrectly specified all others are known (exim stopped, archives not imp... [00:42:39] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:45:25] mutante: ./modules/ircecho/files/ircecho [00:45:57] ori: :) thanks [00:46:18] i asked because of it always says "UPING" [00:46:54] hrmm [00:46:57] probably a regex error [00:47:00] got an example? [00:47:17] < icinga-wm> RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 52.13 ms [00:47:22] UPING OK [00:47:28] when a host comes back [00:47:58] right. looking at my irc logs, before the regex change it was 'UP: PING OK' [00:48:04] it's somewhre in beautify_message [00:48:12] yep [00:50:23] got it [00:52:19] (03PS1) 10Ori.livneh: UPING is not a word. [puppet] - 10https://gerrit.wikimedia.org/r/234455 [00:52:26] mutante: ^ [00:53:14] (03PS1) 10Dzahn: mailman: fix I/O monitoring on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [00:53:34] ori: thank you, it was sooo not important but i could not stop seeing it either :) [00:53:51] haha, well, it was my error originally [00:54:00] (03CR) 10Dzahn: [C: 032] UPING is not a word. [puppet] - 10https://gerrit.wikimedia.org/r/234455 (owner: 10Ori.livneh) [00:54:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:54:44] (03PS2) 10Dzahn: mailman: fix I/O monitoring on fermium [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [00:56:23] (03CR) 10Dzahn: [C: 04-1] "merge when sodium isn't active anymore, not worth adding another if-then-else" [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [00:57:46] hmm debconf::set , only used in mailman module, nowhere else [00:58:00] and i have an issue with that on the jessie box [00:58:08] RECOVERY - Keyholder SSH agent on mira is OK Keyholder is armed with all configured keys. [01:01:29] oh, we had to restart it? right [01:18:18] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 connecting: (unnamed) not-conn: cp3017_v6 [01:20:18] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:21:08] (03PS1) 10Ori.livneh: Lint fixes for I695ce535 and I526d5737 [debs/pybal] - 10https://gerrit.wikimedia.org/r/234457 [01:21:19] (03CR) 10Ori.livneh: [C: 032] Lint fixes for I695ce535 and I526d5737 [debs/pybal] - 10https://gerrit.wikimedia.org/r/234457 (owner: 10Ori.livneh) [01:21:35] (03Merged) 10jenkins-bot: Lint fixes for I695ce535 and I526d5737 [debs/pybal] - 10https://gerrit.wikimedia.org/r/234457 (owner: 10Ori.livneh) [01:21:44] (03CR) 10Ori.livneh: "Very nice work, kudos!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/234253 (owner: 10Giuseppe Lavagetto) [01:21:53] (03CR) 10Ori.livneh: "Very nice, kudos!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/234008 (owner: 10Giuseppe Lavagetto) [01:21:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [01:32:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:37:13] !log on ruthenium restarted parsoid-rt-client and parsoid-vd-client [01:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:43:24] it's being killed by oom-killer [01:43:46] ah [01:45:50] apparently this is not unusual: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ruthenium.eqiad.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1440726295&g=mem_report&z=large&c=Miscellaneous%20eqiad [01:46:19] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [01:47:14] i have another unrelated question since you're here [01:47:51] how would you go about debugging why certain pages (which are not frequently edited) appear hundreds of times per day in the slow parse log? [01:49:20] well, you could log more information in the slow parse log, but it might be more interesting to look at the sources of parse operations generally [01:50:22] I think we have statistics breaking down parser cache misses by their causes [01:51:03] as well as uncacheable parse operations like users with stub thresholds [01:51:52] pcache_miss_stub dominates: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1440726697.198&target=MediaWiki.pcache_miss_expired.count&target=MediaWiki.pcache_miss_absent.count&target=MediaWiki.pcache_miss_stub.count [01:52:33] oh, no they don't. there's a better source for these: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1440726745.959&target=MediaWiki.pcache.miss.absent.count&target=MediaWiki.pcache.miss.expired.count&target=MediaWiki.pcache.miss.rejected.count&target=MediaWiki.pcache.miss.revid.count&target=MediaWiki.pcache.miss.stub.count [01:53:31] for sources of parse operations, i guess the xenon logs could show which code paths are the most frequent callers of the parse [01:53:33] r [01:58:13] !log on ruthenium, reduced parsoid-rt-client concurrency from 16 to 8 since it was OOM and oom-killer was killing random things [01:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:59:28] !log on ruthenium: started parsoid_vd which was previously killed by oom-killer [01:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:38] PROBLEM - Last backup of the tools filesystem on labstore1002 is CRITICAL: CRITICAL - Last run result was exit-code [02:07:49] 6operations, 10Wikimedia-Mailing-lists: publish statistics about number of held messages per list - https://phabricator.wikimedia.org/T110609#1582748 (10Dzahn) command to get stats, in /var/lib/mailman/data/ or /var/lib/mailman/restore/var/lib/mailman/data/ respectively: find . | sed -e "s/pck//g" -e "s/[0-9... [02:12:19] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [02:13:28] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [02:16:53] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1582749 (10Dzahn) @fermium:~# echo get mailman/gate_news | debconf-communicate 10 mailman/gate_news doesn't exist @sodium:~# echo get mailman/gate... [02:17:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:29:09] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [02:30:20] PROBLEM - SSH on analytics1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:20] PROBLEM - RAID on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:29] PROBLEM - YARN NodeManager Node-State on analytics1037 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:08] PROBLEM - Hadoop NodeManager on analytics1037 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:32:19] RECOVERY - SSH on analytics1037 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:32:19] RECOVERY - RAID on analytics1037 is OK: OK: optimal, 13 logical, 14 physical [02:33:36] (03PS1) 10Dzahn: mailman: remove gate_news debconf option [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) [02:34:09] RECOVERY - Hadoop NodeManager on analytics1037 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [02:34:29] RECOVERY - YARN NodeManager Node-State on analytics1037 is OK: OK: YARN NodeManager analytics1037.eqiad.wmnet:8041 Node-State: RUNNING [02:35:10] !log l10nupdate@tin Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 10m 47s) [02:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:50] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:41:26] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf20) at 2015-08-28 02:41:26+00:00 [02:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [02:56:23] 6operations, 10Wikimedia-Mailing-lists: send follow-up email, announce changes with new mailman version if any that have user impact - https://phabricator.wikimedia.org/T110140#1582771 (10Dzahn) 2.1.18 (03-May-2014) Dependencies - There is a new dependency associated with the new Privacy options ->... [02:58:48] (03CR) 10Tim Landscheidt: "The patch doesn't introduce a dependency on a database, it just switches the existing on a flat-file, NFS, not-really-sure-what-it-means d" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt) [03:05:59] PROBLEM - Last backup of the others filesystem on labstore1002 is CRITICAL: CRITICAL - Last run result was exit-code [03:07:24] (03PS2) 10Dzahn: mailman: remove gate_news debconf option [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) [03:11:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:32:29] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.81% of data above the critical threshold [100000000.0] [03:44:18] PROBLEM - Host analytics1037 is DOWN: PING CRITICAL - Packet loss = 100% [03:44:59] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [03:52:09] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:05:08] PROBLEM - Last backup of the maps filesystem on labstore1002 is CRITICAL: CRITICAL - Last run result was exit-code [04:27:36] 6operations, 7Database, 5Patch-For-Review: db1022 duplicate key errors - https://phabricator.wikimedia.org/T105879#1582864 (10Springle) 5Open>3Resolved [04:27:36] 6operations, 7Database: install/setup/deploy db2043-db2070 - https://phabricator.wikimedia.org/T96383#1582872 (10Springle) 5Open>3Resolved [04:28:22] 6operations: puppet stopped mysqld using orphan pid file from puppet agent - https://phabricator.wikimedia.org/T86482#1582877 (10Springle) [04:42:40] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:58:42] !log ori@tin Synchronized php-1.26wmf20/includes/parser/Parser.php: 754b222daf: Add ParserOutput cache and expiry times to NewPP report (duration: 00m 13s) [04:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:12:22] 6operations, 10Wikidata: Deploy wikibase usage tracking on all client wikis on the wikimedia cluster - https://phabricator.wikimedia.org/T110339#1582915 (10Bugreporter) usage tracking is already deployed to zhwiki [05:23:41] (03PS4) 10Muehlenhoff: Logstash: make sure all input defines deal with ferm [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [05:24:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Logstash: make sure all input defines deal with ferm [puppet] - 10https://gerrit.wikimedia.org/r/233866 (owner: 10BryanDavis) [05:31:06] (03PS3) 10Muehlenhoff: Exempt mediawiki/http from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/234221 [05:31:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Exempt mediawiki/http from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/234221 (owner: 10Muehlenhoff) [05:55:20] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. [05:55:48] random UI elements are failing to load [05:56:59] I re-opened https://phabricator.wikimedia.org/T109279 [05:59:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Aug 28 05:59:09 UTC 2015 (duration 59m 8s) [05:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:01:48] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1582949 (10Legoktm) 5Resolved>3Open I'm continually seeing errors like "Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many c... [06:30:38] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:59] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:18] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:19] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:39] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:48] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 3 failures [06:33:00] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:00] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:09] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:19] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:40] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:55:50] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:55:59] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:56:10] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:28] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:50] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:58:19] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:10] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [07:49:58] (03CR) 10Merlijn van Deen: [C: 031] gridengine: Fix status check for gridengine-exec [puppet] - 10https://gerrit.wikimedia.org/r/234432 (https://phabricator.wikimedia.org/T110532) (owner: 10Tim Landscheidt) [07:51:26] <_joe_> valhallasw`cloud: I can merge ^^ for you, seems fairly straightforward [07:51:45] _joe_: yes, please :-) [07:52:21] <_joe_> valhallasw`cloud: but lemme take a look at what is happening there first for a sec [07:54:24] (03PS2) 10Giuseppe Lavagetto: gridengine: Fix status check for gridengine-exec [puppet] - 10https://gerrit.wikimedia.org/r/234432 (https://phabricator.wikimedia.org/T110532) (owner: 10Tim Landscheidt) [07:58:23] (03CR) 10Giuseppe Lavagetto: [C: 032] gridengine: Fix status check for gridengine-exec [puppet] - 10https://gerrit.wikimedia.org/r/234432 (https://phabricator.wikimedia.org/T110532) (owner: 10Tim Landscheidt) [07:59:36] <_joe_> valhallasw`cloud: I think, though, that we should solve this by writing a correct unit file. I mean this is ok as a temporary patch [07:59:48] _joe_: correct unit file? [07:59:51] <_joe_> but well, low-priority [07:59:55] <_joe_> for gridengine [08:00:00] <_joe_> gridengine-exec [08:00:17] <_joe_> right now I guess it uses some sysvinit script [08:00:27] Yeah, I think so. [08:00:30] <_joe_> while we should probably use a native upstart unit [08:00:31] (03PS1) 10Muehlenhoff: Remove some stray imports pkgdb code (not ready yet) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/234477 [08:00:35] but it's what debian bundles [08:01:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove some stray imports pkgdb code (not ready yet) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/234477 (owner: 10Muehlenhoff) [08:01:17] <_joe_> as I said, low-priority :) [08:01:21] (which doesn't mean we can't hack the old one out and replace it, but it's maybe more effort than it's worth) [08:02:45] (03PS1) 10Muehlenhoff: Add missing changelog entry [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/234478 [08:03:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add missing changelog entry [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/234478 (owner: 10Muehlenhoff) [08:04:48] the Son of Grid Engine release docs don't mention anything about upstart, so I assume that's also still using sysvinit [08:13:10] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:16:30] (03PS1) 10Jcrespo: Depool es1001 for cloning, increase es1011 weight, pool es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234479 (https://phabricator.wikimedia.org/T105843) [08:20:34] (03CR) 10Jcrespo: [C: 032] Depool es1001 for cloning, increase es1011 weight, pool es1014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234479 (https://phabricator.wikimedia.org/T105843) (owner: 10Jcrespo) [08:23:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1001, increas weight of es1011, pool es1014 for the first time (duration: 00m 13s) [08:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:59] (03CR) 10Filippo Giunchedi: [C: 031] "safe to merge I suppose?" [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [08:28:59] PROBLEM - puppet last run on ganeti2002 is CRITICAL: CRITICAL: puppet fail [08:29:10] !log uploaded debdeploy 0.0.3 to carbon [08:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:24] (03CR) 10Filippo Giunchedi: [C: 04-1] Add service deploy via scap (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [08:40:33] (03PS4) 1020after4: Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [08:40:41] (03CR) 1020after4: [C: 031] Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [08:45:33] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1583251 (10mmodell) I haven't seen such errors but there is definitely still a problem. I suspect there may be a root cause to this issue, and recent outages. The... [08:53:28] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:55:07] (03PS1) 10Muehlenhoff: Enable ferm on rhodium [puppet] - 10https://gerrit.wikimedia.org/r/234482 [08:56:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on rhodium [puppet] - 10https://gerrit.wikimedia.org/r/234482 (owner: 10Muehlenhoff) [08:56:25] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1583288 (10jcrespo) > I haven't seen such errors I can confirm, looking at the graphs above that those errors are real, hitting max connections as recently as 05... [08:56:54] (03CR) 10Filippo Giunchedi: cassandra: WIP support for multiple instances (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [08:57:12] (03PS7) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) [09:00:22] !log enabled ferm on rhodium puppetmaster backend [09:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:09] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [100000000.0] [09:10:04] (03PS1) 10Hashar: nodepool: user shell to /bin/bash [puppet] - 10https://gerrit.wikimedia.org/r/234483 [09:13:04] good morning [09:19:33] (03PS1) 10Muehlenhoff: also enable ferm on strontium (also move it to the role now) [puppet] - 10https://gerrit.wikimedia.org/r/234484 [09:20:17] 6operations, 10hardware-requests, 7Database, 5Patch-For-Review: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1583322 (10jcrespo) The es1 servers may be some of our oldest servers for critical production usage (16GB of memory!). They still have 1.8TB drives, but they could be... [09:23:10] 6operations, 5Continuous-Integration-Isolation: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1583324 (10hashar) 3NEW [09:23:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] also enable ferm on strontium (also move it to the role now) [puppet] - 10https://gerrit.wikimedia.org/r/234484 (owner: 10Muehlenhoff) [09:25:48] some people unable to access their contributions and/or page histories [09:25:51] https://en.wikipedia.org/wiki/Wikipedia:Help_desk#My_own_contributions_list [09:26:29] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#View_history_not_working.3F [09:26:51] sounds like a query performance problem ? [09:27:16] or maybe a gadget? [09:27:35] the contributions report states 'the browser locks up' which suggests javascript rather than a network/server issue [09:27:38] yeah or network level of course... [09:27:48] <_joe_> https://en.wikipedia.org/w/index.php?title=Jurassic_World&action=history is slow as hell though [09:27:55] <_joe_> or, it's not gonna load apparently [09:28:19] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [09:28:45] !log enabled ferm on strontium puppetmaster backend [09:28:50] <_joe_> whoa If I require the page via curl, it loads just fine in 280 ms [09:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:28:58] <_joe_> so valhallasw`cloud is right :) [09:29:29] the browser does hang at 'waiting for en.wikipedia.org', strangely enough [09:29:58] but the 100% cpu usage does shout javascript again [09:29:59] <_joe_> valhallasw`cloud: I have no idea what's going on there atm, looking [09:30:09] <_joe_> valhallasw`cloud: that's js for sure [09:30:15] <_joe_> some weird loop [09:30:32] <_joe_> the browser tab is unkillable [09:30:43] <_joe_> I get it, we installed the crashme.de gadget :P [09:32:57] <_joe_> is that happening for any history page on enwiki? [09:33:24] <_joe_> nope [09:33:29] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1583345 (10akosiaris) >>! In T109279#1582035, @mmodell wrote: > Every time I type in the comment box, phabricator makes an async connection to render the comment... [09:35:29] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [09:36:00] (03PS1) 10Muehlenhoff: Enable ferm on swift/esams [puppet] - 10https://gerrit.wikimedia.org/r/234486 [09:36:09] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [09:36:09] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [09:37:59] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [09:38:19] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [09:40:36] (03CR) 10Filippo Giunchedi: [C: 031] Enable ferm on swift/esams [puppet] - 10https://gerrit.wikimedia.org/r/234486 (owner: 10Muehlenhoff) [09:41:27] _joe_: it also fails if i disable JS in my browser [09:42:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on swift/esams [puppet] - 10https://gerrit.wikimedia.org/r/234486 (owner: 10Muehlenhoff) [09:43:29] I can also reproduce it if I curl the page, then load it in chrome [09:43:44] wut ? [09:43:47] so it's the content ? [09:44:33] I think so. When I remove the first 40-or so lines, it does load [09:44:48] it seems to be the [09:44:58] huh. what [09:45:00] now it does load?! [09:45:14] ...and also on-wiki [09:45:28] until I resize the page?! [09:45:35] !log enabled ferm for swift on esams [09:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:33] !log Cloning es1001 database into es1012 [09:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:49] moritzm, can we postpone ferm on phab and do it on any other misc db server? [09:50:08] hey folks! I have lost a deb package from our apt.wm.o repositories and can't find it anymore :( [09:50:33] because of https://phabricator.wikimedia.org/T109279#1579991 [09:50:34] we used to have python-novaclient in jessie backports, but apt-cache policy no more shows it from backports and I am a bit lost in figuring out what happens [09:52:41] hashar, I've done a quick search, and I only can find python-novaclient in the debian and ubuntu upstream mirrors [09:52:55] moritzm: sure thing, I'll pick something else instead [09:52:59] 6operations, 5Continuous-Integration-Isolation: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1583380 (10hashar) It is not showing up at http://apt.wikimedia.org/wikimedia/pool/backports/p/ The `python-gear` package is found... [09:53:09] jynus: apparently they have been removed from our backports :-/ [09:53:24] <_joe_> valhallasw`cloud: I'm not sure I can parse what you said :P [09:53:37] jynus: I guess I will check with andrewbogott this afternoon. [09:53:40] _joe_: it loaded. Then I resized my browser window and the page crashed. [09:53:49] hmm, the downloaded html file also locks up my browser.... [09:53:49] <_joe_> oh, wtf [09:53:58] <_joe_> thedj: ah! [09:54:00] i'll poke #webkit [09:54:18] seems to be in the skins.vector.styles css [09:54:52] <_joe_> thedj: tried in firefox? [09:55:01] firefox has no issues [09:55:04] <_joe_> valhallasw`cloud: why on that page and not on others? [09:55:11] <_joe_> oh, just chromium then [09:55:49] works just find. it' sounds like a webkit issue [09:55:59] moritzm, did you just pinged yourself? :-P [09:56:28] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:56:29] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:56:29] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:56:39] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:57:02] !sal [09:57:02] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [09:57:19] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:59:22] thedj: does https://tools.wmflabs.org/gerrit-reviewer-bot/test2.html work for you? [09:59:38] yes it does [09:59:59] thedj: that one has just mediawiki.legacy.shared removed from the RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:22:24] (03PS1) 10JanZerebecki: Another CentralAuth double cookie workaround [puppet] - 10https://gerrit.wikimedia.org/r/234510 (https://phabricator.wikimedia.org/T109038) [12:24:45] anyone up for deploying a change that fixes login on wikidata for some users? [12:25:06] a varnish change that is [12:25:17] (03PS6) 10Alexandros Kosiaris: maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) [12:25:22] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: ensure PostgreSQL's logs as maps-admin [puppet] - 10https://gerrit.wikimedia.org/r/234273 (https://phabricator.wikimedia.org/T106637) (owner: 10Alexandros Kosiaris) [12:28:40] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:32:50] (03PS1) 10BBlack: ssl_ciphersite: bugfix for apache-2.4.8+ DHE selection [puppet] - 10https://gerrit.wikimedia.org/r/234512 [12:34:06] (03PS1) 10Alexandros Kosiaris: maps: Fix typo in maps-admin name [puppet] - 10https://gerrit.wikimedia.org/r/234513 [12:34:08] bblack: deploy another centralauth cookie workaround, please?: https://gerrit.wikimedia.org/r/234510 [12:34:28] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Fix typo in maps-admin name [puppet] - 10https://gerrit.wikimedia.org/r/234513 (owner: 10Alexandros Kosiaris) [12:36:13] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1583885 (10akosiaris) [12:36:31] 10Ops-Access-Requests, 6operations, 6Discovery, 10Maps, and 2 others: Grant sudo on map-tests200* for maps team - https://phabricator.wikimedia.org/T106637#1583886 (10akosiaris) 5Open>3Resolved With the last patched merged, I think this is done. Resolving [12:36:47] (03PS2) 10BBlack: Another CentralAuth double cookie workaround [puppet] - 10https://gerrit.wikimedia.org/r/234510 (https://phabricator.wikimedia.org/T109038) (owner: 10JanZerebecki) [12:36:54] (03CR) 10BBlack: [C: 032] Another CentralAuth double cookie workaround [puppet] - 10https://gerrit.wikimedia.org/r/234510 (https://phabricator.wikimedia.org/T109038) (owner: 10JanZerebecki) [12:37:01] (03CR) 10BBlack: [V: 032] Another CentralAuth double cookie workaround [puppet] - 10https://gerrit.wikimedia.org/r/234510 (https://phabricator.wikimedia.org/T109038) (owner: 10JanZerebecki) [12:37:35] jzerebecki: thanks! [12:38:38] bblack: I have someone next room who can reproduce and can verify if that fixed this case [12:41:34] (03PS1) 10Muehlenhoff: Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 [12:43:38] (03CR) 10BBlack: [C: 031] Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [12:44:56] jzerebecki: ok, it will take a few to deploy [12:45:32] (03PS2) 10Jcrespo: Save binary log coordinates from the master and the slave on backup [puppet] - 10https://gerrit.wikimedia.org/r/234503 [12:47:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "a small nit, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [12:49:02] jzerebecki: should be running everywhere now [12:50:13] (03CR) 10Jcrespo: [C: 04-1] "This has not been tested, and it is a critical service. Do not deploy yet." [puppet] - 10https://gerrit.wikimedia.org/r/234503 (owner: 10Jcrespo) [12:50:58] trying [12:54:29] bblack: worked. thx. [12:55:38] (03PS2) 10Muehlenhoff: Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 [13:01:55] (03PS1) 10BBlack: refactor wikidata CA cookies workarounds a bit [puppet] - 10https://gerrit.wikimedia.org/r/234517 (https://phabricator.wikimedia.org/T109038) [13:18:11] (03CR) 10Chad: "Safe enough. It still doesn't run (needs other config in hiera), but it's at least the right roles now :)" [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [13:19:23] (03CR) 10Chad: "Why does it need access to port 22?" [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [13:28:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "still a small issue" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [13:32:21] (03PS3) 10Muehlenhoff: Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 [13:34:38] _joe_: no that's just the nrpe user unable to read root mounted snapshots [13:35:12] (03PS4) 10Giuseppe Lavagetto: Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [13:35:59] <_joe_> YuviPanda: safely landed? [13:36:03] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow SSH access from puppet master for puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/234514 (owner: 10Muehlenhoff) [13:37:33] _joe_: yes! [13:38:57] <_joe_> next person that merges a puppet change, lemme know if it's fixed [13:42:17] (03CR) 10Andrew Bogott: "No objection, but, does this work? I would've thought that it would encounter the already-existing account and do nothing." [puppet] - 10https://gerrit.wikimedia.org/r/234483 (owner: 10Hashar) [13:44:19] hashar_: I didn’t mess with packages in jessie-backports, but I believe the new jessie base image doesn’t include the jessie-backports repo by default. Did we maybe lose that repo entirely when I reimaged nodepool? [13:47:48] (03PS4) 10Filippo Giunchedi: Assign swift roles via ENC [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [13:47:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Assign swift roles via ENC [puppet] - 10https://gerrit.wikimedia.org/r/200625 (https://phabricator.wikimedia.org/T91553) (owner: 10Thcipriani) [13:48:04] <_joe_> godog: lemme know if puppet-merge works now [13:48:15] _joe_: yup it does, no issues [13:48:31] <_joe_> moritzm: ^^ [13:48:32] <_joe_> :) [13:49:10] 6operations, 5Continuous-Integration-Isolation: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584020 (10Andrew) I suspect the issue is that those packages were in the upstream backports repo. Earlier Jessie base images incl... [13:51:19] (03CR) 10Cscott: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234450 (owner: 10Alex Monk) [13:51:29] (03CR) 10JanZerebecki: [C: 031] refactor wikidata CA cookies workarounds a bit [puppet] - 10https://gerrit.wikimedia.org/r/234517 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [13:51:35] (03CR) 10Cscott: Always use VRS to configure Visual Editor (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233439 (owner: 10Cscott) [13:51:37] (03CR) 10Giuseppe Lavagetto: Move web::sites to web::prod_sites; begin unification in new class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/197655 (owner: 10Chad) [13:53:09] (03PS4) 10Ottomata: Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 [13:53:24] (03CR) 10Ottomata: [C: 032 V: 032] Improve description for statsistics-privatedata-users group in data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/233985 (owner: 10Ottomata) [13:54:35] (03PS3) 10Ottomata: Puppetize multiple kafka eventlogging processors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234415 (https://phabricator.wikimedia.org/T104228) [13:55:04] (03CR) 10JanZerebecki: [C: 031] ssl_ciphersite: bugfix for apache-2.4.8+ DHE selection [puppet] - 10https://gerrit.wikimedia.org/r/234512 (owner: 10BBlack) [14:00:27] 6operations, 10Deployment-Systems: Remove lanthanum.eqiad.wmnet from Trebuchet redis - https://phabricator.wikimedia.org/T110677#1584069 (10hashar) 3NEW [14:05:08] (03PS4) 10Ottomata: Puppetize multiple kafka eventlogging processors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234415 (https://phabricator.wikimedia.org/T104228) [14:06:24] (03CR) 10Ottomata: [C: 032] Puppetize multiple kafka eventlogging processors on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234415 (https://phabricator.wikimedia.org/T104228) (owner: 10Ottomata) [14:09:46] (03PS1) 10Ottomata: Rename server-side kafka processor to server-side-0 [puppet] - 10https://gerrit.wikimedia.org/r/234532 [14:10:02] (03CR) 10Ottomata: [C: 032 V: 032] Rename server-side kafka processor to server-side-0 [puppet] - 10https://gerrit.wikimedia.org/r/234532 (owner: 10Ottomata) [14:10:32] 6operations, 5Continuous-Integration-Isolation: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584116 (10hashar) I noticed our OpenStack puppet manifests used the 'jessie-backports' pinning and found out it had the python mod... [14:11:35] release v=8,o=Wikimedia,a=jessie-wikimedia,n=jessie-wikimedia,l=Wikimedia,c=backports [14:11:42] ^^^ that is why I hate short forms [14:11:51] how can I tell the different between o= and l= :-} [14:13:49] PROBLEM - Check status of defined EventLogging jobs on analytics1010 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/server-side-0 processor/client-side-7 processor/client-side-6 processor/client-side-5 processor/client-side-4 processor/client-side-3 processor/client-side-2 processor/client-side-1 processor/client-side-0 [14:14:25] oh psshhh [14:15:58] RECOVERY - Check status of defined EventLogging jobs on analytics1010 is OK: OK: All defined EventLogging jobs are runnning. [14:17:22] potentially NSFW but 'psshhhhh' related https://www.youtube.com/watch?v=ms3XjSfCaeI [14:19:20] 6operations: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546#1584132 (10Joe) Facts refreshed, procedure documented here: https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/RefreshPuppetCompiler [14:19:26] 6operations: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546#1584141 (10Joe) 5Open>3Resolved [14:22:32] (03CR) 10Filippo Giunchedi: Enable debdeploy for an initial set of servers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [14:22:56] (03CR) 10Filippo Giunchedi: [C: 031] Add debdeploy master to palladium [puppet] - 10https://gerrit.wikimedia.org/r/234495 (owner: 10Muehlenhoff) [14:24:09] (03PS1) 10Hashar: nodepool: update apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/234535 (https://phabricator.wikimedia.org/T110656) [14:24:49] (03CR) 10Hashar: "On labnodepool1001.eqiad.wmnet apt-cache policy has:" [puppet] - 10https://gerrit.wikimedia.org/r/234535 (https://phabricator.wikimedia.org/T110656) (owner: 10Hashar) [14:26:19] andrewbogott: I think you can now copy the python modules I needed for nodepool. You were right, upstream jessie-backports is no more included [14:26:32] andrewbogott: I also updated the apt::pin in nodepool puppet manifest. [14:27:09] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [14:29:09] (03CR) 10Andrew Bogott: [C: 032] nodepool: update apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/234535 (https://phabricator.wikimedia.org/T110656) (owner: 10Hashar) [14:29:40] hashar: ok, try now? [14:30:26] andrewbogott: gotta run puppet on labnodepool1001.eqiad.wmnet [14:30:32] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584184 (10Andrew) ok, I grabbed those two packages from debian.org and added them to jessie-wikimedia backpor... [14:30:33] and I lost root access there :\ [14:30:41] hashar: ah, right, ok, I’ll do it [14:32:37] maybe I should add a sudo rule for contint::admins to grant us the ability to run /usr/local/sbin/puppet-run :} [14:33:07] hashar: was puppet installing those two packages before? If so how did it work? [14:33:12] Or, working but wrong versions? [14:33:31] it did before we reimaged the server [14:33:37] I guess we had them from jessie-backports [14:33:43] aka the upstream repo [14:34:06] mind apt-get upgrading on labnodepool1001.eqiad.wmnet ? [14:35:15] (03CR) 10JanZerebecki: [C: 031] mailman: remove gate_news debconf option [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [14:35:53] uhoh, those packages have a cascade of dependencies [14:37:39] (03PS1) 10Hashar: admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 [14:38:03] (03CR) 10JanZerebecki: [C: 031] lists: lower A[AAA] records to 5M [dns] - 10https://gerrit.wikimedia.org/r/233049 (https://phabricator.wikimedia.org/T110132) (owner: 10John F. Lewis) [14:38:30] hashar: think I should copy the dozen or so dependencies into our repo as well? [14:39:15] if apt-get install fails, I guess so [14:39:28] I think jessie-backports as some more recent openstack version [14:39:36] so potentially we would end up needing them as well :( [14:39:47] or we could add back upstream jessie-backports :-D [14:39:57] saves you the troubles from copy pasting to our repo [14:40:55] btw related to jessie-backports is https://phabricator.wikimedia.org/T107507 [14:42:46] ok, this is going to take a few... [14:46:59] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [14:49:23] (03CR) 10JanZerebecki: "If I understand correctly then this would remove any A(AAA) record from these domains, thus people with browser or web crawlers see nothin" [dns] - 10https://gerrit.wikimedia.org/r/197361 (owner: 10Dzahn) [14:49:29] (03CR) 10Zfilipin: [C: 031] admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [14:51:45] hashar: ok, 11 packages later, things are upgraded. [14:51:54] painful [14:52:54] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584271 (10Andrew) Here's the complete list of packages I added to our backports repo: python-cinderclient_1.... [14:55:36] (03CR) 10JanZerebecki: [C: 031] varnish: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211352 (owner: 10Dzahn) [15:00:37] (03PS2) 10BBlack: ssl_ciphersite: bugfix for apache-2.4.8+ DHE selection [puppet] - 10https://gerrit.wikimedia.org/r/234512 [15:01:29] (03CR) 10BBlack: [C: 032] ssl_ciphersite: bugfix for apache-2.4.8+ DHE selection [puppet] - 10https://gerrit.wikimedia.org/r/234512 (owner: 10BBlack) [15:04:01] (03PS1) 10Ottomata: Set up varnishkafka instance on cache servers to log raw client side events to kafka [puppet] - 10https://gerrit.wikimedia.org/r/234543 (https://phabricator.wikimedia.org/T106255) [15:09:08] (03PS1) 10MarcoAurelio: T110674 Update of permissions at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234544 (https://phabricator.wikimedia.org/T110674) [15:10:17] (03PS1) 10Ottomata: Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 [15:10:55] (03PS2) 10Ottomata: Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 [15:11:19] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:11:44] (03CR) 10jenkins-bot: [V: 04-1] Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 (owner: 10Ottomata) [15:14:07] (03PS1) 10Rush: elasticsearch: ferm for 14-17 [puppet] - 10https://gerrit.wikimedia.org/r/234550 [15:14:11] (03CR) 10Milimetric: [C: 031] Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 (owner: 10Ottomata) [15:14:15] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for 14-17 [puppet] - 10https://gerrit.wikimedia.org/r/234550 (owner: 10Rush) [15:14:17] (03PS2) 10Rush: elasticsearch: ferm for 14-17 [puppet] - 10https://gerrit.wikimedia.org/r/234550 [15:14:50] (03PS1) 10BBlack: ssl_ciphersuite: disable DHE for apache for now [puppet] - 10https://gerrit.wikimedia.org/r/234551 [15:16:17] (03PS3) 10Ottomata: Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 [15:16:49] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: disable DHE for apache for now [puppet] - 10https://gerrit.wikimedia.org/r/234551 (owner: 10BBlack) [15:17:39] (03PS3) 10Rush: elasticsearch: ferm for 14-17 [puppet] - 10https://gerrit.wikimedia.org/r/234550 [15:18:05] (03PS2) 10Muehlenhoff: Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 [15:18:30] (03CR) 10Rush: [C: 032 V: 032] elasticsearch: ferm for 14-17 [puppet] - 10https://gerrit.wikimedia.org/r/234550 (owner: 10Rush) [15:19:40] (03PS4) 10Ottomata: Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 [15:19:56] (03CR) 10Ottomata: [C: 032 V: 032] Don't alert on stopped eventlogging jobs on analytics1010 [puppet] - 10https://gerrit.wikimedia.org/r/234546 (owner: 10Ottomata) [15:20:19] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:22:32] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1584314 (10greg) >>! In T109279#1583345, @akosiaris wrote: >>>! In T109279#1582035, @mmodell wrote: >> Every time I type in the comment box, phabricator makes an... [15:23:06] !log ferm for elasticsearch10[14-17] [15:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:51] 6operations: migrate policy.wikimedia.org from WMF cluster to Wordpress - https://phabricator.wikimedia.org/T110203#1584336 (10RobH) I've had the questions back to Simon/Wordpress support now for over 24 hours, and no reply back advising how to send them the policy.wikimedia.org.key file securely. So far, I've... [15:36:36] (03PS2) 10BBlack: refactor wikidata CA cookies workarounds a bit [puppet] - 10https://gerrit.wikimedia.org/r/234517 (https://phabricator.wikimedia.org/T109038) [15:36:45] (03CR) 10BBlack: [C: 032 V: 032] refactor wikidata CA cookies workarounds a bit [puppet] - 10https://gerrit.wikimedia.org/r/234517 (https://phabricator.wikimedia.org/T109038) (owner: 10BBlack) [15:36:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:37:29] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [100000000.0] [15:38:48] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:39:08] (03CR) 10Filippo Giunchedi: [C: 031] Enable debdeploy for an initial set of servers [puppet] - 10https://gerrit.wikimedia.org/r/234499 (owner: 10Muehlenhoff) [15:39:17] 6operations, 10Citoid, 10Graphoid, 6Services, 10service-template-node: SCA services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1584359 (10mobrovac) [15:41:06] (03PS1) 10Hashar: (WIP) admin: support members aliasing (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/234554 [15:41:49] (03CR) 10jenkins-bot: [V: 04-1] (WIP) admin: support members aliasing (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/234554 (owner: 10Hashar) [15:42:00] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584374 (10hashar) a:3Andrew hashar@labnodepool1001:~$ apt-cache policy python-novaclient python-novaclient:... [15:42:23] 6operations, 5Continuous-Integration-Isolation, 5Patch-For-Review: python-openstackclient and python-openstackclient no more available in jessie backports - https://phabricator.wikimedia.org/T110656#1584376 (10hashar) 5Open>3Resolved Looks all fine now: ``` hashar@labnodepool1001:~$ apt-cache policy pyth... [15:43:19] PROBLEM - puppet last run on mw2042 is CRITICAL: CRITICAL: puppet fail [15:48:20] (03PS2) 10Hashar: (WIP) admin: support members aliasing (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/234554 [15:52:35] (03PS3) 10Dzahn: mailman: remove gate_news debconf option [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) [15:53:25] (03PS4) 10Dzahn: mailman: no gate_news debconf option on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) [15:53:50] (03CR) 10Dzahn: [C: 032] mailman: no gate_news debconf option on >= jessie [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [15:57:34] (03PS3) 10Hashar: (WIP) admin: support members aliasing (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/234554 [15:59:39] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [15:59:42] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/850/" [puppet] - 10https://gerrit.wikimedia.org/r/234462 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [15:59:50] (03PS1) 10Rush: elasticsearch: ferm for [8-9][10-13] [puppet] - 10https://gerrit.wikimedia.org/r/234557 [15:59:56] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: ferm for [8-9][10-13] [puppet] - 10https://gerrit.wikimedia.org/r/234557 (owner: 10Rush) [16:00:09] (03PS2) 10Rush: elasticsearch: ferm for [8-9][10-13] [puppet] - 10https://gerrit.wikimedia.org/r/234557 [16:02:15] lol, thanks o'puppet [16:02:17] Error: invalid byte sequence in UTF-8 [16:02:37] (03CR) 10Rush: [C: 032] elasticsearch: ferm for [8-9][10-13] [puppet] - 10https://gerrit.wikimedia.org/r/234557 (owner: 10Rush) [16:02:52] JohnFLewis: hey :) [16:03:22] JohnFLewis: [/etc/mailman/fr/listinfo.html]/content .. invalid byte sequence in UTF-8 :p [16:03:44] this shows up now since the former issue is fixed [16:03:50] !log ferm for elasticsearch10(0[8-9|1[0-13]) [16:03:54] for some reason it doesnt care about it on sodium [16:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:26] mutante: I think file sequence may be wrong (well encoding). I think Jessie is more verbose than lucid? [16:06:13] yea, we gotta check all the templates or something [16:09:09] first let's see the other puppet error disappear though [16:11:38] RECOVERY - puppet last run on mw2042 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:13:40] (03CR) 10Hashar: "Pretty sure it will do, puppet is supposed to ensure the resource is up-to-date." [puppet] - 10https://gerrit.wikimedia.org/r/234483 (owner: 10Hashar) [16:15:04] JohnFLewis: https://phabricator.wikimedia.org/P1944 :p [16:18:15] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1584475 (10Dzahn) [16:18:39] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:18:53] godog: if you're around, I've been playing with sending data directly to graphite, and I see that it shows up as just the metric, instead of having 6 different things like count, min, max etc under it. Is this because I'm not sending it via statsd, or is it a configuration thing in graphite? [16:18:55] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1562942 (10Dzahn) now the next issue is: Error: invalid byte sequence in UTF-8 from templates --> T110695 [16:19:24] madhuvishy: that's correct, no statsd thus no derived metrics [16:20:07] godog: aah, alright. can you point me to code where this is configured? [16:21:21] madhuvishy: this you mean the statsd derived metrics? [16:21:30] godog: ah yes [16:26:04] madhuvishy: so we use statsite to implement statsd, https://github.com/wikimedia/operations-puppet/blob/production/modules/statsite/templates/statsite.ini.erb what would you like to do? [16:27:16] (03PS1) 10Rush: phab: disable tools crons [puppet] - 10https://gerrit.wikimedia.org/r/234563 [16:28:29] (03PS1) 10Muehlenhoff: Disable ferm on the puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/234564 [16:28:53] (03CR) 10Krinkle: [C: 031] admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [16:29:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable ferm on the puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/234564 (owner: 10Muehlenhoff) [16:31:55] (03CR) 10Merlijn van Deen: [C: 04-1] "Thanks!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/232285 (owner: 10Andrew Bogott) [16:32:28] <_joe_> moritzm: what happened? [16:32:38] <_joe_> why are you disabling the firewalls there? [16:33:04] _joe_: there's still plenty of dropped packets during the puppet-merges [16:33:26] will properly sort that out Monday morning [16:33:34] <_joe_> moritzm: yeah, better that way [16:34:18] PROBLEM - Host strontium is DOWN: PING CRITICAL - Packet loss = 100% [16:36:09] (03PS1) 10Dzahn: mailman: listinfo templates from ISO-8859 to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/234565 (https://phabricator.wikimedia.org/T110695) [16:36:38] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:39] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [16:37:19] PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: Puppet has 1 failures [16:37:41] (03PS2) 10Dzahn: mailman: listinfo templates from ISO-8859 to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/234565 (https://phabricator.wikimedia.org/T110695) [16:38:09] PROBLEM - puppet last run on mw2205 is CRITICAL: CRITICAL: puppet fail [16:38:09] PROBLEM - puppet last run on elastic1005 is CRITICAL: CRITICAL: Puppet has 1 failures [16:38:19] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: Puppet has 1 failures [16:38:49] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures [16:39:18] PROBLEM - puppet last run on mw2034 is CRITICAL: CRITICAL: Puppet has 1 failures [16:41:06] (03CR) 10John F. Lewis: [C: 031] mailman: listinfo templates from ISO-8859 to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/234565 (https://phabricator.wikimedia.org/T110695) (owner: 10Dzahn) [16:41:09] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: puppet fail [16:41:19] PROBLEM - puppet last run on neptunium is CRITICAL: CRITICAL: puppet fail [16:41:50] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [16:42:29] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: puppet fail [16:43:38] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [16:45:38] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:08] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [100000000.0] [16:49:00] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: puppet fail [16:49:11] <_joe_> what's happening? [16:49:25] <_joe_> oh just strontium being back and needing some love [16:49:25] 502 Proxy Error [16:49:39] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.977 second response time [16:50:34] 6operations, 10Continuous-Integration-Infrastructure, 6Discovery, 7Elasticsearch, 5Patch-For-Review: elasticsearch 1.6.0 fails to start after reboot - https://phabricator.wikimedia.org/T109497#1584579 (10dduvall) >>! In T109497#1583546, @hashar wrote: > Or we can remove ElasticSearch from the Jenkins CI... [16:51:18] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: puppet fail [16:51:58] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: puppet fail [16:54:48] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:55:08] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:56:17] (03PS3) 10Dzahn: mailman: listinfo templates from ISO-8859 to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/234565 (https://phabricator.wikimedia.org/T110695) [16:58:19] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:02:34] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:02:55] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:03:05] RECOVERY - puppet last run on mw2034 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:03:44] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:03:55] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:05] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:14] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:04:15] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:04:15] RECOVERY - puppet last run on mw2205 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:04:32] (03PS1) 10Dzahn: mailman: fix "no gate_news on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/234573 [17:04:45] RECOVERY - puppet last run on neptunium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:04:46] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:05:54] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:06:05] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:06:24] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1584629 (10mmodell) It has to validate your session, potentially it has to look up various references to inline references. [17:06:27] (03PS1) 10RobH: ldap-eqiad.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234576 [17:07:11] (03CR) 10John F. Lewis: [C: 031] mailman: fix "no gate_news on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/234573 (owner: 10Dzahn) [17:08:16] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [17:08:28] !log bouncing Cassandra on restbase1001 to apply default (puppet-managed) settings [17:08:34] (03PS1) 10RobH: ldap-codfw.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234577 [17:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:09:15] RECOVERY - Last backup of the tools filesystem on labstore1002 is OK: OK - Last run successful [17:09:55] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [17:10:05] !log catrope@tin Synchronized php-1.26wmf20/extensions/Flow/includes/Parsoid/Utils.php: T110676 (duration: 00m 13s) [17:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:15] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:11:55] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [17:18:40] (03PS1) 10Muehlenhoff: Query the puppetmaster from hiera instead of $::serverip [puppet] - 10https://gerrit.wikimedia.org/r/234578 [17:25:51] 6operations, 10ops-ulsfo: connect ulsfo side of ulsfo-eqdfw connection - https://phabricator.wikimedia.org/T109788#1584758 (10RobH) 5Open>3Resolved I did this last Friday, and it illuminated (possibly due to the dwdm gear involved) [17:32:02] (03CR) 1020after4: [C: 031] phab: disable tools crons [puppet] - 10https://gerrit.wikimedia.org/r/234563 (owner: 10Rush) [17:32:09] (03PS2) 1020after4: phab: disable tools crons [puppet] - 10https://gerrit.wikimedia.org/r/234563 (owner: 10Rush) [17:43:58] (03PS1) 10Muehlenhoff: remove stray line [puppet] - 10https://gerrit.wikimedia.org/r/234583 [17:44:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] remove stray line [puppet] - 10https://gerrit.wikimedia.org/r/234583 (owner: 10Muehlenhoff) [17:45:40] regarding https://phabricator.wikimedia.org/T104747 (updating tmh* videoscalers to trusty) -- is there any likelihood of getting it tested on beta cluster or should I just push to get them put into production rotation directly and see if they explode in production? :D [17:45:59] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:46:53] brion: if the puppet work is all done then it should be fairly simple to build a new VM in beta cluster to replace the current one [17:47:22] I've never been sure that the videoscaler instance in beta cluster actually does anything though :/ [17:47:44] bd808: it doesn't, something's broken :( [17:47:44] (other than occasionally spam itself with jobrunner failures) [17:48:09] so it's probably easier to upgrade to new folgers crystals and see if anyone notices ;) [17:48:45] fixing beta cluster to test something better is almost never a waste of time [17:48:50] true :D [17:49:55] bd808: yeah I queued a couple videos a couple days ago and they just haven't run: http://en.wikipedia.beta.wmflabs.org/wiki/Special:TimedMediaHandler [17:50:34] (03PS2) 10Dzahn: mailman: fix "no gate_news on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/234573 [17:51:07] (03CR) 10Dzahn: [C: 032] mailman: fix "no gate_news on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/234573 (owner: 10Dzahn) [17:51:41] brion: want to wander over to #wikimedia-releng and see if we can figure out what to do about it? [17:52:03] sure! [17:53:54] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1584876 (10mmodell) >>! In T99096#1583708, @Krinkle wrote: > > Using the `/w/static.php` approach we... [17:55:01] 6operations, 10Beta-Cluster, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, and 2 others: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1584883 (10greg) >>! In T104747#1568015, @brion wrote: > Adding beta-cluster project for fixing/updating TMH video scaler job r... [17:55:11] 6operations, 10Beta-Cluster, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, and 2 others: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1584885 (10greg) [17:58:07] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1584888 (10greg) [17:59:06] (03PS3) 10Rush: phab: disable tools crons [puppet] - 10https://gerrit.wikimedia.org/r/234563 [18:00:25] (03CR) 10Rush: [C: 032] phab: disable tools crons [puppet] - 10https://gerrit.wikimedia.org/r/234563 (owner: 10Rush) [18:01:36] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1584895 (10Jdforrester-WMF) I'm totally happy for the Parsoid endpoint to be replaced when RESTbase provides the... [18:05:54] (03PS1) 10Dzahn: mailman: convert French listinfo template [puppet] - 10https://gerrit.wikimedia.org/r/234589 (https://bugzilla.wikimedia.org/110695) [18:07:08] (03PS2) 10Dzahn: mailman: convert French listinfo template [puppet] - 10https://gerrit.wikimedia.org/r/234589 (https://bugzilla.wikimedia.org/110695) [18:08:13] (03PS3) 10Dzahn: mailman: convert French listinfo template [puppet] - 10https://gerrit.wikimedia.org/r/234589 (https://bugzilla.wikimedia.org/110695) [18:08:22] (03CR) 10Dzahn: [C: 032] mailman: convert French listinfo template [puppet] - 10https://gerrit.wikimedia.org/r/234589 (https://bugzilla.wikimedia.org/110695) (owner: 10Dzahn) [18:09:02] !log updating ldap-codfw cert [18:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:20] (03CR) 10RobH: [C: 032] ldap-codfw.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234577 (owner: 10RobH) [18:09:25] (03PS2) 10RobH: ldap-codfw.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234577 [18:09:54] (03CR) 10RobH: [V: 032] ldap-codfw.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234577 (owner: 10RobH) [18:11:03] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:11:36] JohnFLewis: only fixing the French one was enough to make the whole run succesful [18:11:56] JohnFLewis: first time it finished without any errors on jessie [18:12:08] Heh and awesome :) [18:15:52] (03Abandoned) 10Dzahn: mailman: listinfo templates from ISO-8859 to UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/234565 (https://phabricator.wikimedia.org/T110695) (owner: 10Dzahn) [18:17:09] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1584911 (10Dzahn) [18:17:27] (03PS4) 10Rush: icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 [18:18:29] (03CR) 10Rush: [C: 032] icinga: watch for the existence of certain html [puppet] - 10https://gerrit.wikimedia.org/r/234139 (owner: 10Rush) [18:18:42] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 26.92% of data above the critical threshold [100000000.0] [18:18:44] 6operations: Run assert check to verify the existence of certain texts in the footer - https://phabricator.wikimedia.org/T108081#1584915 (10chasemp) [18:18:55] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1584916 (10cscott) (RESTBase does have a wt2html endpoint but the input is a single line text input widget, whic... [18:18:57] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1584919 (10Dzahn) [18:18:59] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1584917 (10Dzahn) 5Open>3Resolved fixed that too. now puppet finishes without any warnings or errors :) [fermium:~] $ puppet agent -tv Info: Re... [18:23:14] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1584924 (10BBlack) In any case (and this point has been confusing throughout the life of parsoidcache): the pars... [18:24:29] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1584927 (10GWicke) @jdforrester-wmf, @cscott: It sounds like the only missing bit to make the API easier to test... [18:35:21] (03PS1) 10Rush: icinga: apply icinga::monitor::legal to role [puppet] - 10https://gerrit.wikimedia.org/r/234590 [18:36:18] (03PS2) 10Rush: icinga: apply icinga::monitor::legal to role [puppet] - 10https://gerrit.wikimedia.org/r/234590 [18:37:22] (03PS1) 10Cmjohnson: Adding the remaining Hadoop nodes an1053 and an1057 [puppet] - 10https://gerrit.wikimedia.org/r/234591 [18:37:24] (03CR) 10Rush: [C: 032] icinga: apply icinga::monitor::legal to role [puppet] - 10https://gerrit.wikimedia.org/r/234590 (owner: 10Rush) [18:37:31] 6operations: Run assert check to verify the existence of certain texts in the footer - https://phabricator.wikimedia.org/T108081#1584948 (10chasemp) [18:42:20] yeehaw [18:42:25] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:51:23] (03PS1) 10MarcoAurelio: Creating closed-labs.dblist and closing eswikibeta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) [18:51:59] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578472 (10cscott) @BBlack it's the "outside-world access to it" which is the primary issue, since that's used a... [18:52:58] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1585054 (10cscott) [18:54:26] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578472 (10Ciencia_Al_Poder) [18:55:04] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 3 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1585078 (10Ciencia_Al_Poder) [18:56:02] 6operations, 6Services, 10Traffic: Define a standardized config mechanism for exposing services through the varnish - https://phabricator.wikimedia.org/T110717#1585084 (10BBlack) 3NEW [18:56:15] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:56:22] 6operations, 6Services, 10Traffic: Define a standardized config mechanism for exposing services through varnish - https://phabricator.wikimedia.org/T110717#1585095 (10BBlack) [18:57:28] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/851/" [puppet] - 10https://gerrit.wikimedia.org/r/234025 (owner: 10Dzahn) [18:57:35] (03PS2) 10Dzahn: analytics: do not use node inheritance [puppet] - 10https://gerrit.wikimedia.org/r/234025 [18:57:53] (03CR) 10Dzahn: [C: 032] analytics: do not use node inheritance [puppet] - 10https://gerrit.wikimedia.org/r/234025 (owner: 10Dzahn) [18:58:15] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:00:42] (03PS1) 10Rush: icinga: dupe resource name fix for legal [puppet] - 10https://gerrit.wikimedia.org/r/234598 [19:00:48] (03CR) 10jenkins-bot: [V: 04-1] icinga: dupe resource name fix for legal [puppet] - 10https://gerrit.wikimedia.org/r/234598 (owner: 10Rush) [19:01:07] (03PS2) 10Rush: icinga: dupe resource name fix for legal [puppet] - 10https://gerrit.wikimedia.org/r/234598 [19:02:18] (03CR) 10Rush: [C: 032] icinga: dupe resource name fix for legal [puppet] - 10https://gerrit.wikimedia.org/r/234598 (owner: 10Rush) [19:03:55] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [19:07:08] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:12] (03PS1) 10BryanDavis: beta: Replace deployment-videoscaler01 with deployment-tmh01 [puppet] - 10https://gerrit.wikimedia.org/r/234599 [19:09:02] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1585156 (10egalvezwmf) Is anyone working on this? Thanks! [19:09:40] (03PS2) 10Cmjohnson: Adding the remaining Hadoop nodes an1053 and an1057 [puppet] - 10https://gerrit.wikimedia.org/r/234591 [19:10:33] (03CR) 10Cmjohnson: [C: 032] Adding the remaining Hadoop nodes an1053 and an1057 [puppet] - 10https://gerrit.wikimedia.org/r/234591 (owner: 10Cmjohnson) [19:11:53] (03PS1) 10Yurik: Added maps-cluster referer rules (e.g. Phab) [puppet] - 10https://gerrit.wikimedia.org/r/234600 [19:12:02] ottomata: the remaining servers for analytics have been replaced and ready for install [19:13:19] (03PS1) 10Rush: icinga: fix host def for en.m.wp.o-legal-html check [puppet] - 10https://gerrit.wikimedia.org/r/234602 [19:13:25] (03CR) 10jenkins-bot: [V: 04-1] icinga: fix host def for en.m.wp.o-legal-html check [puppet] - 10https://gerrit.wikimedia.org/r/234602 (owner: 10Rush) [19:13:32] (03PS2) 10Rush: icinga: fix host def for en.m.wp.o-legal-html check [puppet] - 10https://gerrit.wikimedia.org/r/234602 [19:14:58] (03CR) 10Rush: [C: 032] icinga: fix host def for en.m.wp.o-legal-html check [puppet] - 10https://gerrit.wikimedia.org/r/234602 (owner: 10Rush) [19:23:57] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1585193 (10csteipp) >>! In T109606#1568517, @Moushira wrote: > @Dzahn, please check: http://wikimedia.limeservice.org.. To clarify, are you saying you're asking if we can run this hosted on limese... [19:24:42] wooohoo, tahnks cmjohnson1 will get to those next week [19:37:04] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1585291 (10egalvezwmf) I think @moushira would be able to answer more clearly. But the goal here is; we want to have the option to use limesurvey because it is and open source survey software. We cur... [20:01:43] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1585389 (10csteipp) So the two uses (hosted ourselves vs hosted at another organization) have very different requirements. If we're hosting this ourselves, then we need to do a security review of th... [20:04:01] !log catrope@tin Synchronized php-1.26wmf20/resources/src/mediawiki.legacy/shared.css: T110716 (duration: 00m 12s) [20:04:07] James_F: --^^ [20:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:43] RoanKattouw: Thank you. [20:07:00] PROBLEM - Certificate expiration on nembus is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:08:30] PROBLEM - LDAP on nembus is CRITICAL: Connection refused [20:08:41] PROBLEM - LDAPS on nembus is CRITICAL: Connection refused [20:10:30] RECOVERY - LDAP on nembus is OK: TCP OK - 0.052 second response time on port 389 [20:10:42] RECOVERY - LDAPS on nembus is OK: TCP OK - 0.052 second response time on port 636 [20:10:52] (03PS3) 10Dzahn: mailman: fix I/O monitoring on virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [20:13:32] (03CR) 10John F. Lewis: [C: 031] mailman: fix I/O monitoring on virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [20:15:54] (03PS1) 10Hashar: admin: remove dupe 'haithams' from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/234669 [20:16:00] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [20:16:12] (03CR) 10BryanDavis: "Cherry-picked to deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/234599 (owner: 10BryanDavis) [20:16:22] PROBLEM - LDAP on nembus is CRITICAL: Connection refused [20:19:53] (03PS1) 10Hashar: admin: add 'demon' to gerrit-admins group [puppet] - 10https://gerrit.wikimedia.org/r/234670 [20:20:22] RECOVERY - LDAP on nembus is OK: TCP OK - 0.052 second response time on port 389 [20:20:51] RECOVERY - Certificate expiration on nembus is OK: SSL OK - Certificate ldap-codfw.wikimedia.org valid until 2016-10-20 19:36:03 +0000 (expires in 418 days) [20:21:50] (03PS4) 10Dzahn: mailman: fix I/O monitoring on virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [20:22:46] (03CR) 10Hashar: "Added to next PuppetSWAT https://wikitech.wikimedia.org/wiki/Deployments" [puppet] - 10https://gerrit.wikimedia.org/r/234669 (owner: 10Hashar) [20:22:54] (03CR) 10Hashar: "Added to next PuppetSWAT https://wikitech.wikimedia.org/wiki/Deployments" [puppet] - 10https://gerrit.wikimedia.org/r/234670 (owner: 10Hashar) [20:23:21] (03PS5) 10Dzahn: mailman: fix I/O monitoring on virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [20:24:17] (03PS6) 10Dzahn: mailman: fix I/O monitoring on virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) [20:24:51] (03CR) 10Hashar: "Added to next PuppetSWAT https://wikitech.wikimedia.org/wiki/Deployments" [puppet] - 10https://gerrit.wikimedia.org/r/234539 (owner: 10Hashar) [20:25:01] (03CR) 10Dzahn: [C: 032] "tested the is_virtual detection on fermium (ganeti/true) vs sodium (metal/false)" [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [20:28:38] (03CR) 10Dzahn: "should fix the "UNKNOWN" status in Icinga @ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fermium&service=mailman+I%" [puppet] - 10https://gerrit.wikimedia.org/r/234456 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [20:28:59] 6operations, 10Analytics-Cluster, 10hardware-requests: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1585462 (10Cmjohnson) [20:29:04] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1585460 (10Cmjohnson) 5Open>3Resolved All have been racked and setup...the replacements 1053 and 1057 have not been installed yet. I will leave that up to @ottomata [20:37:13] hashar, you have just "Already deployed on integration puppet master:" in puppet swat? :) [20:37:56] as it's own entry [20:38:28] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1585504 (10Varnent) Yay! *happy dance* Thank you! Yes - an update to project namespace and favicon change would be wonderful. :) [20:41:58] (03PS2) 10Andrew Bogott: ldap-eqiad.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234576 (owner: 10RobH) [20:42:31] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:42:52] (03CR) 10Andrew Bogott: [C: 032] ldap-eqiad.wikimedia.org cert renewal [puppet] - 10https://gerrit.wikimedia.org/r/234576 (owner: 10RobH) [20:44:46] Krenair: bah [20:45:05] Krenair: that was to split the patches in two groups :/ [21:02:21] RECOVERY - Certificate expiration on neptunium is OK: SSL OK - Certificate ldap-eqiad.wikimedia.org valid until 2016-10-20 19:41:02 +0000 (expires in 418 days) [21:04:14] 6operations, 7Wikimedia-log-errors: Memcached TIMEOUT error spam from memcached log for global:slave_lag keys - https://phabricator.wikimedia.org/T108982#1585580 (10hashar) Steady rate of 5K errors per hour since at least August 22h. [21:04:35] 6operations, 7Wikimedia-log-errors: Memcached TIMEOUT error spam from memcached log for global:slave_lag keys - https://phabricator.wikimedia.org/T108982#1585584 (10hashar) [21:05:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit is failed [21:06:12] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1585598 (10Jgreen) [21:06:55] 6operations: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1585605 (10Jgreen) [21:07:38] 6operations: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1585610 (10faidon) Can't we just start upgrading fundraising systems to jessie? [21:09:36] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1002 is OK: OK - create-dbusers is active [21:11:45] (03PS1) 10Smalyshev: Set batch size to default for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/234673 [21:16:41] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1585632 (10Jgreen) p:5Triage>3Normal [21:19:13] gwicke: got opinions on what I should call this ticket i'm about to create? i think i shoudl cretae a new phab project for it. [21:19:21] scalable events system is kind of a bad name [21:19:44] i could call it 'Stream Data Platform' as that is what confluent is calling it, but hm, 'stream' is kind of a loaded word [21:21:33] So is platform [21:21:40] And data... [21:21:42] :D [21:21:53] ottomata: we have https://phabricator.wikimedia.org/T84923 [21:22:12] oh yeah! [21:22:23] could refactor that one [21:22:34] or create a new one, whichever you prefer [21:22:43] looking [21:22:57] the name is still fairly generic, although i like it better than mine [21:23:16] I think it's good to separate the core of it from the use cases [21:23:47] indeed [21:23:53] i guess this is a use case [21:24:51] 6operations: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1585641 (10Jgreen) >>! In T110591#1585610, @faidon wrote: > Can't we just start upgrading fundraising systems to jessie? No I don't think that's feasible this year. [21:25:43] ottomata: looking at the summary, it's currently very focused on the job queue stuff [21:25:57] Jeff_Green: 2015 or FY15-16? [21:26:08] ja, ok, i will make a new one as a blocker for that [21:26:16] 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1585651 (10Jgreen) [21:26:22] ottomata: cool, might end up being a smaller edit distance [21:26:53] gwicke: Unified Event Service [21:26:54] ? [21:26:55] Jeff_Green: investing work into supporting a now deprecated system (upstart) is just wasted work, basically :) [21:27:34] well [21:28:07] 6operations, 7HTTPS, 7LDAP: ldap-codfw.wikimedia.org & ldap-eqiad.wikimedia.org expire in September 2015 - https://phabricator.wikimedia.org/T106604#1585655 (10Andrew) 5Open>3Resolved Certs are now up to date and valid through Oct 20 19:41:02 2016 GMT Getting the new certs in place was a gigantic pain.... [21:28:08] Event Service [21:28:09] just that? [21:28:10] HMmM [21:28:11] ? [21:28:39] madhuvishy: ^^^ ? [21:29:01] paravoid: i think you probably have a pretty good idea pxe -> build -> puppet and all the related crud -> maintenance implications of switching to debian [21:29:13] ottomata: hmmm, it's alright, not grand enough :P [21:29:20] Unified Event Service? [21:29:52] * gwicke likes 'event bus' [21:30:12] Unified Event Bus? drop unified? [21:30:13] paravoid: making that package support upstart (or at least not require systemd) shouldn't be very hard [21:30:39] Jeff_Green: yes, the difference is that one is something that will need to be done eventually anyway, while the other is just going to be wasted work :) [21:31:02] but your choice, I won't push to either direction :) [21:31:04] btw, this naming is all for phabricator [21:31:04] ottomata: I think 'event bus' already implies some amount of uninfication [21:31:09] k [21:31:10] *unification [21:31:11] ottomata: can we find a greek word :P [21:31:13] paravoid: agreed, but it's not by any means as simple as that math [21:31:31] Grand Event Bus [21:31:32] haha [21:31:40] Grand Central Event Bus [21:31:40] i'm cool with Event Bus [21:31:48] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1585658 (10Dzahn) Apache config still needs adjustment for 2.4 "client denied by server configuration" for the monitoring checks etc [21:32:00] paravoid: plus I've worked enough with systemd to half-expect it to be the next upstart :-) [21:32:01] that sounds a lot like Single Point Of ... [21:32:07] ottomata: yeah EventBus seems fine [21:32:14] Jeff_Green: not sure what you mean? [21:32:24] (03PS1) 10Dzahn: mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://bugzilla.wikimedia.org/109925) [21:32:36] haha [21:33:35] paravoid: I mean, it doesn't look like the holy grail to me, it's messy and imperfect, at least so far as I've seen [21:34:04] paravoid: maybe it will become excellent over time, I just feel jaded about it [21:34:24] whether you like systemd or not though, the reality is that upstart is dead [21:34:46] 6operations, 10Datasets-General-or-Unknown: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1585667 (10Aklapper) a:5Qgil>3None [21:34:54] and that was the argument, not that systemd is better or worse [21:35:18] (current ubuntu uses systemd as well now, so will the next LTS release) [21:35:32] paravoid: yeah I know, everyone has gone that way [21:35:35] (03CR) 10John F. Lewis: [C: 031] mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://bugzilla.wikimedia.org/109925) (owner: 10Dzahn) [21:36:04] paravoid: with upstart I just use old-fashioned init scripts and ignore the horror [21:36:26] you can do the same with (Debian's) systemd but I wouldn't recommend it [21:37:12] anyway, back to the original question [21:37:42] "can we do a couple hours of work to make the package usable" vs "can we add a couple-month project right now" [21:38:11] come on, it's not a couple-month project [21:38:23] then why aren't we done in production yet? [21:39:58] ottomata, paravoid: http://kafka.apache.org/documentation.html#compaction [21:40:05] that's the deduplication I meant [21:41:12] it's a slow process, though [21:41:51] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1585695 (10Keegan) >>! In T41482#1585504, @Varnent wrote: > > > Oh my. [21:43:40] ah, aye, yeah. hm, i knew of that feature but hadn't thought about it for deduplication [21:44:21] it wouldn't really work for your use case though, as likely duplicates due to produce hiccups are likely to happen at near the same time, and your consumers will want realtime updates [21:46:40] gwicke: i'm writing up this ticket, and now in our meeting, i think we should have talked about why confluent's stuff won't work for us. [21:46:48] all i have right now is the lack of jsonschema support [21:46:53] (not considering authentication) [21:47:24] generally the schema support is weak [21:47:37] hm, and I guess i'm really only thinking about production of messages too [21:47:38] you need to pass in your own schema (ick!), or a schema id [21:47:38] oh? [21:47:42] ah [21:47:46] right, you want topic == schema [21:47:47] right? [21:47:55] clients shouldn't care about such implementation details [21:48:06] i mean, schema id isn't so bad [21:48:14] it complicates clients a lot [21:48:15] because you can ask registry for latest schema id [21:48:19] but ja [21:48:30] also, not sure what happens if you get that wrong [21:48:53] it sounds like you could just use another schema id, and happily inject invalid events [21:48:58] 6operations, 10Datasets-General-or-Unknown: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1585708 (10VBaranetsky) Great! thanks. [21:49:35] gwicke: it would throw an error if the the schema id isn't registered with the registry i think [21:49:51] yeah, I was thinking about a valid schema, but from another topic [21:49:53] yes, but i think it will let you produce any valid schema to any topic [21:50:16] it's not that unlikely to mix up schema ids in a client [21:50:24] gwicke: i think the rest proxy produces the schema id with the message [21:50:48] if you consume through rest proxy (or i think an avro console consumer that they provide), it will get the schema from the registry when consuming [21:51:02] so, not good, but ok because at least consumption is possible [21:51:06] with mixed schemas in a topic [21:51:26] supporting a mix of schemas sounds rather silly, imho [21:51:41] but ja, I will note this limitation, it would be good to force a topic to map to a schema [21:51:46] agree [21:53:16] (03PS2) 10Dzahn: mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://phabricator.wikimedia.org/T109925) [21:55:56] gwicke: other gripes? [21:59:07] ottomata: the other gripes are about the consumer side [21:59:16] https://phabricator.wikimedia.org/T88459#1493911 [22:00:06] right, [22:00:25] btw, gwicke rest proxy now has json support, just no jsonschema validation [22:00:51] arbitrary json? [22:01:20] I thought it was limited to Avro-flavored json [22:01:30] which didn't look too natural [22:02:36] its new [22:02:46] https://github.com/confluentinc/kafka-rest/pull/89 [22:03:05] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1585730 (10Varnent) I admit - I am not yet sure how I feel about these memes..but...felt it was worth it here. ;) The first two I tried were....more alarming... (why are... [22:04:32] ottomata: that looks nicer [22:04:56] RECOVERY - Disk space on labstore1002 is OK: DISK OK [22:05:09] !log disabling puppet on tin for a few minutes to test an ssh-agent-proxy change [22:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:33] hm, gwicke that stateful consumer thing is kind of hard to address [22:10:43] if you were to implement that, how would you do it? [22:11:20] ottomata: I think in many cases it's better to let the client maintain its offset [22:12:22] basically, make the consumer stateless, and make the client pass in opaque page state which lets us iterate over partitions & offsets [22:13:01] ah, so a request for offsets from a particular partition? [22:13:03] like: [22:13:29] gimme 1 message from topic jobqueue partition 0 starting at offset 100 [22:13:30] ? [22:13:43] (03CR) 10Faidon Liambotis: [C: 031] mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [22:14:23] ottomata: I think i [22:14:26] grr [22:14:53] I think I'd expose timestamps primarily, and leave the partition business to advanced use cases [22:15:04] that is complicated [22:15:11] the kafka api doesn't work with timestamps [22:15:15] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [22:15:19] so you'd have to implement some kind of state to keep track of that [22:15:22] (03PS2) 10BryanDavis: beta: Replace deployment-videoscaler01 with deployment-tmh01 [puppet] - 10https://gerrit.wikimedia.org/r/234599 [22:15:27] it does not have failuers! gah. i need to fix that. [22:15:33] i decommed an15 but haven't finished [22:15:52] anyway, ja gwicke, the kafka api works with offsets [22:16:05] you can consumer from beginning, end or somewhere in the middle of a stream, but it is all based on message offset [22:16:09] I know, but IIRC there is a way to use timestamps too [22:16:18] don't think so... [22:16:29] googling! [22:17:04] I might misremember, googling as well ;) [22:18:21] oh, I think messages do have a timestamp, but there is no efficient way to jump to one (except using binary search) [22:18:51] right, you can put whtaever you want in the message key or value [22:18:53] the node client does support limiting on the timestamp of events [22:18:55] including a timestamp [22:19:10] oh ja? how's it know where to start reading? [22:19:16] https://github.com/SOHU-Co/kafka-node#fetchpayloads-cb [22:19:17] or does it just read it all and filter for what you want? [22:19:27] it only limits on the *end* of the range [22:19:43] ah, i see [22:19:49] which means that it starts reading from the beginning, and probably looks at each event's timestamp until the limit is reached [22:20:02] what timestamp is it using? some internally maintained kafka one? like time kafka recieves message? [22:20:39] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1585755 (10Tbayer) 3NEW a:3Dzahn [22:20:39] https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Enriched+Message+Metadata [22:21:48] (03PS3) 10BryanDavis: beta: Replace deployment-videoscaler01 with deployment-tmh01 [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) [22:21:59] 6operations, 7Surveys: Upload survey dataset to dumps.wikimedia.org - https://phabricator.wikimedia.org/T110746#1585772 (10Tbayer) [22:22:00] (03PS4) 10BryanDavis: beta: Replace deployment-videoscaler01 with deployment-tmh01 [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) [22:22:15] hm, gwicke but that is not in kafka now, that is just an idea? [22:22:22] https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIaccuratelygetoffsetsofmessagesforacertaintimestampusingOffsetRequest? [22:22:55] hm, gwicke http://kafka.apache.org/documentation.html#simpleconsumerapi [22:23:00] public kafak.javaapi.OffsetResponse getOffsetsBefore(OffsetRequest request) [22:23:02] so there is built-in support for doing at least approximate retrieval by timestamp [22:23:20] "Kafka allows querying offsets of messages by time and it does so at segment granularity. " [22:23:37] huh! interesting. [22:24:21] ottomata: an1037 down? [22:24:23] the timestamp is added by the broker, afaik [22:24:24] so, what, you would use that to consume based on the last timsetamp [22:25:24] bah [22:26:07] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1585783 (10Krenair) >>! In T41482#1585504, @Varnent wrote: > Yes - an update to project namespace and favicon change would be wonderful. :) Would you like to provide a n... [22:26:26] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [22:26:36] (03PS1) 10Alex Monk: Update project namespace name and favicon on affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234679 (https://phabricator.wikimedia.org/T41482) [22:26:41] gwicke: so if you wrote it, your job queue consumer would prefer that? [22:27:01] to say "gimme one message starting at last_time_I_consumed"? [22:27:11] or last offset, if that's available [22:27:20] but, there are applications where the timestamp is actually useful [22:28:00] the Java client has an "offsetsBeforeTime" method as well: http://supergsego.com/apache/kafka/0.8.2-beta/java-doc/org/apache/kafka/clients/consumer/KafkaConsumer.html [22:28:12] (03CR) 10BryanDavis: "A good candidate for the next PuppetSWAT®" [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) (owner: 10BryanDavis) [22:29:24] ottomata: from my pov, there are two parts: 1) the first request, and 2) paging from there [22:29:25] !liog powercycling analytics1037 after it went down (why u go down?!) [22:29:45] so, first request says from end? [22:30:09] hm, but, i think that's why offset management is good. your consumer can die, and then start back up from where it left off [22:30:10] ottomata: for 1), I could see passing in a timestamp or offset/partition to be useful [22:30:12] by committing offsets [22:30:26] for 2), we can just use an opaque page state [22:30:46] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [22:30:51] this is for rest consumers [22:30:56] not for streaming ones [22:31:17] gwicke: maybe jobqueue consuming via rest is not the best solution? [22:31:33] actually, for most of these use cases, rest consumption is kinda awkward, no? [22:32:04] possibly [22:32:09] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "be-x-old" to "be-tarask" - https://phabricator.wikimedia.org/T11823#1585808 (10Krenair) a:3Krenair [22:32:16] RECOVERY - Host analytics1037 is UP: PING WARNING - Packet loss = 64%, RTA = 18.83 ms [22:33:10] ottomata: it's all that's offered in the confluence proxy though, so I described that first [22:33:16] hm, aye. [22:33:22] websockets are probably nicer for many clients [22:33:34] gwicke: maybe we should limit our MVP to not consider consumer support, and leave that up to the consumer [22:33:40] since much of that will be implementation specific [22:33:49] allow consumers to consume from kafka [22:33:58] using whatever kafka client [22:34:00] the job stuff might warrant using kafka directly [22:34:57] you ok with me putting that in the ticket? that MVP Event Bus will not support consumption via http api? [22:35:36] yeah, although we should make it clear that it'll be added soon [22:35:46] Real Soon Now™ [22:35:57] at least for internal consumers [22:36:18] aye, a generic kafka -> websockets consumer would be pretty cool [22:36:35] yeah [22:36:46] I could actually seee node work okay for that use case [22:36:50] yeah for sure [22:40:40] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (10Ottomata) [22:40:46] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:42:07] (03PS1) 10Brion VIBBER: Use ffmpeg instead of avconv on labs beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234681 (https://phabricator.wikimedia.org/T110707) [22:43:11] ottomata: I actually wonder if using confluence on the producer side is worth it [22:43:19] gwicke: howso? [22:43:20] 6operations, 10Datasets-General-or-Unknown: Add App Guidelines on Dumps Page - https://phabricator.wikimedia.org/T110742#1585863 (10Krenair) perhaps something at https://dumps.wikimedia.org/legal.html too? [22:43:25] we'd have to store the schema - topic mapping somewhere [22:43:42] and once we do that, we might as well just store the schemas [22:43:54] (03CR) 10BryanDavis: [C: 032] Use ffmpeg instead of avconv on labs beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234681 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [22:44:01] (03Merged) 10jenkins-bot: Use ffmpeg instead of avconv on labs beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234681 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [22:44:45] ottomata: there are other api bits like the topic creation etc, but those are also liabilities from a security pov [22:44:50] gwicke: that's the schema registry, no? [22:45:13] oh i see, the mapping is not there. but it could be! :) [22:45:21] that's one of the things we were going to look into, right? [22:45:27] https://phabricator.wikimedia.org/T110750?workflow=110748 [22:45:49] ottomata: lets look into what the pros / cons of using confluence on the producer side are [22:45:54] k [22:46:44] gwicke: imo the main pros are that schema validation and evolution support [22:46:52] I'm not completely sold on it providing a huge advantage for your use case, but lets just do some basic benchmarking & code browsing [22:47:09] and that it is already written! [22:47:15] and supported by a larger community [22:47:37] (03PS1) 10Alex Monk: Add link to developer app guidelines from dumps pages footer [puppet] - 10https://gerrit.wikimedia.org/r/234685 (https://phabricator.wikimedia.org/T110742) [22:47:47] gwicke: if we did go with avro, and not jsonschema, the only limitation is one that *could* be mostly avoided by convention. [22:48:00] i.e. the topic <-> schema mapping [22:49:57] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:51:21] !log bd808@tin Synchronized wmf-config/CommonSettings-labs.php: Use ffmpeg instead of avconv on labs beta (I250fe33) (duration: 06m 05s) [22:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:47] that was soooo slooooow [22:51:56] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [22:52:10] I think the time has come to tell scap to always use ipv4 addresses [22:53:03] gwicke: have you left me ?! :) [22:56:40] (03PS1) 10BryanDavis: Force use of IPv4 addresses with ssh and rsync [tools/scap] - 10https://gerrit.wikimedia.org/r/234687 [22:56:46] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Puppet has 1 failures [22:58:02] bd808: uhm... that sounds wrong [22:58:07] why was it slow? [22:58:19] ottomata: sorry, was chatting with madhuvishy IRL [22:58:38] IRL BUT WE ARE ALL REMOTEES! [22:58:42] paravoid: I *think* because we have dual stack hosts with no AAAA record responses [22:58:54] ottomata: jaja [22:59:05] bd808: can you elaborate? I don't understand thisdddd [22:59:15] (03CR) 10BryanDavis: "I finally got one of the horribly slow sync-file runs that twentyafterfour has complained of before (06:05 for wmf-config/CommonSettings-l" [tools/scap] - 10https://gerrit.wikimedia.org/r/234687 (owner: 10BryanDavis) [22:59:57] ottomata: I'm keen on making the interfaces as obvious, fool-proof and backend-agnostic as possible, and not exposing things like schema ids is important for that, imho [23:01:04] paravoid: basically creating the initial ssh connections are slow to establish. I'm not sure if it is from tin trying to resolve mwXXXX or (more likely) mwXXXX trying to resolve the IPv6 address that tin connected from to stick in wtmp [23:01:52] neither of these should be slow [23:01:57] gwicke: why is schema id bad though? that's how eventlogging works [23:01:58] revision id [23:02:10] not sure how you are going to do versioned schemas without a revision [23:02:20] ottomata: the client should only need to know the topic name [23:02:37] I'm looking at the logs from that sync and all the rsyncs took >=00:01 to run but the whole thing took 6 minutes [23:02:49] ottomata: if the id is passed in, then it can be faked & confused [23:03:02] and it adds complexity to each producer [23:03:21] gwicke: the producer should know what schema they have to make the event into though [23:03:34] so they are going to have to have a reference to the schema somewhere [23:03:51] yeah, but the proxy should know what the schema should be for the topic [23:03:57] clients shouldn't be able to make that up [23:04:36] if a topic format changes in incompatible ways, then messages from outdated producers should be rejected [23:04:48] paravoid: can you think of something else in the ssh connection handshake that would introduce a second or two of lag and only happen when it had been quite a while in wall clock time since the last scap run? [23:05:30] gwicke: i think that's a fair point, i think that it is work around able, but we can look to see how we could implement that in confluent. but still, the producer is going to have to know more than just the topic [23:05:45] bd808: nothing in particular no... maybe memory pages swapped off? [23:06:01] I know it stated happening around the same time that we suddenly had firewall issues on tin because it was blocking IPv6 connections to the trebuchet http server [23:06:25] is that fixed? :) [23:06:25] hence my "Ah it's ipv6" reaction [23:06:28] yes [23:06:30] most producers will likely have the schema in their source in some way, which I htink is why confluent supports sneding the schema with the produce request, and normalizing all schemas that look the same [23:06:41] it won't create a new schema id for a schema that is the same as one it already has [23:08:00] ottomata: clients can do what they want, but the proxy should verify all messages to one specific schema per topic [23:08:45] gwicke: that's fine, and we can look into enforcing that. i guess i'm still asking about your requirement that client's don't know the schema [23:08:51] ottomata: I think it's worth benchmarking a simple rest service for the producer use case [23:10:02] performance benchmarking gwicke? might be a little early for that, no? [23:10:24] no, I think it's good to do that early to get a feel for the relative performance [23:10:28] ok, hm, gwicke, i'm asking these questions because my default is: if confluent serves our use cases, we should use it. less effort and custom code to maintain [23:10:50] but, implementing this is kinda complicated. schema validation, evolution, kafka production, etc. [23:10:53] if a 100 line service performs comparable to confluence, then we have more options [23:11:22] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in class toollabs [puppet] - 10https://gerrit.wikimedia.org/r/234688 (https://phabricator.wikimedia.org/T87387) [23:11:32] schema validation is a library call [23:11:44] where are the schemas stored? [23:11:44] (03PS2) 10Alex Monk: Update project namespace name on affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234679 (https://phabricator.wikimedia.org/T41482) [23:11:45] and JSON schema is not supported in confluence anyway [23:12:26] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [23:12:49] ottomata, re storage: for eventlogging, currently they are on the wiki [23:12:59] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in class toollabs::checker [puppet] - 10https://gerrit.wikimedia.org/r/234689 (https://phabricator.wikimedia.org/T87387) [23:13:12] ottomata: for the production events, a config should be fine [23:13:14] ha, ok, so we implement a rest endpoint for the eventlogging python lib? [23:13:17] 10Ops-Access-Requests, 6operations: Requesting research DB access for Alex Monk - https://phabricator.wikimedia.org/T110754#1585952 (10Krenair) 3NEW [23:13:57] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in project-make-access [puppet] - 10https://gerrit.wikimedia.org/r/234690 (https://phabricator.wikimedia.org/T87387) [23:14:00] ottomata: for any http client [23:14:12] aye, i mean a rest server using eventlogging code [23:14:40] that's not what I was thinking of, no [23:14:55] perhaps worth a try, but I'd be sceptical about its performance [23:15:43] in agile terms, I guess I'm proposing to spend a couple hours on a 'spike' [23:15:46] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in motd-tips.sh [puppet] - 10https://gerrit.wikimedia.org/r/234692 (https://phabricator.wikimedia.org/T87387) [23:15:49] 10Ops-Access-Requests, 6operations: Requesting research DB access for Alex Monk - https://phabricator.wikimedia.org/T110754#1585967 (10Jdforrester-WMF) +1 sort-of manager sign-off [23:16:16] gwicke: that's fair. you are the nodejs man though. I'm an infrastructure glue kinda guy myself :) [23:16:57] ottomata: do we have a kafka cluster in labs that we can use for testing? [23:17:15] gwicke: sorta, there is an instance in deployment-prep that we kidna maintain, but there is no official cluster. [23:17:17] it is easy to spawn one though [23:17:32] there is one right now in the analytics cluster that i was using to test the kafka upgrade [23:17:41] but it is a one off [23:18:04] not sure what version the deployment-prep one is using [23:18:06] probably shoudln't use that [23:18:14] gwicke: i can make a little cluster for us [23:18:22] ottomata: that would be great! [23:18:26] what project? [23:18:44] I could throw up a simple rest server that writes events to a topic [23:18:53] plus json schema validation [23:19:25] ottomata: could use the services project [23:19:44] guess its worth a try to see. i'm still a little stuck on avro support though :) ...maybe our use cases aren't so close gwicke? not sure. [23:19:45] let me add you [23:19:48] k [23:20:09] avro for storage or the API? [23:20:28] for what's in kafka, so i guess the storage [23:20:39] that's orthogonal, imho [23:20:43] ja? [23:20:44] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in tool-uwsgi-python [puppet] - 10https://gerrit.wikimedia.org/r/234697 (https://phabricator.wikimedia.org/T87387) [23:21:00] how so? i mean, i think there are use cases where json makes sense. and maybe those cases don't need schema validation? [23:21:49] gwicke: i'm just going to use one node for this, for both zk and kafka [23:21:53] json is ubiquitous and has well-optimized implementations [23:22:05] which makes it attractive for the api [23:22:06] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:22:10] internally, we can use whatever [23:22:21] (03CR) 10Tim Landscheidt: "Tested with:" [puppet] - 10https://gerrit.wikimedia.org/r/234697 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [23:22:36] avro in the interface could save some bytes, but complicates most clients [23:22:50] if you were thinking about binary avro [23:23:02] gwicke: no, i'm fine with the json representation of avro for production [23:23:08] but i think you are right, it does complicate consumers [23:23:31] especially if we don't like the built in consumption support from rest proy [23:23:44] consuming directly from kafka would get you binary avro [23:23:53] which is quite annoying to work with unless your client is built for it [23:24:11] but, for analytics use cases they all are [23:24:12] there are libs, but it's not as easy as json [23:24:16] and it is much easier to work with [23:24:17] ja, indeed. [23:24:36] gwicke: i have only tried it in python really, and it is pretty similar there, as it gives you a python dict like json libs usually do [23:24:47] but still, adding avro deps to clients is annoying [23:25:17] if we play our cards right, then we should be able to swap out the internal representation later [23:25:28] this is why i think avro + json support would be good. if we didn't need to validation for the json side of things, then confluent rest proxy as is wwould work [23:25:51] haha, gwicke, maybe we could use json schema, but force that they be in the avro format! :) [23:26:19] lets get some basic data first [23:26:56] for example, if json schema validation is prohibitively expensive, then that might influence our decisions [23:27:17] i guess so, but i don't see any of these things as prohbitively expensive, especially for production [23:27:47] production (of a single or few messages) would be stateless, no? therefore pretty easily scalable [23:28:06] http produce request -> load balanced service [23:28:07] sure, for prod almost anything should work [23:28:17] at 50k messages / second, it's a different game though [23:28:45] sorry, i mean 'production' as in produce requests [23:28:59] ah [23:29:25] you should be an admin in the services project [23:29:35] ja need to create puppet groups [23:30:04] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1586004 (10Varnent) Going with the WMF icon logo seems the most logical and easiest IMHO. https://commons.wikimedia.org/wiki/File:Wikimedia-logo.svg [23:31:12] ha, whats up labs, I can't see any analytics instances anymroe?! [23:31:16] i logged out and back intoo! [23:31:37] ottomata, someone else was reporting something like that [23:31:46] https://phabricator.wikimedia.org/T110629 [23:31:54] (03CR) 1020after4: "It happens almost every time for me, specifically with sync-dir, I still don't understand why that would be slower than a full sync?" [tools/scap] - 10https://gerrit.wikimedia.org/r/234687 (owner: 10BryanDavis) [23:32:12] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1586011 (10brion) [23:32:16] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [23:32:19] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1586007 (10brion) 5Open>3Resolved a:3brion Ok, the details of setting up the updated ffmpeg... [23:32:27] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1586013 (10Krenair) Great, we already have that in the repository. [23:32:32] (03CR) 1020after4: [C: 031] Force use of IPv4 addresses with ssh and rsync [tools/scap] - 10https://gerrit.wikimedia.org/r/234687 (owner: 10BryanDavis) [23:34:31] (03PS3) 10Alex Monk: Update project namespace name and favicon on affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234679 (https://phabricator.wikimedia.org/T41482) [23:34:44] (03CR) 10Faidon Liambotis: [C: 04-1] "The suspicion that IPv6 is to blame for this remains unsubstantiated." [tools/scap] - 10https://gerrit.wikimedia.org/r/234687 (owner: 10BryanDavis) [23:35:29] hm, gwicke, can you log into the node i created? [23:35:32] kafka-event-bus [23:35:52] hm, maybe it is just still configuring [23:36:27] ottomata: no, IIRC it normally takes ~10-15 minutes until you can actually log in [23:37:02] (03CR) 10Faidon Liambotis: [C: 04-1] "I'd very much prefer something like" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [23:37:06] (03CR) 10Alex Monk: [C: 032] Update project namespace name and favicon on affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234679 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [23:37:12] (03Merged) 10jenkins-bot: Update project namespace name and favicon on affcom.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234679 (https://phabricator.wikimedia.org/T41482) (owner: 10Alex Monk) [23:37:34] hm, k, thought it was quicker these days [23:37:55] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:38:18] (03PS1) 10Brion VIBBER: Use backported ffmpeg for multimedia transcoding on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) [23:39:08] Where is the source for gerritbot? (the phabricator bot reporting relevant gerrit changes) [23:39:31] wow, sync-file is slow today [23:40:35] (03CR) 10BryanDavis: "Cherry-picked to beta cluster for testing" [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [23:40:36] haven't seen it do this before [23:41:41] still having labs issues, ottomata? [23:42:30] 10Ops-Access-Requests, 6operations: Requesting research DB access for Alex Monk - https://phabricator.wikimedia.org/T110754#1586031 (10TrevorParscal) Approved. [23:43:36] (03PS1) 10Ori.livneh: harden ssh-agent-proxy [puppet] - 10https://gerrit.wikimedia.org/r/234700 [23:43:44] paravoid: this is what i was working on ^ :) [23:44:58] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234679/ (duration: 06m 56s) [23:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:06] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [23:46:17] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1586040 (10Krenair) [23:46:18] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1586039 (10Krenair) 5Open>3Resolved [23:46:45] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1586041 (10Varnent) {meme, src=votecat, below="Thank you!"} [23:46:54] Krenair: trying to log into a new instance, but maybe it is just not ready yet [23:47:00] (03PS3) 10Dzahn: mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://phabricator.wikimedia.org/T109925) [23:47:02] also [23:47:23] i can't see instances in the analytics project [23:47:31] where i usuallywork with them [23:47:40] gabriel just added me to services project [23:47:42] (03CR) 10Dzahn: [C: 032] mailman: make Apache config 2.4 compatible [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn) [23:47:49] so maybe that broke something? [23:48:48] gwicke: i'm waiting for my brother to show up. i *think* that hte instances is good to go. if you run puppet there it should start everything up. ah crap no, i have to run a zookeeper command [23:48:56] ottomata, try now [23:49:23] ottomata: still no login for me [23:52:19] ori, are you still fiddling with tin? [23:53:13] if so - that wouldn't be why scap is behaving strangely, would it? [23:53:34] i'm not anymore, no [23:53:40] strangely how? [23:54:12] 6 minutes to sync one file [23:54:15] krenair@tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw2009 [23:54:15] Error reading response length from authentication socket. [23:54:15] Permission denied (publickey). [23:54:39] Does not occur on mira. [23:54:42] Just tin. [23:55:18] Krenair: well with all the codfw MW host issues, the codfw deployment server has to work ;) [23:55:28] (Said out of context so, likely not relevant) [23:55:40] ah yeah, I did not restart the proxy after the agent restarted [23:55:44] so its file handle was invalid [23:56:00] try now [23:56:30] also, the proxy should detect that [23:56:54] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1586079 (10brion) [23:57:04] Okay, so that fixed the connection issue. [23:57:10] Krenair: still no login [23:57:18] It's still ridiculously slow from tin only, but maybe that's a separate issue [23:57:37] paravoid: ^ wanna help debug it? [23:58:05] `time ssh -vvv mw2009 env` -- real 0m20.172s [23:59:34] (03CR) 10Dzahn: "Aug 28 23:58:51 fermium apache2[23067]: AH00526: Syntax error on line 26 of /etc/apache2/sites-en...onf:" [puppet] - 10https://gerrit.wikimedia.org/r/234676 (https://phabricator.wikimedia.org/T109925) (owner: 10Dzahn)