[00:02:16] Lightning deploy time! [00:02:18] * RoanKattouw does deploy prep [00:02:50] (03CR) 10Catrope: [C: 032] Make VisualEditor namespaces extend, not replace, default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94420 (owner: 10Jforrester) [00:03:01] (03Merged) 10jenkins-bot: Make VisualEditor namespaces extend, not replace, default [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94420 (owner: 10Jforrester) [00:03:51] (03CR) 10Catrope: [C: 032] Create visualeditor-default.dblist to simplify config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94471 (owner: 10Jforrester) [00:05:05] (03CR) 10Catrope: [C: 032] Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [00:07:01] (03Merged) 10jenkins-bot: Create visualeditor-default.dblist to simplify config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94471 (owner: 10Jforrester) [00:07:18] (03Merged) 10jenkins-bot: Enable VisualEditor on board, collab, office and wikimaniateam wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93780 (owner: 10Jforrester) [00:10:56] !log catrope synchronized visualeditor-default.dblist 'New dblist for wikis with VisualEditor enabled by default' [00:11:20] Logged the message, Master [00:14:39] !log catrope synchronized wmf-config/CommonSettings.php 'Add plumbing for visualeditor-default.dblist, wmgVisualEditorParsoidForwardCookies and wmgVisualEditorNamespaces' [00:14:53] Logged the message, Master [00:15:43] !log catrope synchronized wmf-config/InitialiseSettings.php 'Use visualeditor-default.dblist; enable VisualEditorParsoidForwardCookies on private wikis' [00:15:57] Logged the message, Master [00:16:42] (03PS1) 10Dzahn: remove frlabs.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/95077 [00:16:56] !log catrope synchronized visualeditor.dblist 'Enable VisualEditor on office, board, collab and wikimaniateam wikis' [00:17:11] Logged the message, Master [00:19:51] !log catrope synchronized php-1.23wmf3/resources/startup.js 'touch' [00:20:06] Logged the message, Master [00:24:38] !log catrope synchronized wmf-config/InitialiseSettings.php 'touch' [00:24:52] Logged the message, Master [00:46:17] (03CR) 10Jeremyb: [C: 031] remove frlabs.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/95077 (owner: 10Dzahn) [01:02:43] 9 23.465340 hook: GetPreferences [01:02:46] ohh, commonswiki [01:14:36] i'm going to sync https://gerrit.wikimedia.org/r/#/q/I189ba71de869496a36f49283ec6dce7bbdfccd73,n,z unless anyone objects [01:15:17] seems worth doing if it's bad enough to warrant "pt-kill jobs running on S1 slaves to snipe this query" (https://bugzilla.wikimedia.org/show_bug.cgi?id=56840) [01:15:22] ^ greg-g [01:16:29] Krinkle: does it make sense to add https://gerrit.wikimedia.org/r/#/c/91844/ ? [01:16:57] Fine by me :) [01:17:35] can you make the cherry-pick commits? [01:17:44] oh, you're already on it [01:18:49] ori-l: I'll let you merge them whenever :) [01:26:07] !log ori synchronized php-1.23wmf3/includes/specials/SpecialAllpages.php 'I189ba71de: In Special:AllPages, limit the size of hierarchical lists (Bug: 56840)' [01:26:20] Logged the message, Master [01:29:16] !log ori synchronized php-1.23wmf3/includes/specials/SpecialAllpages.php 'I189ba71de: In Special:AllPages, limit the size of hierarchical lists (Bug: 56840)' [01:29:57] Directory name fail. [01:30:23] !log ori synchronized php-1.23wmf2/includes/specials/SpecialAllpages.php 'I189ba71de: In Special:AllPages, limit the size of hierarchical lists (Bug: 56840)' [01:30:36] Logged the message, Master [01:31:24] done; prod looks good. https://en.wikipedia.org/wiki/Special:AllPages looks good. [01:32:19] springle: it should be safe to remove the pt-kill jobs [01:33:48] ori-l: ok [01:38:26] (03PS1) 10Mwalker: Remove Call to Donate from Site Error Page [operations/puppet] - 10https://gerrit.wikimedia.org/r/95090 [01:39:23] error page doesn't mean site is down. fwiw [01:39:42] jeremyb: it's true [01:39:45] but if it is down [01:39:49] we cant serve donate [01:40:46] (03CR) 10Adamw: [C: 031] "lol: utm_medium=varnisherr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95090 (owner: 10Mwalker) [01:44:38] mwalker: i thought frack was pretty isolated? [01:44:52] it is; but donate is served off of the application cluster [01:44:55] mwalker: so send them totally off cluster instead? maybe to the wikimedia store? [01:45:13] we could probably do the store [01:45:17] orly. huh. i thought donate was frack [01:45:49] anyway, error page doesn't even mean a wiki is down let alone the whole cluster [01:46:01] every time I've seen that error page it has :) [01:46:14] interesting [01:46:25] it's mostly because all the appservers serve all wikis [01:46:33] and if anything affects the appservers; then everything dies [01:46:36] except srv193!! [01:46:41] :P [01:46:43] psh; testwiki hardly counts [01:47:30] i don't really have a strong opinion about whether the plea should be there [01:48:11] Elsie: ^^ [02:13:05] !log restarting externallinks.el_id schema change jobs [02:13:20] Logged the message, Master [02:17:07] !log LocalisationUpdate completed (1.23wmf3) at Wed Nov 13 02:17:07 UTC 2013 [02:17:19] Logged the message, Master [02:20:41] mutante: still working? I'm wondering if we can prune out the misc::download-mediawiki class or if it's used for something… somewhere? [02:24:59] mwalker: I can't tell if you added or removed the trailing newline. [02:25:16] https://gerrit.wikimedia.org/r/#/c/95090/1/templates/varnish/errorpage.inc.vcl.erb,unified [02:25:21] There. [02:26:00] that is a 'bug' in gerrit; no new line was added or removed; it just can't handle files without them very well it seems. The side-by-side shows no change https://gerrit.wikimedia.org/r/#/c/95090/1/templates/varnish/errorpage.inc.vcl.erb [02:26:21] I doubt that's a bug in Gerrit. [02:27:02] My mouse has gone dead. [02:27:32] ok; fair point; my local install diff's the same [02:27:39] so; but in git [02:27:42] *bug [02:27:43] Right. I think Gerrit just relies on Git. [02:27:45] It's not a bug. [02:27:53] You removed the trailing newline and added a space instead, it looks like. [02:28:01] I'd verify, but dead mouse takes priority. [02:28:07] Tweak your text editor? :-) [02:28:21] creating a patch file to view it [02:28:48] Oh, mouse is back. [02:28:59] New batteries. [02:29:22] git format-patch does not have a whitespace error [02:29:26] *does not show a [02:30:00] cd Documents/grrrit [02:30:03] I crack myself up. [02:30:41] the grrrit-wm bot makes me happy every time I see it [02:30:49] I'm updating my repo. [02:31:21] !log LocalisationUpdate completed (1.23wmf2) at Wed Nov 13 02:31:21 UTC 2013 [02:31:23] i'm updating too :) [02:31:30] Sigh, okay, updated and review -d'd. [02:31:36] Logged the message, Master [02:32:39] Hmmm. [02:33:28] Ah, you added a trailing newline. [02:33:38] Never mind. :-) [02:33:40] I hate Gerrit. [02:35:04] (03CR) 10Andrew Bogott: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [02:35:27] Sorry about that, mwalker. [02:35:38] no worries [02:35:45] Elsie: idk, i think you guys just need to use unified instead of side-by-side [02:36:04] You know Krenair says the exact opposite? [02:36:06] both views have their issues [02:36:10] Yeah. [02:36:16] There's no fucking winning with Gerrit. [02:36:26] If you use unified, you can't get +/- lines links. [02:36:32] Inexplicably stupid UI. [02:39:56] (03PS1) 10Springle: move S3 Query::recache traffic to snapshot slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95096 [02:40:49] (03CR) 10Springle: [C: 032] move S3 Query::recache traffic to snapshot slave [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95096 (owner: 10Springle) [02:42:08] !log springle synchronized wmf-config/db-eqiad.php 'move S3 Query::recache traffic to snapshot slave' [02:42:19] Logged the message, Master [02:43:28] (03CR) 10MZMcBride: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95090 (owner: 10Mwalker) [02:43:54] mwalker: BTW: https://bugzilla.wikimedia.org/show_bug.cgi?id=18903 is fun. [02:43:59] Though only tangentially related to what you're working on. [02:53:59] (03PS1) 10Springle: move QueryPage::recache to snapshot slaves except S1. keep snapshot slaves slightly higher LB read priority than master. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95098 [02:55:23] (03CR) 10Springle: [C: 032] move QueryPage::recache to snapshot slaves except S1. keep snapshot slaves slightly higher LB read priority than master. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95098 (owner: 10Springle) [02:56:32] !log springle synchronized wmf-config/db-eqiad.php 'QueryPage::recache traffic' [02:56:45] Logged the message, Master [03:17:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Nov 13 03:17:40 UTC 2013 [03:17:55] Logged the message, Master [03:47:09] root@mw72:/etc/init# dpkg -L twemproxy [03:47:09] /usr/local [03:47:09] /usr/local/bin [03:47:09] /usr/local/bin/nutcracker [03:47:11] root@mw72:/etc/init# [03:47:24] it seems to defeat the purpose of having a package... [04:19:57] (03PS2) 10Ori.livneh: updateBitsBranchPointers: get rid of 'static-stable' branch link [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94447 [04:19:58] (03PS1) 10Ori.livneh: Enable 'exception-json' debug log group; route to $wmfUdp2logDest [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95104 [04:35:39] (03CR) 10Ori.livneh: [C: 032] Enable 'exception-json' debug log group; route to $wmfUdp2logDest [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95104 (owner: 10Ori.livneh) [04:37:05] !log ori updated /a/common to {{Gerrit|I5043210cb}}: Enable 'exception-json' debug log group; route to $wmfUdp2logDest [04:37:17] Logged the message, Master [04:38:48] !log ori synchronized wmf-config/InitialiseSettings.php 'I5043210cb: Enable 'exception-json' debug log group; route to ' [04:39:02] Logged the message, Master [04:39:38] (03CR) 10MZMcBride: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [04:43:43] (03CR) 10MZMcBride: "This is scheduled to be deployed to Wikimedia wikis Thursday, November 14, 2013." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91344 (owner: 10Legoktm) [04:46:21] (03CR) 10Krinkle: [C: 031] updateBitsBranchPointers: get rid of 'static-stable' branch link [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94447 (owner: 10Ori.livneh) [04:56:20] Krinkle: I have negative one million review karma with you [04:56:29] I am aware of the fact and feel appropriately guilty [04:56:36] Not that that helps you, but you know. [04:57:23] I will catch up, somehow. [04:57:30] ori-l: Thx, I might collect on that some day ;-) [04:58:11] anyway, my secret to review is, that I use it to charge my brain. [04:59:07] Do you check out reviews and read them in the console? Or do you do a first pass on Gerrit? [04:59:29] Depends on the size of the diff and whether I expect to submit a revision myself. [04:59:58] I don't like to open a million tabs, though I do like the "Review and next" checkbox tracker in gerrit when reviewing on mobile. [05:00:23] I can move back and still go "next" to an unreviewed file [05:00:52] but in cli I don't need that since I'd just page through less (hey, lol, just realised that is also called 'less') and git-show [05:13:27] I tried Gerrit on my phone once. [05:13:34] I don't remember the experience fondly. [05:17:01] it's only half-decent on mobile safari (iOS 7), but not too bad. I had an empty battery a few times in the last 2 weeks while traveling by train, used the phone to do some code review. [05:18:16] at first i was thinking phone battery and wondering how you used it with a dead battery :P [05:20:27] I don't understand why Gerrit doesn't say "jenkins-bot is going to submit this code" in the +2 user interface. [05:20:38] And disable that part of the form. [05:20:40] Or something. [05:20:42] It seems stupid. [05:57:56] PROBLEM - Varnish HTCP daemon on amssq52 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:56] PROBLEM - Varnish traffic logger on amssq52 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:58:16] PROBLEM - Varnish HTTP text-backend on amssq52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:04:46] RECOVERY - Varnish traffic logger on amssq52 is OK: PROCS OK: 2 processes with command name varnishncsa [06:04:46] RECOVERY - Varnish HTCP daemon on amssq52 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [06:05:06] RECOVERY - Varnish HTTP text-backend on amssq52 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.192 second response time [06:07:13] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:14:13] RECOVERY - DPKG on snapshot3 is OK: All packages OK [06:24:15] (03CR) 10MZMcBride: "Andrew + Nemo: I think you two should find some time to chat with each other on IRC. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90117 (owner: 10Nemo bis) [06:25:04] PROBLEM - Varnish HTTP text-backend on amssq52 is CRITICAL: HTTP CRITICAL - No data received from host [06:25:09] (03PS3) 10TTO: Make missing.php aware of interwiki prefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94716 [06:25:53] PROBLEM - Varnish HTCP daemon on amssq52 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:25:53] PROBLEM - Varnish traffic logger on amssq52 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:33] !log rebooting amssq52, deadlock in kmem_alloc [06:31:48] Logged the message, Master [06:34:04] PROBLEM - SSH on amssq52 is CRITICAL: Connection refused [06:36:02] Elsie: re Elsie, see this channel starting @ 20:38:59 UTC [06:37:16] grrr [06:37:30] Elsie: i mean re Nemo_bis+andrewbogott_afk [06:39:15] Right. [06:40:16] Okay. [06:52:03] PROBLEM - Varnish HTTP text-frontend on amssq52 is CRITICAL: Connection timed out [06:52:27] Elsie: IRC is overrated. [06:53:32] Amen. [06:53:53] Go in peace [06:55:04] RECOVERY - SSH on amssq52 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:55:44] RECOVERY - Varnish HTCP daemon on amssq52 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [06:55:44] RECOVERY - Varnish traffic logger on amssq52 is OK: PROCS OK: 2 processes with command name varnishncsa [06:55:53] RECOVERY - Varnish HTTP text-frontend on amssq52 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.193 second response time [06:56:03] RECOVERY - Varnish HTTP text-backend on amssq52 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.192 second response time [07:22:19] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:09] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:32:19] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:33:09] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:36:19] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:37:09] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:41:19] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:19] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:19] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [07:46:19] RECOVERY - DPKG on snapshot3 is OK: All packages OK [07:46:39] PROBLEM - Puppet freshness on amssq48 is CRITICAL: No successful Puppet run in the last 10 hours [07:58:09] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [07:58:09] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:58:19] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:01:09] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (114692) [08:01:09] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (114693) [08:02:19] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:03:59] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:59] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:05:19] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:31] PROBLEM - puppet disabled on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:02] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:21] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:10:31] RECOVERY - puppet disabled on snapshot3 is OK: OK [08:13:01] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:13:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:11] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:18:41] PROBLEM - puppet disabled on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:01] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:20:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:41] RECOVERY - puppet disabled on snapshot3 is OK: OK [08:24:51] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:25:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:25:11] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:27:11] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:27:11] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:27:41] PROBLEM - puppet disabled on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:12] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (109092) [08:28:12] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (109092) [08:29:41] RECOVERY - puppet disabled on snapshot3 is OK: OK [08:34:21] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:11] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:37:51] PROBLEM - SSH on amssq48 is CRITICAL: Connection refused [08:38:12] !log rebooting amssq48, deadlock in kmalloc (started about 10 hours ago) [08:38:21] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:27] Logged the message, Master [08:42:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:42:41] PROBLEM - Varnish HTTP text-frontend on amssq48 is CRITICAL: Connection refused [08:42:51] RECOVERY - SSH on amssq48 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:43:11] RECOVERY - Varnish HTTP text-backend on amssq48 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.196 second response time [08:43:11] RECOVERY - Varnish HTCP daemon on amssq48 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [08:43:11] RECOVERY - Puppet freshness on amssq48 is OK: puppet ran at Wed Nov 13 08:43:06 UTC 2013 [08:43:21] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:43:21] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:43:41] RECOVERY - Varnish HTTP text-frontend on amssq48 is OK: HTTP OK: HTTP/1.1 200 OK - 199 bytes in 0.191 second response time [08:44:12] RECOVERY - Varnish traffic logger on amssq48 is OK: PROCS OK: 2 processes with command name varnishncsa [08:46:21] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:01] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:47:11] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:48:01] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:49:11] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:49:41] PROBLEM - puppet disabled on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:41] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:50:12] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104680) [08:50:12] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104680) [08:50:21] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [08:50:26] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [08:50:31] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:31] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:32] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:32] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:33] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:33] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:34] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:34] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:35] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:50:36] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:36] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:36] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:37] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:37] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:38] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:38] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:39] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:39] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:40] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:40] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:41] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::c [08:50:43] nic [08:50:44] RECOVERY - puppet disabled on snapshot3 is OK: OK [08:50:51] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:50:53] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:50:56] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:56] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::a [08:51:01] PROBLEM - Disk space on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:44] ouch [08:51:49] hey [08:52:07] mpls link seems down [08:52:21] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:23] that or the router on the other end [08:52:26] in any case, time for a dns change [08:52:30] I can't see the transit either [08:52:39] the route is completely lost over here [08:52:41] might be the router then [08:52:58] pushing dns [08:53:36] (03PS1) 10Mark Bergsma: Move ulsfo traffic to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95118 [08:53:46] ah [08:53:59] (03CR) 10Mark Bergsma: [C: 032] Move ulsfo traffic to eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/95118 (owner: 10Mark Bergsma) [08:54:58] gah [08:54:59] hey mark [08:54:59] oh you're awake ? [08:55:07] we're all here and irritated [08:55:08] yes [08:55:17] so I dunno what's up with cr1-ulsfo [08:55:20] or cr2 [08:55:22] ok [08:55:31] perhaps the whole site is down [08:55:32] let's try going in via the oob [08:55:47] how do we do that? [08:55:56] do we have the info anywhere? [08:55:58] mr1-ulsfo.oob.wikimedia.org [08:56:01] RECOVERY - Disk space on snapshot3 is OK: DISK OK [08:56:16] then to the scs [08:57:42] logging into cr1-ulsfo now [08:58:06] grrr [08:58:09] the link is down [08:58:12] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [08:58:12] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:58:16] gtt? [08:58:18] which link? [08:58:29] also why we didn't get to connect, since that's the only transit in ulsfo [08:58:32] gtt, yeah [08:58:37] sorry, i just woke up [08:58:41] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 72.00 ms [08:58:45] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.72 ms [08:58:47] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 74.99 ms [08:58:48] :P [08:58:48] to the sound of the phone going beepbeepbeep [08:58:49] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [08:58:49] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 72.04 ms [08:58:49] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 73.69 ms [08:58:49] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.11 ms [08:58:49] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.10 ms [08:58:49] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [08:58:49] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [08:58:50] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 73.18 ms [08:58:50] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 73.75 ms [08:58:51] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [08:58:51] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [08:58:52] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 73.68 ms [08:58:52] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [08:58:53] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 73.75 ms [08:58:53] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 74.93 ms [08:58:54] same here :P [08:58:54] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.00 ms [08:58:54] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.97 ms [08:58:55] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 74.99 ms [08:58:55] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.93 ms [08:58:56] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 74.98 ms [08:58:56] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [08:59:01] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.42 ms [08:59:02] and the gtt link just came back on its own [08:59:12] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.30 ms [08:59:12] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 74.32 ms [08:59:12] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 72.44 ms [08:59:12] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.75 ms [08:59:14] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 73.96 ms [08:59:14] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.05 ms [08:59:16] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.25 ms [08:59:21] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.24 ms [08:59:28] did it work from eqiad? [08:59:34] your paging is still on EU timezone? :) [08:59:35] I didn't get to try [08:59:43] oh, oops [09:00:04] let's see if the eqiad side was also down [09:00:08] it wasn't [09:00:12] first thing I checked [09:00:16] okay [09:00:21] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:00:26] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:00:34] whaat [09:00:36] ok, we can test again now [09:00:39] :P [09:01:07] awesome [09:01:11] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63358 bytes in 0.389 second response time [09:01:15] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63475 bytes in 0.785 second response time [09:01:15] it's up though [09:01:23] home->ulsfo doesn't work for me [09:01:46] i'd give it another 45 seconds, in case routing convergence [09:01:51] also my question was silly, eqiad->ulsfo was obviously broken or else icinga wouldn't complain :) [09:02:27] it's back [09:04:08] ok just sent off the ticket [09:04:53] ok [09:05:02] i'll finish breakfast now [09:05:07] then see how the links are doing [09:05:18] logs show only xe-0/0/3 going down [09:05:27] ulsfo better remain on eqiad for another while [09:05:35] yeah you don't say [09:06:05] aww, ct is still on paging [09:06:07] thanks for getting up leslie :) [09:06:10] haha what [09:06:17] really?? [09:06:31] at least in the contact sfile, let's see if in the group definition [09:06:49] at least he's getting love from icinga [09:06:52] brb ;) [09:07:12] PROBLEM - Packetloss_Average on erbium is CRITICAL: CRITICAL: packet_loss_average is 11.2605327972 (gt 8.0) [09:07:29] thank youuuu [09:07:49] oh not in the contact groups [09:08:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [09:08:12] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [09:11:12] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100063) [09:11:12] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100063) [09:11:12] RECOVERY - Packetloss_Average on erbium is OK: OK: packet_loss_average is 2.58028264706 [09:14:11] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.65663552795 (gt 8.0) [09:16:11] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [09:16:33] ffs [09:16:41] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::a [09:16:44] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:44] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:45] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:45] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:46] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:46] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:47] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:47] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:51] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:52] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:52] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:53] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:17:02] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:17:06] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:17:07] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:17:11] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:17:21] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [09:17:41] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::c [09:18:01] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:01] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:18:11] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 2.24274715116 [09:18:18] (03PS1) 10Hashar: contint: jenkins git config core.packedGitLimit=2G [operations/puppet] - 10https://gerrit.wikimedia.org/r/95123 [09:25:51] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 73.76 ms [09:25:51] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 74.99 ms [09:25:51] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 73.74 ms [09:25:51] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [09:25:51] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.94 ms [09:25:51] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [09:25:51] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 73.26 ms [09:25:52] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [09:25:52] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 74.98 ms [09:25:53] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 73.75 ms [09:25:53] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.78 ms [09:26:01] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 74.18 ms [09:26:02] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [09:26:02] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 76.00 ms [09:26:02] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.53 ms [09:26:03] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.60 ms [09:26:11] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 72.01 ms [09:26:11] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 73.28 ms [09:26:11] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [09:26:11] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.00 ms [09:26:11] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [09:26:11] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 73.28 ms [09:26:11] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 73.74 ms [09:26:12] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [09:26:12] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms [09:26:13] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.76 ms [09:26:14] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [09:26:14] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 72.00 ms [09:26:14] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.73 ms [09:26:16] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.97 ms [09:26:32] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 76.43 ms [09:26:32] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 76.27 ms [09:26:47] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.25 ms [09:41:27] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::a [09:41:30] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [09:41:32] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [09:41:37] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:37] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:37] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:37] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:37] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:37] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:38] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:38] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:39] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:39] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:40] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:40] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:57] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::c [09:41:59] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [09:41:59] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:00] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:00] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:01] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:01] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:42:02] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:02] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:03] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:42:03] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:42:07] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:07] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:44:27] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 73.75 ms [09:44:28] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 75.86 ms [09:44:28] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.62 ms [09:44:28] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 72.30 ms [09:44:28] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 74.25 ms [09:44:28] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 75.70 ms [09:44:28] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 73.94 ms [09:44:29] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [09:44:29] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 74.04 ms [09:44:30] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.77 ms [09:44:30] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 76.09 ms [09:44:31] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 73.89 ms [09:44:31] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 74.09 ms [09:44:32] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.93 ms [09:44:32] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 78.65 ms [09:44:33] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 73.93 ms [09:44:33] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.47 ms [09:44:34] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 77.80 ms [09:44:34] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 75.22 ms [09:44:35] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 74.92 ms [09:44:35] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 76.09 ms [09:44:36] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 75.60 ms [09:44:36] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 74.12 ms [09:44:37] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.00 ms [09:44:37] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 75.28 ms [09:44:38] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.16 ms [09:44:38] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.14 ms [09:44:40] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 72.02 ms [09:44:42] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 76.43 ms [09:44:45] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.30 ms [09:44:48] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 75.08 ms [09:44:55] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.47 ms [09:45:05] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.81 ms [09:51:35] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::a [09:51:37] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [09:51:39] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [09:51:43] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:45] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:46] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:46] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:47] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:47] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:48] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:48] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:49] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:49] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:50] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:50] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:51] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:51] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:52] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:52] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:51:55] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::c [09:52:05] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:05] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:05] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:05] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:52:12] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [09:52:15] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:52:17] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:54:35] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 73.97 ms [09:54:36] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.24 ms [09:54:36] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 72.64 ms [09:54:36] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 75.42 ms [09:54:36] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.59 ms [09:54:36] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 76.19 ms [09:54:36] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms [09:54:37] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 75.68 ms [09:54:37] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 74.15 ms [09:54:38] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 75.43 ms [09:54:38] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 73.87 ms [09:54:39] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 73.47 ms [09:54:39] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 73.38 ms [09:54:40] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 74.09 ms [09:54:40] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.83 ms [09:54:41] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 76.42 ms [09:54:41] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 75.78 ms [09:54:42] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [09:54:42] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [09:54:43] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 72.07 ms [09:54:43] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 73.72 ms [09:54:44] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 73.24 ms [09:54:44] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [09:54:45] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 75.03 ms [09:54:45] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 75.75 ms [09:54:46] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 74.08 ms [09:54:46] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [09:54:47] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 72.04 ms [09:54:47] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 75.32 ms [09:54:48] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.32 ms [09:54:48] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 73.31 ms [09:54:50] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 75.04 ms [09:55:06] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.82 ms [10:31:16] (03CR) 10Faidon Liambotis: [C: 032] Using interface::add_ip6_mapped on analytics Kafka brokers. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94933 (owner: 10Ottomata) [10:31:25] (03PS2) 10Faidon Liambotis: Using interface::add_ip6_mapped on analytics Kafka brokers. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94933 (owner: 10Ottomata) [10:31:35] (03CR) 10Faidon Liambotis: [C: 032] Using interface::add_ip6_mapped on analytics Kafka brokers. [operations/puppet] - 10https://gerrit.wikimedia.org/r/94933 (owner: 10Ottomata) [10:49:29] (03PS1) 10Hashar: deployment: integration/phpcs for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/95137 [10:53:04] !log Manually set vm.dirty_ratio to 10 on amssq57 [10:53:19] Logged the message, Master [10:54:32] (03CR) 10Mark Bergsma: [C: 032] deployment: integration/phpcs for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/95137 (owner: 10Hashar) [10:56:06] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [10:56:06] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:00:56] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103652) [11:00:57] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103652) [11:09:01] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [11:09:02] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:12:01] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110464) [11:12:01] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (110464) [11:12:29] I think I preferred it when it was always critical [11:18:48] (03PS1) 10Ori.livneh: Disable spammy check_job_queue alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/95139 [11:18:52] :r( [11:19:14] it's buggy AND spammy [11:19:29] it's a useful check [11:19:30] then we will end up not noticing we pilled up 10M of jobs in [11:19:36] spam is better than being blind [11:19:41] exactly [11:19:53] snow-blindness is a form of blindness [11:19:54] I agree with you that the check should be done on the age of the oldest item [11:20:01] but we should surely not hide the issue [11:20:33] well, ok: at what value would you actually go investigate? [11:20:38] the problem is that we're always near 100k [11:20:44] so it goes over and under the line all the time [11:20:54] 100k doesn't sound like a reasonable amount of jobs to be in the queue though [11:21:11] agreed, but do we need to be reminded of that on IRC every 5 seconds? [11:21:41] let's set it to 200k for now [11:21:44] how does that sound? [11:21:52] 6748 itwiki [11:21:53] 36871 enwiki [11:21:54] 39722 frwiki [11:21:55] :( [11:22:27] it was 10k... [11:22:36] I was about to suggest 150k but I won't bikeshed over it [11:22:37] 200k means 2 millions just for top10 wikis [11:22:47] 150k is fine by me [11:22:54] Nemo_bis: no, we have a total as well [11:23:08] paravoid: how much is that? [11:23:11] it's also stupid that we have the same threshold for invidivual wikis & the total, but oh wel [11:23:20] Nemo_bis: the status quo is set by what people react to as critical [11:23:26] not by what gets spammed on IRC [11:23:34] indeed [11:23:36] or not necessarily, at least. [11:23:40] ori-l: sure, nobody cares about what icinga-wm thinks of the job queue :) [11:24:00] ori-l: sounds like a performance issue! [11:24:02] * paravoid ducks [11:24:24] ori-l: is there a way to know job queue length by job type? [11:24:39] i have no idea [11:24:42] ok [11:24:43] maybe nemo knows [11:24:45] sure it can be done [11:24:50] maybe not currently [11:24:52] i'm not a performance engineer, i just play one on television [11:24:55] iirc Tim has a script for it [11:24:59] ok [11:25:02] there is [11:25:14] mwscript showJobs.php --wiki=frwiki --group [11:25:20] cool [11:25:28] most of the jobs are refreshLinks2 [11:25:33] * aude nods [11:25:35] some huge template might have been edited [11:25:45] wikidata creates some of them, but shouldn't spike [11:26:07] should be a steady amount, unless someone floods wikidata with a bot [11:26:10] aude: https://bugzilla.wikimedia.org/show_bug.cgi?id=47628 btw [11:26:28] ah, yes [11:26:36] translate and wikibase together :) [11:26:48] actually.... [11:26:57] we have another bug for that [11:26:59] maybe [11:27:49] and the top jobs for frwiki were: [11:27:50] 4278 rootJobTimestamp=20131102202900 [11:27:51] 1500 rootJobTimestamp=20131102221601 [11:28:03] i think https://bugzilla.wikimedia.org/show_bug.cgi?id=50202 is related [11:28:16] which seems quite old [11:28:25] (03PS1) 10Faidon Liambotis: Bump jobqueue critical to 200k [operations/puppet] - 10https://gerrit.wikimedia.org/r/95140 [11:29:02] I have no clue why the job queue keep such old jobs in though [11:30:09] (03CR) 10Faidon Liambotis: [C: 032] Bump jobqueue critical to 200k [operations/puppet] - 10https://gerrit.wikimedia.org/r/95140 (owner: 10Faidon Liambotis) [11:33:49] (03Abandoned) 10Ori.livneh: Disable spammy check_job_queue alert [operations/puppet] - 10https://gerrit.wikimedia.org/r/95139 (owner: 10Ori.livneh) [11:34:22] alternately rate of growth [11:34:43] sure, this has been proposed in the past [11:34:49] do you see anyone working on it? :P [11:35:10] it's also likely we need more job runners to lower the job count [11:35:22] * ori-l runs from jobs [11:35:28] perhaps temporarily, just to get it low again, since it didn't vary much [11:35:35] or perhaps permanently [11:35:37] who the fuck knows :) [11:36:27] well, my take-away is that it'd be worth the effort to spend some time being choosy with graphite alerting plugins [11:36:49] pick one that is best-of-breed, and make it easy to configure correctly [11:37:30] more job runners sounds good, tho [11:38:14] we could replace our entire software stack with uwsgi plugins [11:38:30] i'm pretty sure there's a --wikitext argument somewhere [11:38:38] hahahaha [11:38:50] rofl, that really made laugh [11:39:12] heh [11:39:20] ok, bed time. 3:40 AM. fun. [11:40:55] bah gash no more graph the job queue rate hehe [11:41:16] http://gdash.wikimedia.org/dashboards/jobq/ :( [11:43:07] !log Starting amssq* persistent storage cache fragmentation rework, temporarily disabling puppet in the process [11:43:21] Logged the message, Master [11:43:43] (03PS2) 10Faidon Liambotis: Giving analytics1021/1022 static IPv6 addresses [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 (owner: 10Ottomata) [11:44:11] PROBLEM - Varnish HTTP text-backend on amssq48 is CRITICAL: Connection refused [11:45:36] (03PS3) 10Faidon Liambotis: Giving analytics1021/1022 static IPv6 addresses [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 (owner: 10Ottomata) [11:46:14] (03PS4) 10Faidon Liambotis: Giving analytics1021/1022 static IPv6 addresses [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 (owner: 10Ottomata) [11:46:26] (03CR) 10Faidon Liambotis: [C: 032] Giving analytics1021/1022 static IPv6 addresses [operations/dns] - 10https://gerrit.wikimedia.org/r/93983 (owner: 10Ottomata) [11:48:03] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [11:48:03] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:53:18] (03PS1) 10Hashar: upgrade from upstream tip of master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95142 [11:53:19] (03PS1) 10Hashar: bump debian/changelog [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95143 [11:53:25] AzaToth: ^^^ [11:53:51] and https://gerrit.wikimedia.org/r/#/c/91506/2 [11:54:20] got to find someone from to merge in those three changes, rebuild the package and upload it on apt.wikimedia.org [11:54:31] then we can get the package upgraded on the labs instance that is running the debian jobs :-] [12:01:28] off for lunch [12:15:23] whoah, MOAR 5xx https://gdash.wikimedia.org/dashboards/reqerror/ [12:15:31] I hope the graph lies :) [12:17:10] bunch of #0 /usr/local/apache/common-local/php-1.23wmf2/includes/job/JobQueueRedis.php(326): JobQueueRedis->throwRedisException('xxx', Object(RedisConnRef), Object(RedisException)) [12:17:18] 2013-11-13 11:25:34 mw1015 enwiki: [5eb46536] [no req] Exception from line 831 of /usr/local/apache/common-local/php-1.23wmf2/includes/job/JobQueueRedis.php: Redis server error: protocol error, got '?' as reply-type byte [12:17:31] which might explain why the job queue can't empty up some jobs [12:18:00] err there is not that many [12:18:25] + it is unrelated to 5xx errors [12:19:10] hashar: that's https://gerrit.wikimedia.org/r/#/c/94848/ [12:20:22] great [12:20:29] heading lunch now :] [12:25:17] (03PS1) 10AzaToth: adding .gitreview [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/95144 [12:31:29] (03CR) 10AzaToth: [C: 031] upgrade from upstream tip of master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95142 (owner: 10Hashar) [12:43:06] RECOVERY - Varnish HTTP text-backend on amssq48 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.191 second response time [12:53:06] hashar: seems a lack of tag for one of the builds [12:53:20] https://integration.wikimedia.org/ci/job/operations-debs-jenkins-debian-glue-debian-glue/7/console [12:53:39] 12:00:04 fatal: Not a valid object name upstream/v0.7.1-6 [12:55:03] hashar: ping [12:56:05] AzaToth: ahhh [12:56:13] no clue how to fix that one [12:56:44] need to push a tag [12:57:06] or make gbp use the tip of the upstream branch ? [12:57:38] (03PS1) 10Mattflaschen: Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 [12:59:23] mark: some folks apparently think that eyeballs might be confused by varnish admin being "nobody" [13:01:23] it's not adviced to use tip o branch [13:04:57] hashar: zuul doesn't sync tags? [13:07:03] it does [13:07:10] hashar: imo, if a tag is pushed to gerrit it should be mirrored to integration [13:07:35] but the tag are probably not pushed in the repo [13:07:43] ah, you forgot to push it [13:08:27] I think the fix is to push it and rebuild [13:09:58] ! [remote rejected] v0.7.1 -> v0.7.1 (prohibited by Gerrit) [13:09:59] bah [13:10:20] got to finish up integration browser tests, would look at debian glue another day. sorry [13:10:40] oh [13:10:47] need to give yourself the perm ヾ [13:11:16] maybe gbp.conf can be pointed to a specific commit though [13:11:24] hmm [13:11:41] the tag is a pointer to a specific commit [13:13:50] (03CR) 10Akosiaris: [C: 032] "LGTM but mental note. I would love to see if we can migrate" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94136 (owner: 10Hashar) [13:14:05] (03CR) 10Akosiaris: [C: 032] zuul: dependencies for Gearman based version [operations/puppet] - 10https://gerrit.wikimedia.org/r/93454 (owner: 10Hashar) [13:15:52] akosiaris: thanks :) [13:17:07] :-). Looking at the third one now [13:21:27] hashar: logging.handlers.TimedRotatingFileHandler at gearman-logging.conf [13:21:36] have you used extensively ? [13:22:02] I have seen some real weird scenarios with it not rotating when it should and rotating when it should not [13:22:12] copy pasted that from OpenStack configuration :D [13:23:33] Sartoris [13:23:35] (Redirected from Git-deploy) [13:23:35] #REDIRECT Trebuchet [13:23:37] hoo confusing is that [13:23:49] i suggest [13:24:01] #REDIRECT "The Ryan deployment system" [13:24:12] and be done with that :-) [13:24:45] !log deploying phpcs on Jenkins slave using "the nameless wikimedia deployment system" [13:24:50] I asked ryan about this yesterday, apparently sartoris and trebuchet are different things [13:25:00] Logged the message, Master [13:25:06] ahh [13:25:08] sartoris is the git-deploy replacement command, trebuchet is the salt automation part [13:25:29] so we will end up with scap / jenkins / git-deploy perl script / the git-deploy python rewrite named sartoris and …. Trebuchet yet another thing [13:25:36] meh [13:25:46] most of them maintained by volunteers :-] [13:30:55] akosiaris: the logging.handlers.TimedRotatingFileHandler python thing is already use to rotate some existing logs [13:31:16] for example the debug.log and the zuul.log that contains >=INFO [13:33:53] I hate myself / augeas and puppet [13:34:59] PROBLEM - Varnish HTTP text-backend on amssq49 is CRITICAL: Connection refused [13:35:15] (03CR) 10Akosiaris: [C: 032] "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [13:35:28] but not merged as requested ^ [13:35:56] thx [13:36:17] I am almost sure it can be merged but I need to double check that in labs first [13:36:19] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 200,000 [13:38:15] akosiaris: so the iptables/Augeas trick in https://gerrit.wikimedia.org/r/#/c/94136/ does not work [13:38:26] I wanted to deny access to port 4370 (gearman) from labs [13:38:36] but augeas mess up the rules for some reason [13:39:19] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:56] -A INPUT -m comment --comment private_all_all -p all -j ACCEPT -s 10.0.0.0/8 [13:39:57] -A INPUT -m comment --comment deny_all-gearman_gearman -p tcp -j DROP --dport 4370 [13:39:58] :( [13:41:14] hmmmm not an augeas issue I am afraid [13:41:32] more like puppet ? [13:41:39] might be puppet releasing the iptables_add_service in some specific order yea [13:41:45] how hard is it to migrate to ferm ? [13:41:52] in no specific order [13:42:02] that is why... [13:42:09] good thing you noticed. [13:42:21] It shouldn't be too hard [13:42:36] whats you ETA ? [13:42:41] Nov 20th [13:42:45] though I can still postpone it [13:42:48] Ok doable [13:42:54] but already waited a few months to get the python modules debianized :-D [13:43:04] and I really really want to offload Zuul from my brain :] [13:43:20] the reason I haven't migrated to ferm yet is that I have the feeling I will end up spending 3 days figuring it out [13:43:31] Can't do it today but i will make a note of helping you do it Friday morning ok ? [13:44:08] can look it up meanwhile [13:44:24] I 'll probably post a rough sketch replacement and will talk it a bit after [13:44:27] for puppet, I am wondering if I could use some 'before =>' statements in the iptables_add_service {} calls [13:44:39] it would work [13:45:00] so if we fail with ferm, you have a plan B :-) [13:45:39] or I plan B first and look at ferm next week :] [13:45:50] lol [13:45:51] running puppetd -tv --debug [13:45:56] then it's plan A but ok :P [13:46:01] the rules are released in some random order [13:46:11] not surpised [13:46:37] I understand now why there are four different classes and some require statement [13:46:39] lame ordering [13:46:41] grbmmb [13:47:19] the problem is that this is a state machine [13:47:25] * hashar wants to buy a bunch of hardware firewalls [13:47:34] ferm is not [13:47:47] it has files and on every file update it refreshes the firewall [13:47:55] what's up with ferm? [13:48:19] we are discussing about replacing modules/contint/manifests/firewall.pp with some simple ferm rules [13:48:27] which is pretty doable [13:48:36] rather easy i 'd say [13:48:57] yes yes yes please please [13:48:59] RECOVERY - Varnish HTTP text-backend on amssq49 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.191 second response time [13:49:14] :-) [13:49:25] do you have any doc on wikitech / some examples to build on ? [13:50:04] * hashar points at https://wikitech.wikimedia.org/w/index.php?title=Ferm&action=edit [13:50:48] there's almost nothing specific of our ferm setup [13:50:49] just man ferm? :) [13:50:57] I never RTFM [13:51:03] prefer following up tutorials :D [13:51:09] if you never rtfm, why would you read wikitech? :) [13:51:20] that is the whole different between smart people like you guys and me :D [13:51:32] oh come on :) [13:51:39] huh... nice excuse [13:51:41] :P [13:51:46] yeah [13:51:47] not convinced [13:51:57] smart people avoid doing unnecessary work [13:52:13] like write tutorials for people who can read manuals just as well [13:52:13] (03CR) 10Hashar: "After talking with Greek ops, we should use ferm." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94136 (owner: 10Hashar) [13:52:17] hahaha [13:52:19] "Greek ops" [13:52:22] looool [13:52:26] :-] [13:52:37] more seriously, I got a Jenkins job to fix up [13:52:45] then will man ferm / dig our puppet.git log [13:52:53] and bring a patch [13:52:58] I am sure I can figure it out [13:53:12] hopefully it is all about: ferm { "dowhatI want": { some nice hash } } [13:53:18] :-D [13:53:50] basically, yeah [13:53:58] there's a few ferm invocations in our manifests [13:54:24] not many [13:54:40] we have ferm::service and the mroe low level ferm::rule [13:54:53] modules/base/manifests/init.pp has a very simple example with ferm::rule [13:54:57] \O/ [13:55:34] ferm::service is even easier [13:55:57] ferm::service { 'http': proto => 'tcp', port => '80' } [13:56:03] or even port => 'http' [13:56:29] nice, cause I can never remember the port numbers [13:56:37] why the hell do we use numbers for anyway [13:56:56] there's a couple things pending on the ferm side [13:57:16] one is to default to policy DROP again, which means we need to take care of gitblit (I think nothing else uses it yet) [13:57:43] the other one is to template the defs.pp with data from networks.pp, so you could use our networks automagically [13:58:32] yes that we need to do soon [13:59:19] there were two patches which had -1 "do that" [13:59:26] I abandoned yours akosiaris yesterday [13:59:32] i know [13:59:32] the other one was older :) [14:00:00] !log adjusting new api appservers weight from 30 -> 20 to balance the load a little better [14:00:14] Logged the message, Master [14:00:18] ( http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 ) [14:12:58] php never cease to amaze me: libgcc_s.so.1 must be installed for pthread_cancel to work [14:12:59] :D [14:15:57] Yay, dependancies [14:18:44] so the thing about the old and new api servers is that the old ones have 12GB ram and the new ones have 64 [14:18:55] I wonder what it would take to give the older boxes a bit more memory [14:19:12] as far as $$ [14:20:00] cmjohnson1, lemme know when you wanna do the shuffle dance [14:20:01] :) [14:20:15] What's the ram? EEC/Registered/Speed/Dim size/DDR [123]? [14:20:27] no idea what's in them [14:21:42] these are r410s (so don't know how much they take but it's gotta be more than 12gb) [14:22:20] Single-bit ECC [14:22:21] can't servermon tell you? [14:22:56] ah no it is not smart enough yet [14:23:16] ottomata: i can move 4 today...an1019-1022...will those 4 work for you? [14:23:49] ah no I lie [14:24:08] ddr3 [14:24:17] 1333 MHz [14:24:31] Multi-bit ECC [14:24:48] and the board supposedly can take up to 128gb [14:24:55] hmm [14:25:00] 21 and 22 are important to do soon [14:25:09] so yes [14:25:30] Reedy: oh hey, were you investigating the apache load increase yesterday? [14:26:02] cmjohnson1: yeah, i'd really love for 21 and 22 to be in different racks asap, that will let me keep working on something [14:26:08] i will need leslie or mark help creating the vlan on the switch first [14:26:12] actually [14:26:13] ottomata: ROWS [14:26:14] :) [14:26:35] 23,24 and 25 are important to get in different ROWS sooner too [14:26:36] :) [14:27:04] cmjohnson1: could we do those 5 instead? [14:28:00] Something like this? [14:28:00] Row [14:28:00] X: 21, 23 [14:28:00] Y: 22, 24 [14:28:00] Z: 25 [14:28:12] paravoid: Not exactly. It seemed to be timed roughly with me deploying EducationProgram with elwiki, but that amount of load can't have been just that extension [14:29:17] apergos: wtf. Looking at Crucial, the ram is $280 cheaper in the UK :/. 448.79 GBP 717.381 USD. Or from the US $998.99 [14:29:21] 2013-11-12T19:42:00+00:00,42.965,0,1.9308333333,0.0041666666667,54.833888889 [14:29:24] 2013-11-12T19:48:00+00:00,61.430833333,0,2.0377777778,0,36.269444444 [14:29:51] 19:48 logmsgbot: reedy updated /a/common to I1bb030fce: Enable mwsearch logs [14:30:04] That's a lie [14:30:06] 19:44 logmsgbot: reedy synchronized php-1.23wmf2 'Support CIDR ranges in $wgSquidServersNoPurge' [14:30:15] how is that possible ( Reedy, about the prices) [14:30:39] Reedy: what's the truth, i.e. what exactly was deployed? [14:30:53] the timestamp matches exactly, it's too much of a coincidence to ignore [14:31:04] apergos: I really have nfi. http://www.crucial.com/store/mpartspecs.aspx?mtbpoid=CA4A7035A5CA7304 http://www.crucial.com/uk/store/mpartspecs.aspx?mtbpoid=5967B11DA5CA7304 [14:31:09] ottomata: i am going to move 20-25 [14:32:00] awesooom [14:32:13] paravoid: so, that updated /a/common that aren't true [14:32:17] ottomata: this means renumbering btw [14:32:18] so cmjohnson1 [14:32:26] 21 and 22 need to be in different rows [14:32:26] oh! [14:32:30] ok [14:32:31] well [14:32:42] The 19:48 update isn't a deployment [14:32:44] i will continue to refer to them with their numbers [14:32:48] until they change [14:32:57] cmjohnson1: so, ja, 21 and 22 need to be separate [14:33:07] and each of 23, 24 and 25 need to be separate [14:33:11] Reedy: could you tell me what exactly changed (commit ids?) at around that time? [14:33:19] clearly I shouldn't trust the logs :) [14:33:28] oh..okay...well we'll leave 22 where it is [14:33:36] k [14:33:38] 19-22, 23-25 [14:33:42] will move [14:33:47] ok [14:33:57] you could leave one of 23-25 where it is [14:33:58] paravoid: Anything that's "reedy synchronized" is a deployment and is correct [14:33:58] Reedy: the us one is low profile, that's why [14:34:01] if that is useful [14:34:22] apergos: They're the same part number [14:34:30] hrm..i meant 19-21, 23-25 can move [14:34:43] Reedy: synchronized what, though? [14:35:03] 17:38 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Enable mwsearch logs' [14:35:04] Reedy: no they aren't: CT2715183 vs CT2714637 [14:35:06] can you give me a git commit range? [14:35:13] 19:44 logmsgbot: reedy synchronized php-1.23wmf2 'Support CIDR ranges in $wgSquidServersNoPurge' [14:35:13] 19:40 logmsgbot: reedy synchronized php-1.23wmf3 'Support CIDR ranges in $wgSquidServersNoPurge' [14:35:17] ^ Those 2 are backports [14:35:22] ottomata: no it's okay, i need to keep the racks somewhat balanced. Once Row D is up will move a few more to spread across 3 rows [14:35:28] * Reedy kicks gerrit [14:36:00] paravoid: CIDR ranges is https://gerrit.wikimedia.org/r/#/q/I49e34bdf13e8e8c6cd169c362c283fe1034bdc6d,n,z [14:36:04] don't assume I know how your deploys work and I'd prefer not to guess, so please commit ids :) [14:36:18] oh, so it's a cherry-pick, not everything up to that commit? [14:36:18] That's why I was kicking gerrit [14:36:28] it was being slow to let me find it ;) [14:36:29] Yup [14:36:32] Just a cherry pick [14:36:52] 19:48 logmsgbot: reedy updated /a/common to I1bb030fce: Enable mwsearch logs [14:36:58] That would've been a git pull for something else [14:36:59] let me check [14:37:06] yeah, this is the new hook [14:37:08] hmm, ok well cmjohnson1, just so it makes sense, 21 and 22 need to be separate, and 23, 24 and 25 each need to be separate [14:37:12] 23-25 are zookeeper nodes [14:37:20] they operate in a quorum and do elections and such [14:37:23] it'd much better if it gave the range of commits that were updated, not just tip [14:37:49] fortuntely they are easy to move one at a time [14:37:54] taking one down won't hurt production [14:38:04] so if we need to move one of those into row D later we can do that [14:38:10] ottomata: and 11-13 / journalnodes? [14:38:27] ottomata: to be clear 23,24 and 25 all need to be in a separate row? 23 in row A 24 row B 25 row C ? [14:38:29] https://git.wikimedia.org/log/operations%2Fmediawiki-config.git [14:38:35] but i suggested leaving one of 23-25 where it is so that there is at least one ZK node in a different rack than the others [14:38:40] ROW* [14:38:53] cmjohnson1: that is correct [14:38:53] no, what cmjohnson1 said [14:39:12] and that is the same with 11-13 [14:39:12] Reedy: where was it pointed before? [14:39:16] as paravoid just mentioned [14:39:20] but we can do that later [14:39:32] 19:55 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Iab47779a2c0f9fe239676d75a279336156353c4b' [14:39:33] 19:48 logmsgbot: reedy updated /a/common to I1bb030fce: Enable mwsearch logs [14:39:44] hrm...okay...let me look at this again...i can't spread across 3 rows right now...i can give you 2 and 1 at a later date [14:39:49] that's fine [14:39:50] Reedy: so just those two? [14:39:54] https://gerrit.wikimedia.org/r/#/c/94964/ [14:40:04] if we get 21 and 22 separated, adn then one of 23-25 separated right now [14:40:11] I can continue with my immediate tasks [14:40:19] so before wmf-config was pointed to 7abec00d1deea9384f430fd678d96cd54aa156b6, 'Merge "CirrusSearch as secondary for nlwiki"' [14:40:24] i was about to assign some IPs to 21 and 22, but am holding off on this reshuffle atm [14:40:30] and then you pushed just those two commits [14:40:41] correct? [14:40:52] 2? [14:41:09] "Enable mwsearch logs" & "Enable EducationProgram on elwiki" [14:41:25] Yup, mwsearch logs being 1738 [14:41:25] ottomata: besides 21,22 23-25 ...do you need any others separated? [14:41:33] ok [14:41:37] and 1955 for the education program [14:41:57] can we revert the CIDR change? [14:42:00] No deploy at the 1948, just the CIDR code cherry picks a few minutes before [14:42:07] Very likely, yeah [14:42:23] I don't think anyone has added any dependancies to it [14:42:23] it's very likely we can fix this in another way too [14:42:36] like remove all this huge list and just put 2-3 blocks [14:42:47] That was the idea of that commit [14:42:48] because now it tries to go throught the whole list and interpret them as blocks [14:42:52] cmjohnson1: 11-13 need to be separated in the same way as 23-25, one in each of 3 rows [14:42:59] but that is less of a hurry [14:43:00] and actually [14:43:01] but let's make sure it's what produces the issue [14:43:02] So we don't have to have all the ips seperately... [14:43:06] can be done as part of a larger spreading [14:43:13] 11-20 are hadoop datanodes [14:43:22] we should just spread those as evenly as possible [14:43:30] 11-13 are also hadoop journalnodes, and they do stuff in a quorum [14:43:32] Reedy: will you or should I? [14:43:45] but, it is really easy for us to move the journalnode daemon to another hose [14:43:46] host [14:43:53] Reedy: I generally try to limit myself to -config commits :) [14:44:07] but I can if you can't, this is a bit of an emergency [14:44:09] so, i wouldn't factor that into the reshuffle so much, as long as the datanodes (11-20) are spread out between 3 rows [14:44:15] we can assign what needs to go where [14:44:41] hmm [14:44:48] ottomata: space is limited so if I am moving a few things to row A i would prefer to get them all...also ottomata...could you update the ticket with the servers that need to spread out plz [14:45:01] hmmm paravoid, i wonder if we should rename an23-25 to somethign non analytics related, to potentially encourage others to use the zookeeper cluster there? [14:45:15] sure, cmjohnson1 good idea [14:45:38] question cmjohnson1, we aren't really using an01-08 right now [14:45:43] it hink labs was going to grab them [14:45:51] maybe we'd keep one of them, not sure what the plan was [14:45:58] ottomata: I think there's a rule of not renaming servers because it's too confusing [14:46:00] but, should those be reshuffled too, or should we not think about it right now [14:46:04] paravoid: ok [14:46:08] cool with me [14:46:26] I see the point and I don't mind as long as it doesn't get too confusing [14:46:40] I can just picture RobH complaining :) [14:47:15] Reedy: ? [14:48:16] paravoid: Yeah, you'd see I was doing them if you were in #wikimedia-dev for the reverts and merges ;) [14:48:25] oh, sorry :) [14:48:43] hmm [14:49:05] * Reedy waits for rsync [14:49:06] * paravoid weights the usefulness of being in -dev vs. the number of irssi windows vs. the focus steal [14:49:18] I suspect it's not needed [14:49:26] !log reedy synchronized php-1.23wmf2/ 'Revert I49e34bdf13e8e8c6cd169c362c283fe1034bdc6d' [14:49:30] But based on what you said above I see one improvement to the original commit [14:49:34] cmjohnson1: updated [14:49:41] Logged the message, Master [14:50:32] !log reedy synchronized php-1.23wmf3/ 'Revert I49e34bdf13e8e8c6cd169c362c283fe1034bdc6d' [14:50:43] (03PS1) 10Hashar: contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 [14:50:48] Logged the message, Master [14:51:00] akosiaris: paravoid: apergos: got some basic ferm rules for contint :-] https://gerrit.wikimedia.org/r/95162 [14:51:06] $wgSquidServersNoPurge is 225 entries [14:51:40] it currently does IP cidr changes up to 225 times [14:51:48] s/changes/matches/ [14:52:13] we can make wgSquidServersNoPurge to accept a cidr block mabye [14:52:30] hashar: That was the point of the commit [14:52:31] well actually: [14:52:32] foreach ( $wgSquidServersNoPurge as $block ) { [14:52:32] if ( IP::isInRange( $ip, $block ) ) { [14:52:42] I just made https://gerrit.wikimedia.org/r/95163 [14:52:56] Which stops the iteration over everything when the ip is listed 1:1 in the array [14:53:24] indeed [14:53:43] Which may fix the perf issue itself... [14:53:46] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report [14:53:49] heh [14:53:51] or add the cidr in wgSquidServersNoPurge ? [14:53:53] mark: so, it was YOUR FAULT [14:53:55] :P [14:53:57] :D [14:54:01] That looks pretty conclusive [14:54:07] something like 208.80.152.0/22 or something [14:54:14] hashar: exactly [14:54:20] same for ipv6 [14:54:37] Which makes squid.php sane ;) [14:54:38] so it'd probably be equally as solved if we just trimmed the list to (ipv4, ipv6) * (esams, eqiad, ulsfo, pmtpa) [14:55:07] Having that extra code in core is nicer too [14:55:14] For other people who may use our silly software [14:55:15] indeed [14:55:26] That's slightly worrying [14:55:34] haha [14:56:01] But like you say, up to 225 iterations of isInRange() per request.. [14:56:42] Reedy: you want to do the === in IP::isInRange [14:56:48] that would benefit other lookup [14:57:22] definitely not 208.80.152.0/22 though [14:57:25] that includes labs [14:58:31] ah, paravoid, i was going to have to change those IPv6 addies again anyway, wasn't I? [14:58:37] if we move these nodes around? [14:58:40] ottomata: yes, I told you about this yesterday [14:58:43] but I merged anyway today [14:58:46] just to help you [14:59:21] ha, ok, thanks, i was just going to wait before submitting a new patchset on the dns ones, buuuuut, ok! [14:59:23] thanks! [14:59:24] :) [14:59:29] Reedy: merging change in [15:00:20] thanks [15:00:40] conflict while cherry picking to the wmf branches :( [15:00:49] ottomata: oh btw, please stop using overly long lines and periods in the first line of commit messages :P [15:01:35] hashar: The change I reverted needs un-reverting then that coming ontop of it [15:01:47] ahh [15:02:14] go ahead, will be happy to +2 all those reverts and cherry picks [15:02:46] paravoid: Are we ok if we revert the revert and bring that commit in? [15:02:59] Then we just need to know what ranges to simplify the config further... [15:03:17] we can try [15:03:21] hm [15:03:37] I wonder what percentage of requests have random XFFs set [15:03:48] you know what? [15:03:53] can't we just split the arrays? [15:04:08] it's silly to do CIDR requests on an array that didn't contain CIDRs until now and it's clearly more expensive [15:04:10] Have a range array and a non range? [15:04:19] yes [15:04:46] we can optimize that in IP::isInRange can't we? [15:04:47] no periods!? [15:04:51] it takes an $addr and a $range [15:04:55] if both are equals, simply return true [15:05:00] what's wrong with periods? [15:05:09] ottomata: http://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [15:05:46] As being a title (not a sentence in a paragraph) it should not end in a full stop. Though one can argue about the semantics of it being a sentence or a title, consistency is important and we don't end in a full stop. [15:05:54] * paravoid slaps ottomata with the fine manual :P [15:06:07] * hashar hides as the main author of that manual. [15:06:27] hmm, ok ok ok ok ok [15:06:28] ottomata: the reason is that the first line of the commit message is used in Gerrit mail notification, patch formatting, git log --oneline etc.. [15:06:44] the reason is that it's a title, not a sentence [15:07:00] titles don't end in full stops [15:07:17] titles aren't in the imperative mode :-P [15:07:35] "Phrase your subject in imperative mood." for those who were about to [citation needed] me [15:07:35] damn [15:07:36] :) [15:08:16] anyway, the mediawiki.org docs is not our thing, these are fairly typical rules across git repos [15:08:47] load 30% down [15:08:48] amazing [15:09:03] this is one heck of an expensive call [15:09:55] bd808|BUFFER: strike 1 [15:10:01] ;) [15:10:07] haha [15:10:19] or step N in the employee orientation [15:10:55] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&me=Wikimedia&m=cpu_report&s=by+name&mc=2&g=cpu_report [15:10:59] Makes a difference clusterwide too [15:11:00] haha [15:11:04] of course [15:11:33] halfed api appservers load [15:12:32] amazing [15:14:08] the funniest part is that the load would vary a lot between hour of the day [15:14:15] because we have the list split by DC and hence service regions [15:14:36] so ulsfo is at the bottom and oceania/west north america traffic would traverse the whole list [15:19:05] anyone know whether something got changed in graphite ? [15:19:14] like what? [15:19:16] graphs stopped working a couple days ago [15:19:17] http://gdash.wikimedia.org/dashboards/totalphp/ [15:20:01] :( [15:20:05] ori-l is working on it [15:20:08] no idea what happened here though [15:20:24] hopefully ori is aware of it :-] [15:20:28] will drop in an email to be safe [15:31:54] Reedy/hashar: I reopened https://bugzilla.wikimedia.org/show_bug.cgi?id=52829 [15:33:55] Reedy: don't we want to reintroduce the wgSquidServersNoPurge range support ? [15:34:04] together with your in_array() optimization [15:35:21] We can, yeah [15:35:30] it's whether we do anything else to core mw first [15:36:18] squash both commits in one and reapply on wmf branches ? [15:36:33] then later on tweak $wgSquidServersNoPurge to use cidr [15:37:04] We can redeploy it now if paravoid is happy to let us do so [15:38:35] I think let's just get back on the bug report and let someone have a closer look as a non-emergency [15:38:43] e.g. bryan [15:44:10] (03PS1) 10Faidon Liambotis: Geolocate our own IP space manually [operations/dns] - 10https://gerrit.wikimedia.org/r/95169 [15:44:18] (03CR) 10jenkins-bot: [V: 04-1] Geolocate our own IP space manually [operations/dns] - 10https://gerrit.wikimedia.org/r/95169 (owner: 10Faidon Liambotis) [15:44:23] damn [15:45:12] update bug 52829 thx [15:45:46] (03PS2) 10Faidon Liambotis: Geolocate our own IP space manually [operations/dns] - 10https://gerrit.wikimedia.org/r/95169 [15:45:46] * hashar gives a beer to Jenkins for saving us from a dns outage [15:46:01] nah, my scripts would catch it before reloading [15:46:17] and we try gdnsd reload without a restart too [15:46:27] so there's more safeguards against this [15:46:36] but jenkins is useful :) [15:46:51] Coren: see the commit above [15:47:11] what I would like eventually is make sure A and PTR entries are matched [15:47:11] Coren: it should help by zeoring the impact of ulsfo issues to labs [15:47:25] and also reducing latency etc. [15:47:26] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:00] cmjohnson1: hey :) [15:48:07] hey! [15:48:08] (03CR) 10coren: [C: 032] "Yes, it is silly." [operations/dns] - 10https://gerrit.wikimedia.org/r/95169 (owner: 10Faidon Liambotis) [15:48:16] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.002 second response time [15:48:18] paravoid: Merged. [15:48:53] lesliecarr: need to setup analytics vlan on row A [15:49:03] uhm, I wanted to leave it for review a little while longer [15:49:10] can you help? [15:49:18] and it wouldn't be an issue right now, as ulsfo is already depooled now [15:49:35] ok [15:50:40] and row B, presumably? [15:51:18] i think row b already has one .... [15:51:23] row b does have one [15:51:27] oh, ok, sorry [15:51:29] paravoid: Hm. Sorry, I didn't consider you'd want to delay -- it's really straightforward. [15:51:33] (03PS1) 10AzaToth: adding gitreview [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95170 [15:51:50] Coren: kind of. i'm wondering if it makes sense to point esams to esams and not directly to eqiad, for example. [15:52:40] but no, that commit doesn't hurt, we can iterate [15:52:43] paravoid: I'd have though you'd want to avoid the extra breaking part for everything that's "us" (and expected you intended that) [15:52:54] what do you mean? [15:53:26] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:42] hashar: ping [15:53:53] pong [15:54:07] Coren: sorry, I lost you :) [15:54:26] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 9.479 second response time [15:54:41] paravoid: I meant that I thought that you intended to not use uslfo at all for our own subnets [15:54:44] can you +2 https://gerrit.wikimedia.org/r/95170 [15:54:49] ulsfo* [15:55:22] Coren: we could point ulsfo to ulsfo and esams to esams, the question is whether it makes sense consider the traffic patterns that come from our own caching centers/DCs [15:55:41] hashar: and I don't even have a upstream/v0.7.1-6 tag, so... [15:55:42] pmtpa & eqiad shouldn't be pointed to ulsfo, though, that's for sure :) [15:55:52] AzaToth: I crafted that one using git describe [15:56:03] Fer shure. [15:56:03] for ulsfo is also a bit silly, since we don't have recursor there anyway :) [15:56:04] (03CR) 10Hashar: [C: 032 V: 032] adding gitreview [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95170 (owner: 10AzaToth) [15:56:10] ah [15:56:20] (and our recursor doesn't do edns-client-subnet) [15:56:26] !log reedy synchronized wmf-config/CommonSettings.php 'Set $wgImgAuthDetails to true' [15:56:41] Logged the message, Master [15:56:41] AzaToth: merged in:] [15:56:59] good [15:58:25] AaronSchulz: About? [15:58:32] hashar: can you give me push tag right? [15:58:44] AaronSchulz: seems timeline is broken on private wikis [15:58:44] File "mwstore://local-multiwrite/local-public/timeline/92b6b06b956228bc173a1513808c91da.png" does not exist. [16:00:37] $wgTimelineSettings->fileBackend = 'local-multiwrite'; [16:02:09] The image is at https://upload.wikimedia.org/wikipedia/office/timeline/92b6b06b956228bc173a1513808c91da.png [16:02:10] easytimeline used to set chmods so that the files it created were not readable to current user :D [16:02:26] (until i fixed that a few months ago – and no, it's probably not related) [16:02:36] hashar: ? [16:03:38] AzaToth: busy sorry [16:03:46] will sort it out later on [16:05:22] Reedy, hashar, paravoid: Is there something I can do to help with the problem I created in the CIDR patch? [16:05:45] try harder to actually crash the cluster ? [16:05:53] it handled your DoS :-D [16:06:09] hashar: I gave it my best shot. I guess I'm not Sr material :) [16:06:18] bd808: more seriously have a look at https://bugzilla.wikimedia.org/show_bug.cgi?id=52829#c8 and comment #9 [16:06:46] basically, we got to pick your patch and Reedy short circuit patches [16:06:51] apply them to wmf branches and deploy [16:06:55] probably want to schedule it [16:07:23] bd808: I erplied on the bug report [16:08:52] Reedy's fix should effectively turn off my "enhancement" in our prod cluster until we start using CIDR instead of explicit IPs in the config correct? [16:09:03] Because we proxy everything [16:09:33] That was the idea, yeah [16:09:36] no [16:09:45] there can be multiple XFFs in chain [16:09:58] so I can just do a request with XFF and our caching layer will just append our IPs [16:10:08] and I think the function is called for every XFF in list, correct me if I'm wrong [16:11:52] paravoid: You are correct. It iterates the XFF chain until it finds a match or reaches the end. [16:12:26] so yes, it'd help a lot sinec it'd fix it for most of the requests [16:12:33] which have no XFF when they enter our infrastructure [16:12:50] but it'd still cause a full traverse on requests coming from third-party proxies [16:12:58] incl. some popular ones such as opera mini, nokia etc. [16:13:29] It works the XFF from back to front so really the first test should match a squid/varnish box [16:14:11] but it'd still pop from that list until a non-match is found or the list ends, right? [16:16:34] Oh right. It's trying to find the "real" request IP [16:18:08] (03PS2) 10AzaToth: bump debian/changelog [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95143 (owner: 10Hashar) [16:18:16] I gave isInRange a quick look [16:18:27] it looks overly complicated that something that could as simple as a couple of bitwise operations [16:18:39] cidr matching can be made really fast [16:18:54] Yeah that function could be optomized [16:19:52] There's a another short circuit that could be added in wfIsConfiguredProxy() too. Move on in list if $block doesn't contain a '/' [16:20:27] so then you don't use the range features of isInRange, so you might just as well optimize it [16:22:24] uhhhhhh [16:22:27] * greg-g reads backlog [16:22:37] t's fine [16:22:40] no worries [16:22:43] bd808: Optimi[zs]e all the things [16:22:46] Then blame AaronSchulz [16:22:57] heh [16:23:14] I'll ignore then and move on to other morning readings [16:23:16] Reedy: It's more obviously the performance engineer's fault. He +2 and everything [16:23:26] :) [16:23:36] it's just part of employee orientation [16:23:39] That too [16:23:44] bad ori-l [16:23:59] ottomata: I dont care if you guys rename them, so paravoid is incorrect there. We use asset tags to track [16:24:03] So server names are immaterial. [16:24:14] RobH: oh, sorry :) [16:24:21] you can rename them, but you have to put in a few tickets to complete it properly [16:24:28] thanks :) [16:24:29] so the onsite person knows to update the server label and racktables [16:24:41] otherwise its cool [16:25:14] So rename as needed, just ensure you update: switch port label, dns/dhcp/etc, and drop ticekt for the onsite to put a new label on it =] [16:25:31] this is the way we break the cluster [16:25:31] break the cluster [16:25:31] break the cluster [16:25:45] i'm indifferent actually, paravoid, what do you think? an23-25 are zk nodes, for which other teams might have use [16:25:46] if the analytics machines break the cluster we are in a bad position [16:26:00] RobH: different issue [16:26:00] heh [16:26:00] so naming them analytics* might discourage others from using? [16:26:11] abbreviate [16:26:13] that always works well [16:26:20] pls dont make them anal servers. [16:26:23] :D [16:26:24] haha [16:26:31] nono, zookeeper [16:26:32] lol [16:26:47] I'm not sure if a zk* notation would make sense, maybe misc... [16:26:48] well, if you use numbers, eqiad is 1k range [16:26:49] zookeeper1001-1003 fine with me [16:26:51] yeah [16:26:52] so zk1001+ is fine [16:26:53] or that [16:26:55] might be too specific? [16:27:02] what kind of boxes are these? [16:27:04] cisco? [16:27:12] i mean, they aren't high powered machines, so I doubt there will be other stuff there, but naming them such does lock them down a bit [16:27:14] no [16:27:15] umm [16:27:17] we tend to not name servers after a specific service, but generic service name (cp for cache proxy rather than varnishXXXX) [16:27:27] R310 [16:27:28] but meh, if zk is all that runs [16:27:36] 4 core X3430 @ 2.40GHz [16:27:37] 8G RAM [16:27:39] oh, these are the 5 r310s [16:27:46] yea, not powerful. [16:29:24] !log Jenkins switched PHP Codesniffer (phpcs) to use the version from integration/phpcs.git instead of the pear one {{gerrit|95172}} [16:29:33] zk is a coordination / distributed config management service [16:29:34] hmm [16:29:38] Logged the message, Master [16:30:13] manybubbles + greg-g, did you guys have potential needs for zookeeper one day? [16:31:33] !log Jenkins: applying label 'hasPhpcs' on lanthanum slave, letting php code sniffer jobs to roam there. [16:31:47] Logged the message, Master [16:35:10] a generic zk setup sounds a good idea nevertheless [16:35:27] I don't care if we name it zk10xx or misc names (or analytics names) [16:35:40] analytics seems the worst of these three options [16:35:46] like* [16:36:26] I could say the exact same about kafka brokers though :) [16:36:54] I know ori has some plans on using them? [16:37:01] paravoid: thanks for the community talking on #tech last night [16:37:04] i'm going through scrollback [16:37:11] which one? [16:37:38] ottomata: not that I know of. would like to run some elasticsearch masters on those machines one day, but otherwise, I don't think so. [16:38:09] manybubbles: there's a zookeeper plugin for elasticsearch, dunno if there's any upsides over zen though [16:39:03] cmjohnson1: what's one of the interfaces to go to analytics1-a ? [16:39:08] paravoid: someone once talked about how it doesn't suffer from split-brain problems [16:39:32] but I'm more inclined to stay stock because that gets more attention [16:40:52] lesliecarr: 2/0/17 will be one of the interfaces..but there is nothing there yet [16:41:09] manybubbles: nod [16:41:11] that's cool, for interface-range statements you need a member interface [16:41:24] cmjohnson1: should be good to go! [16:41:25] heh...yeah! [16:41:26] cool [16:41:27] thx [16:46:28] mark: ping re https://gerrit.wikimedia.org/r/#/c/93527/ ;) [16:48:36] !log Jenkins: phpcs jobs successfully running on lanthanum. !!! [16:48:51] Logged the message, Master [16:56:40] !log recreating esams varnish backend caches, preallocated with fallocate; the following have been done: amssq49, 50, 53, 55, 57, 58 [16:56:54] otherwise I will not be able to keep track [16:56:55] Logged the message, Master [17:07:24] (03PS1) 10coren: Tool Labs: Import many dependencies from TS's Jira [operations/puppet] - 10https://gerrit.wikimedia.org/r/95183 [17:09:19] Jira ? [17:09:28] (03CR) 10coren: [C: 032] "Package additions." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95183 (owner: 10coren) [17:09:38] atlassian Jira ? [17:10:33] toolserver had jira.... I never knew... [17:14:04] (03CR) 10Hashar: "Will get it reviewed with Alexandros :-]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [17:14:22] I am off [17:14:25] *wave* [17:14:35] bye [17:15:23] gwicke: "Various improvements" :) [17:16:23] greg-g: you asked for it ;) [17:16:43] none of those commits is done yet, so that's about as much as I can tell you [17:19:55] greg-g: I'll keep the actual info on our deployments page, so that we have a simple overview [17:20:09] changed the links from the wikitech deployments page to directly point there [17:20:57] !log populateRevisionLength.php running in screen on terbium [17:21:11] Logged the message, Master [17:26:14] gwicke: thanks [17:26:25] gwicke: that's a good place for it [17:31:20] (03PS1) 10coren: Tool Labs: Dependencies for BZ 56996 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95187 [17:31:56] ah paravoid, so you think we should rename all these nodes? :p [17:31:57] i don't mind [17:33:04] ottomata: so whatcha gonna call em? [17:33:08] (03CR) 10coren: [C: 032] "Package installs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95187 (owner: 10coren) [17:33:16] zk1XXX? [17:33:56] zookeeper10* [17:33:56] kafka10*,log10*,distlog10*,dl10*,logbuffer10*, :p [17:33:56] hadoop10*,analytics10*(?),batch10*,paralleljob10* hah [17:34:02] https://wikitech.wikimedia.org/wiki/Server_naming_conventions [17:34:10] well, whatever is settled on, update that page please =] [17:34:12] oo reading [17:34:21] just lists all our other service groups and the names [17:34:41] (doesnt include testsearch, cuz thats temp) [17:34:54] curious, robH, why is 2000-2999 reserved? [17:35:06] new DC [17:35:08] yep [17:35:10] pmtpa replacement [17:35:12] new us based dc [17:35:31] just dunno what it will be called yet, i guess i could put new us based DC, heh [17:35:38] ah [17:35:43] k [17:35:57] Did someone fiddle with the imagescaler class and roles lately? [17:35:57] RobH: fwiw, esams has no flowers any more [17:35:59] yeah all names I can thikn of are lame unless they are named after the software [17:36:00] dunno [17:36:09] zookeeper* [17:36:09] kafka* [17:36:09] hadoop* [17:36:10] meh [17:36:12] ? [17:36:13] RobH: just notable dutch people [17:36:19] so in the past both mark and myself kind of hated server renames, since the name is how we tracked them [17:36:19] Since Saturday: "Could not find class imagescaler for i-000005f9.pmtpa.wmflabs on node i-000005f9.pmtpa.wmflabs" breaks puppet runs on labs. [17:36:27] paravoid: cool [17:36:30] (not encyclopedians necessarily) [17:36:32] i'll upate page in a bit [17:36:41] I hadn't heard about the flower rule [17:36:45] it was old [17:36:52] so when I read it I was trying to think of any flower-named servers [17:36:54] when i first started there were flower names in both knams and pmtpa [17:36:55] and failed :) [17:37:00] but they have all gone away over time [17:37:35] you're also missing tmh* [17:37:39] although I'd like to see it gone [17:38:08] (in favor of mw*) [17:38:10] rose, I just removed it the other day [17:38:15] a left over from years ago [17:38:31] (03CR) 10GWicke: "That would be awesome. People are killing our puny labs VM currently, and mobile will start playing with the content soon as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [17:38:49] also missing amslvs ;) [17:39:02] ok, I'll stop doing edits over IRC now [17:39:03] add em! [17:39:06] yeah :P [17:39:13] i fixed the notable dutch thing [17:39:31] lucnh bbl [17:39:34] so yea, renames used to mean losing track of what server is what, and when it was ordered [17:39:43] but for the past few years we've been using asset tags [17:39:49] right, that's what I was remembering [17:39:51] and I didn't know that [17:39:55] my only annoyance is when folks rename servers, and then dont bother to update the racktables [17:39:58] or the label on server [17:40:07] so when it breaks later no one knows where it is [17:40:14] (see example history of professor ;) [17:40:28] Reedy: link? [17:40:44] I have turned up several of those in my cleanup certainly [17:40:47] makes the annual inventory kind of hell [17:41:00] which is now more Chris's problem than my own ;] [17:41:05] \o/ [17:41:10] don't sound so cheerful :-D [17:41:19] he isnt in channel, so it wont depress him [17:41:20] heh [17:41:30] plus now the audit goes really, really smoothly [17:41:37] cuz our accounting team is kinda awesome. [17:42:30] !add-labs-user [17:42:36] :( [17:47:45] (03Abandoned) 10Yuvipanda: Move toollabs specific configuration to their own file [operations/puppet] - 10https://gerrit.wikimedia.org/r/84926 (owner: 10Yuvipanda) [17:50:35] looks like still 503's ? [17:52:45] LeslieCarr: trying to find out if it's just the one page or if there's more than that going onn [17:53:11] so far just that one [17:53:28] how do the links look anyways? [17:54:01] which links ? [17:54:34] the link, should say, which we are not using because traffic is going through eqiad iirc :-P [17:57:30] andrewbogott: there is an open ticket that will eventually use it [17:57:42] "new download server for mediawiki" or so [17:58:26] mutante: OK, we can leave it be then, I'll add a note. [17:58:35] But, meanwhile, do you mind reviewing that patch? It looks pretty good to me. [17:58:56] andrewbogott: which patch? there are at least 2 related to download [17:59:02] https://gerrit.wikimedia.org/r/#/c/90760/ [17:59:29] (03CR) 10Andrew Bogott: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [17:59:58] andrewbogott: https://gerrit.wikimedia.org/r/#/c/94408/ [18:00:40] dang [18:00:59] eh, yeah, let me comment on the first one [18:01:26] both versions have their merits. [18:01:33] i would like apergos to take a look [18:01:41] * andrewbogott nods [18:02:26] (03CR) 10Andrew Bogott: "This would be a good candidate for an entry here: https://wikitech.wikimedia.org/wiki/Puppet_Todo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [18:02:30] I'll do gerrit reviews tomorrow I guess [18:02:31] (03CR) 10Andrew Bogott: "This would be a good candidate for an entry here: https://wikitech.wikimedia.org/wiki/Puppet_Todo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:02:35] matanya was first though apparently [18:02:42] it's been a few days [18:03:43] Probably applying the role change to matanya's patch is the right way forward. [18:04:20] andrewbogott: apergos : so download-mediawiki was on kaulen i bet [18:04:28] but we can reuse it later [18:05:33] (03CR) 10Dzahn: "the download::mediawiki class doesn't seem to be in use right now, but i think it was used on kaulen before and we can make use of it agai" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:05:38] yes kaulen I think [18:05:43] greg-g: we won't depl today, someone else could go [18:08:26] yurik_: thanks, how's your deploy access, btw? [18:09:32] greg-g: all good now :) [18:09:34] rcx [18:09:37] thx [18:09:39] :) [18:12:32] akosiaris: poke [18:13:42] AzaToth: in a meeting [18:13:49] k [18:17:52] Reedy: why is Extension: not in the default mw.org search namespaces? [18:32:43] (03CR) 10Dzahn: [C: 031] "matanya, this looks quite good. though i have created another patch forgetting this already existed. I'm ok with mine being the abandoned " [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:38:07] I notice I've got some changesets unreviewed at https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/debs/ircd-ratbox,n,z [18:38:21] anyone able to review it? [18:39:45] Jeff_Green, do we already have hardware for PDF rendering? [18:39:53] (03CR) 10Matanya: "Thanks. if you think your is better just merge it. I can easily add the role, but if you want to merge yours, let me know, to prevent redu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:40:11] maxsem I believe we have three servers sitting idle for this purpose. rechecking that [18:41:32] mutante: lets finish it together [18:42:54] if you guys get it done and are happy with it you don't need to wait for me to sign off [18:43:06] MaxSem: they appear to have been pulled, so we'll need to commission new servers [18:43:21] hmm [18:43:39] okay, thanks [18:44:24] matanya: yes ok:) i'll take a look to create another patch set on yours then.. just not right now because we have meeting and such [18:44:34] MaxSem: yep, found the RT referencing wiping them [18:44:51] thanks mutante [18:46:16] (03CR) 10Matanya: "As agreed on IRC Daniel Will finish the role part based on his work in: https://gerrit.wikimedia.org/r/#/c/94408/2" [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:47:41] mutante, matanya, maybe we should just merge matanya's patch as is, then do the role refactor in a second patch? [18:47:43] Might be simpler [18:48:36] AaronSchulz: Because no one has added it? [18:48:38] i'd be fine with that too [18:49:15] Maybe that Sam guy could add it [18:49:37] ok… if I catch a gap between meetings today I'll merge the existing patch. [18:50:47] andrewbogott: cool [18:50:59] matanya: fwiw.. https://gerrit.wikimedia.org/r/#/c/94407/ so we dont duplicate again [18:51:16] but that one is WIP because i think wikibugs shouldnt be in it.. [18:51:39] it's both IRC but one is server and one is a client [18:52:16] AaronSchulz: It's already in... [18:52:16] '+mediawikiwiki' => array( 12 => 1, 100 => 1, 102 => 1 ), [18:52:27] Extension [18:52:54] it's not selected when I do search [18:53:34] It's there in $wgNamespacesToBeSearchedDefault when used via eval.php [18:56:08] (03PS1) 10Reedy: Disable and remove CommunityHiring and CommunityApplications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95197 [18:56:44] (03PS2) 10Reedy: Disable and remove CommunityHiring and CommunityApplications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95197 [18:57:31] ok, thanks andrewbogott and mutante [18:59:04] ottomata: busy? [18:59:48] hiyaa [18:59:50] about to have ops meeting [19:00:02] what'sup? [19:00:39] oh, ok. just wondered since you are on RT duty, if you can help out a bit with training to a noob :) [19:01:23] (03PS3) 10Reedy: Disable and remove CommunityHiring and CommunityApplications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95197 [19:01:47] (03CR) 10Reedy: [C: 032] Disable and remove CommunityHiring and CommunityApplications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95197 (owner: 10Reedy) [19:01:57] (03Merged) 10jenkins-bot: Disable and remove CommunityHiring and CommunityApplications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95197 (owner: 10Reedy) [19:03:22] !log reedy synchronized wmf-config/ 'Disable and remove CommunityHiring and CommunityApplications' [19:03:38] Logged the message, Master [19:07:04] !log reedy updated /a/common to {{Gerrit|I775790df6}}: Disable and remove CommunityHiring and CommunityApplications [19:07:15] Logged the message, Master [19:08:57] (03PS1) 10Reedy: Remove usages of $wmgUseUsabilityInitiative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95199 [19:09:34] (03CR) 10Reedy: [C: 032] Remove usages of $wmgUseUsabilityInitiative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95199 (owner: 10Reedy) [19:09:38] Reedy: and wmgUsabilityEnforce too? [19:09:42] (03Merged) 10jenkins-bot: Remove usages of $wmgUseUsabilityInitiative [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95199 (owner: 10Reedy) [19:09:55] * Reedy points MatmaRex at the git repo [19:09:56] :P [19:10:16] Reedy: lolno. [19:10:22] !log reedy updated /a/common to {{Gerrit|I2a05d4a97}}: Remove usages of $wmgUseUsabilityInitiative [19:10:28] (03PS1) 10Reedy: $wgImgAuthDetails = true; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95200 [19:10:35] Reedy: i recently removed 60 lines of config related to Vector, you still owe me some. :P [19:10:36] Logged the message, Master [19:11:42] (03CR) 10Reedy: [C: 032] $wgImgAuthDetails = true; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95200 (owner: 10Reedy) [19:11:52] (03Merged) 10jenkins-bot: $wgImgAuthDetails = true; [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95200 (owner: 10Reedy) [19:13:59] !log reedy synchronized wmf-config/ 'wgImgAuthDetails to true. Remove wmgUseUsabilityInitiative' [19:14:14] Logged the message, Master [19:30:57] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:57] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 6.567 second response time [19:45:57] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:57] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 3.827 second response time [19:58:26] (03PS1) 10Dr0ptp4kt: Rearrange W0 config-pages-supported variables order to be more logical. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95212 [20:04:57] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:55] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 8.937 second response time [20:05:57] ^^^yurik yurik_ i believe change 95212 will address the non-display of the "Edit" button issue for users defined in the "admins" array of Zero: config blobs; if not it should be a no-op as far as i can tell. [20:06:11] LeslieCarr: regarding rt 5885, did you open a bz ticket? [20:06:27] yurik yurik_ i'm gonna eat lunch. be back later [20:06:37] (*i* will be back later) [20:06:42] (03CR) 10Yurik: [C: 032] Rearrange W0 config-pages-supported variables order to be more logical. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95212 (owner: 10Dr0ptp4kt) [20:06:51] (03Merged) 10jenkins-bot: Rearrange W0 config-pages-supported variables order to be more logical. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95212 (owner: 10Dr0ptp4kt) [20:07:05] oh, i don't think i actually did [20:07:15] or maybe the other person did [20:07:26] but i think i opened that to remind myself.... which failed :-/ [20:08:47] LeslieCarr: well, i'm here to fullfil that role :) [20:10:42] hehe [20:14:55] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:11] oh matanya, sorry [20:15:24] i can totally help, meeting should be over soon [20:15:28] but I am not on RT duty! :p [20:15:36] we forgot to change the topic again [20:15:44] s'ok they, happy to help iiiin 15 or 20 mins [20:15:56] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 3.988 second response time [20:15:59] ottomata: thanks again. ping me when available [20:17:38] (03PS4) 10Ryan Lane: Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 [20:26:44] (03Abandoned) 10Ottomata: analytics102{1,2} are Kafka brokers and need public addresses. [operations/puppet] - 10https://gerrit.wikimedia.org/r/93904 (owner: 10Ottomata) [20:27:10] (03PS2) 10Ottomata: Giving analytics102{1,2} a public address. [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 [20:27:19] (03Abandoned) 10Ottomata: Giving analytics102{1,2} a public address. [operations/dns] - 10https://gerrit.wikimedia.org/r/93903 (owner: 10Ottomata) [20:28:04] (03Abandoned) 10Ottomata: Adding kafka::udp2log::relay define to consume from Kafka and send to udp2log. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 (owner: 10Ottomata) [20:28:42] (03Abandoned) 10Ottomata: Creating new dclass module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83960 (owner: 10Ottomata) [20:30:25] (03Abandoned) 10Ottomata: Making elasticsearch ganglia plugin query $ipaddress instead of localhost for ES stats. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92325 (owner: 10Ottomata) [20:40:06] (03PS6) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [20:40:14] (03PS8) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [20:40:21] (03PS4) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [20:42:03] (03CR) 10Dzahn: [C: 032] "12:44 < mutante> Jeff_Green: "frlabs" .. i suppose it's very dead?" [operations/dns] - 10https://gerrit.wikimedia.org/r/95077 (owner: 10Dzahn) [20:42:25] mutante: re: salt master [20:42:51] we should set up a salt master in eqiad, and also setup syndics in each caching datacenter [20:43:06] Ryan_Lane: created a ticket because somebody on etherpad said it's missing [20:43:16] adding that:) [20:43:22] in the other primary datacenter, when it's available, we should set up a second master [20:43:28] salt has multi-master support [20:43:45] hm multiple masters, how does that work if I want to run a command on all the minions? [20:43:46] so that if one primary datacenter needs to be failed over, salt (and deployment) continues to work [20:43:55] I cn run from any master independently? [20:43:57] apergos: the minions subscribe to both masters [20:44:02] so, yeah [20:44:05] ah excellent [20:44:13] I think they only publish to their primary [20:44:17] and key acceptance? [20:44:21] Ryan_Lane: isn't salt and puppet overlapping effort? [20:44:23] needs to be replicated, I think [20:44:29] hm ok [20:44:31] publish to? [20:44:34] matanya: we don't use the overlapping portion :) [20:44:36] what does that mean? [20:44:40] yes, it should be replicated apergos [20:44:45] jeremyb: you can publish events/commands from the minions to the master [20:45:01] huh [20:45:11] and if their primary dies? then it picks a new one? [20:45:27] so it's more like we use it as a replacement for dsh [20:45:31] matanya: [20:45:39] Ryan_Lane: mean puppet is better in configuration and salt as multi apply tool? [20:45:59] some things you want to happen right now at the same time [20:46:00] we have a shitload of puppet code and people aren't really sold on moving from puppet to salt for that [20:46:00] i see mutante [20:46:12] we're using all the other features of salt, though [20:46:15] !log Jenkins: fixing up jobs *-phpcs-HEAD which were not using the proper phpcs path {{gerrit|95268}} [20:46:25] so, the deployment system is using runners, pillars, grains, modules, returners, etc.. [20:46:29] Logged the message, Master [20:46:34] and otherwise we use it for remote execution [20:46:40] Ryan_Lane: if you point me to that pile, i might help out with it, i like salt a lot [20:46:48] oh, awesome [20:47:11] I have a deployment system written using it [20:47:13] one sec [20:47:43] looking up for the name of it maybe? :D [20:47:54] it's trebuchet ;) [20:48:03] and soonish I'll move it out of puppet and into its own repo [20:48:05] some people told that Trebuchet is something different than sartoris [20:48:07] then have puppet use it as a submodule [20:48:11] sartoris is indeed different [20:48:13] I thought so, saw you commiting there the other day [20:48:17] anyway, I refer to all of that as "Wikimedia deployment system" [20:48:20] <- some people [20:48:29] but is that name really futureproof? :) [20:48:31] matanya: http://git.wikimedia.org/tree/operations%2Fpuppet.git/630ae0a4b5392da40e9f1917cb801b118860b1d2/modules%2Fdeployment [20:48:32] there is no "the"* deployment system [20:48:42] not yet [20:48:49] Ryan_Lane: nice name anyway [20:48:49] !log DNS update - kill frlabs, merge unmerged config-geo changes [20:48:51] Coren: ^ [20:48:54] trebuchet is slated to replace scap [20:48:59] * aude nods [20:49:03] Logged the message, Master [20:49:11] matanya: thanks :) [20:49:22] I like the idea of lobbing giant rocks into our servers [20:50:04] matanya: I'm sure everyone will be more than happy to have your help with salt :) [20:50:20] so far I've been the one mostly doing stuff with it [20:50:42] salt is replacing ssh isn't it ? [20:50:48] dsh, kind of [20:51:17] Ryan_Lane: I have a hadoop cluster at work manged by salt for execution and with puppet for configuation, so i see what you are trying to do [20:51:20] it can't fully replace ssh ;) [20:51:36] I would love to have salt on labs project [20:51:42] yes, that's a goal [20:51:42] upgrading al lthe instance is cumbersome [20:51:55] I should actually write up my plans for that [20:52:32] the idea is to let projects specify instances as "peer masters" [20:52:37] hashar: you can do it today by "hand" quite quickly [20:52:49] then from those instances you could do remote execution calls on any other instance in the project [20:53:12] using runners and pillars this should be doable [20:53:18] matanya: how can I helP? [20:53:23] it needs some openstack integration, though [20:53:25] or ldap [20:53:44] oh, there was also another possibility [20:53:52] which the salt folks told me about, which is super cool [20:54:02] I have signed an NDA ottomata to help out on RT, but a quick walkthrough would be great (policies, closing, opening etc) [20:54:12] set up your own salt master on an instance, then make it a syndic [20:54:18] @ ottomata [20:54:20] point your instances at your master [20:54:25] then I'd accept the syndic [20:54:35] so I'd be able to still make calls on all the instances, but you could from your master [20:54:46] oh haha [20:54:47] uhhhh [20:54:51] hashar: this would also work for trebuchet btw [20:54:59] you might be asking the wrong person then, I know very little about that [20:55:02] RobH i think knows most there [20:55:10] hashar: remember how I was mentioning it would be a problem to make calls between production and labs? [20:55:17] it's easy to make one way calls from production into labs :) [20:55:23] using syndics [20:55:36] matanya: uhh, first policy, never answer vendors [20:55:39] apparently it's the whole point of them [20:55:48] but you shouldnt have access to vendor queues as volunteer these days, so should be ok [20:56:11] im not sure we have any real policy outline [20:56:11] (03PS1) 10Dzahn: add missing terminal dot for analytics102[12] IPv6 [operations/dns] - 10https://gerrit.wikimedia.org/r/95273 [20:56:16] paravoid: this is not nearly ready to go, but would love review whenever you get a chance [20:56:16] https://gerrit.wikimedia.org/r/#/c/94169/ [20:56:17] no hurry [20:56:17] and you can change syndics to form a one way hierarchy [20:56:26] poor jeremyb and Thehelpfulone had to do it wrong and get someone mad at them [20:56:27] heh [20:56:30] <3 salt [20:56:34] (not wrong, but you know what i mean) [20:56:38] RobH: want to take it private? [20:56:39] Ryan_Lane: great to know prod -> labs will be possible [20:56:41] ottomata: full stop on the commit message [20:56:47] nah, its not private really, i dont mind discussing in here [20:56:58] hashar: we could also have prod -> labs -> deployment-prep [20:56:58] ok [20:56:59] Ryan_Lane: also have you considered moving Trebuchet out of operations/puppet.git so more non ops can be involved in maintaining it ? [20:57:02] where you run your own master [20:57:13] hashar: yes, my plan is to move it into its own repo [20:57:15] I'm just not sure I have a whole lot of answers. So the person you should first ask on any given ticekt if you need help is whoever is on RT triage that week [20:57:18] which changes in this channel topic [20:57:25] but in terms of policy, hrmm [20:57:28] ottomata: https://gerrit.wikimedia.org/r/#/c/95273/ [20:58:23] Ryan_Lane: whenever you do, announce it somewhere so I can step in :] [20:58:46] hahaha [20:58:46] * Ryan_Lane nods [20:58:46] will do :) [20:58:46] we dont have anythign really laid out for RT volunteers, as we now have a total of 3 =] [20:58:46] (03CR) 10Dzahn: [C: 032] add missing terminal dot for analytics102[12] IPv6 [operations/dns] - 10https://gerrit.wikimedia.org/r/95273 (owner: 10Dzahn) [20:58:46] ok will fix paravoid :p [20:59:26] have we merged the submodule already? [20:59:26] RobH: you mean https://rt.wikimedia.org/Ticket/Display.html?id=4076 ... [20:59:26] Ryan_Lane: and if you are looking at a way to integrate distutils/setuptools whatever, definitely have a look at https://pypi.python.org/pypi/pbr by OpenStack. That makes it trivial to add setup.py configuration. [20:59:26] aand, who's on RT right now? [20:59:26] cause it still says its me [20:59:26] Ryan_Lane: pbr is one of the most downloaded module and it is only a few months old :] [20:59:26] !log DNS update - fix warnings for analytics IPv6 addresses [20:59:26] paravoid: yes, submodule merged [20:59:26] Logged the message, Master [20:59:26] matanya: heh, yes [20:59:26] well varnishkafka puppet module merge [20:59:26] this adds the submodule to production puppet [20:59:26] did I review that? [20:59:26] i think so? [20:59:26] someone did [20:59:26] checking [20:59:29] I don't remember :) [20:59:35] ok, RobH looking at https://rt.wikimedia.org/Ticket/Display.html?id=1839 [20:59:38] i had some recent small unreviewed additions and modifications [20:59:40] but the main thing yeah [20:59:57] yup [20:59:58] https://gerrit.wikimedia.org/r/#/c/82885/5 [21:00:00] I see it was supposed to be on a meeting, but no update since then [21:00:05] matanya: So mutante says you should also feel free to ask him as well [21:00:17] but as you come across stuff, you should feel free to ask and we'll help [21:00:21] (cuz yea, we have no doc for this) [21:00:30] though we should, wanna make it as a volunteer? ;] [21:00:35] ah, thanks mutante [21:00:36] i'm going to use this permission [21:01:03] (03PS4) 10Chad: Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 [21:01:10] (03CR) 10jenkins-bot: [V: 04-1] Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [21:01:41] <^demon|away> stfu jenkins [21:01:42] (03PS6) 10Ottomata: Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [21:01:47] heh [21:01:50] paravoid ^ :p better? [21:01:52] would be glad to once i understand RobH [21:01:58] hashar: I'm not totally sure setuptools is needed [21:02:17] hashar: salt will always be needed [21:02:33] (03PS5) 10Chad: Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 [21:02:34] and either salt will install everything, or puppet or chef will, if necessary [21:02:44] matanya: i also maybe able to answer some questions (and i definitely have some of my own to ask) [21:04:49] hm… mutante, are you able to submit https://gerrit.wikimedia.org/r/#/c/90760/? The 'Publish and Submit' button is dimmed for me and I can't understand why... [21:05:18] (03CR) 10Andrew Bogott: [C: 032] "OK, I think we should merge this as is, then Daniel will do the role-refactor in a later patch." [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [21:05:35] andrewbogott: you can't submit a draft? [21:05:38] i guess [21:05:48] oh, nvm [21:05:58] Yeah, I don't think it's a draft. [21:06:04] Or is a secret draft [21:06:12] no, it was the trailing question mark [21:07:00] what does Auto Merge mean? [21:08:04] dunno, must be a new diff method [21:08:17] mutante: who is responsible for icinga issues? [21:08:21] I can think of what I'd like it to be... [21:12:36] http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-11/projectcounts-20131111-190001 [21:12:51] there is a line among the project list that says: r0w 4r0wonly@en 1 156 [21:13:05] anyone got a clue how that happened ? [21:13:12] what's wrong with icinga... I can't believe I'm asking this. I'm only asking in case it's something I can look at tomorrow morning [21:13:37] matanya: what's the issue? we try to share responsibility in the team [21:13:38] thedj: nope, the files get to dumps.wm.o well after being put together [21:14:32] shall i file a ticket about it ? cause it doesn't look normal [21:14:32] apergos: while at it.. db1033 and sq48 down [21:14:37] I know [21:14:39] apergos: already know why? [21:14:42] and there are tickets [21:14:47] kk, nice [21:15:08] and https://wikitech.wikimedia.org/wiki/User:ArielGlenn/Server_cleanup#Hosts_in_dns_and_not_in_dhcp [21:15:08] ah, yea, that one, i saw:) [21:15:09] (03PS20) 10Ottomata: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [21:15:17] paravoid: so, on finalizing that package [21:15:20] they are even named there with links to the tickets! [21:15:27] should we get Snaps to tag varnishkafka then? [21:15:31] before we merge this? [21:15:42] bsically when you check pupet freshness/errors every day you run across a lot of s*&^ [21:15:46] (03PS1) 10Hashar: ldap: make ldaplist runnable by anyone [operations/puppet] - 10https://gerrit.wikimedia.org/r/95276 [21:15:54] addshore: ^^^^ [21:16:08] hashar: :D [21:16:45] (03CR) 10Addshore: [C: 031] ldap: make ldaplist runnable by anyone [operations/puppet] - 10https://gerrit.wikimedia.org/r/95276 (owner: 10Hashar) [21:16:57] addshore: I kept using a production machine to use ldaplist :-D [21:17:04] hahaaa [21:17:15] !log deployed Parsoid be03c28 [21:17:25] (03CR) 10Andrew Bogott: [C: 032] ldap: make ldaplist runnable by anyone [operations/puppet] - 10https://gerrit.wikimedia.org/r/95276 (owner: 10Hashar) [21:17:25] how did you even spot that bug I made hashar ? [21:17:26] stalker! [21:17:30] Logged the message, Master [21:17:38] andrewbogott: thank you :-] [21:17:47] how can i disable VE on wikitech? i seem to have it activated while at the same time my preferences do NOT have the box checked in Beta features [21:18:08] mutante: looking here: https://icinga.wikimedia.org/icinga/ [21:18:15] addshore: so now run puppet and owe a beer to andrewbogott :-] [21:18:21] i see a lot of disk space issues [21:18:37] those are (mostly) bogus, the cp ones [21:18:44] mutante: just click 'edit source'? [21:18:55] ther we need to be able to tell the disk script check that for the cp hosts a different higher value is ok [21:19:10] apergos: what about staffort dpkg? [21:19:42] ignore me about wikitech, i'm just confused [21:19:46] presumably when the puppetmaster stuff is straightened out that will be resolved, unless it's a different issue [21:20:39] matanya: sometimes you'll find ticket links on problems that have already been ACKed [21:20:41] there was an issue with cpu and the new puppetmaster version, but I don't know if that's still it [21:20:59] matanya: yea, the disk size there is fixed [21:21:27] eh, i mean the size of that file on the disk [21:21:36] that uses almost all of it, but won't change [21:22:08] (03CR) 10Edenhill: [C: 031] Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [21:23:36] (03CR) 10Edenhill: [C: 031] (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon Liambotis) [21:23:59] ACKNOWLEDGEMENT - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://rt.wikimedia.org/Ticket/Display.html?id=6274 [21:26:51] matanya: and search indices - check lucene status page - being replaced by Cirrus [21:27:54] check_job_queue on hume/fenari - replaced by same on terbium , can be removed [21:28:15] so as you see, most are not really criticals (but cleanup is good) [21:34:43] (03PS1) 10Andrew Bogott: Move ldapsupportlib.py into /usr/bin to keep ldaplist company. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95280 [21:34:45] (03CR) 10Dzahn: [C: 031] "this sounds reasonable "1) We are already using 303, clients not supporting that are already unable to use these URLs."" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/92925 (owner: 10Dzahn) [21:34:51] hashar: Right back at you: ^^ [21:40:57] (03CR) 10Ryan Lane: [C: 032] Add shadow_reference support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94680 (owner: 10Ryan Lane) [21:41:21] (03PS7) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [21:41:35] fucking gerrit and its trivial rebases [21:42:24] :-D [21:42:43] andrewbogott: ah [21:42:46] (03CR) 10Ryan Lane: [C: 032] Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 (owner: 10Ryan Lane) [21:42:58] (03PS9) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [21:43:03] andrewbogott: maybe some other scripts are depending on that module as well ? [21:43:22] hashar: With my patch it's in both places… we should be covered. [21:43:48] (03CR) 10Ryan Lane: [C: 032] Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [21:43:58] (03PS2) 10Hashar: Move ldapsupportlib.py into /usr/bin to keep ldaplist company. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95280 (owner: 10Andrew Bogott) [21:44:06] andrewbogott: I have added a reference to bug: 57028 [21:44:12] (03CR) 10RobH: [C: 031] "so other than the odd dependency, the changeset looks good to me" [operations/dns] - 10https://gerrit.wikimedia.org/r/94457 (owner: 10Dzahn) [21:44:35] (03CR) 10Hashar: "Thanks :-] should have looked at the actual code before moving files around :/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95280 (owner: 10Andrew Bogott) [21:44:52] andrewbogott: thank you very much [21:45:01] can't follow up though, heading bed right now [21:46:26] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 200,000 [21:46:40] addshore: what instance are you doing your ldaplist? [21:46:56] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:04] s/// [21:48:00] andrewbogott: i WAS JUST DOING IT ON TOOLS-LOGIN ;P [21:48:05] eww caps :/ [21:48:56] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 3.516 second response time [21:49:24] sorry to have broken it :-( [21:49:25] (03PS2) 10QChris: Serve geowiki's private data through statistics websever [operations/puppet] - 10https://gerrit.wikimedia.org/r/94626 [21:49:36] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:14] (03CR) 10Andrew Bogott: [C: 032] Move ldapsupportlib.py into /usr/bin to keep ldaplist company [operations/puppet] - 10https://gerrit.wikimedia.org/r/95280 (owner: 10Andrew Bogott) [21:50:36] hehe, what a cute commit message :P [21:58:23] (03PS1) 10Reedy: Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 [21:59:45] (03PS5) 10Ryan Lane: Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 [22:02:30] (03CR) 10Ryan Lane: [C: 032] Make appserver common a mediawiki deploy target [operations/puppet] - 10https://gerrit.wikimedia.org/r/94832 (owner: 10Ryan Lane) [22:02:33] (03CR) 10Chad: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [22:05:17] (03PS1) 10Odder: (bug 29902) Clean up InitialiseSettings.php, step 4/∞ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95284 [22:06:05] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:05] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [22:11:25] I'm planning to convert icinga into a module, anyone started working on this, or is it all mine? I think andrewbogott or LeslieCarr might know [22:11:34] matanya: no, but yay [22:11:40] thank you [22:11:47] :) [22:12:03] I come to serve LeslieCarr :) [22:12:14] and only me? ;) [22:12:27] matanya: In theory Alex is planning to do that, but I don't know if he's actually done anything. [22:12:28] because i can totally give you LOA's to get connected when they come in :) [22:12:50] Best to check in with him before you start. He is… akosiaris on IRC [22:12:53] um… (sp?) [22:13:08] matanya: if you do that, check for "nagios" remnants to be renamed to icinga [22:13:18] and anything nrpe related [22:13:27] matanya: In theory things like this are tracked here: https://wikitech.wikimedia.org/wiki/Puppet_Todo when you start a new project best to check there and make an entry. [22:13:27] i'd be happy to review stuff [22:13:31] (03CR) 10CSteipp: [C: 031] "After talking through this with QChris, this seems like the most secure option under the circumstances. Someone with merge in this repo sh" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94626 (owner: 10QChris) [22:13:34] …not that that page is formatted in any useful way [22:13:49] give people a finger ... :) [22:14:45] i'll ping alex tomorrow, we meet at the normal hours [22:16:05] oh, and andrewbogott The reason i didn't create the download role in the rirst place is my policy, whem refactoring code into a module, first don't change anything [22:16:14] then, patch the module. [22:16:20] twkozlowski: so in practice it's step 0? [22:16:44] RobH: this is HTML->wiki using parsoid. it works http://parsoid.wmflabs.org/_html/ [22:16:47] matanya, yep, that's a reasonable way to go about things. [22:16:58] so if you have some way to export to HTML , then paste there.. then wiki.. done [22:18:36] matanya: you will find "nagios" all over the place, thing is some of them do make sense to be renamed icinga and some don't or may not. f.e. probably don't want to rebuild package nagios-plugins so that it becomes icinga-plugins [22:19:02] and there were some remnants like /files/nagios or so being used by icinga [22:19:31] mutante: did you see my earlier message about not being able to submit https://gerrit.wikimedia.org/r/#/c/90760/ ? [22:19:32] my first puppet patch did something like that mutante, don't know why LeslieCarr gave me -2 on that :P [22:19:47] * mutante sees $USER1$/check_to_check_nagios_paging [22:20:28] oh that one [22:20:38] andrewbogott: no, why not ? [22:20:50] it got rolled back since all the nrpe-tools put the results of their checks in /var/nagios/rw [22:20:57] so all the nagios-nrpe-tools failed [22:21:04] so we need to fix those up [22:21:05] :( [22:21:07] mutante: I don't know why not! The 'submit' button is dimmed for me and I can't figure out why. [22:21:09] Same for you? [22:21:16] i owe you for that LeslieCarr [22:24:22] (03PS2) 10Odder: (bug 29902) Clean up InitialiseSettings.php, step 4/∞ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95284 [22:24:22] andrewbogott: that's odd, yea the button is dimmed but it also doesn't show a rebase button and "merge if necessary" [22:24:37] Need Rebase or Has Dependency [22:24:53] yep [22:25:07] I guess I can do a local cherry-pick and submit it as a new patch... [22:25:11] but not right now :) [22:32:55] (03PS3) 10Odder: (bug 29902) Clean up InitialiseSettings.php, step 4/∞ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95284 [22:34:58] (03PS2) 10Dzahn: remove osm-web1-4 and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/94457 [22:36:56] oh oh oh, another set of stuff off my list! [22:37:00] (tomorrow) [22:39:17] apergos: just talked to rob, gotta amend that one more time and instead of removing them entirely turn their hostnames into asset tags [22:39:25] so cmjohnson1 can wipe them [22:39:28] right [22:39:32] but it doesnt have to wait either [22:39:36] on it [22:39:42] well I'm not editing my list today [22:39:51] nor running any new reports til tomorrow [22:39:59] so whatever shows up then will get folded in [22:40:05] looks up asset tags in racktables [22:40:08] yep [22:46:55] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [22:47:22] (03PS3) 10Dzahn: remove osm-web1-4 and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/94457 [22:48:04] (03PS1) 10Chad: Don't explode when trying to use hhvm + caches [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95287 [22:48:55] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:53:10] (03CR) 10Chad: "Current version WFM in testing on arsenic." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 (owner: 10Chad) [22:53:53] (03CR) 10Dzahn: [C: 032] remove osm-web1-4 and mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/94457 (owner: 10Dzahn) [22:54:35] !log DNS update - remove osm-web in Tampa, replaced with asset tags for wiping [22:54:49] Logged the message, Master [22:54:52] cmjohnson1: ^ [23:00:49] woo hoo [23:02:01] apergos: moving ticket to pmtpa queue and renaming. i think doing that is fine in a workflow and slightly better than multiple linked tickets where appropriate [23:02:22] ..the ticket moves in the work flow .. [23:02:27] I have no opinion, as long as osm is in the ticket name [23:02:35] nod, it is [23:02:40] and where is my 'any' keyword? :-( [23:02:45] even has (was: oldtitle) [23:02:50] oohhhh nice [23:08:42] (03PS1) 10Andrew Bogott: Remove duplicate definition of python-rsvg [operations/puppet] - 10https://gerrit.wikimedia.org/r/95296 [23:11:00] (03CR) 10Andrew Bogott: [C: 032] Remove duplicate definition of python-rsvg [operations/puppet] - 10https://gerrit.wikimedia.org/r/95296 (owner: 10Andrew Bogott) [23:18:10] gwicke: how are you managing external dependencies in node? (because I'm guessing it's not using npm) [23:19:33] mwalker: npm AFAIK [23:20:08] and that puppetized? [23:20:19] No :( [23:20:31] The way we deploy it is this (and it's horrible) [23:20:55] We build the node_modules directory on a separate server, I think bast1001 (because it needs internet access) [23:21:07] Then we copy it to tin:/srv/deployment/parsoid/config/node_modules [23:21:28] And then we deploy it as part of the config repo which really is only supposed to contain localsettings.js but also contains the node_modules stuff because we suck [23:21:40] mwalker: we plan to use a submodule instead [23:21:54] https://bugzilla.wikimedia.org/show_bug.cgi?id=53723 [23:21:56] gwicke: Using some sort of proxy? [23:22:23] Since neither tin nor any of the deployment targets have access to the public internet... [23:22:29] no, just node_modules being a straight submodule [23:22:37] git submodule [23:22:38] Right, OK [23:22:43] And then have the ugliness in there? [23:23:06] omg that is horrible [23:23:25] (reading the description of how it's done now) [23:23:46] Yes [23:24:41] RoanKattouw: some form of packaging is inevitable, and we won't be able to debianize each dependency in the right version [23:24:48] a submodule sucks but it sucks less imo [23:24:50] Right [23:24:54] than what you are doing now [23:25:13] at least not initially [23:25:14] (03PS2) 10Dzahn: remove db61 [operations/dns] - 10https://gerrit.wikimedia.org/r/94426 [23:25:52] <^demon|away> AaronSchulz: https://gerrit.wikimedia.org/r/#/c/95287/ and/or https://gerrit.wikimedia.org/r/#/c/93622/ ? :) [23:26:03] we already have another contrib repository that is used by Jenkins tests, both that and the config repo would basically be replaced with the single submodule [23:26:30] a submodule is likely best [23:26:50] hell, you can do sub-submodules if you really want [23:27:02] ;) [23:27:02] I just fixed recursive submodule support for git deploy ;) [23:27:04] erm [23:27:10] and this is where I check out [23:27:12] err [23:27:14] trebuchet [23:27:16] and I don't mean git checkout either :-D [23:27:31] ^demon|away: yeah, so that hack is clearly for labs right? Maybe it could use a comment to that effect? [23:27:40] <^demon|away> When using submodules, please don't use git:// or I track you down and eat you :p [23:27:41] also it will be called git-sartoris-ryan-deploy til the day you die [23:27:43] apergos: we have sub-submodules in mediawiki deployment [23:27:46] anywyas, good night! [23:27:49] <^demon|away> AaronSchulz: Prod and labs for now. [23:28:01] * Ryan_Lane waves apergos [23:28:09] * Ryan_Lane waves at apergos [23:28:15] the former was funnier, though [23:28:18] (03CR) 10Dzahn: [C: 032] remove db61 [operations/dns] - 10https://gerrit.wikimedia.org/r/94426 (owner: 10Dzahn) [23:28:57] !log DNS update - remove db61 (former OTRS box) [23:29:12] Logged the message, Master [23:29:21] so, if we use logstash, can we stop logging from irc into a wiki? [23:29:30] and just have the irc bot log into logstash, via a tag? [23:30:36] gwicke: how do you get npm install to place things into a specific directory? [23:31:56] ^demon|away: ? [23:32:22] <^demon|away> The hack for using EmptyBagOStuff? That's a hack for production and labs. [23:32:25] <^demon|away> Not just labs. [23:32:28] paravoid: so I can make img_auth work for math/timeline but how much is there a use for this atm? [23:32:36] mwalker: normally we just place package.json in the directory we want, and then run npm install which populates the subdir node_modules [23:32:36] gwicke: oh it just does it by default; nvm [23:32:44] there might be a switch for it too [23:32:55] nods; I just found that documentation page :) [23:32:58] AaronSchulz: I didn't file the bug :-) [23:33:21] * AaronSchulz looks at Reedy [23:33:57] i'm just gonna power-off db61 and shutdown uncleanly because i suppose it's going to be wiped anyways and i can't login to it (unless one of you has the key) [23:34:11] RobH: [23:37:32] !log power down db61 [23:37:46] Logged the message, Master [23:40:16] (03PS2) 10Dzahn: remove db61 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94427 [23:41:05] mwalker: is there any reason to manually store pdf output rather than just cache it? [23:41:31] (03CR) 10Dzahn: [C: 032] remove db61 from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94427 (owner: 10Dzahn) [23:41:45] with cache = varnish [23:41:51] "manual" is relative, right? :) [23:42:05] * jeremyb points to pigeonrank [23:43:03] gwicke: I didn't really see any advantages to using a varnish cache -- because we're going for existing API capability it would mean having the varnish server sit between MW and the render servers [23:43:10] and cache management might be harder [23:43:24] you'd avoid the need to implement your own cache [23:43:49] no need to implement your own carp-like functionality etc [23:44:36] and maybe even an easy way to purge things by subscribing to the existing purge feed [23:44:57] I thought about that; but I wasn't sure about how to do that in a time reasonable manner [23:45:06] because all varnish will have is the jobID [23:45:09] so purge a whole book if any article in it changes? [23:45:19] mwalker: how do you figure? [23:45:22] mwalker: will the book have a page name? [23:45:46] no; the books do not have to have a name [23:46:00] all the render servers get is a POST of metadata including the pages that they should include [23:46:07] I see, that could make it harder [23:46:13] especially invalidation [23:46:24] do you plan to build your own dependency tracking? [23:46:26] node can send a list of page IDs in a book in a response header [23:48:10] gwicke: nope; we will construct the job ID from a SHA of (render format, [(title, revid), ...]); so we can serve the same content if no content revids have changed [23:48:52] ah, so no template / image updates then [23:49:03] wasn't really planning on it [23:49:38] I figure we'll keep the render around for a couple of days and then trash it [23:49:47] if it's needed again we'll rerender [23:49:58] that's something varnish could do for you as well [23:50:02] including reall LRU [23:50:05] *real [23:51:37] how to introduce varnish though without also introducing a SPOF [23:51:38] ? [23:51:49] oh, right rev id SHA. so if you really need to purge and action=purge don't work then you have to null edit [23:51:55] the same way as the parsoid varnishes are set up [23:52:11] that's all puppetized, so should not be hard to adapt [23:52:23] i don't see where the SPOF would be [23:52:25] LVS -> 2 varnishes -> LVS -> backends [23:52:38] https://wikitech.wikimedia.org/wiki/Parsoid [23:54:33] > There are no threads on this page yet. [23:54:39] the blue link tricked me! [23:56:19] gwicke: do you CARP in your varnishes? [23:56:36] or do you just not care if the varnishes duplicate cached content? [23:57:04] mwalker: we do something carp-like in the frontend varnishes [23:57:39] they don't cache, but select the backend varnish so that we don't duplicate rendering / cache content [23:57:45] (03CR) 10TTO: [C: 031] "Looks good." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95284 (owner: 10Odder) [23:57:53] when one varnish goes down, only 50% of the cache is lost [23:58:07] or a smaller fraction with more backends [23:59:59] wait; so you have Interwebs -> LVS -> Parsoid Frontend Varnish -> Carp to backend varnish -> LVS -> Parsoid Cluster ?