[00:00:22] where frontend and backend varnish exists on the same machines? like we had the squids setup? [00:00:24] yup, all puppetized [00:00:42] frontend+backend on same machine is already how varnish is done [00:02:13] ok; that sounds somewhat reasonable; I'll add that option to my document [00:02:17] going back to purging though [00:02:45] action=purge ? or on writes? [00:03:00] for books that DO have pages; is there a way to add an arbitrary page to the dependency tree? or...? [00:03:07] jeremyb: on writes [00:03:08] writes should be taken care of by revid [00:03:12] in sha [00:03:12] if you include all relevant info in your hash, then you can skip purging [00:03:14] right? [00:03:23] (03PS4) 10TTO: Clean up wgSiteName in InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86418 [00:03:35] instead of revid, you can use the page_touched timestamp [00:03:45] (name from foggy memory) [00:03:52] makes sense; but I'll have to check how the SHA gets cached in the Collection extension [00:04:00] that is incremented whenever the page needs to be re-rendered [00:04:01] mwalker: i'm a bit lost now... how can a book not have a page? [00:04:06] * jeremyb did read the wiki [00:04:23] it's not actually in the wiki; it's a vagary of the collection extension [00:04:31] you can have a temporary book stored in your session [00:04:39] I'm also wondering whether it would make sense to just have a page for each book [00:04:42] right, ok [00:04:54] but you can't have a PDF without an article [00:05:29] jeremyb: that's correct -- the book/collection may be ephemeral -- but it's always linking to existing content [00:06:05] gwicke: it would... but! we're trying to minimize churn in the collection extension itself right now; and the extension currently allows ephemeral collections [00:06:21] hrmmmmm, maybe i just don't know enough about Collection [00:07:00] mwalker: maybe architect things so that ephemeral collections are not cached? [00:07:00] jeremyb: if you really want to play I can add you to our labs project [00:07:20] unless they only involve a single article, in which case you can just fetch the cached copy of that [00:07:22] so tempting... idk :P [00:07:58] it seems unlikely to get many cache hits on ephemeral multi-article collections [00:07:59] i was pruning windows earlier so that i could join your new channel [00:08:46] so those could be POSTs that aren't cached, and normal persistent books would be GETs with a normal page URL that are purged with the normal dependency tracking [00:08:55] mutante: great ...create a new rt ticket for wiping so i can track [00:09:10] nevermind...see them ^ [00:09:37] cmjohnson1: 6162->6300 6256->6301 [00:09:48] i made them 'children'.. shrug [00:10:00] but a parent can be resolved before the child, doesnt matter [00:10:02] gwicke: want to jump into #mediawiki-pdfhack to talk about this further? [00:10:15] mwalker: ok [00:11:16] cool..thanks! just in case steve doesn't get to them all [00:27:26] PROBLEM - profiler-to-carbon on professor is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [00:28:19] * jeremyb detects an o r i [00:31:26] RECOVERY - profiler-to-carbon on professor is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [00:32:38] !log restarted stuck profiler-to-carbon on professor [00:32:52] Logged the message, Master [00:35:38] (03CR) 10MZMcBride: "Thanks for working on this. :-)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95284 (owner: 10Odder) [00:45:51] paravoid: so I don't think we need to mess with the journal stuff thursday, seems like we can just do the switch and then another copy script run using timestamps (and without --syncviadelete to avoid deletes) [00:46:14] * AaronSchulz thought about turning off the file change log a few times [01:17:54] AaronSchulz: https://bugzilla.wikimedia.org/show_bug.cgi?id=51136 [01:18:02] AaronSchulz: ombudsmenwiki wants it apparently [01:18:28] Reedy: https://gerrit.wikimedia.org/r/#/c/95286/ [01:18:59] I guess https://gerrit.wikimedia.org/r/#/c/95304/ + config would make those urls work [01:19:51] (03PS1) 10Ori.livneh: Enable Bug54847 debug log group and route to $wmfUdp2logDest [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95313 [01:20:45] (03CR) 10Ori.livneh: [C: 032] Enable Bug54847 debug log group and route to $wmfUdp2logDest [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95313 (owner: 10Ori.livneh) [01:23:59] (03PS1) 10Ori.livneh: Revert "Rearrange W0 config-pages-supported variables order to be more logical." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95315 [01:25:18] (03CR) 10Ori.livneh: [C: 032] "Merged but not deployed; can't verify that it is correct at the moment." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95315 (owner: 10Ori.livneh) [01:26:44] !log ori updated /a/common to {{Gerrit|I27cf69dd1}}: Revert "Rearrange W0 config-pages-supported variables order to be more logical." [01:27:00] Logged the message, Master [01:27:48] !log ori synchronized wmf-config/InitialiseSettings.php 'Iaae2da0accd: Enable Bug54847 debug log group and route to ' [01:28:03] Logged the message, Master [01:28:06] every. time. [01:28:26] i forget to quote $varNames, I mean. [01:28:51] bash is a cruel and fickle mistress. [01:29:05] don't i know it [01:30:12] Perhaps scap could be trained. "I see your deploy message is stilted. Are you drunk and/or careless?" ;-) [01:31:30] if we're borrowing ideas from Gmail Labs, a thirty-second undo window would be even better [01:32:01] Great for emergencies. [01:32:59] <^demon|away> wfCountDown( 30 ); [01:33:43] "You seem to be typing awfully quickly. Cool down for a few minutes and try again." [01:35:10] <^demon|away> Could add rate limiting as well. [01:35:29] <^demon|away> "You seem to keep syncing the same file. Have you tried testing your code somewhere other than production?" [01:37:27] ^demon|away: aren't you supposed to be away? [01:37:43] <^demon|away> I've been |away ever since lunch. [01:37:48] <^demon|away> I never came back, I suppose. [01:40:29] RobH, re your earlier mention, I found it quite amusing I was teaching ops RT practice at one point ;) [01:48:21] (03PS2) 10Dzahn: remove gurvin and gurvin.mgmt, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/94448 [01:52:49] ori-l: or just echo it back to you and let you verify that's what you wanted to say [02:05:04] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:07:56] !log LocalisationUpdate completed (1.23wmf3) at Thu Nov 14 02:07:56 UTC 2013 [02:08:15] Logged the message, Master [02:08:55] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 2.783 second response time [02:13:58] !log LocalisationUpdate completed (1.23wmf2) at Thu Nov 14 02:13:58 UTC 2013 [02:14:13] Logged the message, Master [02:22:25] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:22:25] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:15] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 3063636 seconds since restart [02:24:15] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:35:33] (03PS1) 10Dzahn: disable wikistats update cron jobs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95317 [02:35:40] (03CR) 10jenkins-bot: [V: 04-1] disable wikistats update cron jobs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95317 (owner: 10Dzahn) [02:36:09] (03PS2) 10Dzahn: disable wikistats update cron jobs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95317 [02:36:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Nov 14 02:36:09 UTC 2013 [02:36:24] Logged the message, Master [03:39:53] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:40:53] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:04:12] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 200,000 [05:07:15] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:26] <\\\> Hey, is WMF doing anything for Google codein this year [05:39:08] yes. [05:39:30] \\\: https://www.mediawiki.org/wiki/Google_Code-in [05:40:38] <\\\> legoktm: danke [05:56:01] \\\: maybe more relevant to #wikimedia-tech or #wikimedia-dev [05:56:06] or #mediawiki even [06:27:50] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [06:29:50] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [07:25:14] PROBLEM - RAID on mw1210 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:04] PROBLEM - SSH on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:56] !log powercycled mw1210, unreachable via mgmt, no messages on console [07:34:10] Logged the message, Master [07:35:04] PROBLEM - Host mw1210 is DOWN: PING CRITICAL - Packet loss = 100% [07:35:54] RECOVERY - SSH on mw1210 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:36:04] RECOVERY - Host mw1210 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:36:04] RECOVERY - RAID on mw1210 is OK: OK: no RAID installed [07:40:15] PROBLEM - Frontend Squid HTTP on sq80 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:41:14] RECOVERY - Frontend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 531 bytes in 4.127 second response time [07:44:05] PROBLEM - Backend Squid HTTP on sq80 is CRITICAL: Connection refused [07:46:07] !log powercycled sq80, accessible but responded 'bus error' or 'i/o error' to all commands [07:46:19] Logged the message, Master [07:47:34] PROBLEM - Host sq80 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:05] RECOVERY - Backend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 486 bytes in 0.078 second response time [07:48:14] RECOVERY - Host sq80 is UP: PING OK - Packet loss = 0%, RTA = 35.51 ms [08:38:56] (03PS1) 10Ori.livneh: Add uWSGI module [operations/puppet] - 10https://gerrit.wikimedia.org/r/95331 [08:39:46] ^ paravoid: probably going to merge that, but post-merge comments still welcome; I'd be happy to go back and amend [08:42:15] ori-l: do you know how to add people to ldap? [08:42:33] you don't add people to ldap :) [08:42:42] Ryan_Lane adds people to ldap [08:42:48] when a user is created via wikitech, it adds people to ldap [08:43:07] the only thing you might add people to is unmanaged groups like wmf, ops or wmde [08:43:18] everything else is managed [08:43:24] (via wikitech) [08:43:43] i was asking about wmde Ryan_Lane [08:44:04] regarding https://rt.wikimedia.org/Ticket/Display.html?id=6298 Ryan_Lane [08:44:08] ah. on formey: modify-ldap-group --addmember= wmde [08:44:17] as root [08:44:39] or is it --addmembers? [08:44:42] meh --help will tell you ;) [08:44:46] pretty sure this is documented somewhere [08:45:01] Ryan_Lane: I don't have access to any server :) [08:45:05] yeah [08:45:09] ori can [08:45:16] or anyone on ops [08:45:16] heh [08:45:16] (03PS1) 10Ori.livneh: Port graphite module to use uwsgi::app [operations/puppet] - 10https://gerrit.wikimedia.org/r/95332 [08:45:19] FINE [08:45:49] :) [08:45:49] * matanya wistles [08:52:24] matanya: done [08:53:25] thanks :) i was just asking ... [08:54:22] i ran 'ldaplist -l group wmde' to confirm the group exists, then 'modify-ldap-group --addmembers=jeroendedauw wmde' [08:54:50] and it worked? [08:55:15] i ran 'ldaplist -l passwd jeroendedauw' to confirm, but it was not the right option [08:55:24] so i just decided to be overconfident and assume it worked [08:55:27] ldaplist -l group wmde [08:55:37] meh, certainty is boring [08:55:41] :) [08:55:46] if you didn't get an error it worked [08:55:48] I never check [08:55:59] yep, worked [08:56:04] i just checked [08:56:22] of course I wrote all of those scripts so I know how they react on failure [08:56:31] they are terrible, terrible python [08:56:35] don't look at the code [08:56:55] i am remarkably sympathetic to working software :) [08:56:59] heh [08:57:05] I wrote that like 6-7 years ago [08:58:47] (03CR) 10Ori.livneh: [C: 032] Add uWSGI module [operations/puppet] - 10https://gerrit.wikimedia.org/r/95331 (owner: 10Ori.livneh) [08:59:03] Ryan_Lane: no one is looking [08:59:09] (03CR) 10Ori.livneh: [C: 032] Port graphite module to use uwsgi::app [operations/puppet] - 10https://gerrit.wikimedia.org/r/95332 (owner: 10Ori.livneh) [08:59:10] :D [08:59:29] one day I should update all of that code, make it more generic and release it as a library [08:59:41] i no longer do drive-by pep8s now that i am entrusted to do real work [08:59:44] there's really very little for usable ldap tooling [08:59:48] which, i admit, is substantially less fun than i imagined [08:59:53] :D [09:00:02] so Ryan_Lane If i want to add wmde devs to some stuff, it would be :Require ldap-group cn=wmde,ou=groups,dc=wikimedia,dc=org ? [09:00:24] add them to stuff? what do you mean? [09:00:28] in apache? [09:00:37] yes Ryan_Lane [09:00:44] probably something like that :) [09:00:45] a.k.a https://rt.wikimedia.org/Ticket/Display.html?id=6293 [09:00:53] I'd have to look it up [09:01:05] I do that in apache infrequently enough to always have to check the reference [09:01:16] good for you :) [09:11:21] (03PS1) 10Matanya: allow wmde devs to access graphite and gdash [operations/puppet] - 10https://gerrit.wikimedia.org/r/95333 [09:16:03] matanya: Require ldap-group cn=wmde,ou=groups,dc=wikimeida,dc=org [09:16:09] spot the typo :) [09:16:50] (03CR) 10Aude: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95333 (owner: 10Matanya) [09:18:10] (03PS2) 10Matanya: allow wmde devs to access graphite and gdash [operations/puppet] - 10https://gerrit.wikimedia.org/r/95333 [09:18:14] arrg, never edit this early [09:19:13] thanks Aaron|home and aude [09:19:38] * Aaron|home listens to California by Phantom Planet [09:19:41] * Aaron|home is not ashamed [09:20:01] Ryan_Lane: still up? :) [09:20:07] yep [12:26:00] (03PS1) 10Hashar: zuul: refer to puppet variables with a @ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95359 [12:27:03] (03CR) 10Hashar: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [12:37:02] (03CR) 10Hashar: "This patch is mostly about adding gearman and gearman_server sections in the configuration file. I have confirmed both in labs and locally" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [12:37:31] (03Abandoned) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 (owner: 10Hashar) [12:43:34] (03PS1) 10QChris: Backup geowiki's data-private bare repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 [12:44:51] (03CR) 10QChris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95363 (owner: 10QChris) [13:02:07] !log bad news: Nov 14 11:17:43 amssq54 kernel: [246380.560873] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) and this host has the preallocated cache file [13:02:19] Logged the message, Master [13:07:36] PROBLEM - LVS HTTP IPv4 on parsoid-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [13:09:14] (03PS1) 10Mark Bergsma: Correct parsoid cache port [operations/puppet] - 10https://gerrit.wikimedia.org/r/95373 [13:09:36] !log preallocated cache file on amssq: 51, 52, 54, 56 59, 60 in addition to previously logged ones [13:09:42] (03CR) 10Mark Bergsma: [C: 032 V: 032] Correct parsoid cache port [operations/puppet] - 10https://gerrit.wikimedia.org/r/95373 (owner: 10Mark Bergsma) [13:09:50] Logged the message, Master [13:10:52] hashar: are you working on gerrit.pp? [13:11:07] matanya: nop [13:11:21] matanya: that would be Chad / Qchris / Ryan_Lane [13:11:31] I simply added some replication slaves by lamely copy/pasting some lines [13:12:05] is there a normal way to search gerrit and find out who si working on what? [13:13:44] I usually fetch the review notes and look at them [13:14:03] using aliases: [13:14:14] matanya: parsoid page is you, right? [13:14:16] $ git fetchreviews --help [13:14:17] `git fetchreviews' is aliased to `fetch -v gerrit refs/notes/review:refs/notes/review' [13:14:18] $ git codereview --help [13:14:19] `git codereview' is aliased to `log --decorate --notes=review' [13:14:33] matanya: the first will get the Code-Review / Verified scores from gerrit [13:14:57] er [13:14:59] matanya: the later show them in git log, so you can later: git codereview manifests/gerrit.pp and have clue has to who reviewed previous changes on that file [13:15:03] mark: parsoid page is you, right? [13:16:00] commuting to coworking place brb [13:16:53] thanks hashsr [13:23:10] (03PS1) 10Faidon Liambotis: filebackend: switch multiMaster to swift-eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95376 [13:24:00] c'mon jenkins [13:24:31] no jenkins? [13:25:42] (03CR) 10Faidon Liambotis: [C: 032 V: 032] filebackend: switch multiMaster to swift-eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95376 (owner: 10Faidon Liambotis) [13:26:31] !log faidon updated /a/common to {{Gerrit|I898bfd6f1}}: filebackend: switch multiMaster to swift-eqiad [13:26:45] Logged the message, Master [13:27:14] !log faidon synchronized wmf-config/filebackend.php 'switch multiMaster to swift-eqiad' [13:27:28] Logged the message, Master [13:36:25] (03PS1) 10Faidon Liambotis: Varnish: switch upload to eqiad Swift cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/95379 [13:47:31] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Varnish: switch upload to eqiad Swift cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/95379 (owner: 10Faidon Liambotis) [14:46:56] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [14:47:56] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:55:03] hashar: got more time now? [14:55:26] not really [14:55:33] k [14:56:33] if I had full access to the packages in gerrit and jenkins, I wouldn't have to poke you :-P [15:05:33] (03CR) 10Dereckson: "So if I understand well this patch, you've renamed EVERY $wmf... into $wmg...? Is that it?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [15:05:57] RECOVERY - LVS HTTP IPv4 on parsoid-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1463 bytes in 0.004 second response time [15:06:20] bad mark [15:10:55] (03CR) 10Dereckson: [C: 04-1] "Please read carefully the comment 2 of the bug:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [15:12:02] AzaToth: so [15:12:28] AzaToth: we want to build a package from upstream using the latest tag + some changes [15:12:35] I have no clue how to achieve that in gbp though [15:12:50] iirc there is a way to instruct gbp to get a specific commit [15:16:42] something like: [15:16:43] upstream-tree=branch [15:16:43] upstream-branch= [15:22:38] (03Abandoned) 10Hashar: resync with upstream v0.7.0 [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/91506 (owner: 10Hashar) [15:22:47] (03Abandoned) 10Hashar: upgrade from upstream tip of master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95142 (owner: 10Hashar) [15:24:14] (03CR) 10Arav93: "Initially when I changed the ones mentioned here, I got an error which said I did not change a variable which was not mentioned in the lis" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [15:26:40] (03PS1) 10Hashar: reset repo by merging in tag 'v0.7.1' from upstream [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95422 [15:27:00] AzaToth: https://gerrit.wikimedia.org/r/95422 would bump debian glue to v0.7.1, I cleared out our entries in debian/changelog [15:29:12] hashar: actually, upstream-tree can take treeish [15:32:29] hashar: so upstream-tree=v0.7.1-6-gf618f4d should work [15:33:21] more or less [15:33:24] (03PS1) 10Hashar: bump wmf package to v0.7.1-6-gf618f4d [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95424 [15:33:27] upstream-tree except a treeish [15:33:30] so the full commit [15:33:38] I have no clue what will happen in gbp though :-(( [15:33:41] we will see [15:34:01] hopefully it will discard the version given in the change log and actually use the commit I filled in debian/gbp.conf [15:36:17] dpkg-buildpackage: error: version number does not start with digit [15:36:18] bah [15:36:44] hah [15:36:50] hashar: I fixed that yesterday [15:36:54] https://integration.wikimedia.org/ci/job/operations-debs-jenkins-debian-glue-debian-glue/9/console [15:37:01] made a new changeset [15:37:08] but you didn't have time to review it [15:37:14] source version v0.7.1-6-gf618f4d+0~20131114153327.9~1.gbpb7cf17 [15:37:18] that is ugly :-] [15:38:04] (03Abandoned) 10Hashar: bump debian/changelog [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95143 (owner: 10Hashar) [15:38:07] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:14] https://gerrit.wikimedia.org/r/#/c/95143/1..2/debian/changelog [15:38:34] hashar: ↑ [15:38:35] oh I haven't seen that one [15:39:08] hashar: generally, drop the "v" and add a "-0" as it's not a debian native package [15:39:38] (03PS2) 10Hashar: bump wmf package to 0.7.1-6-gf618f4d [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/95424 [15:39:43] done [15:39:57] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 5.144 second response time [15:40:11] one day I will have to actually read and remember how Debian version works [15:40:19] the -0 indicate it's the zeroeth debian revision (real debian revisions start with 1) [15:40:24] https://gerrit.wikimedia.org/r/#/c/95424/1..2/debian/changelog,unified [15:40:29] so -0 is a hack ? [15:40:34] ubuntu had -0unbutu1 [15:40:56] you use -0 for packages/versions not yet in debian [15:41:10] so if/when included in debian, the versioning still works [15:41:25] we could have used -0wmf1 [15:41:38] or -0+wmf1 [15:42:08] -0 is a safe bet though ツ [15:42:25] but yea, you could call it a hack ツ [15:43:22] hashar: the build queue on integration looks fucked [15:43:45] mwext-Scribunto-testextensions-master * 50 [15:44:45] hashar: unstable result ツ [15:46:08] hashar: which version of python-defaults is installed on the builder? [15:47:06] no clue [15:47:08] how would I check? [15:47:26] I should grant you access on the integration project :] [15:48:08] heh [15:48:52] hexmode: hoooollyy shit [15:49:00] hexmode: what the hell are you doing with Scribunto and REL1_21 ? [15:49:22] hexmode: are you cherry-picking a bunch of unrelated changes or bumping REL1_21 up to master ? [15:49:29] if later, you should submit a single merge commit [15:52:26] upstream-tree=f618f4d35a88efd1d3529217c49df5892899aecd might suffice [15:54:31] to have gbp use it , I am pretty sure you need upstream-branch=branch as well [15:55:27] I tried without, and it seemed to work [15:55:46] (03PS1) 10Cmjohnson: Adding analytics vlan to 10.in, changing ips of an1009,11-13,21,23..removing ipv6 an21 [operations/dns] - 10https://gerrit.wikimedia.org/r/95426 [15:58:53] !log preallocated cache files for amssq61 and 62 with ext4, this requires removing logbuf option in fstab [15:59:09] Logged the message, Master [16:02:07] apergos: thanks :-) [16:02:51] of the ones we already did, only amssq54 whined since then, and only once, so we'll see [16:06:14] !log Gerrit/Git resetting mediawiki/extensions/Scribunto REL1_21 branch from fd1fbb4 to previous b5015a2. hexmode cherry-picked a bunch of changes from another branch/master instead of submitting a merge commit. [16:06:28] Logged the message, Master [16:09:31] all day, i have had periodic errors with gerrit like [16:09:32] "error: Could not resolve host: gerrit.wikimedia.org; nodename nor servname provided" [16:09:37] anyone else? [16:09:50] yup it happens to me from time to time [16:10:02] never investigated though since it resolve by itself pretty fast [16:10:08] been a lot to day [16:10:11] today* [16:10:25] aude: if you can't resolve it, then the dns is fubar [16:10:32] ok [16:10:48] at least the DNS entry is on all wikimedia nameservers right now [16:11:39] with a TTL of 60s :( [16:18:06] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:56] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 1.859 second response time [16:19:21] (03PS1) 10Hashar: raise TTL for gerrit.wikimedia.org from 60s to 1H [operations/dns] - 10https://gerrit.wikimedia.org/r/95428 [16:20:22] aude: raising the TTL of gerrit.wikimedia.org with ^^^ [16:20:24] that might help [16:20:29] ok [16:20:39] (03CR) 10Faidon Liambotis: [C: 032] "I doubt it's the cause of your issues, but it's nevertheless a good idea." [operations/dns] - 10https://gerrit.wikimedia.org/r/95428 (owner: 10Hashar) [16:21:00] good idea :) [16:26:20] that is the most common issue in DNS [16:26:30] lowering TTL and forgetting to bump it again post migration :-( [16:26:44] doesn't matter much [16:27:36] PROBLEM - Puppet freshness on sq44 is CRITICAL: No successful Puppet run in the last 10 hours [16:30:33] ^d: hey, btw, what was it that we were missing to enable ipv6 in gerrit? [16:30:58] <^d> We talked about that before? [16:31:04] * ^d doesn't remember [16:32:19] I think, I'm not sure :) [16:33:43] ottomata: all set here to make the move [16:33:50] oh woah awesome [16:34:20] hmm, let's be careful wtih an23-25 [16:34:24] can we do those one at a time? [16:34:33] are you moving all of those or is one staying where it is? [16:34:34] well we are only moving an23 [16:34:39] ok cool [16:34:39] <^d> paravoid: So, I can't think of anything that's really gerrit-specific. It's just a jetty app running behind an apache reverse proxy. [16:34:41] no probs then [16:34:44] you can move any whenever you want [16:34:49] <^d> It doesn't care much about IP addresses afaik. [16:34:55] we need to move an1009,11-12,21,23 [16:35:03] ottomata https://gerrit.wikimedia.org/r/95426 once the move is done [16:35:08] <^d> paravoid: (And I see no open bugs in the tracker upstream mentioning ipv6) [16:35:51] hm , why removing ipv6 on an21? [16:36:02] <^d> Let me double check, but I think we bind on *:8080 and *:29418, which means we should Just Work. [16:36:41] Elsie: does the reviewer count work now? [16:38:09] <^d> paravoid: Relevant bits of gerrit.config.erb: http://p.defau.lt/?yGhx3ZnET9LvJHVfFr7DvA. The sshd stuff should be fine as-is. And since we're behind apache the httpd stuff shouldn't matter. [16:39:06] ^d: cool, thanks [16:39:16] <^d> yw. [16:39:29] (03PS2) 10Akosiaris: Provide a force ssh command+key for private update [operations/puppet] - 10https://gerrit.wikimedia.org/r/94770 [16:40:47] akosiaris: did you see my email to ops list about rsyslog line length limit? [16:41:50] also, paravoid, do we need a varnishkafka tag before we merge? [16:41:54] yes [16:42:02] ok [16:42:05] so [16:42:05] and switch gbp.conf to tag [16:42:09] ja [16:42:09] k [16:42:21] wasn't sure if the initial commit should be merged first [16:42:26] and a separate changelog entry for the tag [16:42:26] but ok [16:42:35] (03CR) 10Dereckson: "There are two possibilities I would sugget." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [16:42:43] Snaps: we ready to tag? [16:45:34] (03CR) 10Akosiaris: [C: 032] Provide a force ssh command+key for private update [operations/puppet] - 10https://gerrit.wikimedia.org/r/94770 (owner: 10Akosiaris) [16:45:35] !log powering down analytics1009, analytics1011-13, analtyics1021, analytics1023 to relocate to rack A2 [16:45:50] Logged the message, Master [16:46:13] (03PS1) 10Faidon Liambotis: Add IPv6 address to Gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/95433 [16:46:31] (03PS2) 10Faidon Liambotis: Add a static IPv6 address to Gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/95433 [16:46:41] (03CR) 10Faidon Liambotis: [C: 032] Add a static IPv6 address to Gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/95433 (owner: 10Faidon Liambotis) [16:48:06] PROBLEM - SSH on analytics1009 is CRITICAL: Connection refused [16:48:16] PROBLEM - puppet disabled on analytics1009 is CRITICAL: Connection refused by host [16:48:17] PROBLEM - Disk space on analytics1009 is CRITICAL: Connection refused by host [16:48:56] PROBLEM - SSH on analytics1012 is CRITICAL: Connection refused [16:48:56] PROBLEM - Disk space on analytics1011 is CRITICAL: Connection refused by host [16:48:56] PROBLEM - RAID on analytics1009 is CRITICAL: Connection refused by host [16:49:06] PROBLEM - SSH on analytics1021 is CRITICAL: Connection refused [16:49:06] PROBLEM - puppet disabled on analytics1011 is CRITICAL: Connection refused by host [16:49:06] PROBLEM - RAID on analytics1011 is CRITICAL: Connection refused by host [16:49:07] PROBLEM - Disk space on analytics1021 is CRITICAL: Connection refused by host [16:49:07] PROBLEM - SSH on analytics1011 is CRITICAL: Connection refused [16:49:07] PROBLEM - puppet disabled on analytics1013 is CRITICAL: Connection refused by host [16:49:07] PROBLEM - DPKG on analytics1009 is CRITICAL: Connection refused by host [16:49:07] PROBLEM - DPKG on analytics1011 is CRITICAL: Connection refused by host [16:49:08] PROBLEM - DPKG on analytics1013 is CRITICAL: Connection refused by host [16:49:09] PROBLEM - DPKG on analytics1012 is CRITICAL: Connection refused by host [16:49:16] PROBLEM - Disk space on analytics1013 is CRITICAL: Connection refused by host [16:49:17] PROBLEM - Disk space on analytics1012 is CRITICAL: Connection refused by host [16:49:17] PROBLEM - RAID on analytics1021 is CRITICAL: Connection refused by host [16:49:17] PROBLEM - RAID on analytics1012 is CRITICAL: Connection refused by host [16:49:17] PROBLEM - SSH on analytics1023 is CRITICAL: Connection refused [16:49:17] PROBLEM - SSH on analytics1013 is CRITICAL: Connection refused [16:49:18] PROBLEM - Disk space on analytics1023 is CRITICAL: Connection refused by host [16:49:18] PROBLEM - puppet disabled on analytics1012 is CRITICAL: Connection refused by host [16:49:26] PROBLEM - DPKG on analytics1021 is CRITICAL: Connection refused by host [16:49:27] hehe [16:49:29] (03PS1) 10Faidon Liambotis: Add AAAA to Gerrit [operations/dns] - 10https://gerrit.wikimedia.org/r/95434 [16:49:38] cmjohnson1: forgot to add maintenance mode in Icinga :-D [16:49:46] PROBLEM - puppet disabled on analytics1023 is CRITICAL: Connection refused by host [16:49:46] PROBLEM - puppet disabled on analytics1021 is CRITICAL: Connection refused by host [16:49:46] PROBLEM - RAID on analytics1023 is CRITICAL: Connection refused by host [16:49:51] uh oh [16:50:02] :-D [16:50:06] PROBLEM - RAID on analytics1013 is CRITICAL: Connection refused by host [16:50:07] PROBLEM - DPKG on analytics1023 is CRITICAL: Connection refused by host [16:50:08] why does icinga-wm always greet me the same way every day? [16:50:09] greg-g: planned maintenance [16:50:16] because it <3 you [16:50:20] get up, check some email, look at IRC and BAM! [16:50:21] just like our pages say... [16:50:22] (I think) [16:51:07] <^d> We should s/PROBLEM/BAM!/ in icinga alerts. [16:51:09] (03PS1) 10Manybubbles: CirrusSearch secondary for itwiki and plwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95437 [16:51:17] or Oooops [16:51:21] <^d> BAM! - DPKG on analytics1023 is CRITICAL: Connection refused by host [16:51:33] heh [16:51:36] (03CR) 10Faidon Liambotis: [C: 032] Add AAAA to Gerrit [operations/dns] - 10https://gerrit.wikimedia.org/r/95434 (owner: 10Faidon Liambotis) [16:51:54] (03CR) 10MarkAHershberger: [C: 031] Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [16:52:03] gerrit now has ipv6 [16:52:45] ^d: ^ [16:52:49] ^d: :) :) [16:53:29] feel free to ping me with ipv4-only services and I'll put them in my queue ;) [16:53:35] <^d> :) [16:54:09] greg-g: https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&tab=v&vn=Media+storage&hide-hf=false [16:54:26] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:06] paravoid: lots o colors [16:55:16] top row [16:55:18] (03PS3) 10Ottomata: Serve geowiki's private data through statistics websever [operations/puppet] - 10https://gerrit.wikimedia.org/r/94626 (owner: 10QChris) [16:55:21] compare left and right [16:55:21] does those represent individual machines? stacked graph or something? [16:55:22] (03CR) 10Ottomata: [C: 032 V: 032] Serve geowiki's private data through statistics websever [operations/puppet] - 10https://gerrit.wikimedia.org/r/94626 (owner: 10QChris) [16:55:23] and their titles [16:55:26] yeah [16:55:32] swift pmtpa -> eqiad [16:55:37] switch over a bit obvious :) [16:55:51] my way of telling you that I'm done [16:55:51] what does "fe" stand for? [16:55:54] (03CR) 10Andrew Bogott: "OK, brace yourself for some silly gerrit gymnastics as I attempt to get this merged." [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [16:55:55] paravoid: :) thanks [16:56:12] * matanya is holding fingers [16:56:44] greg-g: frontend [16:56:50] ah [16:56:59] but yeah, awesome [16:57:05] andrewbogott: I'm late to the party but "download" is a terrible name for a module [16:57:40] paravoid, matanya, have a better suggestion? [16:58:20] andrewbogott: paravoid maybe hosting? [16:58:27] (I've always been puzzled by what exactly was meant by 'download' but figured I was the only one.) [16:58:44] since it is sites we hsot for downloads? [16:58:51] *host [16:59:17] paravoid: that graph is pretty [16:59:36] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:00:18] <^d> manybubbles: You ready? [17:00:45] ^d: sure! to deploy the extension update, can I use sync_dir? [17:00:56] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:15] <^d> Yeah, just update the submodule then sync-dir. [17:01:21] ^d: k. starting. [17:01:23] <^d> Lemme +2 your changes so jenkins will do its think, [17:01:24] greg-g: ^ [17:01:25] <^d> *thing [17:01:30] oh yeah, that [17:01:44] (03CR) 10Chad: [C: 032] CirrusSearch secondary for itwiki and plwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95437 (owner: 10Manybubbles) [17:01:46] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:20] manybubbles: what's even prettier is http://gdash.wikimedia.org/dashboards/filebackend/ [17:02:27] manybubbles: :) [17:02:43] (03Merged) 10jenkins-bot: CirrusSearch secondary for itwiki and plwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95437 (owner: 10Manybubbles) [17:02:52] paravoid: that one is harder to read I think [17:03:11] just see the latency dropping at about 15:00 UTC [17:04:04] matanya: git tells me that your patch is a 'merge'. Any idea what that's about? [17:04:09] paravoid: wow, and that is log scale [17:04:28] yes [17:04:29] yes andrewbogott. i have rebased it against changes you did [17:04:46] ^d: another thing to confirm, I have to do the sync for both wmf3 and wmf2 [17:04:55] matanya: did you rebase or merge? Rebasing shouldn't cause that problem... [17:05:21] i have rebased, and had a conflict. the result was a merge [17:05:50] andrewbogott: any problem with that? [17:05:56] <^d> manybubbles: You only updated the submodule on wmf2 so just that one. [17:06:05] "fatal: Commit 81ea0840cc9ee65a678fc6f3f4e35d733a30ccd1 is a merge but no -m option was given." [17:06:06] (03PS1) 10Matanya: imagescaler: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/95440 [17:06:21] ^d: let me update it on wmf3 as well [17:07:02] ugh [17:07:06] andrewbogott: jus try to cherry pick, i guess [17:07:10] *t [17:07:15] <^d> manybubbles: k :) [17:07:19] matanya: I can't, that's the point :) [17:07:24] I've never seen this error, actually. [17:07:25] I don't like all these "mv manifests/* modules/" attempts [17:07:48] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:51] matanya: can you create a new fresh 'origin' branch and see if it lest you cherry-pick locally? [17:07:56] paravoid, what would you prefer? [17:07:59] imagescaler needs to be merged with appserver [17:08:03] (for example) [17:08:41] paravoid: Sure, but… I don't think that's a general principle. Some groups of manifests are already coherent organizations, some aren't. [17:09:11] <^d> manybubbles: merged that too. [17:09:16] andrewbogott: don't merge my patches of paravoid is un happy with them, please [17:09:23] *if [17:09:26] haha [17:09:50] I'm not unhappy in general, no :) [17:09:56] quite the opposite [17:09:58] matanya: don't worry :) [17:09:59] PROBLEM - Host analytics1023 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:17] (03PS1) 10QChris: Serve geowiki datafiles folder directly [operations/puppet] - 10https://gerrit.wikimedia.org/r/95443 [17:10:34] matanya: it might be worthwhile for you to broadcast future refactor attempts to a mailing list before you start work, and see if people have existing ideas for how things will be rearranged. [17:10:53] I haven't looked at the wikipage for a while [17:11:01] is it still being updated? [17:11:17] * paravoid looks [17:11:21] Rarely, and only by me. [17:11:54] heh, last revision is by matanya, about the imagescaler work [17:12:08] andrewbogott: my point of view is, refactoring everything into modules is a step we must take at first. then we can start arranging stuff [17:12:31] we've previously discussed this (to the death) [17:12:46] i'm unaware of that [17:12:50] matanya: that's not an unreasonable approach, but we've discussed this and agreed to do refactors & modularization in concert. [17:13:00] at least /some/ refactoring [17:13:02] (which you would have no way of knowing :) ) [17:13:05] and i guess i lost this debate without being in it :) [17:13:08] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:17] this was before you started working on this :( [17:13:39] I can through this patch away [17:13:43] !log manybubbles synchronized php-1.23wmf2/extensions/CirrusSearch/ 'Update CirrusSearch to master' [17:13:58] Logged the message, Master [17:13:59] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 3.922 second response time [17:14:03] !log manybubbles synchronized php-1.23wmf3/extensions/CirrusSearch/ 'Update CirrusSearch to master' [17:14:17] Logged the message, Master [17:14:20] matanya: this particular patch (download) is definitely useful, pending a rename. [17:14:30] I just can't figure out how to get git to deal with the damn thing. [17:14:36] this == imagescaler [17:14:43] Oh, that I don't know about. [17:14:47] I have no good ideas, so I won't block you, just merge it and we'll figure it out later [17:14:54] about the "download" naming [17:15:07] (03CR) 10Ottomata: [C: 032 V: 032] Serve geowiki datafiles folder directly [operations/puppet] - 10https://gerrit.wikimedia.org/r/95443 (owner: 10QChris) [17:15:19] maybe "downloads"? dunno [17:15:20] andrewbogott: doesn't let me do this too [17:15:51] paravoid: any thoughts on rsyslog line length limit for varnishkafka stats? [17:15:56] matanya: You must've done something different from what I do when resolving rebase conflicts. [17:16:08] matanya: Worst case we can produce a text diff and re-apply it... [17:16:13] i did what RoanKattouw_away told me :) [17:16:16] Although that could be messy. [17:16:16] ottomata: what is it that is being logged exactly? [17:16:34] matanya: if it ever involved typing the word 'merge' then roan is wrong! [17:16:39] paravoid: https://gist.github.com/ottomata/7457391 [17:17:02] andrewbogott: the word used was git rebase --skip [17:17:14] oh! That's even weirder. [17:17:26] after git rebase [17:17:33] I mean, sure enough, that will get you a valid patch locally, but inasmuch as the patch isn't a patch to origin it's not very helpful. [17:17:36] ottomata: uhm... [17:17:49] ottomata: to me, /var/log is logs for sysadmins, not stats like those [17:18:00] oook [17:18:04] !log manybubbles synchronized wmf-config/InitialiseSettings.php 'plwiktionary and itwiki get cirrus as secondary' [17:18:07] this is completely unreadable [17:18:10] well rsyslog is flexible, i could identify the lines somehow [17:18:16] Logged the message, Master [17:18:17] and direct them somewhere else [17:18:22] ^d and greg-g: I believe I'm done syncing code [17:18:24] but i'd still need to up the rsyslog line lenght limit [17:18:37] it's very borderline [17:18:55] but I think you're better off not using syslog for this at all [17:18:56] manybubbles: neat-o [17:19:06] manybubbles: how's things? broken? good? somewhere between? [17:19:06] just write to /var/cache/varnishkafka/stats for example [17:19:08] !log Jenkins: updated code sniffer style from 574f68d to 0bebf0f7b [17:19:08] <^d> manybubbles: I think so too. Easy to verify :) [17:19:23] Logged the message, Master [17:19:32] <^d> https://www.mediawiki.org/wiki/Special:Version - has Elasticsearch version now :) [17:19:39] matanya: does that make sense? I think this patch is a diff that applies to some other earlier (or partially merged) version of the repo. [17:19:43] alternatively, just push them to statsd, although I guess that's a larger project :) [17:19:45] So git can't apply it to origin. [17:19:51] I'm not sure how we unwind this… thinking. [17:20:20] i'm trying to push something here andrewbogott [17:20:21] (03CR) 10GWicke: "Awesome, thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [17:20:25] 'k [17:20:26] ^d: verified some expected change made it to plwiktionary. yay [17:21:05] nope, doesn't work andrewbogott [17:21:18] ! [remote rejected] HEAD -> refs/publish/production/download (no new changes) [17:21:34] paravoid: yeah i've been talking with Snaps a bit about that [17:21:39] not sure if we wanted to build that into varnishkafka or not [17:21:49] parsing them and sending them out isn't too hard [17:22:00] was going to use https://github.com/etsy/logster [17:22:10] matanya: OK. I think the way to handle this (clumsy!) is to do 'git diff origin' > foo.diff [17:22:12] the statsd protocol is really really trivial [17:22:16] (which, it seems, was based on maplebed's ganglia-logtailer) [17:22:29] then patch foo.diff into a clean checkout of origin... [17:22:40] logster seems nice [17:22:43] And then go through and resolve conflicts by hand, and then see what you get. [17:22:51] andrewbogott: ok, trying [17:22:55] !log rebuilding all CirrusSearch indexes in place to get config updates [17:22:59] haha [17:23:01] small world [17:23:04] I can do all that, but it might be more interesting for you to try :) [17:23:11] Logged the message, Master [17:23:28] yeha, and logster would let me not worry about statsd right now, and just deal with ganglia [17:23:34] since everything i'm doing is in ganglia anyway [17:23:56] * andrewbogott goes to peel parsnips, back in 5 [17:24:45] sigh, so anyway, paravoid, what shoudl I do, I don't care where the stats go, and I'm sure I can use the rsyslogd conf file to filter these json ones to /var/cache [17:24:56] but I still have a rsyslog line length limit to deal with [17:25:00] i can up it [17:25:01] (03PS1) 10Hashar: deployment: mediawiki/tools/codesniffer for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/95446 [17:25:09] but I *think* I can only do so globally [17:25:14] which probably won't hurt anything really [17:25:25] the diff is empty andrewbogott, so can't apply it :/ [17:25:27] just isn't nice for an installed program to change the global settings [17:25:37] can I get a merge / deploy of https://gerrit.wikimedia.org/r/95446 , that adds a repository in the "nameless wikimedia deployment system based on the perl git-deploy script" [17:25:41] pleeaaase :] [17:26:17] (03CR) 10Ori.livneh: [C: 032] deployment: mediawiki/tools/codesniffer for Jenkins CI slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/95446 (owner: 10Hashar) [17:26:25] \O: [17:26:27] \O/ [17:26:32] thx [17:26:42] it isn't no [17:26:43] ori-l: don't bother running puppet on tin, that can wait. [17:27:07] paravoid ^^^^^ [17:27:08] ^d: so the in place reindex is pretty fast.... 20 processes going at 250 docs/sec. with our tiny hardware [17:27:26] oh sorry yeah [17:27:28] yeah not nice [17:27:33] ottomata: you have to puppet-merge on palladium [17:27:41] <^d> manybubbles: Some of the incremental updates we've done here and there have helped :) [17:27:43] oh! [17:27:50] ottomata: already running it tho [17:27:55] are my sockpuppet merges not working then? [17:27:56] oh [17:27:57] sorry [17:28:00] ^d: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=testsearch1003.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1384449925&v=12033012&m=es_indexes&vl=indexes%2Fsec&ti=es_indexes&z=large [17:28:07] if I do so on palladium will it go to sockpuppet/stafford too? [17:28:11] I will cahnge my behavior [17:28:32] ottomata: you have to do both manually, but just as a temporary measure; see akosiaris's email [17:30:03] ok [17:30:05] thanks [17:30:09] np [17:31:51] (03PS1) 10Ottomata: Updating help message for puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/95448 [17:32:10] (03CR) 10Ottomata: [C: 032 V: 032] Updating help message for puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/95448 (owner: 10Ottomata) [17:33:26] paravoid, sorry to keep bothering, but I was actually working on logster stuff to parse and send varnishkafka stats to ganglia, so if you think we shoudln't be writing this stuff to files (via rsyslog?) then I should know sooner rather than later [17:33:29] so I don't keep working on it [17:33:45] I think it's a poor fit, yes [17:34:04] we can workaround it if you want to be done with it, but I think it's not that great [17:34:31] so, what then, statsd support in vk? [17:35:10] for some parts of it, that'd be ideal I think, but in the meantime just have varnishkafka write to those files by itself? [17:35:22] instead of relying on syslog for writing stats? [17:35:46] (03PS1) 10Umherirrender: enable Echo on all beta.wmflabs.org-wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 [17:35:49] we'd ahve to code support for that in, right now it is either stderr or syslgo [17:35:56] got Faidon? [17:36:02] that would be me [17:36:10] handy [17:36:12] hi! [17:36:13] :-) [17:36:14] bwerrrrr [17:36:14] off for today, see you tomorrow [17:36:20] Snaps: i guess is not aroudn? [17:36:28] (03CR) 10Umherirrender: "Untested" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 (owner: 10Umherirrender) [17:36:29] matanya: if the diff is empty that you aren't diffing the right things :) You want a diff between your local weird branch with 'download' refactor applied vs. origin [17:37:22] andrewbogott: i have. i'm on the download branch. running $ git diff download origin > foo.diff [17:37:30] paravoid: Can I convince you to do a little bgp config for me? [17:37:35] and the diff is empty and zero size [17:38:08] I need an ibgp session for pmacct. -- the mx80's IPFIX dst_asn data is trash when you have lots of AS-PATHs. [17:38:37] hrm [17:38:49] do you need it badly? I'd prefer if we went through Leslie [17:39:08] Ok, I've been bugging her, but was looking for some redundancy.. :) [17:39:14] heh [17:39:16] not badly, but kinda stuck atm. [17:39:35] I'll give her a few more days. [17:39:44] yeah, let's please do that [17:40:00] were you able to build some debs? [17:40:09] I could test those out instead. [17:40:15] sorry, that command was pseudocode, not literal :) [17:40:15] I didn't find the time yet, but I didn't forget [17:40:27] If you're on a branch, to diff vs. another branch just 'git diff ' [17:40:30] ok, np.. [17:40:34] sorry :( [17:40:35] So, in your case, $ git diff origin [17:40:39] sorry, was cryptic before [17:41:56] still zero size andrewbogott [17:42:27] um… that means your 'download' branch == origin, which is intresting! [17:42:31] and tragic [17:42:49] anyway, I can do this :) hang on... [17:43:14] * matanya is lost in this git magic [17:43:55] cajoel: you know we have confeds now, right? [17:44:32] (03CR) 10Hashar: [C: 031] "Looks fine. Once merged / updated on beta, a wmf person can manually trigger https://integration.wikimedia.org/ci/job/beta-update-databas" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95450 (owner: 10Umherirrender) [17:46:34] matanya: Ok, for demonstration purposes only, this patch is the total applied diff (which means it will include conflicts as well, probably…) [17:46:44] paravoid: I think you can still do a rr-client session from a single device... [17:46:47] next I'll clean things up… hopefully it will be obvious how to do that. [17:47:05] the goal is to just get all bestpaths from that device -- not actually exchange any data... [17:47:19] paravoid: is the confed layout diagramed somewhere? [17:47:30] cajoel: the point is that you'd probably have to do multiple iBGP sessions with each private AS [17:47:43] I don't think it is, no [17:47:58] paravoid: it's only on the single egress device that I need the feed.. [17:48:08] what do you mean? [17:48:13] maybe I don't understand confeds as much as I think I do, but aren't they one AS per device? [17:48:41] it's the first time I'm doing them, but I think our setup is [17:49:01] one private AS per DC (2 routers) [17:49:01] each egress device (router) would have an ibgp feed back to the collector [17:49:12] we have 2 egress devices per site [17:49:25] that's ok [17:49:34] so one ibgp per site, at least [17:49:46] I think you need one per device [17:49:55] (03PS1) 10Andrew Bogott: Convert download into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95454 [17:50:01] as one device might have different path selections due to igp costs? [17:50:13] or do you land all upsteams on both egress devices? [17:50:25] no, the exact opposite actually [17:50:34] each transit is only on one device [17:50:38] (03Abandoned) 10Andrew Bogott: Convert download into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95454 (owner: 10Andrew Bogott) [17:50:41] paravoid: want to do a quick hangout? [17:50:47] ok! Well, that patch was unsalvagably bad. [17:51:08] what's wrong with IRC? :-) [17:51:10] pv: so you might have one router preferring isp1, and the other preferring isp2.. [17:51:17] So, matanya, I don't think there's an easy fix… you'll need to rebuild the patch by hand :( [17:51:29] ok andrewbogott thanks [17:51:40] if you have equal localprefs and as-paths, etc. [17:51:46] I don't think there's a lesson here except that --skip breaks everything. [17:51:53] If you have merge conflicts you have to actually resolve them by hand. [17:52:05] hmm, yeah, I guess so [17:52:06] if you did me a table dump from both devices, I could confirm that theory, but I think it's likely.... [17:52:14] yeah, you're right [17:52:16] ok, note for RoanKattouw :) [17:52:22] that's why I believe you normally want ibgp from each egress device [17:52:30] heh [17:52:38] andrewbogott: how do i reviet it back to a working status? [17:52:39] Yeah that --skip was my brain fart, sorrt [17:52:41] *sorry [17:52:50] mark wanted to split routers into core/edge (using juniper virtual routers) for this reason [17:52:59] matanya: just start a fresh branch... [17:53:01] but we've found a few limitations, so this probably won't happen [17:53:01] $ git fetch origin [17:53:08] ok andrewbogott [17:53:09] $ git checkout -b freshbranch origin [17:53:19] yes, that i should know by now :) [17:53:22] sure sure.. [17:54:14] ok, so you'll need to peer cr1-eqiad/65002, cr2-eqiad/65002, cr1-sdtpa/65001, cr2-pmtpa/65001, cr1-ulsfo/65003, cr2-ulsfo/65003 [17:54:14] oooof paravoid, is is so bad to just up the line limit for rsyslog on varnish hosts? i'm fine with doing this via puppet if we don't want to do it in the package [17:54:32] with* [17:54:38] esams is a completely different AS [17:54:50] so not part of the confederation [17:55:01] (14907 vs. 43821) [17:55:28] ottomata: you want to do a infrastructure-wide to logging because you want to abuse rsyslog for stats collection :) [17:55:38] haha [17:55:39] infrastructure-wide change [17:55:47] well if you put it that way of course it sounds horrible :p [17:55:50] :) [17:56:08] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:31] for gathering stats about analytics, using a daemon that was made to use a special protocol for... transporting log lines [17:56:42] am I making it better? :) [17:56:56] !log reedy synchronized php-1.23wmf3/extensions/Wikibase [17:56:58] I mean, sure, yes, do it if you want to be over with it, but don't expect me to like it :) [17:56:59] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 2.148 second response time [17:57:13] Logged the message, Master [17:58:21] haha, we talked about sending these stats back into a special kafka topic [17:58:33] but i'd rather be able to see produce error counts directly from the producers [17:58:44] if there are produce errors, and we can't produce to kafka [17:58:47] how would we know? [17:58:47] :p [17:58:47] so yeah [17:58:50] pshh [18:04:03] ottomata: we could do both! syslog for fallback [18:05:06] what's that Snaps, log to a file? [18:05:08] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:10] or statsd? [18:05:55] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [18:07:32] ottomata : we send stats to kafka and if delivery fails we syslog it instead or write to file [18:08:17] manybubbles: just updated elasticsearch.deb in our apt [18:08:19] http://apt.wikimedia.org/wikimedia/pool/universe/e/elasticsearch/ [18:08:25] ahhhhhh noo [18:08:32] ottomata: yay! no? [18:08:35] haha [18:08:38] that was for Snaps [18:08:55] putting it in kafka vs. a file means I have to parse it differently in different places [18:09:01] just to get the stats to ganglia [18:09:04] ottomata: cool. I'll try the update soon in dev/beta/labs [18:09:06] i want the monitoring to just work [18:10:07] (03PS1) 10Mark Bergsma: Distribute all CentralAutoLogin requests randomly over backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/95458 [18:10:10] Snaps: can we just separate the stats part of the logging from the regular vk_log stuff? [18:10:17] and always use a local file? [18:10:19] for that? [18:11:43] (03CR) 10Faidon Liambotis: [C: 032] Distribute all CentralAutoLogin requests randomly over backends [operations/puppet] - 10https://gerrit.wikimedia.org/r/95458 (owner: 10Mark Bergsma) [18:13:11] (03PS1) 10Odder: (bug 57042) Update $wgUploadNavigationUrl on tewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95459 [18:14:07] (03PS2) 10Cmjohnson: Adding analytics vlan to 10.in, changing ips of an1009,11-13,21,23..removing ipv6 an21 [operations/dns] - 10https://gerrit.wikimedia.org/r/95426 [18:14:15] paravoid: heh, nice trick :) [18:14:23] Aaron|home: which one? [18:14:38] well, I guess it's mark's really [18:14:54] paravoid: 95458 [18:17:25] PROBLEM - MySQL Processlist on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:34] (03PS1) 10Matanya: download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95460 [18:17:43] andrewbogott: ^ [18:18:05] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:15] PROBLEM - MySQL InnoDB on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:15] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:16] PROBLEM - MySQL Recent Restart on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:16] PROBLEM - MySQL Idle Transactions on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:19] matanya: ok! I guess I should read this again for good measure... [18:18:36] good idea, i'm humen and make lots of mistakes [18:18:38] Aaron|home: yeah, some really bad luck is killing cp1055 [18:18:55] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay -1 seconds [18:19:00] it's #54195 + (some uls api bugs I don't have handy) + consistent hashing [18:19:05] RECOVERY - MySQL InnoDB on db1028 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:19:05] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 1 seconds [18:19:06] RECOVERY - MySQL Recent Restart on db1028 is OK: OK 20809053 seconds since restart [18:19:06] RECOVERY - MySQL Idle Transactions on db1028 is OK: OK longest blocking idle transaction sleeps for 0 seconds [18:20:05] ottomata: local file is good [18:20:53] akosiaris: when you say "give it a simple Rdoc format.. so the documentation generator can parse it".. do you mean literally "rdoc foo.pp" or would "puppet doc foo.pp" finding it be sufficient [18:21:17] because the latter finds comments while rdoc sees 0 classes [18:21:35] though puppet doc is based on rdoc [18:21:52] (03CR) 10Andrew Bogott: [C: 032] download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95460 (owner: 10Matanya) [18:22:15] RECOVERY - MySQL Processlist on db1028 is OK: OK 0 unauthenticated, 0 locked, 19 copy to table, 14 statistics [18:22:17] matanya: can you abandon the other patch in gerrit, to avoid confusion? [18:22:19] Snaps: are you ok with splitting them? [18:22:24] Snaps: moving to PM [18:22:25] yes [18:22:35] ah:) hi [18:22:57] (03PS1) 10Mark Bergsma: Change IP addresses of upload-lb.eqiad to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/95462 [18:23:11] (03Abandoned) 10Matanya: download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/90760 (owner: 10Matanya) [18:23:12] (03CR) 10Anomie: "> In this case, yes, you should only migrate the ones indicated in bug title and comment 2." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [18:23:18] Aaron|home: btw, swift change went in earlier today and I also switched varnish [18:23:43] * Aaron|home thought the varnish change idea was in jest :) [18:26:53] cajoel: hey [18:27:04] i was unable to figure out how to change the port on an ibgp session via junos [18:29:16] paravoid: I don't see that in SAL [18:29:40] SAL isn't used much for things in gerrit [18:30:39] (03CR) 10RobH: [C: 031] "one space issue (already chatted about in irc), otherwise this looks sane to me. (just doing +1 so Chris can merge)" [operations/dns] - 10https://gerrit.wikimedia.org/r/95426 (owner: 10Cmjohnson) [18:31:16] Aaron|home: git history :) [18:32:30] (03CR) 10jenkins-bot: [V: 04-1] download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95460 (owner: 10Matanya) [18:34:18] (03CR) 10Mark Bergsma: [C: 032] Change IP addresses of upload-lb.eqiad to the new Zero scheme [operations/dns] - 10https://gerrit.wikimedia.org/r/95462 (owner: 10Mark Bergsma) [18:34:33] (03PS2) 10Matanya: download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95460 [18:35:09] jenkins dead? [18:35:18] no, just gave me -1 [18:35:22] he is just slow [18:35:36] andrewbogott: uploaded a new one [18:36:39] !log reedy synchronized php-1.23wmf4/ [18:36:57] Logged the message, Master [18:37:27] !log reedy synchronized docroot and w [18:37:45] Logged the message, Master [18:38:16] !log reedy updated /a/common to {{Gerrit|I076d1eddf}}: CirrusSearch secondary for itwiki and plwiktionary [18:38:34] LeslieCarr: tricky... [18:38:36] Logged the message, Master [18:39:19] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95464 [18:39:21] (03PS1) 10Reedy: Everything else to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95465 [18:39:21] (03PS1) 10Reedy: Phase 1 wikis to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95466 [18:39:37] Leslie: Will the neighbor line take IP:PORT ? [18:39:48] (03CR) 10Cmjohnson: [C: 032] Adding analytics vlan to 10.in, changing ips of an1009,11-13,21,23..removing ipv6 an21 [operations/dns] - 10https://gerrit.wikimedia.org/r/95426 (owner: 10Cmjohnson) [18:40:11] !log dns update [18:40:30] Logged the message, Master [18:40:48] i can try, doubtful that will work since nothing else junos works like that, but doesn't hurt to try [18:40:56] ottomata: I ran my regression tests both during and after the upgrade - not a hickup. [18:41:20] well, maybe i can try [18:41:23] i'm ona plane [18:41:28] ssh currently not yet going through [18:41:45] great! [18:42:24] ok, ssh is unusable, 3 second delay, not gonna login on anything with that [18:42:37] impressive that it works on a plane at all [18:42:48] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95464 (owner: 10Reedy) [18:45:14] Leslie: Is the session configured on normal port for now? [18:46:14] (03PS1) 10Ori.livneh: Forward beta cluster UDP logs to logstash.pmtpa.wmflabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/95468 [18:46:17] (03CR) 10Reedy: [V: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95464 (owner: 10Reedy) [18:46:19] it's not [18:46:29] i wa strying to figure out how to put it on not normal port [18:46:29] yay re 95468 [18:46:47] lc: me too -- reading... [18:46:49] if someone (paravoid or mark?) wants to do this, i would be okay with it [18:46:59] odd that it's not an obvious option. [18:47:09] what are you trying to do? [18:47:14] paravoid wanted you to do it.. :) [18:47:35] mark: establishing an ibgp session to pmacct -- for AS-PATH correlation of flow data. [18:47:47] (03CR) 10Ori.livneh: [C: 032] Forward beta cluster UDP logs to logstash.pmtpa.wmflabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/95468 (owner: 10Ori.livneh) [18:47:57] to a nonstandard port? [18:48:01] trying to figure out if it's possible to configure junos to use a nonstandard port for ibgp [18:48:04] yep [18:48:45] may not be possible indeed [18:49:06] why do you need a nonstandard port, < 1024 port needs root access? [18:49:22] I am getting 503s from Varnish [18:49:29] JIC it hasn't been reported [18:49:31] well i'm not logging into any routers iwth 3 second lags [18:49:37] :) [18:49:47] that's just basically saying "you will fuck up and destroy production" [18:49:54] that's a good reason, but also since we'll eventually need a bunch of feeds, and it's easier to keep them seperate this way... [18:49:55] StevenonChromeOS: can you post more info, the line where it says "paste this:"? [18:50:06] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:15] PROBLEM - MySQL InnoDB on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:15] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:15] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:15] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:16] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:16] PROBLEM - MySQL Recent Restart on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:16] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:16] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:17] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:17] PROBLEM - MySQL Idle Transactions on db1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:18] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:18] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:19] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:19] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:20] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:20] ^ [18:50:20] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:20] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:21] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:21] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:22] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:22] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:23] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:23] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:24] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:25] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:45] mark https://dpaste.de/kiJj [18:50:46] so, yeah [18:50:52] I'm fine if it's in varnish but as soon as I try to log in… :-( https://www.mediawiki.org/w/index.php?title=Special:UserLogin&returnto=MediaWiki [18:50:54] s7, db1028 [18:51:08] nvm [18:51:29] s7 in general looks very very unhappy [18:51:45] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.161 second response time [18:51:46] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.189 second response time [18:51:46] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [18:51:46] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.044 second response time [18:51:46] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [18:51:46] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [18:51:47] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [18:51:47] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.468 second response time [18:51:48] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:51:48] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.662 second response time [18:51:49] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.043 second response time [18:51:53] =/ [18:51:54] set protocols bgp group BLAH ? [18:51:55] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.277 second response time [18:51:56] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 64845 bytes in 2.024 second response time [18:51:58] wtf just happened? [18:51:59] what happened? [18:52:00] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.025 second response time [18:52:00] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.278 second response time [18:52:00] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.872 second response time [18:52:00] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:00] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.153 second response time [18:52:00] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.564 second response time [18:52:01] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.043 second response time [18:52:06] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:07] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:07] set protocols bgp group BLAH neighbor ? [18:52:07] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [18:52:15] cajoel: outage now, just a minute [18:52:16] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.446 second response time [18:52:16] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.881 second response time [18:52:16] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [18:52:17] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.058 second response time [18:52:17] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.103 second response time [18:52:26] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.317 second response time [18:52:26] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.792 second response time [18:52:26] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.244 second response time [18:52:26] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.406 second response time [18:52:26] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.113 second response time [18:52:27] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.887 second response time [18:52:27] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.871 second response time [18:52:28] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.460 second response time [18:52:28] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.587 second response time [18:52:46] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.076 second response time [18:52:48] mark: no worries... I have some IT stuff to work on.. sync up later... [18:52:52] IndexPager::buildQueryInfo [18:52:53] as always [18:52:55] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.246 second response time [18:52:56] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.666 second response time [18:53:06] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.980 second response time [18:53:07] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.870 second response time [18:53:07] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.552 second response time [18:53:16] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.727 second response time [18:53:16] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.730 second response time [18:53:16] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [18:53:16] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.192 second response time [18:53:16] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.219 second response time [18:53:25] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.618 second response time [18:53:26] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.354 second response time [18:54:01] so yea, s7 as i think paravoid was pointing out seems not ok, repl slaves are really taxed or down [18:54:06] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.512 second response time [18:54:10] just observation [18:54:16] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.691 second response time [18:54:16] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:16] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:26] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.075 second response time [18:54:26] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.496 second response time [18:54:26] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:59] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.411 second response time [18:54:59] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:06] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.129 second response time [18:55:07] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:55:07] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.466 second response time [18:55:16] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.994 second response time [18:55:16] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.851 second response time [18:55:16] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.413 second response time [18:55:17] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:17] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:55:17] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.050 second response time [18:55:17] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [18:55:17] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:55:25] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.766 second response time [18:55:26] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.617 second response time [18:55:26] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.057 second response time [18:55:26] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.394 second response time [18:55:26] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:55:41] ^d: looks like we got some pool queue errors on elasticsearch - they seems to have come up when I was performing the in place reindex and stopped now that it is done [18:55:45] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.255 second response time [18:55:46] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [18:55:55] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.295 second response time [18:55:56] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:56:15] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.973 second response time [18:56:16] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.974 second response time [18:56:28] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.788 second response time [18:56:30] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.577 second response time [18:56:30] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:56:30] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:06] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.126 second response time [18:57:06] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.408 second response time [18:57:07] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.768 second response time [18:57:25] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.537 second response time [18:57:27] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.579 second response time [18:57:27] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.845 second response time [18:57:27] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.147 second response time [18:57:56] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [18:58:05] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.150 second response time [18:58:26] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:25] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.694 second response time [18:59:26] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:26] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:06] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.541 second response time [19:00:16] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:25] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.475 second response time [19:00:26] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.701 second response time [19:00:26] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:35] !log pt-kill on s7 to kill LogPager queries [19:00:52] Logged the message, Master [19:01:05] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.757 second response time [19:01:06] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.893 second response time [19:01:13] not sure [19:01:16] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:01:25] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.486 second response time [19:01:29] seems to be still happening though [19:01:49] robla: on it [19:01:50] (oops...accidentally scrolled back and responding way bacK) [19:01:54] ah :) [19:02:06] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.489 second response time [19:02:15] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.581 second response time [19:02:16] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.347 second response time [19:02:26] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.207 second response time [19:02:26] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.616 second response time [19:02:26] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.847 second response time [19:02:46] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:26] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:26] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:56] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:06] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:06] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:06] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:12] ugh, again [19:05:16] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:16] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:16] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:16] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:36] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:37] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:38] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:38] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:39] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:39] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:40] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:40] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:41] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:41] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:42] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:42] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:43] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:43] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:44] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:44] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:45] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:45] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:46] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:47] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.645 second response time [19:05:47] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:47] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.350 second response time [19:05:56] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.412 second response time [19:05:57] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.607 second response time [19:05:57] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:57] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.032 second response time [19:05:57] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:57] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.173 second response time [19:06:07] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.972 second response time [19:06:07] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.117 second response time [19:06:07] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.538 second response time [19:06:07] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:07] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:07] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:08] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:08] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.982 second response time [19:06:09] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.169 second response time [19:06:16] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.206 second response time [19:06:17] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.487 second response time [19:06:17] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.661 second response time [19:06:17] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [19:06:17] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:17] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.340 second response time [19:06:17] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.600 second response time [19:06:18] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.708 second response time [19:06:18] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:19] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [19:06:19] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [19:06:20] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.821 second response time [19:06:20] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.317 second response time [19:06:26] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.413 second response time [19:06:27] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.220 second response time [19:06:27] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.932 second response time [19:06:27] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.617 second response time [19:06:27] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:48] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.123 second response time [19:06:57] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.747 second response time [19:06:57] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.220 second response time [19:06:57] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.050 second response time [19:07:06] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.753 second response time [19:07:07] (03CR) 10Andrew Bogott: [C: 032] download: convert into a module and clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/95460 (owner: 10Matanya) [19:08:07] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.888 second response time [19:08:27] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.123 second response time [19:08:27] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.668 second response time [19:08:27] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.430 second response time [19:08:27] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.839 second response time [19:08:27] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.970 second response time [19:08:47] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:08:57] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.012 second response time [19:08:57] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.844 second response time [19:08:57] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:08:57] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [19:09:06] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.731 second response time [19:09:07] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.129 second response time [19:09:07] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [19:09:07] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.382 second response time [19:09:07] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.042 second response time [19:09:07] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:09:07] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.897 second response time [19:09:08] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.163 second response time [19:09:08] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.710 second response time [19:09:17] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.052 second response time [19:09:17] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:09:27] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.213 second response time [19:09:47] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.069 second response time [19:09:47] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.898 second response time [19:09:56] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.426 second response time [19:09:57] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.840 second response time [19:09:57] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.563 second response time [19:10:07] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.603 second response time [19:10:07] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.863 second response time [19:10:18] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.944 second response time [19:10:18] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.960 second response time [19:10:26] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.723 second response time [19:10:46] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.842 second response time [19:10:56] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.115 second response time [19:11:07] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.706 second response time [19:12:17] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:17] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:17] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.449 second response time [19:13:17] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:23] greg-g: Hi. [19:13:27] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:33] greg-g: Is MassMessage still scheduled for about now? [19:13:47] Marybelle: theoretically yes, but we're having unrelated site outage issues [19:13:57] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.575 second response time [19:14:00] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.208 second response time [19:14:00] Lame. [19:14:04] yeah [19:14:17] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.697 second response time [19:14:18] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.573 second response time [19:14:57] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.664 second response time [19:15:07] RECOVERY - Apache HTTP on mw1072 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.527 second response time [19:15:17] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.762 second response time [19:16:26] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.620 second response time [19:16:45] (03PS1) 10Andrew Bogott: Replace system_role with system::role in a recent patch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95469 [19:16:57] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.170 second response time [19:17:06] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.805 second response time [19:17:07] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [19:17:17] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [19:17:17] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.469 second response time [19:17:17] RECOVERY - MySQL Recent Restart on db1028 is OK: OK 20812542 seconds since restart [19:17:17] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 0 seconds [19:17:17] RECOVERY - MySQL Idle Transactions on db1028 is OK: OK longest blocking idle transaction sleeps for 0 seconds [19:17:19] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.093 second response time [19:17:19] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.023 second response time [19:17:26] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.088 second response time [19:17:47] RECOVERY - MySQL Slave Running on db1028 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [19:17:57] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.120 second response time [19:17:58] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 3 seconds [19:18:07] RECOVERY - MySQL InnoDB on db1028 is OK: OK longest blocking idle transaction sleeps for 0 seconds [19:18:17] RECOVERY - MySQL Processlist on db1028 is OK: OK 0 unauthenticated, 0 locked, 4 copy to table, 20 statistics [19:18:35] !log logging for record: outage started at approx 13:50 GMT, seems to have ended well, now-ish [19:18:51] Logged the message, RobH [19:18:56] yea, there will be far more informative followup later. [19:19:11] !log may have spoken too soon about ending, we shall see [19:19:28] Logged the message, RobH [19:22:29] RobH, I think you mean that the outage started at about 1850 UTC? unless it's been going on for hours before anyone noticed? [19:22:33] wow, what happened? [19:22:57] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [19:22:57] aude: still under investigation [19:23:04] ok [19:23:57] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:24:25] Risker: bleh.... i did [19:24:42] !log 18:50 not 13:30 durrrrr [19:24:56] man i even typoed my log correction [19:24:58] Logged the message, RobH [19:25:00] but its good enough [19:25:22] RobH: :) [19:25:33] well, we couldn't have the log showing the team as being unresponsive for 5 hours now, could we? :) [19:25:58] incorrect logs are bad, period [19:26:52] heh, yea cuz the second this one started folks were on it [19:27:13] its awfully nice when outages occur in the overlap of timezones for the majority of our operations team. [19:27:15] even woke up the (new) australian! [19:27:26] if it had been an hour earlier waking sean up would have been mean [19:27:34] but not unreasonable [19:27:46] shouldda done it sooner, I wasn't thinking [19:28:29] at least it's not during european morning hours [19:28:30] brb, don't let anyone deploy anything [19:28:36] PROBLEM - Puppet freshness on sq44 is CRITICAL: No successful Puppet run in the last 10 hours [19:28:37] it's usually dead on irc [19:29:24] (03PS1) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 [19:29:40] i'm gonna deploy something [19:29:55] missed all the excitement; my phone rebooted itself into the 'please enter your pin' phase without telling me, so no messages [19:30:58] (03CR) 10Ottomata: "Magnus! Be brutal! :) I don't mind totally scrapping this commit if it is way off." [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [19:33:08] !log Changed IP addresses of upload-lb.eqiad.wikimedia.org [19:33:20] (03PS2) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 [19:33:25] Logged the message, Master [19:34:52] !log reedy Started syncing Wikimedia installation... : testwiki to 1.23wmf4 and build l10n cache [19:35:08] Logged the message, Master [19:36:49] :( [19:40:37] (03CR) 10Andrew Bogott: [C: 032] Replace system_role with system::role in a recent patch. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95469 (owner: 10Andrew Bogott) [19:41:54] (03PS3) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 [19:43:39] (03PS1) 10Andrew Bogott: s/generic::gluster-client/gluster::client/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95476 [19:45:07] (03PS2) 10Andrew Bogott: s/generic::gluster-client/gluster::client/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95476 [19:45:35] How to run bot via tools wmf labs ? [19:46:14] Kolega2357: #wikimedia-labs [19:47:04] (03CR) 10Andrew Bogott: [C: 032] s/generic::gluster-client/gluster::client/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/95476 (owner: 10Andrew Bogott) [19:51:07] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:16] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:24] whoops... [19:51:27] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:32] gah,..... [19:51:35] not again [19:52:17] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.095 second response time [19:52:24] (03PS1) 10Andrew Bogott: Fixed a bunch of file resource paths. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95478 [19:52:24] * greg-g breathes [19:52:44] (03PS2) 10Reedy: Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 [19:53:16] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.229 second response time [19:53:26] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:49] gah [19:54:52] (03CR) 10jenkins-bot: [V: 04-1] Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [19:54:57] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.534 second response time [19:55:48] (03CR) 10Andrew Bogott: [C: 032] Fixed a bunch of file resource paths. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95478 (owner: 10Andrew Bogott) [19:55:53] !sal [19:55:53] https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [19:56:11] labs? [19:56:16] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:17] ...apparently [19:56:43] what? [19:56:54] dumb [19:56:55] (03PS1) 10Reedy: Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 [19:57:30] (03CR) 10jenkins-bot: [V: 04-1] Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 (owner: 10Reedy) [19:57:42] matanya: Note the path changes in https://gerrit.wikimedia.org/r/#/c/95478/ [19:57:53] !sal del [19:57:53] Successfully removed sal [19:58:07] !sal is https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:07] Key was added [19:58:08] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:15] !lsal is https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [19:58:15] Key was added [19:58:27] !lsal del [19:58:27] Successfully removed lsal [19:58:31] (03PS3) 10Reedy: Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 [19:58:51] !sal is production: https://wikitech.wikimedia.org/wiki/Server_Admin_Log labs: https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [19:58:52] This key already exist - remove it, if you want to change it [19:58:53] marktraceur: yay shared botbrain [19:58:59] !sal del [19:58:59] Successfully removed sal [19:59:01] !sal is production: https://wikitech.wikimedia.org/wiki/Server_Admin_Log labs: https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [19:59:02] Key was added [19:59:20] it's nice to have a shared bot brain, generally [19:59:25] yeah [19:59:29] @info [19:59:29] http://bots.wmflabs.org/~wm-bot/dump/%23wikimedia-operations.htm [19:59:49] (i wonder how to get wm-bot to admit where is it getting its botbrain from) [19:59:57] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.090 second response time [20:00:08] ah, that actually says "Linked to #wikimedia-labs" [20:00:17] @help [20:00:17] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.20.2.0 my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [20:00:32] @commands [20:00:32] Commands: there is too many commands to display on one line, see http://meta.wikimedia.org/wiki/wm-bot for a list of commands and help [20:00:56] (03PS2) 10Reedy: Change numerous global functions for anonymous ones [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95479 [20:03:08] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [20:03:26] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.270 second response time [20:04:29] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:30] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.102 second response time [20:05:43] !log reedy Finished syncing Wikimedia installation... : testwiki to 1.23wmf4 and build l10n cache [20:05:47] * apergos eyes the apaches warilhttp://download.fedoraproject.org/pub/fedora/linux/releases/test/20-Beta/Fedora/x86_64/iso/Fedora-20-Beta-x86_64-DVD.iso [20:06:02] ok that was weird [20:06:06] Logged the message, Master [20:06:17] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:21] nice. stuck control key [20:06:57] and how it thought that was on the clipboard who knows [20:07:07] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:48] apergos: coulda been worse [20:07:50] :) [20:08:06] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [20:08:35] well that particular thing is from much earlier in the day, and I guarantee I have copy pasted a pile of commands and stuff since then [20:08:49] who knows [20:09:23] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.911 second response time [20:11:27] (03CR) 10Reedy: [C: 032] Everything else to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95465 (owner: 10Reedy) [20:11:36] (03Merged) 10jenkins-bot: Everything else to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95465 (owner: 10Reedy) [20:12:17] !log reedy updated /a/common to {{Gerrit|I3034b1e4e}}: Add/update symlinks [20:12:31] Logged the message, Master [20:15:36] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf3 [20:15:39] Logged the message, Master [20:15:41] stfu apc [20:15:41] Coren or Ryan_Lane: I updated CirrusSearch on the cluster earlier today. Does wikitech get the update automatically or do you have to do it? Either way, there are some scripts to run to clean up from a bug/complete the update. [20:15:44] I have to do it [20:15:48] is this a new MW release? [20:15:48] or in the same branch? [20:15:50] manybubbles: It has to be done by hand. [20:15:51] I can just do a submodule update [20:15:52] Yeah, what Ryan said. [20:15:52] Coren: be careful when updating [20:16:01] we haven't been very good about how we're managing things :) [20:16:01] let me fix that, in fact. [20:16:30] Ryan_Lane or Coren: cool. You can update CirrusSearch to master like we did on the cluster. I can send you the commands to run to rebuild the search index. [20:16:30] andrewbogott: let's start being really vigilant about pushing our changes into the wmf branches [20:16:42] andrewbogott: otherwise people may do: git submodule update --init --recursive [20:16:48] and overwrite our changes unwittingly [20:16:49] actually, it is here: https://wikitech.wikimedia.org/wiki/Search/New#Full_reindex [20:16:53] (03CR) 10Reedy: [C: 032] Phase 1 wikis to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95466 (owner: 10Reedy) [20:17:04] (03Merged) 10jenkins-bot: Phase 1 wikis to 1.23wmf4 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95466 (owner: 10Reedy) [20:17:41] manybubbles: should I be on wmf/1.23wmf4 now? [20:17:44] Ryan_Lane: You're talking about getting submodule reference patches into core? [20:17:49] or wmf3? [20:17:57] Ryan_Lane: the cluster is wmf2 and wmf3 [20:18:02] ok [20:18:11] both have my changes [20:18:21] andrewbogott: I'm about to push in my change and I'll show you what I mean :) [20:19:08] ok... [20:19:23] anyone around know how to connect to beta's general purpose machine? [20:19:30] basically we should never do anything manual on virt0 [20:19:32] I'm getting ssh denied [20:19:41] manybubbles: I do not, sorry [20:19:47] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Phase 1 wikis to 1.23wmf4 [20:19:48] Ryan_Lane: thanks [20:19:58] Ryan_Lane: I may be missing a big piece of background. Are extensions actually managed as submodules in the core repo? [20:20:03] Logged the message, Master [20:20:15] andrewbogott: yep, but in the wmf branches [20:20:18] which we use on virt0 [20:20:24] Ah, ok. [20:20:43] our extensions are in there too [20:20:51] So that means that puppetizing mw + extensions should be handled differently too… [20:20:59] hm maybe, depending on which extensions. :/ [20:21:20] well, I wouldn't necessarily puppetize MW on virt0 [20:21:22] just the dependencies [20:21:23] anyone looking at the apache logs? they seem to be acting up [20:21:35] I'd actually like to use git-deploy for wikitech [20:21:58] so that whenever people deploy to the cluster it also lands on wikitech [20:22:18] Sure, that'd be good. [20:22:28] i assume that currently we're running a somewhat out-of-date version of MW on virt0 [20:22:33] unless you update it by hand periodically [20:22:39] I updated yesterday [20:22:44] ok then :) [20:26:19] !log ori synchronized php-1.23wmf3/extensions/UniversalLanguageSelector 'Update UniversalLanguageSelector to b2f9e4211efc' [20:26:34] Logged the message, Master [20:29:42] (03CR) 10Andrew Bogott: [C: 04-2] "Most of these changes have already been merged as part of a different patch. Keeping this one in gerrit for reference, as we want to keep" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94408 (owner: 10Dzahn) [20:30:52] manybubbles: there's a problem [20:31:00] Tampa_cluster...A database query error has occurred. [20:31:03] tampa cluster? [20:31:07] Ryan_Lane: ? [20:31:17] why in the world is there something wikimedia specific in there? ;) [20:31:23] Error: 1100 Table 'page_restrictions' was not locked with LOCK TABLES (virt0.wikimedia.org) [20:31:44] ick [20:31:58] i am getting a ton of js errors on wikidata [20:31:58] Ryan_Lane: I'm pretty sure I'm not involved [20:32:07] whoops [20:32:10] wrong person [20:32:10] not convinced 100% it's related to uls but might be [20:32:11] or caching [20:32:15] wait, no [20:32:17] right person [20:32:24] heh [20:32:26] nice [20:32:26] manybubbles: this is from running updateSearchIndex.php [20:32:31] aude: like what? [20:32:48] (03PS1) 10MarkTraceur: Enable VectorBeta on group0 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95486 [20:32:54] one thing is broken gadget [20:32:57] language seelct [20:33:06] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.020 second response time [20:33:10] Not testwikidatawiki? [20:33:11] https://commons.wikimedia.org/w/index.php?title=MediaWiki:Gadget-LanguageSelect.js&action=raw&ctype=text/javascript [20:33:11] Ryan_Lane: hmm - can you post the log somewhere? I don't reference that table directly. [20:33:14] wikidatawiki [20:33:22] oh [20:33:23] crap [20:33:24] ori-l: I think you just broke bits [20:33:26] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:31] sorry, I called the wrong maintenance script [20:33:48] :( [20:33:57] wait, no I didn't. :D [20:34:03] Ryan_Lane: ? is it because I have mwscript thing in there? [20:34:04] ugh [20:34:07] how? [20:34:11] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:13] manybubbles: I don't use mwscript [20:34:17] the ULS change? [20:34:27] I've lost css and such on enwiki [20:34:28] i think it's something else maybe [20:34:30] or combo [20:34:30] oh, bakc [20:34:31] of stuff [20:34:38] manybubbles: I can't just copy/paste, so this is more difficult for me [20:34:41] or not? [20:34:50] one sec. I'll make them specific to MW [20:34:52] Ryan_Lane: I can send it to you without the mwscript in the wya [20:34:55] and not our cluster [20:34:55] localBasePath = new RegExp('^' + mw.util.wikiGetlink( mw.config.get( 'wgFormattedNamespaces' )['6'] + ':' ).replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&")), [20:34:58] horrible js [20:35:01] in a gadget [20:35:08] but using wikiGetlink which i think was changed? [20:35:15] aude: b/c alias was kept [20:35:23] oh [20:35:31] gadget-imagelinks.js [20:35:32] bits is indeed having major issues [20:35:32] that one [20:35:37] manybubbles: it's working now [20:35:39] sorry :) [20:35:46] Ryan_Lane: sorry for making it cluster specific [20:35:50] no worries [20:35:57] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3839 bytes in 4.149 second response time [20:35:59] you may want to update the docs on CirrusSearch [20:36:06] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65061 bytes in 2.323 second response time [20:36:07] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:14] gah [20:36:16] tampa? seriously? [20:36:17] PROBLEM - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.010 second response time [20:36:26] PROBLEM - Varnish HTTP bits on cp3020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:43] paravoid: they are still being used for cache, right? [20:36:50] bits is taking forever now [20:36:57] PROBLEM - Varnish HTTP bits on cp3019 is CRITICAL: HTTP CRITICAL - No data received from host [20:36:57] PROBLEM - LVS HTTPS IPv4 on upload-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:57] probably will time out [20:37:10] esams bits uses tampa bits as a possible backend [20:37:12] Uncaught ReferenceError: $ is not defined [20:37:17] like jquery not found? [20:37:24] Ouch [20:37:27] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:27] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:27] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:27] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:27] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:27] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:30] eqiad bits cache traffic has plumetted [20:37:38] Yeah if bits is having issues then all sorts of weird JS errors will happen [20:37:41] Ryan_Lane: the README has lots of stuff about this [20:37:44] OK, i'm going to revert ULS, just in case. [20:37:57] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:01] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:38:11] that is logged out [20:38:17] PROBLEM - Varnish HTTP bits on cp1069 is CRITICAL: Connection timed out [20:38:26] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1021 bytes in 5.690 second response time [20:38:31] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 6.696 second response time [20:38:31] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 6.861 second response time [20:38:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 6.874 second response time [20:38:37] so, bits eqiad + all LVS pmtpa? [20:38:40] ori-l: did you just revert it? [20:38:40] how does that make any sense? [20:38:46] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:48] no [20:38:52] huh [20:39:03] bits caches are maxed out on sessions [20:39:09] oh [20:39:17] RECOVERY - Varnish HTTP bits on cp1069 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.003 second response time [20:39:17] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 0.299 second response time [20:39:18] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 0.305 second response time [20:39:23] everyone hitting it, eveyringthing expiring? [20:39:25] everything* [20:39:26] paravoid: note, just https [20:39:26] RECOVERY - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3850 bytes in 8.309 second response time [20:39:35] so nginx is unhappy, which is shared across all clusters [20:39:37] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65061 bytes in 0.295 second response time [20:39:39] ugh [20:39:46] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 0.293 second response time [20:39:50] RECOVERY - LVS HTTPS IPv4 on upload-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 654 bytes in 0.188 second response time [20:39:52] mark: should I proceed with reverting the ULS? is it still plausibly the culprit? [20:39:55] yeah, that makes sense [20:39:57] RECOVERY - Varnish HTTP bits on cp3019 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.205 second response time [20:39:57] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65063 bytes in 0.303 second response time [20:39:59] I hadn't seen that, duh [20:40:04] ori-l: yes [20:40:05] which is probably ipv6 [20:40:08] i don't know that it is, but it seems likely [20:40:09] i am not convinced, [20:40:16] and yeah, it's still ipv6 [20:40:17] RECOVERY - Varnish HTTP bits on cp3020 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.192 second response time [20:40:17] PROBLEM - LVS HTTP IPv4 on bits-lb.pmtpa.wikimedia.org is CRITICAL: Connection timed out [20:40:21] not convinced about uls, but good as precaution [20:40:26] PROBLEM - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:31] top requests on bits are ULS right now [20:40:37] PROBLEM - LVS HTTP IPv6 on bits-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host [20:40:39] * aude nods [20:40:40] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection timed out [20:40:52] What is going on? [20:41:02] mark: going to switch IPv6 over to varnish before leaving for vacation? :) [20:41:07] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3833 bytes in 6.885 second response time [20:41:14] Bsadowski1: being diagnosed [20:41:16] PROBLEM - LVS HTTPS IPv6 on bits-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:17] Now it's HTTP, mark [20:41:26] it's HTTP on bits [20:41:27] PROBLEM - Varnish HTTP bits on cp1057 is CRITICAL: Connection timed out [20:41:27] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: Connection timed out [20:41:33] Ah, okay. [20:41:33] and HTTPS on everything [20:41:41] because HTTPS is a shared resource [20:41:58] we could actually split that apart in the future [20:41:59] nostalgia skin [20:42:02] and likely should [20:42:07] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.028 second response time [20:42:17] Ryan_Lane: log that or something :) [20:42:17] HTTPS + varnish frontend per cluster [20:42:18] !log ori synchronized php-1.23wmf3/extensions/UniversalLanguageSelector 'Revert "Update UniversalLanguageSelector to b2f9e4211efc"' [20:42:27] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3901 bytes in 0.426 second response time [20:42:27] greg-g: I'll be on contract for that [20:42:31] RECOVERY - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3899 bytes in 0.529 second response time [20:42:32] Ryan_Lane: cool [20:42:33] Logged the message, Master [20:42:34] Is Wikipedia being REALLY slow for anyone else? [20:42:36] PROBLEM - Varnish HTTP bits on sq70 is CRITICAL: Connection timed out [20:42:40] yes [20:42:42] Sven_Manguard: known issue, being worked on [20:42:43] Sven_Manguard: notice icinga [20:42:45] :) [20:42:47] problem with css / js /etc [20:43:23] Reedy: you're all done with the mw deploys, right? [20:43:25] thx [20:43:27] PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: Connection timed out [20:43:27] RECOVERY - Varnish HTTP bits on cp1057 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 7.408 second response time [20:43:36] PROBLEM - LVS HTTPS IPv4 on bits-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:07] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:44:08] experimentally raised the session limit to 400k on one eqiad bits box [20:44:28] it seems happier now [20:44:31] let me do the others temporarily [20:44:36] PROBLEM - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.028 second response time [20:44:56] RECOVERY - Varnish HTTP bits on sq70 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 7.924 second response time [20:44:57] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3839 bytes in 0.051 second response time [20:45:22] !log ori synchronized php-1.23wmf3/resources/startup.js 'touch' [20:45:37] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 1.418 second response time [20:45:38] RECOVERY - LVS HTTP IPv6 on bits-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3830 bytes in 0.112 second response time [20:45:38] Logged the message, Master [20:45:57] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3844 bytes in 0.001 second response time [20:46:01] it stabilized around 250k now [20:46:27] RECOVERY - LVS HTTPS IPv6 on bits-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3850 bytes in 0.010 second response time [20:46:36] RECOVERY - LVS HTTPS IPv4 on bits-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3846 bytes in 8.116 second response time [20:47:02] seems to be coming back [20:47:04] so, uls then? [20:47:06] RECOVERY - LVS HTTPS IPv6 on bits-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3836 bytes in 0.907 second response time [20:47:21] no [20:47:27] RECOVERY - Varnish HTTP bits on sq69 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.072 second response time [20:47:27] i raised the session limit [20:47:27] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK: HTTP/1.1 200 OK - 186 bytes in 0.071 second response time [20:47:27] RECOVERY - LVS HTTP IPv4 on bits-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3840 bytes in 0.072 second response time [20:47:29] it's still quite slow [20:47:33] if i hadn't done that, it would still be dropping connections [20:47:48] at lease I got my watchlist [20:47:57] including the bits [20:48:14] what trigger that though [20:48:22] what he said [20:48:29] !log Raised session_max on eqiad bits caches from 200k to 400k [20:48:40] it might have been just a new URL because of ULS [20:48:43] what's the technical term for when a page loads, but it looks like basic HTML instead of having all of the pretty formatting? Because that's what's happening and I want to be able to say that correctly [20:48:44] Logged the message, Master [20:48:53] so clients fetching that again as it wasn't on their browser caches [20:48:58] Sven_Manguard: the css and javascript aren't loading [20:49:11] Sven_Manguard: nostalgia skin :) [20:49:16] Oh. That's what it is. Cool [20:49:18] Sven_Manguard: fubar [20:49:25] snafu, rather. [20:49:33] !log reedy synchronized php-1.23wmf4/includes/ [20:49:37] so if you ganglia for eqiad, it has a spike up to 740MB/s, then slowly going down [20:49:48] Logged the message, Master [20:51:14] ok, so I don't think it was ori-l's change per se [20:51:17] it was deploying bits [20:51:45] during what is high traffic hours probably [20:52:00] but we do that all the time [20:52:00] (03PS1) 10Dr0ptp4kt: Ensure that Googlebot-Mobile gets redirected to mobile. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 [20:52:04] right [20:52:36] you said that last time too, ori [20:52:42] yet it broke then too [20:52:52] hehe yes [20:53:08] but impact should be lower due to ori-l's localstorage thing too, right? [20:53:15] or is this not deployed yet? [20:53:18] localStorage is off on all wikis except test / test2 / mw.o [20:53:22] ah [20:53:23] dammit [20:53:32] anyway, fwiw because it's not apparent [20:53:33] it wasn't really the last time; i synced JS module updates many times since, and so have others [20:53:46] Also, ULS probably has large-ish static assets (like fonts) [20:53:52] but the ULS change replaced the API called with a custom ResourceLoader module [20:53:54] probably? [20:53:56] :P [20:53:59] Which will have different ---- oh and that [20:54:00] That would help [20:54:03] my hunch is that it put unreasonable load on the apaches [20:54:05] "last time when it broke for similar reasons" [20:54:06] ori-l's deployment came after me asking about it because it was causing some other load issues [20:54:07] (03PS1) 10Manybubbles: Configure labs to have 2 search replicas [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95533 [20:54:13] Shifting what's probably quite a bit of data from one point to another [20:54:21] 132M including .git stuff [20:54:30] ori-l: no, apaches didn't seem to have any issues [20:54:38] no, it's too many sessions on the bits caches [20:54:46] simply too many clients downloading the 'new' assets it looks like [20:54:58] On the bits *caches* [20:55:03] yes [20:55:06] right, sorry [20:55:15] So that suggests that the only thing keeping that from happening during normal operation is client-side cache? [20:55:26] yes [20:55:28] Or are the requests cache misses that take a long time to fulfill? [20:55:33] it couldn't be caused by apaches being slow to respond? [20:55:39] I don't think so [20:55:41] and the sessions being kept open for longer? [20:55:50] Cause if there are too many simultaneous *hits* then that's scary [20:56:08] i haven't ruled that out yet but it doesn't look like it [20:56:17] Either something is dramatically increasing the number of requests that are being made, or our caching infrastructure doesn't have enough capacity to actually cache what it's supposed to be caching [20:56:59] RoanKattouw: that's not true, we're designing with client caches in mind [20:57:07] ULS is still on the same commit that I reverted in 1.23wmf4 [20:57:11] not for bits, in general I mean [20:57:23] To a certain extent that's fair [20:57:24] so if there is a problem, it should still be detectable on 1.23wmf4 wikis [20:57:37] which is wikidata [20:57:40] i don't see a problem [20:57:47] errr [20:57:50] test.wikidata [20:58:45] paravoid: Sure, but I think a situation in which the only difference between staying up and going down is mercy provided by client-side caches is a bit precarious [20:59:34] ori-l: paravoid the apaches showed a huge spikes in latency [21:00:00] http://gdash.wikimedia.org/dashboards/totalphp/deploys [21:02:30] https://bits.wikimedia.org/static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Autonym/Autonym.woff?version=20131112 [21:02:33] not loading [21:02:56] !log jenkins: migrated some phpcs job to use code sniffer style git-deployed on slaves in /srv/deployment/integration/mediawiki-tools-codesniffer/MediaWiki [21:03:12] Logged the message, Master [21:03:23] http://dpaste.com/1462691/ request headers (see "no-cache") [21:03:58] morebots: ping [21:03:58] I am a logbot running on tools-login. [21:03:58] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [21:03:58] To log a message, type !log . [21:03:59] comeon [21:04:39] (03PS1) 10Mark Bergsma: Double session limit for bits caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/95534 [21:05:10] https://bits.wikimedia.org/static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Autonym/Autonym.ttf?version=20131112 [21:05:13] also [21:05:37] (03PS2) 10Mark Bergsma: Double session limit for bits caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/95534 [21:05:40] ^^^ mark [21:05:41] that causes "English" at the top not to display [21:05:46] !log jenkins: migrated some phpcs job to use code sniffer style git-deployed on slaves in /srv/deployment/integration/mediawiki-tools-codesniffer/MediaWiki [21:05:47] !log vvvv {{bug|57064}} [21:05:48] it's trying to load the font [21:06:17] Logged the message, Master [21:08:29] heh: http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&s=by+name&c=SSL+cluster+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [21:08:32] that's not loading on at least the esams bits caches indeed [21:08:33] memory doubled [21:08:49] same: http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=SSL%2520cluster%2520pmtpa&tab=m&vn=&hide-hf=false [21:09:27] likely due to SSL cache [21:09:42] I think we're allowing 1m sessions in the ssl cache? [21:10:57] it's still half than in December :P [21:12:44] that was due to the number of nginx processes running [21:12:44] it was tuned for uploads, when uploads used nginx [21:12:44] I re-tuned it for ssl [21:12:44] cp3019 got: [21:12:44] 43615 RxStatus b 403 [21:12:44] 43615 RxResponse b Forbidden [21:12:44] 43615 RxHeader b Server: Apache [21:12:44] for url: /static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Autonym/Autonym.eot?version=20131112 [21:17:18] it's in wikimedia.conf, line 83 [21:17:18] so, the nice thing is that nginx handled this perfectly fine: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&s=by+name&c=SSL%2520cluster%2520esams&tab=m&vn=&hide-hf=false [21:17:18] what is 'this', though? [21:17:18] confirmed, the bits caches are getting 403s for those urls [21:17:19] from nginx's perspective 'this' is a shitload of connections and waiting on backends [21:17:19] on cp1070: [21:17:19] 18942 TxURL b /static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/Autonym/Autonym.ttf?version=20131112 [21:17:19] 18942 RxProtocol b HTTP/1.1 [21:17:19] 18942 RxStatus b 403 [21:17:19] 18942 RxResponse b Forbidden [21:17:19] 18942 RxHeader b Date: Thu, 14 Nov 2013 21:15:43 GMT [21:17:19] 18942 RxHeader b Server: Apache [21:17:46] why would 403s cause a request spike, though? [21:18:23] a request spike on what? [21:18:41] why so many more connections...? [21:19:15] i don't know why but it would be loaded everywhere [21:19:27] one theory is all requests for this getting blocked on the few backend requests that don't complete [21:19:35] a busy object [21:19:38] i think autonym might be new (not sure) [21:19:54] oh, so it was really just requests piling up [21:20:02] maybe [21:20:14] now, I still don't know what exactly ori deployed besides that he deployed "something with uls" [21:20:20] heh [21:20:24] the 403 is new [21:20:27] so ori-l: summary please? :) [21:20:44] OK, but just a last thought on the 403 [21:20:51] after setting up static-current [21:21:05] I chatted about it with Reedy, and mentioned https://bits.wikimedia.org/static-current/extensions/UniversalLanguageSelector/data/fontrepo/fonts/LohitNepali/Lohit-Nepali.woff as a test case [21:21:11] so I know for a fact that it was loading then [21:21:19] this is Oct. 17 [21:21:26] ori-l: but autonym might be a new one [21:21:37] yeah, but that URL is 403ing too now [21:21:38] and apparently used to display "English" [21:21:42] ok [21:22:07] re: what I deployed: ULS was making an API call on every page [21:22:09] this is bug https://bugzilla.wikimedia.org/show_bug.cgi?id=56509 [21:22:19] oh [21:22:21] there was a fix by the ULS team (which I did not review) [21:22:31] it was merged and had been deployed to 1.23wmf4 [21:23:00] so my thinking was: - resolves the extra API call, - successfully deployed to 1.23wmf4 & running in prod == should be safe [21:23:36] the autonym font is loaded on every page, AFAIK [21:23:39] so if the 403 is not cached [21:23:46] that would be bad [21:23:58] so, how does this api change cause a changed asset url (or at least, that's what I believe is happening) [21:24:03] api call change [21:24:25] because all the API call was doing was load a static JSON blob from disk [21:24:59] making it a resourceloader module means it can be retrieved in a single request along with other modules and have ResourceLoader manage client-side caching in a manner appropriate for a piece of static JS [21:25:16] right [21:25:39] (ori deployed the fix after a discussion with me, and after I pointed out production issues that we were having and this was fixing) [21:25:42] i think there were multiple changes in the deployment [21:26:01] separate issues probably [21:26:23] aude, yes, I did not cherry-pick just the change, but bumped the 1.23wmf3 ULS submodule to the same commit as 1.23wmf4 [21:26:31] reasoning that it was "known good" [21:26:47] i had to make changes to wikibase to keep compatibility [21:27:01] had to add jquery.uls.data as a dependency in certain places in wikibase [21:27:10] unrelated though to the issues [21:27:54] I checked with siebrand prior to bumping the submodule [21:28:04] it seemed fine when i tested it earlier [21:28:16] but i have no way to test all the caching (maybe on beta) [21:30:29] when I curled the 403ing Autonym font a few minutes ago, the X-Cache header was < X-Cache: cp1070 miss (0) [21:30:53] so perhaps it is varnish that is declining to cache backend responses with status code 403? [21:31:33] and client-side caching is not enough to mitigate the fact that everyone who doesn't already have this in their browser cache is going to be making a request to varnish that it has to retrieve from the backend [21:31:48] ^ mark [21:32:09] not expert but sounds plausible [21:32:09] yes [21:32:23] sorry, I think I'm a bit lagged. [21:32:35] mark@mw1150:~$ cd /usr/local/apache/common/docroot/bits/static-current/extensions [21:32:35] -bash: cd: /usr/local/apache/common/docroot/bits/static-current/extensions: No such file or directory [21:32:56] lrwxrwxrwx 1 mwdeploy mwdeploy 33 Nov 14 18:15 extensions -> /a/common/php-1.23wmf4/extensions [21:32:56] lrwxrwxrwx 1 mwdeploy mwdeploy 32 Nov 14 18:15 resources -> /a/common/php-1.23wmf4/resources [21:32:56] lrwxrwxrwx 1 mwdeploy mwdeploy 28 Nov 14 18:15 skins -> /a/common/php-1.23wmf4/skins [21:32:58] broken links [21:33:29] not on tin. was 1.23wmf4 not pushed out? [21:33:48] apaches don't even have /a [21:34:06] oh, fuck. yes, the symlinks are wrong. hang on. [21:36:21] (03PS1) 10Dr0ptp4kt: WIP: DO NOT MERGE YET. Allow Google's bots to scrape bits. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95548 [21:36:48] lol, perfect timing dr0ptp4kt [21:37:09] paravoid, oh funny [21:37:10] as for varnish caching that 403, the bits caches have this in vcl_fetch: [21:37:11] elsif (beresp.status >= 400 && beresp.status <= 499 && beresp.ttl > 1m) { [21:37:11] set beresp.ttl = 1m; [21:37:11] } [21:37:11] oh gosh [21:37:19] so they limit the cache to just 1 min [21:37:40] OK, I didn't break the symlinks [21:37:54] https://gerrit.wikimedia.org/r/#/c/95464/1/docroot/bits/static-stable/skins [21:37:57] e.g. [21:38:04] is that really correct? [21:38:52] (03CR) 10Dr0ptp4kt: [C: 04-2] WIP: DO NOT MERGE YET. Allow Google's bots to scrape bits. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95548 (owner: 10Dr0ptp4kt) [21:38:56] no, I did break them. [21:39:39] guilty change: https://gerrit.wikimedia.org/r/#/c/94280/1 [21:40:20] huh, we have a script for that [21:40:56] emergencies happenign here, or can cmjohnson1 and I ask a networking questions (ahem, hi mark! :) ) [21:41:08] or LeslieCarr? [21:41:12] shoot [21:41:26] cmjohnson1: just moved analytics1011 (and some other nodes) [21:41:30] their ips have changed [21:41:32] things look ok [21:41:42] but it can't talk to puppetmaster / stafford [21:41:45] the werid bit is this [21:41:49] when I 'ping puppet' [21:41:51] i get this [21:42:00] root@analytics1011:~# ping puppet [21:42:01] PING stafford.pmtpa.wmnet (10.0.0.24) 56(84) bytes of data. [21:42:01] From analytics1012.eqiad.wmnet (10.64.5.3) icmp_seq=1 Packet filtered [21:42:06] (03PS1) 10Ori.livneh: Revert "Fix how 'current' branch is determined in updateBitsBranchPointers" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95550 [21:42:11] so, looks like maybe the network ACL or something weird [21:42:11] what does puppet resolve to? [21:42:16] note that alex is working on that stuff [21:42:20] perhaps the ACL is not reflecting that yet [21:42:23] stafford [21:42:33] but also, notethat it says From analytics1012... [21:42:34] oh it did say above [21:42:36] this is analytics1011 [21:42:40] and i am pinging puppet [21:42:46] it's not just puppet [21:42:50] /etc/hosts [21:42:52] what does analytics1012 have to do with anything? [21:42:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Fix how 'current' branch is determined in updateBitsBranchPointers" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95550 (owner: 10Ori.livneh) [21:42:53] yeah [21:43:01] so this is a new subnet in a new row? [21:43:03] !log ori updated /a/common to {{Gerrit|Id1a2bb2c8}}: Revert "Fix how 'current' branch is determined in updateBitsBranchPointers" [21:43:04] 10.64.5.2 analytics1011.eqiad.wmnet analytics1011 [21:43:16] row A? [21:43:16] paravoid 10.64.5.2 analytics1011.eqiad.wmnet analytics1011 [21:43:18] Logged the message, Master [21:43:21] hosts ^^ [21:43:31] yes row A [21:43:38] i bet the acl isn't correct for that yet [21:44:07] ja, i would guess that's why it can't talk to stafford [21:44:16] but, still weird about an12 being in the output of ping there [21:44:54] vrrp ips clashing? [21:45:15] so .3 being one of the two routers [21:45:16] !log ori synchronized docroot/bits/static-current 'Correct 'static-current' symlinks on bits' [21:45:32] Logged the message, Master [21:45:45] yes [21:45:50] heh [21:45:51] 10.64.5.2 = cr1-eqiad [21:45:53] that was a guess [21:45:54] 10.64.5.3 = cr2-eqiad [21:45:59] 10.64.5.1 = vrrp ip [21:45:59] right [21:46:00] don't use those [21:46:04] ori-l: much better [21:46:05] !log ori synchronized docroot/bits/static-stable 'Correct 'static-stable' symlinks on bits' [21:46:05] and packet filtered being missing ACLs [21:46:10] lemme fix that [21:46:10] besides that, the ACLs aren't updated yet [21:46:13] test.wikidata is quick and works [21:46:18] Logged the message, Master [21:46:23] it should be the same in the other analytics vlans [21:46:46] anyway, it's 11 pm here and I'm again backlogged on stuff I needed today/tomorrow before I leave [21:46:51] so can this wait for leslie tomorrow? :) [21:46:58] yah, ok, [21:47:04] OK, so the Autonym font URL load now [21:47:05] well, maybe we can get LeslieCarr today? ahhhhh [21:47:06] yeah [21:47:08] it can wait [21:47:11] i'm going to peace out sometime soon too [21:47:13] she's on a flight [21:47:16] some stuff is down til these are back up, but its ok [21:47:17] ah ok [21:47:24] mark, I apologize for the bits issue; I'll write a postmortem and e-mail the list. [21:47:26] ok, what do I need to do to get things up at least? [21:47:27] ottomata fixing dns [21:47:33] ori-l: ok [21:47:36] ori-l: thanks :) [21:47:38] the ACLs need to work [21:47:40] but mark, its ok [21:47:50] i probably wont' have time to get everything back to normal before I leave anyway [21:47:51] so don't worry about it [21:48:05] ok [21:48:23] ask leslie to carefully check the full config of the new subnets, compare with the existing subnets, including ACLs [21:48:32] and multicast [21:48:41] and ipv6 router advertisements [21:48:44] and yadda yadda yadda [21:48:56] ok cool [21:49:03] i'm copy/pasting that for me to send to her tomorrow :) [21:49:25] :) [21:50:47] ori-l: i assume you'll fix it from here on too? :) [21:51:03] oh you did already [21:51:04] good [21:51:36] things look good [21:51:48] I feel like a massive idiot, sorry again. [21:52:07] well, it's good to know it's not necessarily uls to blame [21:52:14] it works fine now [21:55:39] (03CR) 10Dr0ptp4kt: "DO NOT MERGE YET" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95548 (owner: 10Dr0ptp4kt) [21:58:14] gwicke: http://focusdesigns.com/compare/ [21:59:45] !log Gerrit: added marktraceur to group [https://gerrit.wikimedia.org/r/#/admin/groups/539,members stream-events] [21:59:56] morebots: commeeeonn [22:00:01] Logged the message, Master [22:00:01] I am a logbot running on tools-login. [22:00:01] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [22:00:01] To log a message, type !log . [22:00:55] does it like [] [22:04:26] (03PS1) 10Cmjohnson: fixing dns entries for analytics boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/95554 [22:05:40] !log hashar> !log Gerrit: added marktraceur to group https://gerrit.wikimedia.org/r/#/admin/groups/539,members stream-events [22:05:50] .. [22:05:56] Logged the message, Master [22:06:12] (03CR) 10Cmjohnson: [C: 032] fixing dns entries for analytics boxes [operations/dns] - 10https://gerrit.wikimedia.org/r/95554 (owner: 10Cmjohnson) [22:07:25] Nemo_bis: thank you :) [22:12:13] paravoid: i'm creating a .deb package for logster, [22:12:18] it installs an executable [22:12:26] should there be more than one package then? [22:12:37] python-logster (with just python libs) [22:12:37] and logster (with executable)? [22:12:41] or just logster [22:12:44] or just python-logster [22:12:45] with both? [22:12:52] the latter is fine [22:12:57] you can even build it as "logster" [22:13:02] logster, would be good [22:13:04] that's what I started as [22:13:06] is it better than ganglia-logtailer [22:13:10] ? [22:13:11] ha, k [22:13:16] yeah, it does more than ganglia [22:13:28] and is easier to extend i think [22:13:43] what does ganglia logtailer do? :) [22:14:05] it streams the rrds somewhere? [22:14:10] Ryan_Lane: this is https://github.com/etsy/logster [22:14:13] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15611 MB (3% inode=99%): [22:14:14] The logster project was created at Etsy as a fork of ganglia-logtailer (https://bitbucket.org/maplebed/ganglia-logtailer). [22:14:17] funny isn't it [22:14:24] hahaha [22:14:38] otto wants to use that [22:14:45] ha, actually it isn't that much different than ganglia-logtailer, except for the multiple outputs [22:14:46] it all comes full cycle! [22:15:11] I don't see how this is much different than ircecho [22:15:26] I took a look at fedmsg btw, did I tell you? [22:15:31] oh, right [22:15:33] how is it? [22:15:37] I have a vim open with 4-5 bugs I want to file [22:15:40] :D [22:15:51] paravoid, the master of bugs [22:15:57] hm [22:16:03] awesome stuff, like the source hardcoding the environments that the fedora infrastructure has [22:16:06] prod, stg, dev [22:16:06] god help salt stack if you ever start using it :D [22:16:23] I wanted production / labs / fundraising, obviously [22:16:25] (actually, they'd like the bufs) [22:16:27] *bugs [22:16:39] and it's hardcoded in the source :(( [22:16:44] bleh [22:16:53] yeah, it has all kinds of silly things like that [22:16:55] I wasn't very impressed [22:17:04] I'd actually like to use this: http://www.stackalytics.com/ [22:17:04] maaaybe with a little work [22:17:05] ottomata: ori-l is looking at logstash as well [22:17:12] if my phone stopped beeping for two days [22:17:17] logster [22:17:17] but like fedmsg, it has openstack stuff hardcoded in it [22:17:46] ottomata: logstash is a different thing [22:17:49] we've talked about it a few times [22:17:50] that's by far the best git/bug/mailing list/company/contributor stats thing I've ever seen [22:18:03] you definitely don't want to run logstash on each host ;) [22:18:05] I want to help with that too :(( [22:18:09] it eats a shit-ton of memory [22:18:11] yeah i know [22:18:13] i'm saying logster [22:18:17] i aint said logstash [22:18:20] yeah, logster [22:19:32] hm. lyft was telling me about some other thing that could be used too [22:19:54] that does exponential backoff on failures and such as well [22:20:06] and tracks multiple files and uses inotify [22:20:24] springle: http://pastebin.com/3MKTNZ2M (one of these seems a lot faster) [22:20:31] on enwiki [22:20:33] too bad I can't remember what it was [22:21:08] Ryan_Lane: well, you'll know soon enough [22:21:16] :'( [22:21:22] heh, yeah, but you'll be using logster by then ;) [22:21:39] (03PS1) 10Ottomata: Initial .deb packaging [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/95556 [22:26:50] Aaron|home: with the force? quite possible, but how does it affect the other variations and wikis? [22:27:17] don't want to go back to ~10min examples :) [22:29:32] PROBLEM - Puppet freshness on sq44 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:27] it works for the case of (per user,no filtering aside from log_type filter for permissions and the "spammy log" log_type filter ("review","patrol"),user logs are not mostly review/patrol or those are not filtered) [22:31:47] springle: this is fine for enwiki, for wikidatawiki, you may have uses with boatloads of mostly patrol log entries [22:32:10] that will still suck...we should probably disable patrolling or something there [22:32:38] * Aaron|home looks for bot users with patrol/autopatrol log spam [22:33:09] there is no good index plan for that afaik [22:33:48] i'm not a fan of reitroducing more forcing. it sets us up for a fall again in the future if dataset changes and makes it hard to react [22:34:32] example: the wikidata wb_terms query recently that caused spikes. without force it could be reindexed on the fly to allow patching at leisure. forced queries are intractable [22:34:36] well without it, you need some other way to have mysql not get tripped up by that IN clause [22:34:56] do we need to join tag_summary all the time? [22:35:24] most examples i see are ts_tags=NULL, and ts_log_id having such a log cardinality is a problem [22:37:01] s/log/low/ [22:38:13] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15924 MB (3% inode=99%): [22:39:02] remove the tag_summary join and the non-forced version is generally single-digit seconds, and may be more amenable to reindexing. it switches to Index Condition Pushdown (nice for cold data) [22:39:35] I see that " Show review log" is gone from wikidata [22:39:44] maybe someone turned off patrolling already....that's good [22:41:38] springle: in theory tags are shown next to log entries...though I never see any since they don't actually tend to get tagged [22:43:27] springle: wait, tag_summary has no PK right? [22:43:37] Aaron|home: correct [22:44:16] it has unique keys, but on nullable fields. so PK is internal [22:44:24] yep [22:45:37] review != patrol [22:46:43] yeah I meant "Patrol" [22:46:47] tag_summary ts_log_id cardinality is attrocious. hence the ref lookup is poor, not much better than index scan [22:47:56] for the case of my log entries, the FORCE INDEX(user_time) is still fast with the tag_summary query though [22:48:33] though no FORCE + no tag summary is indeed much faster than with the tag table [22:51:42] springle: anyway, maybe tag_summary should be 3 tables? [22:52:03] sure, i get that force is fast. but "Using index condition" plus a small filesort is a better targe imho. it's more likely to scale when data is cold [22:52:22] tag_summary could be refactored, yes [22:53:02] sure, in fact that's a good option, despite the pain i forsee migrating a table without a PK :) [22:53:14] though that has to be solved one day regardless [22:55:39] (03PS1) 10MaxSem: Fix Solr OOM on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/95559 [22:56:16] Coren, andrewbogott ^^^ is an easy fix to GeoData broken in beta:) [22:57:06] (03CR) 10coren: [C: 032] "Trivial." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95559 (owner: 10MaxSem) [22:57:20] thanks:) [22:57:45] Merged and pushed. [23:00:49] paravoid: thanks for adding IPv6 on Gerrit :] [23:07:17] Aaron|home: tag_summary is pulled entirely from change_tag, right? [23:07:28] yep [23:13:17] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /mnt/data 15719 MB (3% inode=99%): [23:14:43] Aaron|home: one possible substitute http://pastebin.com/XtGgSMNK . not pretty, but fast [23:15:23] might need an extra DISTINCT in there somewhere [23:20:10] hmm no actually that's still wrong [23:21:38] 2 [23:31:34] springle: we could probably just use GROUP_CONCAT, we already use subqueries in a few things, and sqlite/mysql support it and posgtres has array_to_string (which also hides NULLs) [23:31:44] we'd just need a new Database class wrapper method [23:32:02] (subqueries implying mysql >= 4.1) [23:32:56] I remember when I was excited about that release [23:33:41] things die hard sometimes ;) [23:35:12] yeah, change_tag+group_concat is much better for logpager than tag_summary in its present form [23:36:34] it would at least allow me to continue to mess with indexes on problem wikis if some forms of logpager still misbehave [23:37:32] paravoid: awake enough to help with a weird dns issue over in #wikimedia-labs? [23:37:49] (not an emergency) [23:38:41] springle: does the GROUP BY log_id mess with the sorting performance? [23:40:39] though maybe that would all be a subquery [23:41:04] in which case you don't need GROUP BY log_id on the outer query [23:41:24] springle: do have the query you'd do with group_concat? [23:42:30] * Aaron|home stares at https://upload.wikimedia.org/wikipedia/commons/2/2f/Chipmunk%2C_AZ%2C_USA_%289536809478%29.jpg [23:43:00] Aaron|home: I thought it was an error or something [23:43:04] you had me there [23:44:26] nope, just an exercise in irresistible cuteness [23:45:22] Aaron|home: http://pastebin.com/wSW4UA9W two versions [23:46:04] yeah, that's about the two ways I was thinking of [23:46:04] probably the first is the better, for simplicity. doesn't get 'using index condition' but does avoid the filesort and subquery [23:46:44] both seem similar speed and rows touched on a cold pmtpa enwiki slave [23:49:21] I guess we'd have to check $dbr->implicitGroupby() in the code too [23:49:38] otherwise PG will whine about 'X must appear in the GROUP BY clause or be used in an aggregate function' [23:49:50] for the first one [23:50:21] hrmm [23:50:34] from a readability and cross compatibility standpoint, I prefer the 2nd [23:52:47] sure. six of one, half a dozen of the other [23:55:12] Aaron|home: actually pt-visual-explain makes the 2nd look better under the hood [23:56:02] (03PS3) 10Dzahn: move IRC stuff to module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 [23:56:44] (03CR) 10Dzahn: "(11 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 (owner: 10Dzahn) [23:57:55] Aaron|home: http://pastebin.com/HQaqWd1b [23:58:17] * Aaron|home wonders how pt-visual-explain works [23:58:19] the subquery gets moved out entirely [23:58:40] andrewbogott: where was the puppet project list you mentioned pls [23:59:00] (03CR) 10TTO: "@Dereckson: "wmf" is a prefix that should only be used for functions, as I understand it. All variables should start with "$wmg". Per the " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94598 (owner: 10Arav93) [23:59:14] link from 'puppet' or 'puppet coding'? we could use a landing page there [23:59:28] mutante: https://wikitech.wikimedia.org/wiki/Puppet_Todo [23:59:35] And, it is linked from… one of those pages I think. [23:59:40] i sat through an entire lecture by Baron one year at MySQLconf and still came away wondering about pt-visual-explain. it's a fancy script