[00:07:56] andrewbogott: https://wikitech.wikimedia.org/w/index.php?title=RT&diff=81539&oldid=47588 [00:08:54] andrewbogott: eh, if you have additions to https://wikitech.wikimedia.org/w/index.php?title=RT&diff=81539&oldid=81532 [00:09:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:13] the mail server setup part, meh, but i'll ttyl [00:10:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [00:17:14] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [00:23:14] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:14] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [00:34:14] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:44:05] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [00:45:05] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [00:51:05] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [00:51:05] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:05] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:05] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:48] (03PS1) 10Dzahn: remove old RT 3.8 stuff and just keep RT 4.x Apache stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 [00:57:05] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:59:05] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [01:00:05] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:05] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:05] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [01:04:05] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [01:25:03] (grrrit-wm dead because labs is dead) [01:26:59] PROBLEM - DPKG on labstore3 is CRITICAL: Timeout while attempting connection [01:28:39] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:09] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:30:26] Ryan_Lane: around? got a moment for a patch? [01:30:37] sure. what's up? [01:30:47] https://gerrit.wikimedia.org/r/#/c/81447/ [01:31:24] goes the hafnium which i have sudo on, so i can force the puppet run and check for failures [01:33:30] ori-l: zsock.subscribe = b'' ? [01:34:35] the hipstery pyzmq interface relies on @property decorators [01:34:45] :D [01:34:46] an empty subscription in zmq pubsub is 'subscribe to everything' [01:35:17] it's a bit strange, but that's zmq pubsub semantics. if you don't declare a subscription you don't get anything. [01:38:06] gah, the print statement on line 56 is superfluous, i'll remove it [01:39:06] I'm still unfamiliar with the b'' notation [01:39:26] can you send me a pointer on how it's used? [01:39:39] just means it's bytes rather than unicode [01:40:09] the from __future__ import unicode_literals statement means that 'abc' (in the file) is a unicode string [01:40:20] and unicode strings aren't bytes [01:40:37] ah. ok. [01:40:58] i'll just update it to do it the non-hipstery way [01:41:11] nah, it's ok [01:41:23] k, thx [01:45:55] I hate how we have things spread between role, misc, modules and manifests [01:45:59] misc is especially fucked [01:46:18] yeah, i know :/ i'm cleaning it up gradually [01:46:22] i added a pystatsd module yesterday [01:46:43] nobody/nogroup? :( [01:46:44] heh [01:46:51] but the graphite situation is fucked, gdash is literally deployed via puppet and its code is in the puppet repo [01:47:00] yeah, gdash will move to git-deploy [01:47:29] i filed a bug with 'volunteer-needed' keyword because i figured it's a good opportunity for someone to get familiar with git-deploy [01:47:33] * Ryan_Lane nods [01:47:50] maybe I'll spend some time tomorrow cleaning up git-deploy's config in puppet [01:47:57] to make most of the config optional [01:48:25] hm. is this the best way to handle inits? [01:48:28] as a template? [01:48:35] upstart is kind of a shitty system [01:48:50] is /etc/default/ not used in upstart? [01:49:02] i think it's an init.d thing, but i dunno [01:49:26] I'm not totally sure how upstart will handle a change in its file [01:49:43] oh, did i not set a subscribe? [01:49:54] I'd imagine it should be fine since it's supposed to track the process itself [01:50:08] I think you did [01:50:23] oh [01:50:25] no, you didn't [01:51:02] I think changing the init and notifying should work fine [01:51:17] changing the init? [01:51:19] it should kill the tracked pid and restart based on the new file [01:51:45] doing this kind of thing in the old init system could occasionally lead to weirdness [01:51:48] ah no, that's not necessary because it doesn't fork; upstart will daemonize it [01:51:57] upstart discourages app developers from daemonizing [01:52:02] * Ryan_Lane nods [01:52:12] this looks fine, assuming you add a notify or subscribe [01:52:40] yep, done (notify) [01:55:36] merged [01:56:00] :D thank you! [01:56:06] i am hideously excited about this [01:57:47] heh [01:57:48] yw [01:59:06] gah, i'm an idiot [01:59:31] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/81448 . sorry. [02:01:14] merged [02:01:17] it happens [02:02:04] thanks [02:02:18] yw [02:06:17] works! [02:06:24] * ori-l waits for dataz. [02:07:04] * YuviPanda streams all cricket scores ever, into ori-l [02:07:11] MAKE SENSE OF THAT, HAH! [02:16:15] !log LocalisationUpdate completed (1.22wmf14) at Wed Aug 28 02:16:15 UTC 2013 [02:16:21] Logged the message, Master [02:16:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:51] Ryan_Lane: What's wrong with Labs? :-( [02:17:09] Elsie: do you mean tools specifically? [02:17:19] ask in #wikimedia-labs, rather than here [02:17:25] coren is working on it [02:17:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.282 second response time [02:17:51] All right. [02:17:53] should really specify tools rather than just labs. labs itself is fine ;) [02:22:06] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [02:22:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:27:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [02:30:40] !log LocalisationUpdate completed (1.22wmf13) at Wed Aug 28 02:30:40 UTC 2013 [02:30:46] Logged the message, Master [02:31:22] Ryan_Lane: wmflabs.org doesn't resolve. ;-) [02:31:56] it never has [02:32:02] www.wmflabs.org did at some point [02:32:10] but I removed that from someone's project [02:32:24] let me just add that to virt0 [02:39:32] Elsie: there, now wmflabs.org and www.wmflabs.org redirect to wikitech [02:40:10] Still not resolving for me, but possibly cached. [02:40:13] yep [02:40:25] you already tried to hit it. so, your resolver has negative cache [02:40:49] ;wmflabs.org. IN A [02:40:49] in about an hour it'll be fixed [02:40:58] All right. [02:41:04] dig @labs-ns0.wikimedia.org wmflabs.org [02:41:07] There's an open bug about this somewhere. It'll be nice to have it resolved. :-) [02:41:36] 208.80.152.32 [02:41:37] Nice. [02:41:44] You see nytimes.com got hit today? [02:41:47] yep [02:41:55] and twitter [02:42:46] I know. I went to check Twitter when I noticed nytimes.com was down. I got so confused. [02:43:11] :D [02:43:49] Hrmmm, are they still down? [02:44:04] www.nytimes.com works. Silly DNS. [02:45:01] (03PS1) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:02] (03PS2) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:04] (03PS3) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:06] (03PS4) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:08] (03CR) 10Ryan Lane: [C: 032] Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 (owner: 10Ori.livneh) [02:45:13] (03PS1) 10Ori.livneh: Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81448 [02:45:14] (03CR) 10Ryan Lane: [C: 032] Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81448 (owner: 10Ori.livneh) [02:45:28] (03PS1) 10Ryan Lane: Add wmflabs.org to wikitech's apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 [02:45:29] (03CR) 10Ryan Lane: [C: 032] Add wmflabs.org to wikitech's apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 (owner: 10Ryan Lane) [02:45:30] (03CR) 10Ryan Lane: [C: 032] Add newer pmtpa virt nodes to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/81418 (owner: 10Ryan Lane) [02:45:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 28 02:45:49 UTC 2013 [02:45:55] Logged the message, Master [02:46:11] (03CR) 10MZMcBride: "https://bugzilla.wikimedia.org/show_bug.cgi?id=36885" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 (owner: 10Ryan Lane) [02:46:17] heh, see? Redis didn't lose any messages :P [02:46:25] PROBLEM - Disk space on analytics1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 11258 MB (3% inode=99%): [03:40:58] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:27:14] (03CR) 10Greg Grossmeier: [C: 031] "If this is good, let's get it in before Thursday's deploy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [06:44:46] (03CR) 10Nikerabbit: [C: 031] Don't show the IME in the CodeEditor textarea [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [06:45:41] apergos: good morning :-] [06:48:20] hello [06:50:47] hashar: hello [06:51:15] apergos: I was wondering if you knew anything about the production memcached servers? [06:51:25] apparently we have a bunch of them with lot of memory allocated [06:51:32] but can't find the definition for it in puppet :/ [06:51:47] the memcached class only creates one instance of ~ 80MB memory which is not that much [06:51:57] lemme look [06:52:04] on beta Chad created two instances but each of them only have 80MB, I would like to allocate mooaaar mem [06:53:42] there are a pile of mc*** with role::memcached [06:53:50] or are you talking about something else? [06:54:16] ahh [06:54:25] 16 boxes looks like for eqiad [06:54:40] yeah that is the boxes [06:55:20] and that role calls ::memcached class with a size of 89088 [06:55:38] which would mean we only have 16 x 89088 memory allocated [06:55:56] yep [06:56:39] which is like … not that much :] [06:56:42] don't forget that a lot of stuff is in redis these days too [06:56:54] and we have a database parser cache [06:57:16] so yeah maybe that is enough [06:57:17] yep [06:57:31] I guess beta should use more redis and have a db backed parser cache hehe [06:57:43] prolly should, if it's going to act like production [07:00:27] yup [07:00:35] more bugs I need to fill [07:22:12] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [07:27:40] (03PS1) 10Ori.livneh: Change metric name format for NavigationTiming events [operations/puppet] - 10https://gerrit.wikimedia.org/r/81464 [07:32:38] apergos: also, Bryan Davis (multimedia software engineer), posted a mail to engineering list about multimedia tech debt [07:32:54] apergos: he is referring to thumbnails so I though you might want to reply / get in touch with him [07:33:14] (sorry I am processing my mail backlog hehe) [07:34:43] so am I. now down to 700 messages unread... [07:34:55] I hav enot gotten to the tech debt thread yet. [07:35:28] and I found out that the production parser cache uses both memcached and the db hehe [07:44:05] and I filled a bunch of bugs to get MariaDB on beta .. [07:47:07] great [07:53:29] apergos: could you possibly merge change 81464 (see above)? [07:55:02] looking [07:55:11] thank you [07:59:07] mwalker: sleep you should! [07:59:12] mwalker: thanks for the pep8 merges :-] [07:59:23] hashar: probably [07:59:29] thanks for writing them [07:59:41] any thoughts on why the jobs aren't triggering yet though? [07:59:52] cause I haven't merged the change [07:59:55] arhar [08:00:07] makes sense [08:00:21] https://gerrit.wikimedia.org/r/#/c/81378/ [08:00:24] replied a bit there [08:00:33] you might want to have the linting job to vote Verified + 2 [08:00:48] and maybe add gate-and-submit which would make Jenkins merge the change for you whenever you CR+2 [08:00:55] (if jenkins is allowed to merge changes there) [08:01:07] ori-l: instead of site.country.metric it's now going to be metric.site.country, this is desired right? [08:01:22] yep [08:01:28] cool [08:01:54] (03CR) 10ArielGlenn: [C: 032] Change metric name format for NavigationTiming events [operations/puppet] - 10https://gerrit.wikimedia.org/r/81464 (owner: 10Ori.livneh) [08:01:58] hashar: I was going to wait until we had error free runs before doing V+2 on anything [08:02:37] apergos: thanks very much [08:03:21] hashar: unless there's an optoin to have it +2 but not -2 [08:03:40] mwalker: well they are non voting [08:03:47] so if the lint fails, jenkins is not going to vote -2 [08:03:56] it will even vote +2 regardless [08:03:57] this is for professor? [08:03:59] ( ori-l ) [08:04:03] hashar: interesting [08:04:20] sounds good to me; what change do I need to make? [08:04:22] apergos: no, hafnium [08:04:25] woops [08:04:26] mwalker: that is what 'voting: false' does, it basically discard the result of the job to find out what the voting score is. [08:04:40] i have access to hafnium, so i can update it [08:04:43] ok running puppet now [08:04:49] oh. heh well too late [08:04:53] mwalker: replace check-only by check-voter [08:04:55] np, saves me the trouble [08:05:10] mwalker: and copy paste the block below it, and replace check-voter with gate-and-submit. I can do it if you want [08:05:39] I can only learn by doing [08:05:50] some people are not willing to learn :-] [08:06:42] some of them want to abuse you, some of them want to be abused. [08:06:43] ok you're good to go ori-l [08:06:59] apergos: woot! thanks! :D [08:07:06] sure! [08:11:44] hashar: what is the point of having a gate-and-submit job if we're already running the tests in check-voter and the tests do not vote? [08:12:02] just to save us the trouble of adding it when we do make the tests voting? [08:12:17] yup [08:12:34] even if all tests are non voting, Jenkins will still vote Verified +2 [08:12:40] this way you do not have to vote verified :] [08:12:47] gotcha [08:12:50] saves you one click on each change [08:13:10] if it costs 0.10$ per click, we save money :) [08:14:02] ok; pushed [08:17:25] mwalker: nooow head to bed :] [08:17:43] indeed [08:17:45] sleeps time [08:18:03] thank you mwalker|sleeps ! [08:37:24] off will be back later on [08:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:35] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 194 seconds [08:53:05] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 214 seconds [08:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:00:05] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -0 seconds [09:00:35] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [09:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:24:57] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:25:57] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [09:28:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [09:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [09:57:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:14:23] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:23] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:33] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:36] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:36] PROBLEM - LVS HTTPS IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:37] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:54] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:54] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:13] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 90853 bytes in 0.436 second response time [10:15:13] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.711 second response time [10:15:15] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.705 second response time [10:15:15] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.435 second response time [10:15:15] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.436 second response time [10:15:23] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.349 second response time [10:15:23] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3893 bytes in 0.534 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.695 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.714 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.714 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.711 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 90851 bytes in 0.799 second response time [10:15:26] RECOVERY - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 44563 bytes in 0.352 second response time [10:15:26] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.357 second response time [10:15:27] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.435 second response time [10:15:27] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.440 second response time [10:15:28] RECOVERY - LVS HTTPS IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 44563 bytes in 0.616 second response time [10:15:33] RECOVERY - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1186 bytes in 0.176 second response time [10:15:34] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.355 second response time [10:15:34] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.710 second response time [10:15:43] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.349 second response time [10:15:43] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.696 second response time [10:18:03] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [10:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [10:24:03] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:03] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:03] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [10:43:49] (03CR) 10Hashar: "The files will still be owned by root/root arent they?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [10:44:59] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:45:59] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [10:51:59] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:51:59] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:59] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:59] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [10:57:59] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:59] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [11:00:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:00:59] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:59] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:59] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [11:02:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:04:59] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [12:05:14] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [12:22:52] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [12:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [12:32:31] !log jenkins: updated various jobs to use $ZUUL_PROJECT when determining the git repository to fetch from {{bug|53470}} [12:32:37] Logged the message, Master [12:34:35] (03CR) 10Andrew Bogott: [C: 031] "I was expecting there to be a corresponding change in site.pp -- did we rip out the tampa RT server definition already?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 (owner: 10Dzahn) [13:05:33] paravoid, I'm looking at https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/nginx and it's not clear to me how to actually build. I presume I need to check out the actual nginx source someplace in there? [13:20:00] !log temporarily changing innodb purge settings on db55 (massive growing history list) [13:20:05] Logged the message, Master [13:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:40] (03PS1) 10Hashar: tcpircbot: rm unused `os` module import [operations/puppet] - 10https://gerrit.wikimedia.org/r/81489 [13:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:24:04] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 225 seconds [13:24:15] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 236 seconds [13:24:26] heh [13:27:14] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [13:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:35:29] !log stopped db55 slave threads while purge catches up; killed long running research user transaction apparently doing nothing but holding locks [13:35:34] Logged the message, Master [13:41:54] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [13:43:39] !log same purge situation for db58 and db43, and remedy [13:43:45] Logged the message, Master [13:47:14] PROBLEM - MySQL Slave Delay on db1051 is CRITICAL: CRIT replication delay 192 seconds [13:47:24] PROBLEM - MySQL Replication Heartbeat on db1051 is CRITICAL: CRIT replication delay 202 seconds [13:47:25] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 188 seconds [13:48:04] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 232 seconds [13:49:07] andrewbogott_afk: no idea, but what do you want with it? in our HTTPS chats with Ryan, the consensus was that we were going to evaluate switching to pristine packages and this may happen soon [13:49:24] RECOVERY - MySQL Replication Heartbeat on db1051 is OK: OK replication delay 15 seconds [13:49:40] so it might not worth the trouble figuring it out [13:49:47] Ryan would know best either way though [13:50:07] paravoid, yuvi needs an updated version of nginx-extras. I was going to just make a tip package for him. [13:50:14] RECOVERY - MySQL Slave Delay on db1051 is OK: OK replication delay 0 seconds [13:50:22] But, OK, I can wait and ask ryan -- it looks like you've modified that code but maybe you never built. [13:50:40] I think this was like 18 months ago, right? [13:50:52] yeah 2012-05 [13:51:01] yeah, quite a while ago [13:51:21] sorry, I don't remember a thing :) [13:51:22] hiyaaa! mark, if you are you around, could you comment on this? https://rt.wikimedia.org/Ticket/Display.html?id=5678 [13:51:49] hiyaaa! [13:51:51] did you kill kafka? [13:51:58] hahaha [13:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:47] ha wha? [13:53:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [13:54:26] well varnishkafka suddenly broke [13:54:46] Aug 28 13:42:29 cp1048 varnishkafka[21693]: KAFKAERR: Kafka error (-196): analytics1004.eqiad.wmnet:9092/4: Failed to connect to broker at analytics1004.eqiad.wmnet:9092: Connection refused [13:55:54] !log jenkins: renamed Editcount jobs from EditCount to Editcount [13:56:00] Logged the message, Master [13:57:59] mark: and crashed after that? [13:58:06] no [13:58:14] phew [13:58:15] but running out of buffer space [13:58:20] and flooding rsyslog with that :) [13:58:29] good :) [13:58:33] Aug 28 13:41:07 cp1048 rsyslogd-2177: imuxsock begins to drop messages from pid 21693 due to rate-limiting [13:58:34] Aug 28 13:41:13 cp1048 rsyslogd-2177: imuxsock lost 44689 messages from pid 21693 due to rate-limiting [13:58:34] Aug 28 13:41:13 cp1048 varnishkafka[21693]: PRODUCE: Failed to produce kafka message: No buffer space available [13:58:34] !log jenkins: renamed OAuth jobs from Oauth to OAuth [13:58:40] Logged the message, Master [13:59:01] mark: should probably rate limit that output a bit in varnishkafka.. [13:59:10] would be nice [14:04:19] anyway [14:04:27] kinda difficult to do varnishkafka perf testing now ;) [14:04:34] but first impressions are good [14:04:43] it's using less than half of varnishncsa's cpu [14:04:52] plus the fact that we only need to run one instance instead of 2+ [14:05:47] mark, i responded on the RT, but the issue is cloudera hadoop packages depend on cloudera zookeeper package [14:05:53] which conflicts with ubuntu zookeeper package [14:05:57] which is what is installed on those machines [14:06:29] headache territory [14:06:48] and you can't run those journalnodes shared on existing hadoop machines either? [14:07:01] I started looking at that [14:07:07] but then remembered how we had that conversation [14:07:08] I always think all these quorum based services are so wasteful with machines, especially once you start to run multiple [14:07:17] which was basically [14:07:27] yeah, maybe, i think that it would be ok, except that it means we have to configure those machines differently, which is a bit annoying [14:07:34] that zookeeper is probably a useful piece of infrastructure to have and doesn't necessarily have anything to do with analytics [14:07:35] i don't htink we should run on namenode machines [14:07:40] beacuse that defeats the purpose [14:07:46] yeah zookeeper we can move out [14:07:55] it's a tiny piece of software though, so it could be easily run on a machine that runs something else [14:07:58] but I wonder if we ALSO need to assign 2/3 more machines to analytics [14:08:03] I would think so [14:08:05] and if we do it on the datanodes, then we have to remember this and partition them differently, and possibly configure job resources there differently [14:08:13] if we move zk out [14:08:17] we don't need 2/3 more machines [14:08:29] you do if we move those existing zk machines out ;) [14:08:32] oh ha [14:08:41] however way you look at it, it's 2-3 additional machines for this service [14:08:42] if you do that, then yeah we'd need 2 more, mayyyybe one [14:09:01] kind of, you could run zookeeper on some existing servers :) [14:09:01] we could get away with just one more, if that is a problem [14:09:09] misc servers [14:09:58] so, i think our planned zk usage will be pretty low. but, apparently linked in has had problems when there are a large number of kafka consumers all saving state in zookeeper [14:10:22] https://issues.apache.org/jira/browse/KAFKA-1000 [14:10:39] we don't have any high consumer use plans, so i htink it will be ok [14:10:48] camus is batched so it should be fine [14:14:01] (03PS3) 10Faidon Liambotis: Middle-East to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 [14:14:02] (03PS3) 10Faidon Liambotis: Africa to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80971 [14:14:20] (03CR) 10Faidon Liambotis: [C: 032] Africa to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80971 (owner: 10Faidon Liambotis) [14:15:09] !log authdns-update: switching Africa to esams [14:15:14] Logged the message, Master [14:15:17] Aug 28 14:14:03 cp1048 varnishd[1486]: Child (20318) said Could not destroy object 3290904632 in EXP_NukeLRU [14:15:21] that doesn't look good [14:15:25] paravoid, was that 'you could run zookeeper on misc servers' comment directed at me or to mark? :) [14:16:13] what are the I/O requirements for journalnodes? [14:16:41] (03PS3) 10Andrew Bogott: Make our checks for definitions a bit more explicit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77331 [14:20:06] Aug 28 14:19:37 cp1048 frontend[31119]: Child (31121) said : (malloc) Error in munmap(): #001<80><8A>#177 [14:20:07] Aug 28 14:19:37 cp1048 frontend[31119]: Child (31121) said : (malloc) Error in munmap(): [14:20:19] (03CR) 10Andrew Bogott: [C: 032] Make our checks for definitions a bit more explicit. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77331 (owner: 10Andrew Bogott) [14:20:45] :( [14:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.380 second response time [14:26:09] africa isn't noticeable in graphs at all.. [14:26:12] mark, i'm not sure about io usage [14:26:18] but, i'm looking at some datanode disk configs [14:26:24] not even a little [14:26:54] i betcha I can create a raid 1 partition on sda and sdb on them [14:26:59] there's a lot of unused space there [14:27:12] and i think it should be ok to share io on the os disks [14:27:25] what does a journalnode do exactly? [14:27:42] they syncs namenode metatdata edits to the standby namenode [14:27:52] (03CR) 10Faidon Liambotis: [C: 032] Middle-East to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 (owner: 10Faidon Liambotis) [14:28:05] for HA namenode, you need to have the namenode metatdata highly available [14:28:19] there are 2 ways to do this: NFS, or Quorum Based JournalNodes [14:28:29] JournalNodes are the newer and preferred and less hacky way [14:28:32] http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html [14:28:53] paravoid, mark did you have time to look at the varnish/streaming response issues (bug 52854) [14:29:05] j^: i'll look at that soon [14:31:30] so, mark, ok, i'm going to see if I can put journalnodes on a few datanodes, if we have problems with it we can revisit getting new nodes for this later. But! I'm still for a Zk outside of analytics. if we want to move 3 of the analytics nodes elsewhere for others to use as well, i'm for it! [14:32:15] i am for that too [14:32:19] i just updated the ticket [14:32:30] k danke [14:33:21] if we need the extra machines, then we can take them, I just hate to spend 3 machines on a service that does shit-all in terms of resources [14:33:38] so if you can, try to share with datanodes indeed [14:33:42] if that is problematic, let's move it over [14:33:44] is that ok? [14:34:17] !log authdns-update: switching Middle-East to esams [14:34:22] Logged the message, Master [14:34:24] yeah sounds good [14:34:29] thanks [14:38:36] ottomata: so, what happened to kraken? [14:38:36] er [14:38:38] kafka [14:39:16] oh what's up? [14:39:41] ohhhh looky there, [14:39:43] (no noticeable traffic increase with ME either) [14:39:49] an04's disk filled up! [14:39:55] didn't know you were gonna fill it~! [14:39:58] 276G [14:40:44] (03PS3) 10Faidon Liambotis: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [14:40:45] (03PS3) 10Faidon Liambotis: Switch Central/South Asia to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 [14:40:49] (rebasing) [14:41:43] (03CR) 10Faidon Liambotis: [C: 04-1] "Per discussion with Mark, postpone this decision later, perhaps after ulsfo." [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 (owner: 10Faidon Liambotis) [14:42:06] heh [14:42:16] interesting [14:42:16] so [14:42:23] so i just ran varnishkafka on one box for a day [14:42:35] topic: varnish partition: 0 leader: 3 replicas: 3,4 isr: 3 [14:42:36] topic: varnish partition: 1 leader: 3 replicas: 4,3 isr: 3 [14:42:45] you can see that broker 3 (analytics1003) [14:42:54] is the leader, and has the only in-sync-replica (isr) [14:43:12] which means that you should still be able to produce [14:43:16] since its disk isn't full [14:43:27] varnishkafka got connrefused [14:43:35] Snaps_ has a thing to do in librdkafka where it is supposed to reconnect [14:43:50] if produce reqs fail due to broker problems [14:44:09] i'm not sure if he's got that yet, but you are probably not using a version of varnishkafka built with the fix even if it is done [14:44:36] so, when that actually, works, what should have happened is that the producers would have noticed that a new leader had been elected [14:44:39] and they'd start sending their data there [14:44:44] this would have filled up an03 eventually too [14:44:46] so this is right now: [14:44:47] Aug 28 14:44:29 cp1048 varnishkafka[8101]: KAFKAERR: Kafka error (-196): analytics1004.eqiad.wmnet:9092/4: Failed to connect to broker at analytics1004.eqiad.wmnet:9092: Connection refused [14:44:58] after a restart [14:45:02] that is with a restart? oh ok [14:45:04] hm [14:45:16] intersteing, yeah kafka is down on an04 [14:45:20] because the disk is full [14:45:32] but i'm not sure why that would cause the producer to not be able to start [14:45:50] librdkafka will automatically connect [14:45:54] no need to restart [14:46:16] Snaps_: if one of the listed brokers in metadata.broker.list is down [14:46:22] when varnishkafka starts up [14:46:31] what is supposed to happen? [14:46:57] ottomata: it will always strive to keep a connection to all known brokers. As soon as it connects to a broker it requests the metadata, possibly aquiring a list of new brokers, which it will connect to. [14:47:20] ottomata: the problem we talked about earlier was not about connections, but that the metadata was not automatically refreshed on a connection that did not go down. But thats fixed [14:47:37] ok, probably mark is not running with a varnishkafka built with that fix [14:47:38] ok [14:47:46] but, righ tnow, he is trying to start varnishkafka [14:47:57] one of the brokers is down [14:48:11] mark, does varnishkafka work even with that error message? [14:48:33] maybe it just prints the error because it can't connect, but it will produce to the partition leader (an03) anyway? [14:49:08] it doesn't seem to be doing much [14:49:21] yeah i'm not seeing any thing in the consumer i just spawned up [14:51:51] what version of librdkafka are you on, mark? [14:52:07] Installed: 0.8~wip20130807-1 [14:54:38] so, if we want a broker in esams and ulsfo [14:54:44] will we need zookeeper clusters there as well? [14:57:06] yes [14:57:14] are you fucking kidding me? [14:57:17] no way [14:57:22] mark: okay, theres been quite a few fixes the last couple of days to librdkafka. [14:57:46] we're not gonna set up 3 boxes just to support logging in a caching datacenter [14:58:00] that defeats the point entirely [14:58:17] we can run zk on other hosts [14:58:24] doesn't have to be dedicated boxes [14:58:25] like which? varnish? :) [14:58:29] there are no other hosts in a caching dc [14:58:59] maybe this is naive, but if there are 3 brokers, i don't see why we can't run it on the same boxes as the kafka brokers [14:59:10] 3 brokers per caching dc? [14:59:50] well, yes, at least, we have been using 2 in analytics, and i think 2 is fine to start with for now, might be all we need for a while, i don't know. but Zookeeper needs at least 3 nodes [14:59:53] since it is a quorum [15:00:16] i guess we're going with no brokers in caching dcs then [15:00:23] ? [15:00:32] needing 3 hosts per caching dc for this is insane [15:01:00] how many producers are there in esams? [15:01:08] 20 or so [15:01:25] PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:43] so, what we are trying to do is get the logs reliably streamed between datacenters [15:01:56] yes [15:02:01] and you'll need to do it without needing 3 boxes per dc [15:02:14] we! [15:02:14] we! [15:02:16] :) [15:02:17] right? [15:02:18] we? [15:02:29] if it was 'we', we didn't arrive here [15:02:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:35] !log restarting slave threads on db55, db58, db43. will lag for a while. [15:02:40] Logged the message, Master [15:03:16] right? [15:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [15:03:47] if there are other better ways of solving the problem, then we should figure that out, right? [15:03:54] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 4780 seconds [15:04:14] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 4802 seconds [15:04:15] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 6236 seconds [15:05:49] setting up a broker for intermediate storage in a caching dc has legal problems, but seems reasonable [15:06:10] also needing 3 zookeeper nodes with that is a huge waste, and contradictory to the idea of lightweight caching centers [15:06:43] hm, i'm not sure about this in 0.8, but in 0.7 you could manually just supply the list of brokers without using zookeeper…i think, not sure if that is still possible [15:07:36] that seems preferable to using at least 6 servers to figure that out ;) [15:07:59] lemme see what happens in the labs cluster... [15:08:07] hmmmmm nooooo [15:08:09] wait no [15:08:10] this won't work [15:08:21] because, we need zookeeper to consume [15:08:31] and the mirror is a consumer [15:08:56] yeah no, and the brokers definitely write to zk [15:09:21] and none of this seemed to be a problem before? :) [15:09:29] we haven't done multi data center before [15:10:49] mark, i think the plan (albeit not very well discussed, I suppose) was to have 2 brokers in each DC anyway, is it so bad to have a 3rd? :) [15:11:25] 2 brokers per dc, what for? [15:11:33] can't producers talk to brokers in other dcs? [15:12:09] i'm sure they can but it isn't recommended [15:12:20] (03PS2) 10Andrew Bogott: Turn on pluginsync. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77378 [15:12:21] (03PS3) 10Andrew Bogott: Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 [15:12:43] i'm starting to think we should have brokers only in our primary dcs [15:13:08] they do some buffering [15:13:15] 2 brokers per dc for HA, we want to do this reliably, right? having a single machine responsible for traffic defeats the purpose a bit [15:14:08] if we can figure out a way to reliably get the data from the caching DCs to the primary kafka brokers in eqiad without using kafka in the caching DCs, that's cool i suppose [15:14:21] i'm starting to think that udplog was pretty good [15:14:25] perhaps we should add that to varnishkafka ;-) [15:14:42] it didn't need 3 hosts to figure out where to send or find data [15:14:45] haha, uh huh, if you want all of your traffic to go to the same place [15:14:56] and if you wanted to lose data while you restart a udp2log instance [15:15:09] and if you wanted to deploy changes to N hosts when you need to move an instance [15:15:26] no [15:15:35] because maintaing 3 almost pointless boxes per dc is so much better [15:15:47] we have plenty of money and staff time, right [15:15:52] and rack space and power [15:16:18] imagine we lose a few log lines :) [15:16:27] i think the people who fund this stuff think getting reliable analytics is important? [15:16:51] i doubt the people who pay for this think wasting machines is important [15:17:40] data loss == lots of staff time spent figuring out what the heck went wrong, from analysts down to engineers [15:17:47] its not just the machine cost [15:18:09] so do it without data loss and without a lot of resource waste [15:18:29] do you have a better idea than the current plan? [15:18:46] yes [15:18:54] write something lightweight, custom [15:19:53] that is a pretty generic plan :p [15:20:15] I wish I would have spent the time already spent discussing kafka and various solutions on writing a custom solution [15:20:17] thanks for volunteering! [15:21:04] (or even investigate an alternative already out there) [15:21:10] <^d> manybubbles: Yo :) [15:21:21] ^d: yo! [15:21:32] so that library I mentioned works.... [15:21:36] mostly. [15:21:52] <^d> Meh, let's not worry about it today. Anything else we need to do before mw.org? [15:22:01] ^d: did you ever do a deploy to test2? [15:22:06] mark: i think the people who fund this stuff think getting reliable [15:22:08] oops [15:22:10] sorry, wrong paste! [15:22:15] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AvpRkIqSY9hNdFNoYWFIX29hMnFuenV1ckRvYzEzUVE#gid=0 [15:22:21] <^d> manybubbles: I haven't updated to master in awhile. Lemme do that. [15:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:41] ^d: cool. if you could, could you rebuild the search index with the two scripts? [15:23:25] <^d> Yeah [15:23:25] ^d: you have to remember to swap the --indexIdentifier arguments when you do an in place upgrade.... I should make that automatic. [15:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [15:23:29] i think i'll skip our meeting in half an hour [15:23:34] I can't stand this shit [15:23:58] varnishkafka seems to work fine [15:28:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [15:39:13] !log demon synchronized php-1.22wmf14/extensions/CirrusSearch 'CirrusSearch to master' [15:39:18] Logged the message, Master [15:43:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:45:47] <^d> manybubbles: Not our fault, but I'm getting "PHP Notice: Undefined variable: function in /a/common/php-1.22wmf14/extensions/Scribunto/engines/LuaCommon/LuaCommon.php on line 729" on reindexing. [15:45:50] <^d> anomie: ^ [15:46:23] !log updated Parsoid to ca3f45f [15:46:29] Logged the message, Master [15:46:39] ^d: I'm glad it isn't our fault. that is likely going to make reindexing not too happy. If you updated the config that should still be ok. [15:48:02] <^d> I restarted it anyway, seems to be happy now. [15:51:03] ^d, manybubbles: Any idea how to reproduce? [15:51:46] <^d> Not particularly. It was transient, no stacktrace or anything since it was just a notice. [15:52:08] <^d> Ah, popped up again. Same uninformative message though. [15:52:21] ok [15:55:17] ^d: I'm able to shell in a look at the index and it looks like we've still got some cruft left over from a previous version. [15:55:29] <^d> :\ [15:56:36] (03PS1) 10Chad: Moving mediawikiwiki to Cirrus (as secondary backend) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81505 [15:57:36] greg-g: I realized I didn't actually *answer* the email about the invite, I just mentally made a note to show up [15:58:21] apergos: hah. I assumed it wasn't going to happen since you didn't say yes on the cal invite, but then I realize that not everyone does that [15:58:37] so, you and I can just chat, or we can wait until next week to get hashar [15:58:43] ah no hashar today? [15:58:50] hmm maybe better for us all to be in [15:58:56] save you from two meetings [15:58:58] no, I thought it was a good time for him, but he has kid duty now [15:59:01] ok [15:59:02] yeah, thankya :) [15:59:06] all righty [15:59:15] ^d: https://gerrit.wikimedia.org/r/#/c/81506/ will probably fix it [16:03:03] <^d> manybubbles: test2 reconfiged and reindexed. [16:03:28] <^d> If https://gerrit.wikimedia.org/r/#/c/81505/1/wmf-config/InitialiseSettings.php looks good, we'll throw the switch on mw.org too [16:03:53] ^d: that looks good to me [16:04:04] (03CR) 10Chad: [C: 032] Moving mediawikiwiki to Cirrus (as secondary backend) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81505 (owner: 10Chad) [16:04:13] (03Merged) 10jenkins-bot: Moving mediawikiwiki to Cirrus (as secondary backend) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81505 (owner: 10Chad) [16:04:42] ^d: I've noticed a bug where when we reindex into a new, empty index while the old on still exists we serve some requests out of both the new and old index, causing dupes. Filing. [16:05:29] !log demon synchronized wmf-config/InitialiseSettings.php 'CirrusSearch on mediawiki.org \o/' [16:05:35] Logged the message, Master [16:07:03] <^d> !log ES: created mediawikiwiki index [16:07:08] Logged the message, Master [16:07:54] <^d> !log ES: indexing mediawikiwiki in screen on terbium [16:07:59] Logged the message, Master [16:08:24] ^d: in case you care, this is the bug number: 53484 [16:11:02] ^d: it ooks like a lot of templates aren't being expanded: http://www.mediawiki.org/w/index.php?search=search&title=Special%3ASearch&fulltext=1&srbackend=CirrusSearch [16:13:25] Ryan_Lane: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class protoproxy::package for amssq47.esams.wikimedia.org at /etc/puppet/manifests/role/protoproxy.pp:20 on node amssq47.esams.wikimedia.org. [16:13:26] I think you missed that include. Is it safe to remove it? Replace it with include nginx::package? something else entirely ? [16:16:20] ^d: I'm pretty sure something is screwed up while indexing articles. Templates are not expanding as expected. [16:16:33] (03PS1) 10Asher: Revert "add missing uploadlb6 ips" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81509 [16:20:13] <^d> manybubbles: I wonder... [16:20:15] !log Created EducationProgram tables on ptwiki [16:20:20] Logged the message, Master [16:20:31] ^d: I'm also not sure what is going on with namespaces. investigating that [16:20:33] <^d> If since we're running as secondary backend, it's picking up the wrong text formatting stuff. [16:21:32] ^d: grrrrrrrr! [16:21:44] ^d: I totally believe it. [16:21:59] <^d> Actually I've got a good suspicion that's it. [16:22:11] I can have a look at it. [16:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:49] (03PS1) 10Dr0ptp4kt: Adding DNS TXT entry for Google Webmaster Tools site verification. [operations/dns] - 10https://gerrit.wikimedia.org/r/81510 [16:24:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [16:24:30] (03PS1) 10Reedy: Install Education Program extension on pt.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81511 [16:25:05] (03PS2) 10Reedy: Install Education Program extension on pt.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81511 [16:25:18] !log demon synchronized wmf-config/InitialiseSettings.php 'Move mw.org to primary backend as Cirrus to test my theory' [16:26:31] greg-g, no depl today for zero, can reallocate [16:26:52] hah [16:27:00] after I bumped you around so much you're just not going to do it [16:27:05] :) [16:27:16] ^d: I see the problem in forceSearchIndex.php. now to see if it is a problem with regular updates [16:29:10] ^d: so I just got a wmf error trying to update a page on mediawiki.org [16:30:23] <^d> Hmm, what was the error? [16:32:11] (03PS1) 10Chad: Move mw.org to primary backend as Cirrus to test my theory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81512 [16:32:18] (03CR) 10Chad: [C: 032 V: 032] Move mw.org to primary backend as Cirrus to test my theory [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81512 (owner: 10Chad) [16:32:38] ^d: I'm actually not sure where to read the logs. it spun forever and then complained [16:32:48] <^d> Can you ssh to fluorine? [16:33:32] no [16:34:10] ^d: database locked? [16:34:16] did we do that? [16:36:04] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 199 seconds [16:36:09] <^d> I dunno, but I aborted the couple of indexing processes I had. Lemme do it a different way. [16:36:24] ^d: I'm pretty blind at this point. [16:36:24] PROBLEM - MySQL Slave Delay on db1028 is CRITICAL: CRIT replication delay 216 seconds [16:36:24] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 216 seconds [16:36:34] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 222 seconds [16:36:35] those look relevent [16:37:14] <^d> @replag [16:37:15] ^d: [s7] db1024: 265s, db1028: 265s [16:38:19] <^d> manybubbles: mw.org is on s3, as is test2. [16:39:04] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [16:39:06] ^d: is that saying that those dbs have gone way out of sync from their master? [16:39:16] and now seem to be coming back [16:39:24] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [16:39:46] (03CR) 10Helder.wiki: [C: 031] Install Education Program extension on pt.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81511 (owner: 10Reedy) [16:40:16] @replag [16:40:17] Reedy: [s7] db1028: 204s [16:40:24] RECOVERY - MySQL Slave Delay on db1028 is OK: OK replication delay 3 seconds [16:40:30] Think it might have been my fault.. [16:40:34] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay -1 seconds [16:40:35] ^d: I think the problem with not resolving templates is only in forceSearchIndex.php and I've found it. I think. [16:40:53] Reedy: what is up? [16:41:10] Running an update query [16:41:23] Even though I was doing them in much larger batches last night and it was fine [16:41:25] * Reedy kicks mysql [16:41:28] update globaluser set gu_home_db = NULL where gu_home_db = '' LIMIT 50 [16:41:54] indeed [16:41:54] Reedy: that^ ? [16:42:31] I wonder if that hit the same row that the maintenance script was trying to update [16:43:16] it's not indexed. limit 50 won't stop it scanning the table each time, all 26926625 rows on db1024 [16:43:29] batched, but slow batches :) [16:43:53] each one will be slower [16:44:05] (03PS2) 10Dr0ptp4kt: Adding DNS TXT entry for Google Webmaster Tools site verification. [operations/dns] - 10https://gerrit.wikimedia.org/r/81510 [16:45:07] ^d: after that merge we should be able to just restart the index build process [16:45:12] ^d: no need to update the config [16:45:25] (03PS1) 10Chad: Revert "Move mw.org to primary backend as Cirrus to test my theory" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81514 [16:45:31] (03CR) 10Chad: [C: 032 V: 032] Revert "Move mw.org to primary backend as Cirrus to test my theory" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81514 (owner: 10Chad) [16:45:39] (03PS3) 10Dr0ptp4kt: Adding DNS TXT entry for Google Webmaster Tools site verification. [operations/dns] - 10https://gerrit.wikimedia.org/r/81510 [16:46:07] !log demon synchronized wmf-config/InitialiseSettings.php [16:46:13] Logged the message, Master [16:46:22] Reedy: I'm not sure if this is good enough, but it would be better: update globaluser set gu_home_db = NULL where gu_home_db = '' ND gu_id > LIMIT 50 [16:46:38] That was to fix a few errant rows [16:46:57] ah, so there aren't many any way. it just takes forever to find them [16:47:36] Doing the batched reads in the maintenance script is using indexed gu_name [16:48:02] (03PS1) 10Jgreen: prep to make the spamd user configurable, and create spool dir for OTRS spam training [operations/puppet] - 10https://gerrit.wikimedia.org/r/81516 [16:48:13] !log demon synchronized php-1.22wmf14/extensions/CirrusSearch/forceSearchIndex.php 'Fix for reindexer' [16:48:19] Logged the message, Master [16:50:16] https://bugzilla.wikimedia.org/show_bug.cgi?id=53487 Error with CirrusSearch while deleting a page [16:50:28] Call to a member function getArticleID() on a non-object [16:50:59] $title passed in as null apparnetly [16:51:02] * Reedy find a stack trace [16:51:09] Reedy: looking [16:53:10] Reedy: was that a white page style error or did it let you through? [16:53:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:13] (03CR) 10Jgreen: [C: 032 V: 032] prep to make the spamd user configurable, and create spool dir for OTRS spam training [operations/puppet] - 10https://gerrit.wikimedia.org/r/81516 (owner: 10Jgreen) [16:54:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [16:56:10] manybubbles: Wasn't my error, but was in the fatal log so should've been a white page style error [16:57:06] ottomata: here ? [16:57:18] manybubbles: Oooh [16:57:30] manybubbles: Let it through via LQT [16:57:32] * Reedy screenshots [16:57:51] Reedy: I see the cause but I've never been able to reproduce using that method [16:57:56] heya, akosiaris1, yes but just started a meeting [16:58:28] ok will email you then and we talk tomorrow. [16:59:07] manybubbles: http://bug-attachment.wikimedia.org/attachment.cgi?id=13191 [17:00:05] Reedy: excuse my ignorance, but how do I create one of those to delete it? [17:00:12] heh [17:00:20] Have you sysop on mediawiki.org? [17:01:35] manybubbles: https://www.mediawiki.org/wiki/Project_talk:Support_desk Click "Add Topic" in tabs at the top of the page [17:06:35] Reedy: I don't seem to have the option to delete it.... [17:07:55] Reedy: can you nuke the dummy message I put on that page? [17:09:29] manybubbles: just given you sysop, so you should be able to delete it yourself now and see the error.. [17:11:00] Reedy: sweet! now I have to figure out how to get all that installed on my local mediawiki.... [17:11:13] Fairly easily [17:11:40] Checkout the repo for LiquidThreads if you haven't already [17:11:44] require_once( "$IP/extensions/LiquidThreads/LiquidThreads.php" ); [17:11:49] ^ Add that to your LocalSettings.php [17:11:52] Run maintenance/update.php [17:12:24] could be worse [17:15:50] (03CR) 10Asher: [C: 032 V: 032] Revert "add missing uploadlb6 ips" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81509 (owner: 10Asher) [17:17:52] manybubbles: My IDE complains that for CirrusSearchUpdater::updateFromTitle( $title ); you're passing a string not a title objects [17:18:15] $normalTitle = $search->normalizeText( $indexTitle ); [17:18:21] Reedy: your IDE is right. [17:18:34] $search->updateTitle( $this->id, $normalTitle ); [17:18:38] Was just following it back to confirm ;) [17:18:58] Reedy: That method annoyed me because I could never trigger it. Any way, I put a patch on gerrit to fix it blind and now I'm trying to verify it. [17:20:31] manybubbles: Fix looks sane, but newFromID can return null. If you know that your always going to be calling it with a valid ID [17:21:34] (03PS1) 10Jgreen: various otrs + spamassassin and cron tweaks [operations/puppet] - 10https://gerrit.wikimedia.org/r/81521 [17:21:36] Reedy: you are right, I have that protection in other places. [17:22:14] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [17:22:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:47] (03CR) 10Jgreen: [C: 032 V: 031] various otrs + spamassassin and cron tweaks [operations/puppet] - 10https://gerrit.wikimedia.org/r/81521 (owner: 10Jgreen) [17:24:12] <^d> manybubbles: Aww :( Catchable fatal error: Argument 2 passed to Parser::parse() must be an instance of Title, null given, called in /usr/local/apache/common-local/php-1.22wmf14/extensions/RSS/RSSParser.php on line 297 and defined in /usr/local/apache/common-local/php-1.22wmf14/includes/parser/Parser.php on line 351 [17:24:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:24:51] ^d: looks like I missed lots of fun stuff [17:27:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:23] (03CR) 10coren: [C: 032] "Simple enough; worst that can happen is that Google doesn't like it." [operations/dns] - 10https://gerrit.wikimedia.org/r/81510 (owner: 10Dr0ptp4kt) [17:28:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [17:29:56] Oh, crap, I expect the procedure for DNS changed now. I don't see the docs as having been updated. [17:30:34] RECOVERY - Disk space on analytics1004 is OK: DISK OK [17:30:49] Reedy: got Liquidthreads installed and..... I can't reproduce it. More digging.... [17:31:20] manybubbles: Looking at the stack traces it seems on mw.org that some people experienced it on just deleting other pages [17:31:20] Ah, it mostly didn't. [17:31:31] You guys looking at another "Call to a member function on a non-object" LQT error [17:31:33] ? [17:31:42] Not quite [17:31:57] (03PS1) 10Jgreen: more OTRS-related mail/spam tweaks [operations/puppet] - 10https://gerrit.wikimedia.org/r/81523 [17:32:07] Deleting a LQT thread on mw.o gives an error like that in CirrusSearch code [17:32:34] https://bugzilla.wikimedia.org/show_bug.cgi?id=53487#c1 [17:33:18] But also occurs in non LQT threads [17:33:27] dr0ptp4kt: wikipedia.org. 600 IN TXT "google-site-verification=yVWXbnKBVBeI5wD3qQIQU8oEiexGBahqESBgrDDQTro" [17:34:11] (03CR) 10Jgreen: [C: 032 V: 031] more OTRS-related mail/spam tweaks [operations/puppet] - 10https://gerrit.wikimedia.org/r/81523 (owner: 10Jgreen) [17:34:12] So deleting stuff gives an error in CirrusSearch? Sometimes that happens to be an LQT thread? [17:34:37] Indeed [17:34:45] But apparently only on mediawikiwiki [17:35:02] yeah, I can't reproduce it locally. [17:35:15] I'd be tempted to put that fix live on mediawiki.org and see how it fares [17:35:41] Reedy: If you think it is good then we may as well. [17:35:53] I'm just happy to get another set of eyes on it [17:36:36] ^d: we'd like to push an update to stop some errors during deletes that seem to only come in production [17:38:43] Coren, thanks for the review. i see the TXT record in DNS already. does that autodeploy or something? [17:39:55] Coren, btw, the goog webmaster tools console verified the record, too. thanks again! [17:40:29] manybubbles: How far behind master is production currently? [17:40:40] ^demon: 1 commit I believe [17:40:54] ^demon: did you open up a bug for that other issue? [17:42:38] (03PS1) 10Jgreen: more spamassassin config variables in template [operations/puppet] - 10https://gerrit.wikimedia.org/r/81525 [17:43:25] <^demon> No, I didn't. [17:43:42] (03CR) 10Jgreen: [C: 032 V: 031] more spamassassin config variables in template [operations/puppet] - 10https://gerrit.wikimedia.org/r/81525 (owner: 10Jgreen) [17:43:56] https://gerrit.wikimedia.org/r/81526 Updates CirrusSearch to master in 1.22wmf14 [17:45:22] ^demon: Do you want to deploy it or shall I? [17:45:31] <^demon> Go ahead :) [17:46:45] ^d: that backtrace you linked above doesn't look like us. [17:46:46] !log reedy synchronized php-1.22wmf14/extensions/CirrusSearch/ 'Fix bug 53487' [17:46:51] Logged the message, Master [17:46:51] Not that it isn't bad [17:46:55] Coren or ^demon, would proper protocol dictate that I revert the now unnecessary change 80660? i'd like to have that removed from production now that we've gone dns txt on this. [17:47:22] <^demon> Not a clue, I know zilch about this really :) [17:47:47] I dunno. It's certainly harmless. Are you sure Google won't check it again in the future? [17:48:34] !log restarting some datanodes [17:48:36] PROBLEM - DPKG on analytics1014 is CRITICAL: Connection refused by host [17:48:37] PROBLEM - DPKG on analytics1011 is CRITICAL: Connection refused by host [17:48:40] Logged the message, Master [17:48:44] they might, maybe look at other top 10 website's dns entries to see if they still have them (other than microsfot and yahoo, of course) [17:49:03] manybubbles: Looks to be fixed for me [17:49:16] PROBLEM - DPKG on analytics1013 is CRITICAL: Timeout while attempting connection [17:49:16] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:16] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:37] LinkedIn might be a good choice based on this list: http://www.alexa.com/topsites [17:50:09] or twitter, they use google analytics [17:50:14] <^demon> manybubbles: We're averaging about 4 pages/s for mw.org. We've gotta find a way to improve this :) [17:50:37] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:37] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:55] ^demon: Threads! [17:51:01] forkforkfork [17:51:08] <^demon> I tried farming it out to about 4 processes. [17:51:15] <^demon> They all averaged about 4 pages/s. [17:51:24] <^demon> Which means I was doing a whopping 16/s. Screw that :) [17:51:26] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms [17:51:27] ^demon: yeah, I'm not happy about it [17:51:36] <^demon> I know the script can theoretically do around 50/s/thread. [17:51:36] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [17:51:37] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [17:51:38] ^demon: 16 is 4 times better! [17:51:46] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:51:49] ^demon: not when it has lots of templates to resolve [17:52:05] <^demon> Yeah. But the fact that we're not hitting the cache as much kinda sucks :\ [17:56:32] ^demon: I haven't done enough digging to find out why our cache hit rate is so low. one of the other problems is that if we do start hitting the cache we could push user traffic out by crawling the whole thing [17:56:43] (03PS1) 10Jgreen: fix spam/ham mbox paths in spamassassin training script [operations/puppet] - 10https://gerrit.wikimedia.org/r/81530 [17:56:44] (03PS3) 10Andrew Bogott: Turn on pluginsync. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77378 [17:56:45] (03PS4) 10Andrew Bogott: Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 [17:57:44] (03CR) 10Jgreen: [C: 032 V: 031] fix spam/ham mbox paths in spamassassin training script [operations/puppet] - 10https://gerrit.wikimedia.org/r/81530 (owner: 10Jgreen) [17:58:40] akosiaris1: ah. sorry. yes, please fix that if you haven't already [18:00:41] ^demon and Reedy: question: have we slowed down page updates appreciably? I imagine they are slower but I wonder how much. [18:02:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:03:57] ^demon: it looks like load has gone nuts on testsearch1002 [18:04:13] <^demon> testsearch1002 I think is still freaking out for other reasons. [18:04:17] <^demon> 1001 and 3 are just fine. [18:06:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [18:10:14] (03PS1) 10Ottomata: Moving Hadoop JournalNodes to analytics10[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81533 [18:10:29] (03CR) 10Ottomata: [C: 032 V: 032] Moving Hadoop JournalNodes to analytics10[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81533 (owner: 10Ottomata) [18:11:52] Reedy: You mind if I do a quick deploy? Looks like you and chad are the only ones doing stuff on tin right now.. [18:12:18] (03PS1) 10Jgreen: enable train_spamassassin script on iodine [operations/puppet] - 10https://gerrit.wikimedia.org/r/81534 [18:12:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:13] ^demon, Coren: will leave the file on webroot in place for the time being. [18:13:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.250 second response time [18:14:35] (03CR) 10Jgreen: [C: 032 V: 031] enable train_spamassassin script on iodine [operations/puppet] - 10https://gerrit.wikimedia.org/r/81534 (owner: 10Jgreen) [18:18:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.844 second response time [18:19:33] Coren: did you do any verification before merging that TXT record? [18:19:56] Coren: i.e. that it really came from dr0ptp4kt, not impersonated [18:21:13] paravoid: ... well, it came from someone having dr0ptp4kt's credentials and currently on IRC with his cloak at the very least, and consistent with the ongoing discussion. I wouldn't have gauged this to be a skype-worthy moment. [18:21:49] (i.e.: the email wouldn't have been enough, the gerrit changeset was for me) [18:22:18] <^demon> manybubbles: I might end up setting a new world record for thumb twiddling today if it takes this long to index :)\ [18:22:23] csteipp: I'm not doing anything so go ahead for me [18:22:54] mark, kafka brokers are back up, more using different partition data, you might want to restart your varnishkafka instance [18:22:55] ^demon: I'm just going go go back to fixing bugs I guess. how many articles is it going to index? [18:23:12] we should also get you testing with magnus' newer librdkafka soon [18:23:17] <^demon> manybubbles: ~130k I think. [18:23:40] paravoid: Feeling jittery given the recent antics of the SEA? [18:23:58] feeling paranoid in general :) [18:24:23] but if anything, the recent events support my social engineering paranoia :P [18:24:28] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [18:24:29] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [18:25:06] paravoid: !!!! my roommate's brother was detained in Greece when he tried to fly to Turkey for "draft dodging" or whatever you all would call it. [18:25:49] paravoid: they're both Greece citizens along with US, but haven't lived there since they were really young, and they need to prove they haven't spent 90 days in Greece over the last 11 years or else they need to do the mandatory service thing [18:26:01] I had to scan my roommate's old passport to prove he hasn't. [18:26:04] paravoid: If it makes you feel any better, I'm generally paranoid as well -- I worked long enough in security for that. :-) But this request was both consistent, authenticated and expected so no alarm bells were rung. I'm generally imprevious to bavarian firedrills. :-) [18:26:17] greg-g: wtf [18:26:44] greg-g: This is why, if you're a citizen of a country with mandatory service, you don't enter that country [18:26:44] greg-g: 90 days over the last 11 years? that sounds wrong [18:26:53] paravoid: that's what I thought when I read the email from his this morning with the subject "Detained in Greece (maybe)" [18:27:18] greg-g: are they free now? do you need lawyer contacts? [18:27:23] it's a lot more days than that (but who knows what the particular official(s) knew about the law anyways) [18:27:38] (Of course apergos knows random things about Greek immigration law) [18:27:48] (Says the guy that knows all sorts of things he really shouldn't about US immigration law) [18:27:52] <^demon> RoanKattouw: This is why my brother doesn't return to Russia. [18:27:54] paravoid: I don't think yet. He has a decently well to do family, so I'm sure he can arrange as needed. He hasn't tried to leave yet so he's fine. Not sure about his brother. [18:27:58] Yeah exactly [18:28:01] pretty random yeah. although the military service thing I only know form other people who have *cough* dodged it [18:28:02] <^demon> He hasn't officially renounced his Russian citizenship. [18:28:07] but yeah, 90d in 11y is what his email said :/ [18:28:18] ^demon: At least Russia /allows/ you to renounce citizenship? [18:28:26] Morocco famously doesn't allow you to, except by royal decree [18:28:36] <^demon> I believe so. [18:28:43] Russian citizenship renounces you? [18:28:43] woah that's crazy [18:29:02] YuviPanda++ [18:29:07] greg-g: how long did they stay in greece? [18:29:14] Which has caused interesting controversies in .nl where some people are really anal about immigrants having to drop their other citizenships when they get naturalized, and so dual citizens from naturalization shouldn't exist, except they do if they're Moroccan [18:29:14] was it over 30 days? [18:29:36] paravoid: my roommate arrived on 8/19 [18:29:40] even if you're not considered a non-resident, you have the right to be in the country for 30 days [18:29:45] (plus a few days per election) [18:29:48] I think they also have some sort of mandatory service thing, which you can't get rid of by renouncing your citizenship because you can't [18:29:53] well us citizens can be in greece for 90 [18:29:54] paravoid: not sure about his brother who is detained [18:30:19] I'm reading the law now [18:31:02] to be considered a non-resident you must live outside the country for > 180 days every year for the past 7 or 11 years [18:31:18] Morocco also helpfully conveys citizenship on children of Moroccans born abroad, and so we have 2nd or 3rd generation immigrants that need to be careful about traveling to Morocco when they're a certain age, because if the government figures out they're a citizen they get in trouble for service stuff [18:31:19] 7 if you were there for work, 11 if not (e.g. student) [18:31:25] so, now I'm uber paranoid because I have a scan of his passport on my laptop, and my work laptop was stolen 2 weeks ago, so I'm going to clutch this thing like my life depended on it [18:31:27] greg-g, did you confirm that email isn't one of those frauds to steal money? [18:31:56] Platonides: it was either my roommate, or someone who knew exactly where in his room his passport was. [18:32:11] and if that's not the case you can visit for 30 days per year [18:32:21] plus "up to" 40 days per election period [18:32:23] paravoid: huh, then at least my roommate should be fine [18:32:39] (and how did he enter the country without the passport? :P) [18:32:43] (I looked over his passport stamps to see) [18:32:47] !log csteipp synchronized php-1.22wmf14/extensions/OAuth [18:32:47] Platonides: his old one [18:32:53] Logged the message, Master [18:33:04] I never got my passport stamped [18:33:05] Platonides: he needed his records for the past 11years (or whatever) [18:33:17] yeah, that's reasonable [18:33:29] fsvo reasonable [18:33:33] :) [18:33:35] which would clearly be much easier to fetch for the government than your brother [18:33:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:46] army is not [18:33:48] Platonides: yeah, like, they don't have the records themselves? [18:33:50] but that's what the law says [18:34:03] hahaha [18:34:06] that was funny [18:34:08] !log csteipp synchronized php-1.22wmf14/extensions/Wikibase [18:34:13] so, the law is ever worse than that [18:34:14] Logged the message, Master [18:34:28] it says that you have to apply a certificate from the Greek embassy in the US [18:34:31] !log csteipp synchronized php-1.22wmf14/extensions/CheckUser [18:34:33] that certifies that you're a resident in the US [18:34:37] Logged the message, Master [18:34:54] and you have to collect all proof, and pay 10 euros [18:35:10] then you get the certificate, that will be on record and they wouldn't have stopped them at all (presumably) [18:35:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.590 second response time [18:35:41] paravoid: wow [18:35:51] !log csteipp synchronized php-1.22wmf13/extensions/CheckUser [18:35:55] well, good luck to Ionas (my roommate) then [18:35:57] Logged the message, Master [18:35:58] makes sense [18:36:01] if you don't want to go through all that, there's a nice workaround [18:36:12] only fly intra-schengen with greece [18:36:13] although I would have expected that he needed the certificate from the US embassy in Greek instead [18:36:17] pay the official $EUR500? [18:36:18] go US -> frankfurt -> athens for example [18:36:20] !log csteipp synchronized php-1.22wmf13/extensions/Wikibase [18:36:25] Logged the message, Master [18:36:34] no passport check in athens, noone will ever care about all that in frankfurt, profit [18:36:36] *in Greece [18:36:39] PROBLEM - DPKG on analytics1027 is CRITICAL: Timeout while attempting connection [18:36:46] * greg-g notes that down [18:36:56] And the same when you leave [18:36:59] yes [18:37:00] Schengen has exit controls too [18:37:08] PROBLEM - Host analytics1026 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:08] PROBLEM - Host analytics1027 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:13] So in order to leave Greece "safely" you'd have to fly to another Schengen country first, then leave Schengen from there [18:37:22] yep [18:37:35] oh geopolitics [18:37:36] they probably went from athens -> turkey, which has border controls [18:37:39] It's easier in the Western Europe part of Schengen where there are land borders with other Schengen countries that you can just drive across like a US state border [18:38:03] did they use their Greek passports to pass? [18:38:04] * greg-g goes to get lunch and contemplate free-to-roam [18:38:09] So you can like fly into Frankfurt and drive or take a train from there to Amsterdam or Brussels or wherever it is that you want to go [18:38:10] paravoid: dunno [18:38:13] ah now there's a question [18:38:18] or did they get flagged with the US passport? [18:38:20] greg-g: I'm trying to sync [18:38:22] if they gave their us passports they hould never have run into a problem [18:38:23] paravoid, I suppose so [18:38:23] not sure [18:38:29] *should [18:38:33] with a US passport it would have been harder to flag them [18:38:35] greg-g: sorry, to sync CentralNotice in a few minutes. please let me know if there's a reason to hold off. [18:38:38] I'd be amazed if they did, but you never know [18:38:40] Although in some countries it's not technically legal to do that [18:38:53] Some countries require that, if you have a valid passport issued by them, you must use it to enter that country [18:38:57] and 'I have a US passport' should be a good hint that he is a US resident! [18:39:02] awight: yeah, you're good until 1pm [18:39:04] they would have to show some id on intra schengen even, but the us passport should be fine for that [18:39:10] greg-g: OK thanks! [18:39:21] apergos: technically illegal though [18:39:28] hush you [18:39:28] for the US as well [18:39:37] Yeah you have to show ID for intra-Schengen flights but only at security [18:39:38] perhaps they don't have a US passport [18:39:58] if you have a passport issued by the country you visit you are required to enter the country with that passport [18:40:04] at least my roommate does (I have no clue on what's going on with his detained brother) [18:40:16] so if you have dual Greek/US citizenship and you try to enter the US with the Greek one [18:40:17] ... has a US passport [18:40:24] ok, time for lunch [18:40:24] I'm assuming you'll be in deep trouble [18:40:53] they would screen you when with the us passsport they would (theoretically) say 'welcome home' [18:40:58] You may be [18:41:13] I mean they might try to kick you out after 90 days because you entered as a foreigner [18:41:15] in practice I have gotten worse treatment going into the us with a us passport than coming here with the same one [18:41:28] ouch :) [18:41:50] * paravoid is trying to get a second citizenship [18:41:57] I actually have an appointment tomorrow [18:42:01] but since greeks travel to the us without a special visa any more they are not likely to raise a fuss [18:42:06] oh? [18:42:16] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:42:16] RECOVERY - Host analytics1027 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:42:23] which country if I may pry? [18:42:27] cyprus [18:42:37] * apergos wonders if that's a good idea [18:42:44] I have my reasons [18:42:45] :) [18:42:54] I am sure you do! [18:42:59] very long story [18:43:00] and I'll hear 'em over beer someday maybe [18:43:22] as long as it doesn't land you in a bunch of new horrible obligations [18:44:00] ^demon, so you have Russian roots? [18:44:27] PROBLEM - RAID on analytics1027 is CRITICAL: Connection refused by host [18:44:27] PROBLEM - Disk space on analytics1026 is CRITICAL: Connection refused by host [18:44:27] PROBLEM - RAID on analytics1026 is CRITICAL: Connection refused by host [18:44:27] PROBLEM - SSH on analytics1027 is CRITICAL: Connection refused [18:44:27] PROBLEM - SSH on analytics1026 is CRITICAL: Connection refused [18:44:36] PROBLEM - Disk space on analytics1027 is CRITICAL: Connection refused by host [18:44:43] it's very much related to the above conversation [18:44:46] PROBLEM - DPKG on analytics1026 is CRITICAL: Connection refused by host [18:44:56] (03CR) 10Dzahn: "yea, you did in change 70242" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 (owner: 10Dzahn) [18:45:21] hope you get it all worked out [18:45:33] hope so, it's the second appointment :) [18:45:48] my roots are such a mess that they got confused :) [18:45:56] :-D [18:47:07] I have papers from my grandfather that say "sujet Britannique est d'origine Grecque de Chypre" [18:47:45] in french *and* arabic [18:48:27] arabic? badass [18:49:26] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 183 seconds [18:49:36] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 188 seconds [18:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [18:54:26] RECOVERY - SSH on analytics1027 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:54:26] RECOVERY - SSH on analytics1026 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:55:20] <^demon> MaxSem: I do not. Brother and I are both adopted, he's from Russia originally. [18:55:28] ah [18:56:10] ^demon: that's cool! your parents must be awesome. [18:56:46] PROBLEM - NTP on analytics1026 is CRITICAL: NTP CRITICAL: No response from NTP server [18:56:56] PROBLEM - NTP on analytics1027 is CRITICAL: NTP CRITICAL: No response from NTP server [18:57:35] travelling to Russia is generally safe [18:57:52] unless you're Snowden :P [18:57:52] you can't be drafted unless you've signed a summon paper [18:58:08] Platonides, even Cuba refused to accept him [18:58:15] I know [18:58:27] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [18:58:36] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [18:59:11] I guess dual citizens need to visit a military comissariat themselves to get drafted if they're not residents:) [18:59:13] although IMHO this shows the pressure that was over all those countries [18:59:40] difference being that Russia has enough balls to say NSA to f/o [19:03:30] the FSB can take care of itself :P [19:05:11] (03CR) 10Ori.livneh: [C: 031] tcpircbot: rm unused `os` module import [operations/puppet] - 10https://gerrit.wikimedia.org/r/81489 (owner: 10Hashar) [19:08:31] ori-l: what does the Front-Side Bus have to do with this? [19:09:14] with the Nunavut Settlement Area, you mean? [19:12:50] yea, once Microsoft had a key for that [19:13:24] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 191 seconds [19:13:34] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 196 seconds [19:21:25] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [19:21:34] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [19:23:21] (03CR) 10Andrew Bogott: [C: 032] remove old RT 3.8 stuff and just keep RT 4.x Apache stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 (owner: 10Dzahn) [19:28:48] !log awight synchronized php-1.22wmf13/extensions/CentralNotice 'Updating CentralNotice to wmf_deploy' [19:28:54] Logged the message, Master [19:29:24] !log awight synchronized php-1.22wmf14/extensions/CentralNotice 'Updating CentralNotice to wmf_deploy' [19:29:29] Logged the message, Master [19:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:50] andrewbogott: thx [19:33:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [19:34:03] (03PS1) 10Dzahn: run stats update for wikivoyages at 8pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/81558 [19:35:17] (03CR) 10Dzahn: [C: 032] run stats update for wikivoyages at 8pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/81558 (owner: 10Dzahn) [19:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [19:47:04] (03PS1) 10Ori.livneh: Drop 'country' from NavTiming graphs; rename to 'browser' [operations/puppet] - 10https://gerrit.wikimedia.org/r/81561 [19:47:56] (03PS2) 10Ori.livneh: Drop 'country' from NavTiming graphs; rename to 'browser' [operations/puppet] - 10https://gerrit.wikimedia.org/r/81561 [19:50:08] binasher: ! [19:50:44] (03CR) 10Ori.livneh: "The < 60000 test is a crude filter to exclude extreme outliers. I'd ideally like to use a subset of the W3C's test suite (http://w3c-test." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81561 (owner: 10Ori.livneh) [19:51:55] binasher: got a sec to merge ^^ ? [19:52:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [19:54:46] or someone else, for that matter; it's a very small diff (+2 / -3) [20:00:12] blargh. [20:01:52] (03CR) 10coren: [C: 032] "Seems sane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81561 (owner: 10Ori.livneh) [20:02:01] danke [20:02:04] np [20:05:21] (03PS2) 10coren: sudo right for hashar on lanthanum (Jenkins slave) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 (owner: 10Hashar) [20:05:42] (03CR) 10coren: [C: 032] "Given that he has access to the master, the slave is a no-brainer." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 (owner: 10Hashar) [20:06:14] Coren: note jenkins is unable to merge on operations/puppet :) [20:06:35] hashar: How so? [20:06:36] coren, there's a three day wait policy is all [20:06:37] (03PS1) 10Chad: Secure login to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81563 [20:07:04] Coren: yeah, I think you get to wait for a few more people to cast their vote on the RT ticket, just in case [20:07:08] but thanks :-] [20:07:40] it's more like 'wait for no one to speak up. then merge'. anyways whatever [20:07:41] apergos: I thought we discussed this already; that once there was access the three day period was pointless for replicating to other equivalent boxen? [20:08:34] (03PS2) 10CSteipp: Secure login to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81563 (owner: 10Chad) [20:08:42] I probably don't remember the discussion [20:08:49] IMO, he got sudo with RT 4101; adding this to the slave is not a new access request. [20:09:32] (03CR) 10CSteipp: [C: 032] Secure login to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81563 (owner: 10Chad) [20:11:18] (03Merged) 10jenkins-bot: Secure login to true for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81563 (owner: 10Chad) [20:14:00] Coren: did you merge that change on puppetmaster? doing a manual puppet run on hafnium didn't pull in the change; odd.. [20:14:18] ori-l: Not yet, I was just about to after talking with apergos. :-) [20:14:29] oh, np. i was just confused. no rush. [20:14:50] ori-l: Done now though. [20:14:55] ori-l: btw, I switched Middle-East to esams today [20:15:00] much earlier [20:15:11] eh at the point it's merged in gerrit I consider it a done deal (as in might as well merge on sockpuppet) [20:15:11] csteipp: ^^ too [20:15:47] paravoid: ah, I kept twiddling with the graphite graphs over the last 24h, so they won't be useful, but the data on db1047 should be good (and probably revealing) [20:17:18] i'll set myself a reminder to do a 1-wk comparison before/after a week from now [20:18:07] ok [20:18:19] be also warned that services that were blocked are likely to not be blocked now [20:18:45] until the respective state catches up :) [20:18:49] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [20:19:03] right [20:19:42] <^demon> manybubbles: I hit that stupid fatal from the RSS extension while reindexing over lunch. I'm in the middle of SSL right now, will have a look at getting mw.org properly indexed after. [20:20:04] ori-l: we should think at some point at providing EL at some non-geodns hostnames [20:20:18] ori-l: then run A/B comparisons between all of our sites across countries [20:20:31] ^demon: Ah! Now I understand. We might want to be more protectionist about that failures during forceSearchIndex so it just keeps going. [20:20:32] ori-l: that would help us shift traffic to the right DC [20:20:44] ori-l: but we should postpone this for after ulsfo is put in place [20:22:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:16] manybubbles: ^d cirrus search is on wikidata beta labs, right? [20:23:18] !log demon Started syncing Wikimedia installation... : Secure login on alllllll the wikis [20:23:24] Logged the message, Master [20:23:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [20:23:30] <^d> aude: Should be on all beta labs. [20:23:33] ok, cool [20:23:36] aude: yeah. [20:23:41] it found my new item immediately [20:23:46] in special search [20:24:36] * aude waits until it finds stuff in descriptions [20:24:48] might not work [20:24:49] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:25:08] (03PS1) 10Bsitu: Disable the job queue that processes echo notification [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81568 [20:25:29] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 181 seconds [20:26:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:27:10] paravoid: filed as https://bugzilla.wikimedia.org/show_bug.cgi?id=53497 [20:27:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.172 second response time [20:27:29] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 182 seconds [20:27:37] ori-l: you're awesome [20:28:04] ori-l: I'm adding binasher to Cc, he brought up EL at some minimizing latency discussion we had a few ops meetings ago [20:28:15] sure [20:28:40] (i'm mentioning it to give credit :) [20:29:29] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 182 seconds [20:31:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:49] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [20:32:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [20:36:34] !log demon Finished syncing Wikimedia installation... : Secure login on alllllll the wikis [20:36:39] Logged the message, Master [20:36:45] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [20:39:48] Oh, hey, is anyone able to poke Asher with some urgency? [20:39:52] binasher: !!! [20:40:44] !log demon synchronized wmf-config/InitialiseSettings.php 'All wikis to secure login' [20:40:51] Logged the message, Master [20:45:45] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [20:46:45] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [20:46:55] Coren: I'm assuming he's at lunch with Ryan and Daniel? [20:47:10] greg-g: Curse that need for food! [20:47:29] I know, right? [20:49:45] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [20:49:45] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [20:49:45] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [20:49:59] Whatever does the database replication for lab seems to have gone completely berkerk. [20:52:38] All the replicated databases from S3, S4 and S5 have disapeared in the last hour. [20:52:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [20:52:45] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [20:53:36] (03PS1) 10Ori.livneh: Notify navtiming service when its Python code changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81572 [20:53:52] uh oh [20:53:57] maybe springle would be able to help [20:54:17] but away too hm [20:55:36] apergos: he's back now. [20:55:45] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [20:55:45] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [20:55:46] yep [20:58:45] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:45] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [21:01:45] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [21:02:45] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [21:02:45] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [21:05:48] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [21:13:40] * Coren prepares a dunce cap for himself. [21:15:57] there are some issues with certificate https://meta.wikimedia.org/wiki/Wikimedia_Forum#arbcom.nl.wikipedia.org.2C [21:17:40] they are not issued for sub-subdomains [21:17:41] For future reference, not finding S3's databases on S4 is perfectly normal. [21:18:29] Thanks Danny_B [21:18:31] Ryan_Lane: ^ [21:18:49] oh, right [21:18:56] heh :) [21:18:57] that's been a problem for ages [21:19:10] and those wikis are HTTPS only I believe [21:19:26] or are they not? [21:19:50] we have a bug in to rename those wikis [21:21:40] oh right [21:22:13] the problem with that was external storage db names [21:22:41] binasher: https://bugzilla.wikimedia.org/show_bug.cgi?id=31335 [21:23:21] binasher: Do you know if there is a nagios plugin to pop up queries over a certain duration? [21:23:37] binasher: https://bugzilla.wikimedia.org/show_bug.cgi?id=19986 [21:23:54] comment 12, 13 [21:34:08] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 183 seconds [21:34:28] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 205 seconds [21:35:05] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [21:35:25] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -0 seconds [21:35:35] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 1 seconds [21:35:55] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 1 seconds [21:37:24] no explosions still, csteipp / ^d ? [21:42:13] Ryan_Lane: anything explode? [21:42:21] (03PS1) 10Dzahn: add missing file with large query that unions data from different tables for a all-in-one display and let that table have a http error column, requested by robih for debugging [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/81581 [21:42:30] greg-g: it's all broken. everything. [21:42:36] oh god! [21:42:44] * greg-g hides [21:43:00] heh. hardly a blib, really [21:43:04] Ryan_Lane: are either chris or chad at their desks? [21:43:08] chris is [21:43:11] and chad [21:43:13] I wanna tell communications to publish the post [21:43:28] yell at them for me ;) [21:43:31] csteipp, ^d: ^^ [21:43:42] heh [21:44:09] <^d> Can we turn it back off again first? [21:44:18] <^d> It's not feeling like a secure login deploy if we leave it on. [21:44:27] shuddup [21:45:16] alright then, here we go [21:49:43] ^d: Ryan_Lane: random and old, but how about "#2466: rename gerrit2 account in LDAP" .."Will probably be easier to remove [21:49:46] that once 2.5/2.6 and more things are in the REST api." ? [21:49:52] 2012 [21:49:59] <^d> lol. [21:55:12] binasher: I'd really want to stuff some of my DB scripts in git -- what'd be the apropriate project for 'em do you think? Where did you put your half of replication? [21:57:47] Coren: that's a really good question - i think i still need to check some things in [22:06:06] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [22:19:46] PROBLEM - DPKG on ms-be1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:22:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:56] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay -1 seconds [22:23:06] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [22:23:26] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [22:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [22:47:48] !log demon synchronized php-1.22wmf13/extensions/CentralAuth [22:47:53] Logged the message, Master [22:48:12] !log demon synchronized php-1.22wmf14/extensions/CentralAuth [22:48:17] Logged the message, Master [22:49:17] (03PS1) 10Danny B.: cswikinews: Set AbuseFilter notifications [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81592 [22:49:33] binasher: At first glance, it looks like operations/software is the right place for this. [23:00:56] RoanKattouw: Let me know when you are done with VE deploy, I will push out a config change [23:03:04] bsitu: will do [23:03:21] rmoen: thx, :) [23:03:27] Yeah rmoen is doing the deploy today [23:04:13] RoanKattouw, rmoen: no hurry, take your time [23:07:18] (03PS1) 10coren: maintain-replicas: the script that does the magic [operations/software] - 10https://gerrit.wikimedia.org/r/81593 [23:09:03] (03PS2) 10Dzahn: add missing file with large query that unions data from different tables for a all-in-one display. let that table have a http error column, requested by robih for debugging. add missing snipppet for grand total display in wiki syntax Change-Id: Ia400ee3f6 [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/81581 [23:12:25] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 188 seconds [23:13:04] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 226 seconds [23:16:34] !log rmoen synchronized php-1.22wmf14/extensions/VisualEditor/ 'Update VisualEditor to master' [23:16:39] Logged the message, Master [23:17:04] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [23:17:25] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [23:22:33] bsitu: all clear :) [23:22:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:46] rmoen: thx [23:23:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [23:23:38] bsitu: np [23:23:54] (03CR) 10Bsitu: [C: 032] Disable the job queue that processes echo notification [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81568 (owner: 10Bsitu) [23:24:19] (03Merged) 10jenkins-bot: Disable the job queue that processes echo notification [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81568 (owner: 10Bsitu) [23:28:22] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Disable the job queue that processes echo notification' [23:28:27] Logged the message, Master [23:38:37] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 183 seconds [23:39:28] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 239 seconds [23:42:17] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [23:49:28] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [23:49:37] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds