[00:07:56] andrewbogott: https://wikitech.wikimedia.org/w/index.php?title=RT&diff=81539&oldid=47588 [00:08:54] andrewbogott: eh, if you have additions to https://wikitech.wikimedia.org/w/index.php?title=RT&diff=81539&oldid=81532 [00:09:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:13] the mail server setup part, meh, but i'll ttyl [00:10:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [00:17:14] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [00:23:14] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:30:14] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [00:34:14] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:44:05] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [00:45:05] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:48:05] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [00:51:05] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [00:51:05] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:05] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:05] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [00:54:48] (03PS1) 10Dzahn: remove old RT 3.8 stuff and just keep RT 4.x Apache stuff [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 [00:57:05] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:59:05] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [01:00:05] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:05] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:05] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [01:04:05] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [01:25:03] (grrrit-wm dead because labs is dead) [01:26:59] PROBLEM - DPKG on labstore3 is CRITICAL: Timeout while attempting connection [01:28:39] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:09] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [01:30:26] Ryan_Lane: around? got a moment for a patch? [01:30:37] sure. what's up? [01:30:47] https://gerrit.wikimedia.org/r/#/c/81447/ [01:31:24] goes the hafnium which i have sudo on, so i can force the puppet run and check for failures [01:33:30] ori-l: zsock.subscribe = b'' ? [01:34:35] the hipstery pyzmq interface relies on @property decorators [01:34:45] :D [01:34:46] an empty subscription in zmq pubsub is 'subscribe to everything' [01:35:17] it's a bit strange, but that's zmq pubsub semantics. if you don't declare a subscription you don't get anything. [01:38:06] gah, the print statement on line 56 is superfluous, i'll remove it [01:39:06] I'm still unfamiliar with the b'' notation [01:39:26] can you send me a pointer on how it's used? [01:39:39] just means it's bytes rather than unicode [01:40:09] the from __future__ import unicode_literals statement means that 'abc' (in the file) is a unicode string [01:40:20] and unicode strings aren't bytes [01:40:37] ah. ok. [01:40:58] i'll just update it to do it the non-hipstery way [01:41:11] nah, it's ok [01:41:23] k, thx [01:45:55] I hate how we have things spread between role, misc, modules and manifests [01:45:59] misc is especially fucked [01:46:18] yeah, i know :/ i'm cleaning it up gradually [01:46:22] i added a pystatsd module yesterday [01:46:43] nobody/nogroup? :( [01:46:44] heh [01:46:51] but the graphite situation is fucked, gdash is literally deployed via puppet and its code is in the puppet repo [01:47:00] yeah, gdash will move to git-deploy [01:47:29] i filed a bug with 'volunteer-needed' keyword because i figured it's a good opportunity for someone to get familiar with git-deploy [01:47:33] * Ryan_Lane nods [01:47:50] maybe I'll spend some time tomorrow cleaning up git-deploy's config in puppet [01:47:57] to make most of the config optional [01:48:25] hm. is this the best way to handle inits? [01:48:28] as a template? [01:48:35] upstart is kind of a shitty system [01:48:50] is /etc/default/ not used in upstart? [01:49:02] i think it's an init.d thing, but i dunno [01:49:26] I'm not totally sure how upstart will handle a change in its file [01:49:43] oh, did i not set a subscribe? [01:49:54] I'd imagine it should be fine since it's supposed to track the process itself [01:50:08] I think you did [01:50:23] oh [01:50:25] no, you didn't [01:51:02] I think changing the init and notifying should work fine [01:51:17] changing the init? [01:51:19] it should kill the tracked pid and restart based on the new file [01:51:45] doing this kind of thing in the old init system could occasionally lead to weirdness [01:51:48] ah no, that's not necessary because it doesn't fork; upstart will daemonize it [01:51:57] upstart discourages app developers from daemonizing [01:52:02] * Ryan_Lane nods [01:52:12] this looks fine, assuming you add a notify or subscribe [01:52:40] yep, done (notify) [01:55:36] merged [01:56:00] :D thank you! [01:56:06] i am hideously excited about this [01:57:47] heh [01:57:48] yw [01:59:06] gah, i'm an idiot [01:59:31] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/81448 . sorry. [02:01:14] merged [02:01:17] it happens [02:02:04] thanks [02:02:18] yw [02:06:17] works! [02:06:24] * ori-l waits for dataz. [02:07:04] * YuviPanda streams all cricket scores ever, into ori-l [02:07:11] MAKE SENSE OF THAT, HAH! [02:16:15] !log LocalisationUpdate completed (1.22wmf14) at Wed Aug 28 02:16:15 UTC 2013 [02:16:21] Logged the message, Master [02:16:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:16:51] Ryan_Lane: What's wrong with Labs? :-( [02:17:09] Elsie: do you mean tools specifically? [02:17:19] ask in #wikimedia-labs, rather than here [02:17:25] coren is working on it [02:17:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.282 second response time [02:17:51] All right. [02:17:53] should really specify tools rather than just labs. labs itself is fine ;) [02:22:06] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [02:22:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:27:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [02:30:40] !log LocalisationUpdate completed (1.22wmf13) at Wed Aug 28 02:30:40 UTC 2013 [02:30:46] Logged the message, Master [02:31:22] Ryan_Lane: wmflabs.org doesn't resolve. ;-) [02:31:56] it never has [02:32:02] www.wmflabs.org did at some point [02:32:10] but I removed that from someone's project [02:32:24] let me just add that to virt0 [02:39:32] Elsie: there, now wmflabs.org and www.wmflabs.org redirect to wikitech [02:40:10] Still not resolving for me, but possibly cached. [02:40:13] yep [02:40:25] you already tried to hit it. so, your resolver has negative cache [02:40:49] ;wmflabs.org. IN A [02:40:49] in about an hour it'll be fixed [02:40:58] All right. [02:41:04] dig @labs-ns0.wikimedia.org wmflabs.org [02:41:07] There's an open bug about this somewhere. It'll be nice to have it resolved. :-) [02:41:36] 208.80.152.32 [02:41:37] Nice. [02:41:44] You see nytimes.com got hit today? [02:41:47] yep [02:41:55] and twitter [02:42:46] I know. I went to check Twitter when I noticed nytimes.com was down. I got so confused. [02:43:11] :D [02:43:49] Hrmmm, are they still down? [02:44:04] www.nytimes.com works. Silly DNS. [02:45:01] (03PS1) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:02] (03PS2) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:04] (03PS3) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:06] (03PS4) 10Ori.livneh: Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 [02:45:08] (03CR) 10Ryan Lane: [C: 032] Log client-side latency measurements in graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81447 (owner: 10Ori.livneh) [02:45:13] (03PS1) 10Ori.livneh: Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81448 [02:45:14] (03CR) 10Ryan Lane: [C: 032] Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81448 (owner: 10Ori.livneh) [02:45:28] (03PS1) 10Ryan Lane: Add wmflabs.org to wikitech's apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 [02:45:29] (03CR) 10Ryan Lane: [C: 032] Add wmflabs.org to wikitech's apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 (owner: 10Ryan Lane) [02:45:30] (03CR) 10Ryan Lane: [C: 032] Add newer pmtpa virt nodes to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/81418 (owner: 10Ryan Lane) [02:45:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 28 02:45:49 UTC 2013 [02:45:55] Logged the message, Master [02:46:11] (03CR) 10MZMcBride: "https://bugzilla.wikimedia.org/show_bug.cgi?id=36885" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81451 (owner: 10Ryan Lane) [02:46:17] heh, see? Redis didn't lose any messages :P [02:46:25] PROBLEM - Disk space on analytics1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 11258 MB (3% inode=99%): [03:40:58] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:27:14] (03CR) 10Greg Grossmeier: [C: 031] "If this is good, let's get it in before Thursday's deploy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [06:44:46] (03CR) 10Nikerabbit: [C: 031] Don't show the IME in the CodeEditor textarea [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [06:45:41] apergos: good morning :-] [06:48:20] hello [06:50:47] hashar: hello [06:51:15] apergos: I was wondering if you knew anything about the production memcached servers? [06:51:25] apparently we have a bunch of them with lot of memory allocated [06:51:32] but can't find the definition for it in puppet :/ [06:51:47] the memcached class only creates one instance of ~ 80MB memory which is not that much [06:51:57] lemme look [06:52:04] on beta Chad created two instances but each of them only have 80MB, I would like to allocate mooaaar mem [06:53:42] there are a pile of mc*** with role::memcached [06:53:50] or are you talking about something else? [06:54:16] ahh [06:54:25] 16 boxes looks like for eqiad [06:54:40] yeah that is the boxes [06:55:20] and that role calls ::memcached class with a size of 89088 [06:55:38] which would mean we only have 16 x 89088 memory allocated [06:55:56] yep [06:56:39] which is like … not that much :] [06:56:42] don't forget that a lot of stuff is in redis these days too [06:56:54] and we have a database parser cache [06:57:16] so yeah maybe that is enough [06:57:17] yep [06:57:31] I guess beta should use more redis and have a db backed parser cache hehe [06:57:43] prolly should, if it's going to act like production [07:00:27] yup [07:00:35] more bugs I need to fill [07:22:12] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [07:27:40] (03PS1) 10Ori.livneh: Change metric name format for NavigationTiming events [operations/puppet] - 10https://gerrit.wikimedia.org/r/81464 [07:32:38] apergos: also, Bryan Davis (multimedia software engineer), posted a mail to engineering list about multimedia tech debt [07:32:54] apergos: he is referring to thumbnails so I though you might want to reply / get in touch with him [07:33:14] (sorry I am processing my mail backlog hehe) [07:34:43] so am I. now down to 700 messages unread... [07:34:55] I hav enot gotten to the tech debt thread yet. [07:35:28] and I found out that the production parser cache uses both memcached and the db hehe [07:44:05] and I filled a bunch of bugs to get MariaDB on beta .. [07:47:07] great [07:53:29] apergos: could you possibly merge change 81464 (see above)? [07:55:02] looking [07:55:11] thank you [07:59:07] mwalker: sleep you should! [07:59:12] mwalker: thanks for the pep8 merges :-] [07:59:23] hashar: probably [07:59:29] thanks for writing them [07:59:41] any thoughts on why the jobs aren't triggering yet though? [07:59:52] cause I haven't merged the change [07:59:55] arhar [08:00:07] makes sense [08:00:21] https://gerrit.wikimedia.org/r/#/c/81378/ [08:00:24] replied a bit there [08:00:33] you might want to have the linting job to vote Verified + 2 [08:00:48] and maybe add gate-and-submit which would make Jenkins merge the change for you whenever you CR+2 [08:00:55] (if jenkins is allowed to merge changes there) [08:01:07] ori-l: instead of site.country.metric it's now going to be metric.site.country, this is desired right? [08:01:22] yep [08:01:28] cool [08:01:54] (03CR) 10ArielGlenn: [C: 032] Change metric name format for NavigationTiming events [operations/puppet] - 10https://gerrit.wikimedia.org/r/81464 (owner: 10Ori.livneh) [08:01:58] hashar: I was going to wait until we had error free runs before doing V+2 on anything [08:02:37] apergos: thanks very much [08:03:21] hashar: unless there's an optoin to have it +2 but not -2 [08:03:40] mwalker: well they are non voting [08:03:47] so if the lint fails, jenkins is not going to vote -2 [08:03:56] it will even vote +2 regardless [08:03:57] this is for professor? [08:03:59] ( ori-l ) [08:04:03] hashar: interesting [08:04:20] sounds good to me; what change do I need to make? [08:04:22] apergos: no, hafnium [08:04:25] woops [08:04:26] mwalker: that is what 'voting: false' does, it basically discard the result of the job to find out what the voting score is. [08:04:40] i have access to hafnium, so i can update it [08:04:43] ok running puppet now [08:04:49] oh. heh well too late [08:04:53] mwalker: replace check-only by check-voter [08:04:55] np, saves me the trouble [08:05:10] mwalker: and copy paste the block below it, and replace check-voter with gate-and-submit. I can do it if you want [08:05:39] I can only learn by doing [08:05:50] some people are not willing to learn :-] [08:06:42] some of them want to abuse you, some of them want to be abused. [08:06:43] ok you're good to go ori-l [08:06:59] apergos: woot! thanks! :D [08:07:06] sure! [08:11:44] hashar: what is the point of having a gate-and-submit job if we're already running the tests in check-voter and the tests do not vote? [08:12:02] just to save us the trouble of adding it when we do make the tests voting? [08:12:17] yup [08:12:34] even if all tests are non voting, Jenkins will still vote Verified +2 [08:12:40] this way you do not have to vote verified :] [08:12:47] gotcha [08:12:50] saves you one click on each change [08:13:10] if it costs 0.10$ per click, we save money :) [08:14:02] ok; pushed [08:17:25] mwalker: nooow head to bed :] [08:17:43] indeed [08:17:45] sleeps time [08:18:03] thank you mwalker|sleeps ! [08:37:24] off will be back later on [08:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:35] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 194 seconds [08:53:05] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 214 seconds [08:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:00:05] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -0 seconds [09:00:35] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [09:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:24:57] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:25:57] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [09:28:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [09:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [09:57:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:14:23] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:23] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:25] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:33] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:35] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:36] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:36] PROBLEM - LVS HTTPS IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:37] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:38] PROBLEM - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:43] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:54] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:54] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:13] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 90853 bytes in 0.436 second response time [10:15:13] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.711 second response time [10:15:15] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.705 second response time [10:15:15] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.435 second response time [10:15:15] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.436 second response time [10:15:23] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.349 second response time [10:15:23] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3893 bytes in 0.534 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.695 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.714 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.714 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.711 second response time [10:15:25] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 90851 bytes in 0.799 second response time [10:15:26] RECOVERY - LVS HTTP IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 44563 bytes in 0.352 second response time [10:15:26] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.357 second response time [10:15:27] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.435 second response time [10:15:27] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.440 second response time [10:15:28] RECOVERY - LVS HTTPS IPv6 on wikivoyage-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 44563 bytes in 0.616 second response time [10:15:33] RECOVERY - LVS HTTP IPv6 on wikidata-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1186 bytes in 0.176 second response time [10:15:34] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.355 second response time [10:15:34] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.710 second response time [10:15:43] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60687 bytes in 0.349 second response time [10:15:43] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 60685 bytes in 0.696 second response time [10:18:03] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [10:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [10:24:03] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:03] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:03] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [10:43:49] (03CR) 10Hashar: "The files will still be owned by root/root arent they?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [10:44:59] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:45:59] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [10:48:59] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [10:51:59] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:51:59] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:59] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:59] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [10:57:59] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:59] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [11:00:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:00:59] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:59] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:59] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [11:02:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:04:59] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [12:05:14] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [12:22:52] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [12:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [12:32:31] !log jenkins: updated various jobs to use $ZUUL_PROJECT when determining the git repository to fetch from {{bug|53470}} [12:32:37] Logged the message, Master [12:34:35] (03CR) 10Andrew Bogott: [C: 031] "I was expecting there to be a corresponding change in site.pp -- did we rip out the tampa RT server definition already?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81443 (owner: 10Dzahn) [13:05:33] paravoid, I'm looking at https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/nginx and it's not clear to me how to actually build. I presume I need to check out the actual nginx source someplace in there? [13:20:00] !log temporarily changing innodb purge settings on db55 (massive growing history list) [13:20:05] Logged the message, Master [13:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:40] (03PS1) 10Hashar: tcpircbot: rm unused `os` module import [operations/puppet] - 10https://gerrit.wikimedia.org/r/81489 [13:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:24:04] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 225 seconds [13:24:15] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 236 seconds [13:24:26] heh [13:27:14] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [13:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:35:29] !log stopped db55 slave threads while purge catches up; killed long running research user transaction apparently doing nothing but holding locks [13:35:34] Logged the message, Master [13:41:54] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [13:43:39] !log same purge situation for db58 and db43, and remedy [13:43:45] Logged the message, Master [13:47:14] PROBLEM - MySQL Slave Delay on db1051 is CRITICAL: CRIT replication delay 192 seconds [13:47:24] PROBLEM - MySQL Replication Heartbeat on db1051 is CRITICAL: CRIT replication delay 202 seconds [13:47:25] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 188 seconds [13:48:04] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 232 seconds [13:49:07] andrewbogott_afk: no idea, but what do you want with it? in our HTTPS chats with Ryan, the consensus was that we were going to evaluate switching to pristine packages and this may happen soon [13:49:24] RECOVERY - MySQL Replication Heartbeat on db1051 is OK: OK replication delay 15 seconds [13:49:40] so it might not worth the trouble figuring it out [13:49:47] Ryan would know best either way though [13:50:07] paravoid, yuvi needs an updated version of nginx-extras. I was going to just make a tip package for him. [13:50:14] RECOVERY - MySQL Slave Delay on db1051 is OK: OK replication delay 0 seconds [13:50:22] But, OK, I can wait and ask ryan -- it looks like you've modified that code but maybe you never built. [13:50:40] I think this was like 18 months ago, right? [13:50:52] yeah 2012-05 [13:51:01] yeah, quite a while ago [13:51:21] sorry, I don't remember a thing :) [13:51:22] hiyaaa! mark, if you are you around, could you comment on this? https://rt.wikimedia.org/Ticket/Display.html?id=5678 [13:51:49] hiyaaa! [13:51:51] did you kill kafka? [13:51:58] hahaha [13:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:47] ha wha? [13:53:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [13:54:26] well varnishkafka suddenly broke [13:54:46] Aug 28 13:42:29 cp1048 varnishkafka[21693]: KAFKAERR: Kafka error (-196): analytics1004.eqiad.wmnet:9092/4: Failed to connect to broker at analytics1004.eqiad.wmnet:9092: Connection refused [13:55:54] !log jenkins: renamed Editcount jobs from EditCount to Editcount [13:56:00] Logged the message, Master [13:57:59] mark: and crashed after that? [13:58:06] no [13:58:14] phew [13:58:15] but running out of buffer space [13:58:20] and flooding rsyslog with that :) [13:58:29] good :) [13:58:33] Aug 28 13:41:07 cp1048 rsyslogd-2177: imuxsock begins to drop messages from pid 21693 due to rate-limiting [13:58:34] Aug 28 13:41:13 cp1048 rsyslogd-2177: imuxsock lost 44689 messages from pid 21693 due to rate-limiting [13:58:34] Aug 28 13:41:13 cp1048 varnishkafka[21693]: PRODUCE: Failed to produce kafka message: No buffer space available [13:58:34] !log jenkins: renamed OAuth jobs from Oauth to OAuth [13:58:40] Logged the message, Master [13:59:01] mark: should probably rate limit that output a bit in varnishkafka.. [13:59:10] would be nice [14:04:19] anyway [14:04:27] kinda difficult to do varnishkafka perf testing now ;) [14:04:34] but first impressions are good [14:04:43] it's using less than half of varnishncsa's cpu [14:04:52] plus the fact that we only need to run one instance instead of 2+ [14:05:47] mark, i responded on the RT, but the issue is cloudera hadoop packages depend on cloudera zookeeper package [14:05:53]