[00:07:09] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:08:59] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [00:11:15] MaxSem: heh [00:20:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2192283 seconds since restart [00:23:29] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:19] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:28:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2192703 seconds since restart [00:32:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:29] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:34:39] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2193063 seconds since restart [00:35:19] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:53:38] purging issues with both pmtpa and esams [00:53:43] going to comment on bug now [00:54:06] or make a new one i think actually [01:09:11] https://bugzilla.wikimedia.org/56545 [01:10:35] back in a bit. maybe [01:27:47] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:47] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2196363 seconds since restart [01:30:07] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:32:47] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:33:17] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:33:37] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2196603 seconds since restart [01:35:07] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:45:26] springle: check out comment 1 here: https://bugzilla.wikimedia.org/show_bug.cgi?id=56545#c1 [01:45:40] that loosely corresponds with this: [01:45:47] (Nov 1) 10:51 logmsgbot: springle synchronized wmf-config/db-pmtpa.php 'depool first batch of pmtpa boxes to be decommissioned/shipped' [01:46:12] did a multicast proxy get decommissioned a little prematurely? [01:46:27] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:48:08] hrm....I suppose looking at the page history of that page, it could have happened any time between Nov 1 and Nov 3: [01:48:13] https://de.wikipedia.org/w/index.php?title=Massenmedien&action=history [01:48:42] also, i think i was not able to confirm the other one from volker [01:48:51] i wonder how it got fixed [01:48:57] i didn't try purging any of these myself [01:49:21] i had a loop through all DCs fetch each page 5 times. just bash+curl [01:50:39] robla: those pmtpa boxes were only db slaves, and most are still actually running. don't know about multicast [01:51:16] Something doesn't look right: https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=vhtcpd_inpkts_sane&s=descending&c=Text+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [01:52:53] It looks like something is interrupting vhtcp traffic to esams [01:55:37] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::8 [01:55:57] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [01:57:57] yup [02:01:24] !log LocalisationUpdate failed: git pull of extensions failed [02:01:46] Logged the message, Master [02:06:27] springle: I think you and TimStarling are the only ones with the access and know-how to fix this the cache purging issue that are in working hours right now. Sean, are you on it, or do we need to get you some help? [02:08:04] based on the graph that Bryan posted above, there's a legit problem there [02:08:36] ah...looks like it might actually be going again [02:08:44] I will look at it [02:09:19] thanks [02:09:57] thanks TimStarling [02:10:12] i'm not familiar with the text caches [02:11:29] relax everyone, i'm here [02:11:47] i'll proceed to assist tim by making uninformed guesses about what might be the issue [02:11:58] ori-l: that was my plan! [02:12:09] bd808: :) [02:12:11] seriously....me too [02:12:14] does anyone know what happened last time? [02:12:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=54647#c5 is the closest I've gotten so far [02:13:08] well, clearly, it was a network issue [02:13:18] That's all I was told sadly. [02:13:58] The ganglia graphs seem to be showing some traffic now. [02:14:39] PROBLEM - search indices - check lucene status page on search14 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 52969 bytes in 0.146 second response time [02:14:39] It looks like there was a ~2h gap form 00:00 to 02:00 [02:14:56] s/form/from/ [02:15:58] * bd808 just got called to dinner [02:16:10] i'm still reading up on the bug reports [02:18:01] maybe it was related to the saturation and rerouting on the 26th [02:25:26] lots of SmokeAlert packet loss emails about esams over the weekend [02:31:51] well, dobson is not getting HTCP packets, and it is definitely running udpmcast.py and it has joined the relevant multicast group [02:32:05] dobson had an icinga alert earlier [02:32:29] but dobson is in pmtpa [02:32:45] do we really send all HTCP traffic via pmtpa? [02:32:59] why is ulsfo working then? [02:33:02] I suppose it is not implausible [02:33:24] the alerts jeremyb is referring to are: [21:54:45] PROBLEM - NTP peers on dobson is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:25] the dobson alerts were all NTP and recovered in ~a min [02:33:51] also packet loss on emery [02:34:10] where IIRC 'packet loss' == gaps in udp log seq ids [02:34:26] obviously there shouldn't be more than one udpmcast in the world [02:34:42] if there was, there would be duplicate packets sent [02:42:28] jeremyb: why do you think ulsfo is working? [02:44:07] UDP RcvbufErrors on dobson rises by about 5k / sec [02:44:33] ok, so ulsfo is indeed working and is getting UDP packets directly from the apaches [02:44:41] not from a relay [02:44:58] ori-l: what does that mean? [02:45:16] that something is not draining the buffer fast enough? [02:45:57] you know dobson is a DNS server? [02:47:26] oh, no. ok, disregard. [02:48:00] it's not saturated [02:48:47] CPU usage is not especially high [02:48:56] it looked high at first, and then I realised it only has 2 cores [02:49:28] anyway, it is unlikely that I can fix this [02:51:23] what do you think might be going on? [02:54:33] I don't know, I just know enough about networks to know when it's time to cal Mark or Leslie [02:55:02] it wouldn't be the first failure of multicast routing, would it? [02:55:42] I can probably move udpmcast to eqiad [02:55:53] then they can have their sleep [02:57:31] i just came across https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging which has some useful tips, i assume you're aware but mentioning it just in case [02:59:21] what's a quiet misc host at eqiad that I should put this on? [03:00:12] tungsten [03:00:26] slotted to be the graphite host, replacing professor [03:00:41] right now it's just running statsd (udp metric aggregation) with very minimal traffic going on [03:00:58] ok [03:01:10] I don't know if it matters, but my irc logs say that the endpoint in esams is nescio now and not hooft. Hooft is still running a copy of the relay but the varnish boxes were seeing packets from nescio on 2013-10-16. [03:04:55] yes, nescio [03:05:01] I just copied the configuration from dobson [03:05:28] !log moved udpmcast.py from dobson to tungsten to work around failure of multicast routing eqiad -> pmtpa [03:05:45] Logged the message, Master [03:07:29] yeah, ok, so it needs to be a server with an external IP address [03:09:49] like... carbon [03:10:17] or chromium, if we stick with sharing DNS recursors [03:10:31] chromium has low CPU usage [03:12:41] !log moved udpmcast to chromium since it actually has an external IP address [03:12:55] so, it works now [03:12:55] Logged the message, Master [03:19:27] I see packets being logged in ganglia again. [03:22:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:25:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:05] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [03:29:04] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:11] when I search bing for "htcp port", it can't even admit any possibility that i might have actually meant "htcp" and not made a typo [03:35:33] I didn't know people binged. [03:35:39] it just shows all pages containing "http", with no note to say that it has done so, and no link to actually search for htcp [03:36:15] sure, I use bing, it is kinder to RequestPolicy since it doesn't track every outbound click [03:36:52] google actually shows you links with the correct URL, but changes the href attribute on left/right/middle mouse down [03:37:11] so when the click event occurs, it sends you via a tracking page [03:37:14] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::7 [03:37:28] whereas bing doesn't track outbound clicks at all [03:37:34] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [03:38:02] it is kind of halfway between google and duck duck go :) [03:38:27] but yes, you have to start your queries with "+" or else it thinks you are stupid [03:38:34] PROBLEM - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::12 [03:38:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:39:24] RECOVERY - Host wikidata-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 16%, RTA = 31.05 ms [03:39:34] PROBLEM - NTP peers on linne is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:24] RECOVERY - NTP peers on linne is OK: NTP OK: Offset -0.007041 secs [03:41:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:42:27] Ah, Google dropped "+" syntax support, I think. [03:42:37] I use Chrome as browser, but default to Duck Duck Go. It's decent. [03:42:56] Its !bang syntax is nice. [03:45:24] PROBLEM - Host foundation-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::9 [03:45:44] RECOVERY - Host foundation-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [03:48:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:51:55] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:54] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:56:54] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:03] !log LocalisationUpdate failed: git pull of extensions failed [04:15:13] this is me [04:15:19] Logged the message, Master [04:19:48] !log l10nupdate failures are due to "No submodule mapping found in .gitmodules for path 'WikibaseDatabase'"; repo appears to have been deleted; fixed with git rm --cached WikibaseDatabase as l10nupdate on tin. [04:20:02] Logged the message, Master [04:30:48] !log LocalisationUpdate completed (1.23wmf1) at Mon Nov 4 04:30:48 UTC 2013 [04:31:02] Logged the message, Master [04:33:23] yay [04:40:40] !log LocalisationUpdate completed (1.23wmf2) at Mon Nov 4 04:40:40 UTC 2013 [04:40:56] Logged the message, Master [04:49:54] Elsie: + for a google search? it's now just wrap in double quotes [04:50:02] Elsie: they use to warn about the deprecation [04:51:23] TimStarling: i knew ulsfo was working from last-modified headers as pasted on the bug. both unpurged some places articles were old in the same DCs as eachother and new in the same DCs as eachother [04:51:54] where does SAL twitter relay run now? i guess it's still morebots? [04:52:03] is apparently broken [04:52:56] still morebots [04:54:36] ahhh, ori-l is one of the morebots peoples :P [04:54:50] per tools.wmflabs.org index :) [04:54:59] no thank you [04:55:12] i rewrote logmsgbot and it works very reliably [04:55:25] morebots is a mess [04:55:25] I am a logbot running on tools-exec-02. [04:55:25] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [04:55:25] To log a message, type !log . [04:56:06] haha [04:56:13] ori-l: add me to the morebots tool then? :) [04:56:50] OK [05:02:36] morning [05:02:44] what's going on? [05:03:24] see ops list [05:03:51] k, I literally just started working so I haven't seen anything yet [05:03:51] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:03:59] nothing urgent afaik [05:04:35] rpf most likely [05:05:00] TimStarling: think it's worth repurging the interim? (like was already done last month) do we have a well defined start time for the breakage? [05:05:18] the graph shows sometime wed or thurs and again a day later i think [05:05:33] but it also looks like the graph hasn't been around too long [05:05:36] https://ganglia.wikimedia.org/latest/stacked.php?m=vhtcpd_inpkts_sane&c=Text%20caches%20esams&r=week&st=1383541389&host_regex= [05:05:41] do we have a log of BGP sessions? maybe that will correlate to the ganglia data [05:06:03] state transitions you mean? [05:06:09] juniper logs have these, yes [05:06:11] i *think* BGP logging may have been turned on at some point? [05:06:21] and observium will probably have them as well [05:06:27] not too long ago. e.g. last 2 months [05:06:28] observium has some logs, yes [05:06:58] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:06] we had a flapping 10g wave yesterday [05:07:34] between eqiad and sdtpa iirc [05:08:08] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 4 05:08:07 UTC 2013 [05:08:24] Logged the message, Master [05:09:03] still flapping... [05:10:34] jeremyb: how do i add you to a particular tool? [05:10:44] maybe YuviPanda knows [05:10:53] ori-l: https://tools.wmflabs.org/ has links for add and remove per tool [05:10:57] sure [05:11:01] yeah, what jeremyb said [05:11:54] jeremyb: what's your username? [05:12:07] flappped 1298 times since yesterday [05:12:09] =ircnick [05:12:17] autocomplete didn't think so [05:12:22] huh [05:12:52] well i'm already listed on tools.wmflabs.org... [05:13:18] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [05:13:33] ok, added [05:14:34] hrmmm, no enotif (no echo even) [05:14:37] danke :) [05:16:28] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:16:57] ori-l: so to clarify, morebots is the admin tool not the morebots tool? :) [05:18:20] I can still remove you if you like [05:18:29] anyways, added to morebots [05:18:37] TimStarling: re: Google changing links -- they've made a standard out of tracking clicks now [05:18:50] (I finished reading emails & backlog :) [05:19:13] [05:19:20] bing tracks link clicks as well [05:19:45] http://www.whatwg.org/specs/web-apps/current-work/multipage/links.html#dfnReturnLink-0 [05:20:33] For URLs that are HTTP URLs, the requests must be performed using the POST method, with an entity body with the MIME type text/ping consisting of the four-character string "PING". [05:20:37] * paravoid pukes [05:21:18] well, if it's standard, it's easy to block [05:21:34] oh the spec says that user agents should allow the user to block [05:21:49] "User agents should allow the user to adjust this behavior, for example in conjunction with a setting that disables the sending of HTTP Referer (sic) headers. Based on the user's preferences, UAs may either ignore the ping attribute altogether, or selectively ignore URLs in the list (e.g. ignoring any third-party URLs)." [05:22:18] IE will probably spoil it all by disabling it by default [05:22:28] like DNT [05:22:38] yeah I got the reference :) [05:23:05] (grrrit-wm was down again, I restarted it, seems to work now) [05:24:48] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:24:58] caniuse.com doesn't have the ping attribute [05:25:01] too new I guess [05:25:21] Google mobile search is getting faster - to be exact, 200-400 milliseconds faster! We are gradually rolling out this improvement to all browsers that support the attribute (currently, mobile Chrome and Safari). [05:25:41] heh, nice overstatement [05:27:58] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:35:13] (03PS2) 10Ori.livneh: [WIP] Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [05:37:22] (03CR) 10Ori.livneh: "Still WIP. Now uses Debian packages + FHS paths. The Debian Carbon package uses the same crummy init script that is not multi-instance awa" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [05:37:28] if only we had a memcached module... [05:37:47] too much baggage [05:38:08] they won't conflict because the things that currently use the memcached module and graphite won't ever be applied on the same host anyway [05:38:19] sure [05:39:00] carbonctl, heh [05:39:08] reminds of me something :P [05:39:41] yeah, the reuse was conscious. it was a good idea that has been working well (/sbin/eventloggingctl) [05:40:01] note that there's no upstart service module any more, that is a bit alarming but sensible ultimately [05:40:21] the problem is that carbon/init is a task job that goes from starting to stopped [05:40:34] which is normal, that's how task jobs are supposed to function [05:40:49] it spins up carbon/cache & carbon/relay instances per what is configured [05:41:15] the upstart service provider in puppet is not aware of task jobs, and it's not aware of instance jobs either [05:41:34] so everything is managed via carbonctl [05:41:35] why do you need multiple instances? [05:41:50] it's explained in a comment [05:42:01] carbon cache is CPU bound and the python gil prevents it from utilizing multiple cores [05:42:03] ah, found it [05:44:28] may I suggest to file a Debian bug about this use case at some point? [05:45:03] I'd do it, but you're clearly better informed than me [05:45:31] sure, will do [05:46:01] on professor we have profiler-to-carbon sitting in front of carbon [05:46:15] and it is indeed maxing out one core [05:46:51] (changeset looks good so far, fwiw) [05:47:07] glad to see the debian packages worked out [05:47:09] yay, coo [05:47:09] l [05:47:23] oh, that reminds me [05:47:52] python-carbon can be removed from apt.wm.o, i think [05:48:05] (the debian package is 'graphite-carbon') [05:48:15] professor is on lucid so it's not using that anyway [05:48:27] right [05:48:28] done [05:49:28] thanks [05:49:37] morebots: ping [05:49:37] I am a logbot running on tools-exec-01. [05:49:37] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:49:37] To log a message, type !log . [05:50:01] jeremyb: make !logs structured data [05:50:18] ori-l: elaborate? :) [05:50:29] !log test [05:50:43] Logged the message, Master [05:51:26] make references to individuals, commits, and hosts machine-readable [05:51:42] ori-l: were you around when I was proposing/discussing fedmsg? [05:51:44] so that you can easily grep the SAL [05:52:00] paravoid: that's the fedora message bus built on top of 0mq, right? [05:52:04] yes [05:52:20] i found out about it through you, but i don't recall a discussion, so i probably missed that part [05:52:29] it looked good, didn't look into it in detail [05:52:29] not much of a discussion [05:53:24] there's definitely a need to standardize inter-service communication somehow [05:53:38] ori-l: errr, the input sent to morebots is free text. how do you suggest to force the users to format properly? [05:53:43] stop using !log on IRC? [05:53:55] it even has a mediawiki module already, so funny [05:54:00] note: i have thought about the very same issue myself [05:54:07] jeremyb: IRC-generated !logs are only a subset of messages [05:54:17] many are generated by scripts [05:54:22] right [05:54:54] though it's the sort of thing that could and should be replaced by fedmsg i guess [05:55:05] much as having two bots interact is quaint [05:57:13] one of the things the core team has been talking about is replacing free-text wfDebugs with some sort of standard log message descriptor [05:57:21] indicating type / severity / component [05:57:32] i think bd808 had some ideas and was going to write an rfc [05:58:02] !log test2 [05:58:14] Logged the message, Master [06:02:15] wtf [06:02:21] 0mq went from 2.2 to 13.1? [06:02:50] pyzmq that is [06:02:56] trying to remember who else did that :P [06:03:18] that's why debian packages can reset the version number for the version number :) [06:03:23] i forget what that's called [06:03:40] [09:13:02 PM] hrmmm, no enotif (no echo even) <-- file a bug for the echo notifications and I can look at it sometime [06:03:46] doesn't matter in this case though, only if you go backwards [06:03:59] legoktm: uga, too many bugs to file! [06:04:09] PyZMQ releases ≤ 2.2.0 matched libzmq versioning, but this will no longer be the case. To avoid confusion with the contemporary libzmq-3.2 major version release, PyZMQ is jumping to 13.0 (it will be the thirteenth release, so why not?). [06:05:47] [09:55:04 PM] much as having two bots interact is quaint <-- there's a bug somewhere about issues when morebots and logmsgbot end up on different ends of netsplits [06:06:55] shouldn't that be mitigated by having them connect to the exact same server? [06:08:16] hm. they probably should be on dickson [06:08:26] [10:08:19 PM] morebots is connected on rothfuss.freenode.net (FR) [06:08:28] [10:08:23 PM] logmsgbot is connected on leguin.freenode.net (Ume?, SE, EU) [06:09:42] (03PS1) 10Jeremyb: tool labs exec_environ: add python-twitter package [operations/puppet] - 10https://gerrit.wikimedia.org/r/93426 [06:09:58] legoktm: no. see paravoid for why not [06:10:16] can i get a merge? ^^ [06:10:22] why not? [06:10:45] so then if we have an outage that affects dickson we can't !log ? [06:11:25] why would we connect to dickson? [06:11:44] jeremyb: you set fallback servers. so if it cant connect to the first server, it tries another one [06:11:50] paravoid: because its inside labs? [06:11:56] it's not inside labs [06:12:01] it's in the sandbox vlan [06:12:08] but still, what's the benefit? [06:12:28] legoktm: dickson is dedicated not virtual [06:12:35] i uh...don't know what that means. [06:12:48] maybe paravoid wants to merge? :) [06:12:58] we have a separate (new) network for "hosted" equipment [06:13:01] legoktm: bare metal [06:13:10] no, i meant "sandbox vlan" [06:13:20] which is separate from production and separate from labs [06:13:26] oh, ok [06:13:32] legoktm: that means it has about as much access to our networks as any other outside host [06:13:35] and it's sandboxed, there's some constraints into it for security purposes [06:13:37] well, then it doesn't really matter [06:14:08] legoktm: *but* if eqiad falls off the grid then dickson probably does too [06:14:19] if eqiad fails, then we lose morebots too. [06:14:43] not true for morebots' old home [06:14:55] anyway... [06:15:15] should i break tool labs rules and leave the bot working or break the bot or does someone want to deploy that for me? :) [06:16:04] jeremyb: why not just install the package locally for now with pip? [06:16:36] (03CR) 10Jeremyb: "note, this package is already installed on tools-login somehow. twitter works when run on tools-login but not when run from the grid" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93426 (owner: 10Jeremyb) [06:16:53] legoktm: ewww? i hate pip? idk [06:17:19] I don't. [06:17:26] legoktm: but i wonder how it ever broke to begin with. maybe some hosts have it and some don't? [06:17:39] if that's the case thats a bug [06:17:43] all hosts should be equal [06:18:18] right, hence i sent a patch to equalize [06:18:26] i was guessing could be manual installs [06:18:55] the twitter breakage started when apergos booted morebots. so i guess before that point it was in a place with twitter and after not [06:24:52] !log left morebots running on tools-login because it doesn't work on all grid hosts. see https://gerrit.wikimedia.org/r/93426 [06:25:07] Logged the message, Master [06:30:42] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 319 seconds [06:31:12] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: CRIT replication delay 330 seconds [06:38:42] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 319 seconds [06:39:13] * springle glares at db57 [06:39:56] springle: does it hear you? [06:40:02] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 313 seconds [06:40:12] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 321 seconds [06:40:15] hah, db1007 joins the party [06:40:44] * jeremyb goes to sleep [06:42:02] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [06:42:12] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay -1 seconds [06:43:16] -1? :) [06:43:28] it's that fast :) [06:43:33] time slave [06:49:22] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:12] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:53:52] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:22] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:12] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:55:24] (03PS1) 10ArielGlenn: payments as empty hash for lvs service ips in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93432 [06:55:42] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 2215924 seconds since restart [06:56:46] (03CR) 10ArielGlenn: [C: 032] payments as empty hash for lvs service ips in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/93432 (owner: 10ArielGlenn) [06:59:02] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Mon Nov 4 06:58:52 UTC 2013 [07:00:42] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Mon Nov 4 07:00:33 UTC 2013 [07:38:44] PROBLEM - Host dobson is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host emery is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host db53 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:44] PROBLEM - Host es4 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:54] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [07:38:54] PROBLEM - Host db9 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:04] PROBLEM - Host mw90 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw5 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw100 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:05] PROBLEM - Host mw41 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:14] PROBLEM - Host 208.80.152.132 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:24] PROBLEM - Host ms-be4 is DOWN: CRITICAL - Time to live exceeded (10.0.6.203) [07:39:24] PROBLEM - Host es3 is DOWN: CRITICAL - Time to live exceeded (10.0.0.227) [07:39:24] PROBLEM - Host mw105 is DOWN: CRITICAL - Time to live exceeded (10.0.11.105) [07:39:24] PROBLEM - Host mw70 is DOWN: CRITICAL - Time to live exceeded (10.0.11.70) [07:39:24] PROBLEM - Host srv290 is DOWN: CRITICAL - Time to live exceeded (10.0.8.40) [07:39:24] PROBLEM - Host mw116 is DOWN: CRITICAL - Time to live exceeded (10.0.11.116) [07:41:34] PROBLEM - Host pappas is DOWN: PING CRITICAL - Packet loss = 100% [07:41:34] PROBLEM - Host grosley is DOWN: PING CRITICAL - Packet loss = 100% [07:42:04] PROBLEM - Host loudon is DOWN: PING CRITICAL - Packet loss = 100% [07:42:14] PROBLEM - Host db78 is DOWN: PING CRITICAL - Packet loss = 100% [07:42:20] wtf [07:42:28] eeep [07:42:33] argh [07:42:36] network is split somehow [07:42:38] everything at pmtpa [07:42:49] the fiber is permanently cut now I guess [07:42:54] but something is funky with the network [07:42:54] RECOVERY - Host db34 is UP: PING WARNING - Packet loss = 86%, RTA = 35.49 ms [07:43:04] RECOVERY - Host mw124 is UP: PING WARNING - Packet loss = 54%, RTA = 30.11 ms [07:43:04] RECOVERY - Host mw43 is UP: PING WARNING - Packet loss = 54%, RTA = 30.11 ms [07:43:04] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 29.14 ms [07:43:04] RECOVERY - Host mw73 is UP: PING OK - Packet loss = 0%, RTA = 26.72 ms [07:43:05] RECOVERY - Host db67 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms [07:44:07] pages [07:45:14] RECOVERY - Host loudon is UP: PING OK - Packet loss = 0%, RTA = 27.28 ms [07:45:14] RECOVERY - Host pappas is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [07:45:15] RECOVERY - Host grosley is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [07:45:15] RECOVERY - Host db78 is UP: PING OK - Packet loss = 0%, RTA = 26.85 ms [08:00:34] !log cr2-eqiad set xe-5/2/1 disable; 10g wave to cr1-sdtpa, flapping since yesterday, causing packet loss and outages [08:00:55] Logged the message, Master [08:02:24] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 28.0715908029 (gt 8.0) [08:06:28] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.1014765625 [08:09:09] page? and yet no message here [08:13:58] sms is acting up [08:14:02] I'm getting pages but usually delayed [08:14:06] and not from "Wikimedia" anymore [08:14:10] but some random +38 number [08:18:28] awesome... [08:24:03] (03PS1) 10Reedy: mediawikiwiki and testwikidatawiki to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93433 [08:24:04] (03PS1) 10Reedy: Non wikipedias to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93434 [08:32:12] !log reedy synchronized php-1.23wmf2/ 'Ibcf77ed7f04c14a477d7cfd0e244929c552c3394' [08:32:25] Logged the message, Master [08:51:38] RECOVERY - search indices - check lucene status page on search14 is OK: HTTP OK: HTTP/1.1 200 OK - 52993 bytes in 0.155 second response time [08:59:08] (03PS1) 10ArielGlenn: remove dhcp entries for cp1021-1036, 1041-42, reclaimed (rt #5981) [operations/puppet] - 10https://gerrit.wikimedia.org/r/93436 [09:00:20] (03CR) 10ArielGlenn: [C: 032] remove dhcp entries for cp1021-1036, 1041-42, reclaimed (rt #5981) [operations/puppet] - 10https://gerrit.wikimedia.org/r/93436 (owner: 10ArielGlenn) [09:48:05] !log Jenkins: upgrading PHP_CodeSniffer from 1.4.6 to 1.4.7 [09:48:20] Logged the message, Master [09:56:22] !log added python-apscheduler to apt.wikimedia.org [09:56:25] hashar: ^ [09:56:32] !!! [09:56:38] Logged the message, Master [09:57:44] akosiaris: thank you! I will play with Zuul on labs and will hopefully be able to close down the RT ticket requesting some packages :D [10:00:32] :-) [10:02:01] thanks :) [10:10:55] apergos: puppetd --enable; puppetd -vt --noop; puppetd --disable [10:11:23] for streber and similar cases like them [10:11:30] paravoid: don't disable, set high ospf metrics instead [10:11:44] mark: hi [10:11:48] hi [10:11:54] * mark changes that [10:11:54] that won't tell me why someone disabled puppet on a host [10:12:07] mark: why? [10:12:08] people do it for testing something so their stuff won't get overwritten [10:12:10] so we can monitor it? [10:12:22] yes [10:12:24] apergos: it might tell you what, which might help you find who :) [10:12:34] also less disruptive, although it doesn't matter much if the link was already flapping like hell anyway [10:12:49] hell = over a thousand times [10:12:59] we also had a multicast outage that Tim dealt with before I came online [10:15:41] nice [10:15:49] tim moving udpmcast literally hours before I'm decommissioning it [10:15:55] heh [10:19:31] oh you fixed the metric already? [10:19:41] as I announced [10:20:05] ah it was an /me and I missed it [10:20:20] log it or leslie will get confused by SAL :) [10:20:36] leslie doesn't read SAL [10:20:40] she reads rancid :) [10:21:13] !log Reenabled 10G wave between cr2-eqiad and cr1-sdtpa, set high OSPF metrics instead [10:21:26] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:26] Logged the message, Master [10:22:26] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [10:25:50] so pmtpa squids are also not receiving multicast now [10:47:25] !log Deactivated anycast PIM RP on cr2-eqiad [10:47:41] Logged the message, Master [10:48:56] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [10:51:56] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:52:36] !log Deactivated BFD on flapping link ae0 between cr1-eqiad and cr2-eqiad [10:52:52] Logged the message, Master [10:55:00] cr1-eqiad is in trouble [10:55:04] oh? [10:59:06] PROBLEM - Host wikipedia-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::1 [10:59:08] PROBLEM - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:860:ed1a::8 [10:59:26] RECOVERY - Host wikipedia-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [10:59:46] RECOVERY - Host mediawiki-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 35.48 ms [11:22:17] !log multicast routing to pmtpa restored for now [11:22:32] Logged the message, Master [11:24:31] (03PS1) 10Hashar: zuul: dependencies for Gearman based version [operations/puppet] - 10https://gerrit.wikimedia.org/r/93454 [11:30:04] what's up? [11:30:11] in trouble how? [11:51:21] (03PS1) 10Hashar: zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 [11:59:46] (03PS1) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 [12:18:11] (03PS2) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 [12:22:04] (03PS3) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 [12:36:59] (03PS1) 10Mark Bergsma: Revoke Yuri's shell access [operations/puppet] - 10https://gerrit.wikimedia.org/r/93464 [12:38:04] (03CR) 10Mark Bergsma: [C: 032] Revoke Yuri's shell access [operations/puppet] - 10https://gerrit.wikimedia.org/r/93464 (owner: 10Mark Bergsma) [12:41:05] (03CR) 10Faidon Liambotis: [C: 031] "+1, so far." [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [12:43:18] (03PS2) 10Hashar: zuul: configuration for gearman [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 [12:43:19] (03PS4) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 [12:44:31] (03CR) 10Hashar: "Fixed a typo in zuul.conf template: gearman_server -> gearman_server_start" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93457 (owner: 10Hashar) [13:01:41] !log Jenkins upgrading gearman (0.0.4 -> 0.0.5) plugin on gallium from http://repo.jenkins-ci.org/repo/org/jenkins-ci/plugins/gearman-plugin/0.0.5/ [13:01:58] Logged the message, Master [13:02:39] commuting [13:05:33] (03CR) 10Faidon Liambotis: [C: 032] Further constrain W0 X-CS setting to mobile Wikipedia, for now. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92818 (owner: 10Dr0ptp4kt) [13:06:13] (03CR) 10Faidon Liambotis: [C: 032] Enable mobile redirect for wikimania2014.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/92671 (owner: 10MaxSem) [13:23:26] paravoid: do you have a moment to also review https://gerrit.wikimedia.org/r/#/c/93006/ pls? Analytics will be happy [13:24:55] sigh [13:26:50] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:50] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:40] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [13:27:50] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [13:34:47] paravoid: we believe in you :) [13:34:56] (zero + analytics) [13:36:45] (03PS5) 10Hashar: role::zuul::labs::gearman to test out in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93458 [13:48:41] (03PS1) 10Mark Bergsma: Add neighbor blocks for eqiad/esams link [operations/dns] - 10https://gerrit.wikimedia.org/r/93468 [13:49:18] LeslieCarr: mark: palladium and strontium have LACP over 4 interfaces and an MTU of 9192. We can probably do without both. I can also not touch any configuration on the switch and leave it as it is. Any standards we follow ? [13:49:39] yeah remove that [13:49:58] it's been problematic since ubuntu precise [13:50:09] so now if we need more than GigE, we just order 10G [13:50:17] and jumbo frames we've never used [13:50:20] how problematic ? [13:50:49] (03CR) 10Mark Bergsma: [C: 032] Add neighbor blocks for eqiad/esams link [operations/dns] - 10https://gerrit.wikimedia.org/r/93468 (owner: 10Mark Bergsma) [13:50:51] ok i will just use eth0 and have the other ports disabled with the description intact [13:51:05] yes [13:51:12] check if they are in the access-ports interface range [13:51:21] due to lacp their config may be a bit different right now [13:51:35] ok thanks for the hint [13:55:10] RECOVERY - Disk space on wtp1008 is OK: DISK OK [13:55:37] mark: setting up eqiad<->esams? [13:55:44] yes [13:55:48] \o/ [13:55:56] don't celebrate yet [13:55:59] i bet multicast won't work [13:56:08] heh [13:58:35] 97ms, not bad [14:07:16] any dev interesting in investingating a possible weird bug? [14:08:16] Have you tried Bugzilla? [14:09:50] akosiaris: paravoid: thanks to your help on packaging, I know got a Zuul instance in labs that is using Gearman :-]]] [14:12:03] :-) [14:12:50] Elsie: I'm poking here for some reasons [14:13:16] !log Configured OSPF/OSPF3 on cr1-eqiad:xe-4/2/2 <--> cr2-knams:xe-1/1/0 [14:13:32] Logged the message, Master [14:15:49] Vito: You probably want #wikimedia-tech or #mediawiki. [14:15:53] multicast does work now... [14:15:55] Though you're being so vague it's difficult to know which. :-) [14:16:01] \o/ [14:16:19] it pings! [14:16:36] omg, it works [14:16:42] Vito: this channel is merely for the server / infrastructure team. Not that much for bugs with the software :-D [14:16:59] so that's the right place ;) [14:17:03] Vito: you really want #wikimedia-tech and do fill a bug at https://bugzilla.wikimedia.org/ [14:17:08] it think there's some error in db [14:17:15] Vito: no. There are no devs here :-D [14:17:36] i'm starting to wonder if that old core switch in pmtpa is the culprit [14:17:56] akosiaris: you lucky bastard [14:18:02] ?? [14:18:19] not arguing with you but why ? [14:18:21] akosiaris: no weird haproxy and such for the new puppet setup [14:18:29] ah :-) [14:18:38] hashar: https://it.wikipedia.org/w/index.php?title=Speciale:Ripristina&target=Utente%3ADott_Alessandro_Sartore <-- there are no deleted nor suppressed revs, I'm wondering if something went wrong with db [14:24:05] !log Killed udpmcast on chromium [14:24:17] Logged the message, Master [14:25:56] the link is doing ~ 900 Mbps now [14:26:20] oh wow [14:35:04] damn tampa [14:35:55] what? [14:36:40] what's wrong? [14:39:23] (03PS1) 10Akosiaris: Adding palladium/strontium (new puppetmasters) IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/93475 [14:40:56] !log removed LACP configuration from asw-a-eqiad for strontium and set it to standard access-port and private1-a-eqiad vlan [14:41:08] Logged the message, Master [14:41:17] i think the old foundry csw1-sdtpa is causing our multicast problems [14:41:18] !log removed LACP configuration from asw-b-eqiad for palladium and set it to standard access-port and private1-b-eqiad vlan [14:41:35] Logged the message, Master [14:42:10] (03PS1) 10Hashar: Merge upstream 'v0.7.1' into master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/93476 [14:42:51] the lazy person I am let Jenkins build the package now : https://integration.wikimedia.org/ci/job/operations-debs-jenkins-debian-glue-debian-glue/4/ :D [14:44:10] !log Disabled multicast traffic reduction on csw1-sdtpa [14:44:26] Logged the message, Master [14:45:49] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:47:39] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 1 logical drive(s), 4 physical drive(s) [15:07:49] perhaps i'll sacrifice one of the two fibers between sdtpa and pmtpa and connect that eqiad link over that [15:08:05] don't really need > 10 Gbps between the floors anymore anyway [15:08:19] (03CR) 10Akosiaris: [C: 032] Adding palladium/strontium (new puppetmasters) IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/93475 (owner: 10Akosiaris) [15:23:51] manybubbles: "Thanks go to Nik Everett, a frequent contributor to Elasticsearch" \o/ [15:47:13] ottomata1: heya [15:47:20] hiya [15:47:45] apt.wm.org has librdkafka 0.8.0 now [15:47:47] enjoy [15:47:55] 0.8.0-1~precise1 to be exact [15:50:03] awesome, saw that [15:50:05] thanks so much! [16:16:12] (03PS4) 10Andrew Bogott: Switch to using uwsgi for the proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [16:22:13] paravoid: what'd I do? [16:22:42] manybubbles: http://www.elasticsearch.org/blog/0-90-6-released/ [16:22:51] paravoid: ah cool! lets upgrade! [16:22:55] "Other highlighting improvements" [16:23:12] I had no idea when 0.90.6 was coming [16:23:28] I assumed soon but I'm not sure how they make that decision [16:23:42] if you're in their irc channel or otherwise talking with them [16:23:50] you should tell them to make an apt repository [16:23:59] since they make .debs, having an apt is a tiny step over that [16:24:02] (03PS1) 10Cmjohnson: Removing misc servers tola, celsus, lardner, wtp1, kuo from pmtpa dsh, wtp1 from dhcpd (being decom'd) : [operations/puppet] - 10https://gerrit.wikimedia.org/r/93493 [16:24:12] and it's very useful [16:24:26] we wouldn't put the apt repo directly in our servers of course (although others might) [16:24:35] but we have a mechanism for importing third-party repositories [16:24:43] that includes verifying cryptographic signatures [16:24:57] https://github.com/elasticsearch/elasticsearch/issues/3286 [16:25:16] ok, slapped with the bug report [16:25:16] paravoid: I'd like that very much [16:25:22] ...to which you replied 4 months ago [16:25:26] yeah [16:25:49] that was something I started looking at a while ago [16:25:57] it is silly they don't just have one [16:26:10] sorry, that delayed slap was me finding the issue number. [16:29:15] paravoid: the "Pretty is prettier" is actually one I'm excited about [16:29:39] the lack of "\n" was making my terminal angry from time to time [16:31:54] !log Disabled multicast routing between eqiad and pmtpa, setup udpmcast between chromium and dobson instead [16:32:07] Logged the message, Master [16:33:45] AaronSchulz: ping [16:34:41] (03PS1) 10Cmjohnson: Removing dns entries for celsus,kuo, lardner,tola,wtp1 [operations/dns] - 10https://gerrit.wikimedia.org/r/93497 [16:34:45] AaronSchulz: thumbs are very close to completion (~1.5 days, plus another day for a final run?), so now we just miss 5T of temp files [16:35:12] AaronSchulz: so I need to either start copying all that or preferrably have a fix for #56401 and let that finish :) [16:39:50] (03CR) 10Cmjohnson: [C: 032] Removing misc servers tola, celsus, lardner, wtp1, kuo from pmtpa dsh, wtp1 from dhcpd (being decom'd) : [operations/puppet] - 10https://gerrit.wikimedia.org/r/93493 (owner: 10Cmjohnson) [16:41:20] (03PS5) 10Andrew Bogott: Switch to using uwsgi for the proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [16:41:34] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for celsus,kuo, lardner,tola,wtp1 [operations/dns] - 10https://gerrit.wikimedia.org/r/93497 (owner: 10Cmjohnson) [16:42:00] !log dns update [16:42:16] Logged the message, Master [16:45:07] what are you using ES for? [16:45:55] it's the new infrastructure that's being built for the search box you see on the sites [16:46:31] https://wikitech.wikimedia.org/wiki/Elasticsearch is the docs I guess [16:47:08] the old one was a custom written daemon based on lucene [16:47:13] unmaintained for years [16:47:49] okay, nice [16:48:24] but ES will be populated solely for searching, it wont be used as the authoritative store for the wiki content itself? (i.e., replace your mysqls) [16:48:48] correct [16:49:24] too bad, that would be interesting :) [16:56:05] oh we do use ES as the authoritative store for wiki content [16:56:12] except we call it External Storage, and isn't ElasticSearch ;) [16:56:25] the ES acronym has become mightily confusing lately [17:07:14] mark: he, close enough :) [17:27:34] if i wanted to get an idea of what effects an extension utilizing 50M, 250M, or even >1G of memcache would have on the other users of memcache, where should i start? [17:27:37] (03PS1) 10Dzahn: add sockpuppet to misc_pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/93502 [17:27:38] (03PS1) 10Dzahn: ensure /srv/org/wikimedia exists on bugzilla server ..and some minor formatting [operations/puppet] - 10https://gerrit.wikimedia.org/r/93503 [17:28:28] (03PS2) 10Dzahn: add sockpuppet to misc_pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/93502 [17:29:08] (03CR) 10Dzahn: [C: 032] "sockpuppet is a misc. as well" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93502 (owner: 10Dzahn) [17:31:09] RECOVERY - DPKG on xenon is OK: All packages OK [17:31:51] (03PS2) 10Dzahn: ensure /srv/org/wikimedia exists on bugzilla server ..and some minor formatting [operations/puppet] - 10https://gerrit.wikimedia.org/r/93503 [17:38:01] (03PS1) 10Mark Bergsma: Cleanup old/now unused ulsfo upload service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93504 [17:39:09] (03CR) 10Mark Bergsma: [C: 032] Cleanup old/now unused ulsfo upload service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93504 (owner: 10Mark Bergsma) [17:40:44] (03PS11) 10Akosiaris: Modularizing puppetmaster [operations/puppet] - 10https://gerrit.wikimedia.org/r/91353 [17:43:50] (03CR) 10Akosiaris: [C: 032] "Removed puppetmaster::self and merging. More cleanup to follow" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91353 (owner: 10Akosiaris) [17:45:09] (03PS6) 10Akosiaris: Puppetmaster module multi-master capable [operations/puppet] - 10https://gerrit.wikimedia.org/r/93061 [17:47:17] (03CR) 10Akosiaris: [C: 032] Puppetmaster module multi-master capable [operations/puppet] - 10https://gerrit.wikimedia.org/r/93061 (owner: 10Akosiaris) [17:47:45] (03Abandoned) 10Chad: Shut down search_pool[1-3] in pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/92017 (owner: 10Chad) [17:49:44] !log Deactivated ae1.101 on cr1-esams and cr2-knams [17:50:01] Logged the message, Master [17:53:03] (03PS1) 10Akosiaris: Fix a erroneus check in post-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/93507 [17:54:19] I broke post-merge... fixing it now [17:56:52] (03CR) 10Akosiaris: [C: 032] Fix a erroneus check in post-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/93507 (owner: 10Akosiaris) [18:01:16] fixed [18:02:36] (03PS2) 10Akosiaris: palladium/strontium as puppetmasters [operations/puppet] - 10https://gerrit.wikimedia.org/r/93082 [18:05:35] (03CR) 10Akosiaris: [C: 032] palladium/strontium as puppetmasters [operations/puppet] - 10https://gerrit.wikimedia.org/r/93082 (owner: 10Akosiaris) [18:16:43] !IE6 [18:16:47] !IE6 is April 8 2014 - celebrate end of extended support [18:16:48] Key was added [18:21:00] (03PS4) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [18:25:42] !log aaron synchronized php-1.23wmf2/maintenance/cleanupUploadStash.php 'f3332a3d932c6919b3311d89edee20f886e4e86e' [18:25:56] Logged the message, Master [18:26:03] thanks :) [18:26:08] although it started working now [18:26:37] but useful nonetheless [18:27:02] (03PS1) 10Chad: sudo for myself on arsenic [operations/puppet] - 10https://gerrit.wikimedia.org/r/93511 [18:27:11] <^d> RobH: ^ [18:27:44] cool [18:27:52] waiting on tests.... [18:29:26] (03CR) 10RobH: [C: 032] "who watches the watchers? the honor system." [operations/puppet] - 10https://gerrit.wikimedia.org/r/93511 (owner: 10Chad) [18:30:01] ^d: merging now, will force a run on arsenic for ya [18:30:39] !log Configured cr1-esams and cr2-knams for new LVS service IP range 91.198.174.192/27 [18:30:46] <^d> \o/ [18:30:57] Logged the message, Master [18:32:00] ^d: you should have it now [18:32:06] saw it slap in your account [18:33:00] <^d> Working, thanks! [18:39:03] (03PS1) 10Mark Bergsma: Add new esams upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93515 [19:15:45] hmm, the node_modules directory in /srv/deployment/parsoid/config is owned by a user called 'sartoris' now [19:15:59] so I can't deploy as I can't update those files as gwicke [19:17:19] apergos: ping [19:18:07] ask Ryan_Lane about that [19:18:16] Vito: ponng but I am cooking dinner [19:18:17] sartoris seems to be the new name for git-deploy [19:18:18] and will eat soon [19:18:25] apergos: yeah, thanks [19:18:52] apergos: heheheh, me too, btw there's a possibile db issue I think it deserves your attention [19:19:14] maybe someone who is actually in working hours now? [19:19:18] basically there are some lost revs on it.wiki [19:19:25] I mean I'm still hanging out but it's well after my 'shift' [19:19:29] ugh [19:19:34] Ryan_Lane: ping [19:19:37] point me out someone as trustworthy as you ;p [19:19:48] is there a bug report? like with what page, when it was noticed, etc [19:20:02] any weird actions done to the page or revs.... [19:20:15] gwicke: it should be group writeable... [19:20:34] and sgid. is it not? [19:20:41] Ryan_Lane: I'm getting permission errors [19:20:48] really? one sec [19:20:55] touch node_modules/test [19:20:57] touch: cannot touch `node_modules/test': Permission denied [19:21:08] no opened bugs [19:21:11] but some example [19:21:14] https://it.wikipedia.org/w/index.php?title=Speciale:Ripristina&target=Utente%3AVandra%2FSandbox3 [19:21:39] https://it.wikipedia.org/w/index.php?title=Speciale:Ripristina&target=Utente%3ADott_Alessandro_Sartore [19:21:47] gwicke: which repo? [19:21:58] no suppressed revs btw [19:22:00] Ryan_Lane: /srv/deployment/parsoid/config [19:22:16] ah, yeah, bad permissions [19:22:17] one sec [19:22:25] I did not try /Parsoid yet [19:22:34] need to update the config first [19:23:09] gwicke: whenever you want to try node 0.10... :) [19:23:32] paravoid: we should test it in rt testing first [19:23:39] gwicke: it's an issue with the directory missing write and sgid bits for group [19:23:40] yes [19:23:46] I'm fixing it for all repos, just in case [19:23:48] gwicke: ok, try now [19:23:51] paravoid: should we use the ppa now that it has 0.10.21 too? [19:24:09] no, I backported packages from Debian [19:24:11] or do you have a package ready? [19:24:14] 0.10.21 as well [19:24:16] ah, awesome [19:24:19] I needed to backport c-ares first though [19:24:22] the ppa is very non-standard [19:24:57] where are you doing the roundtrips? [19:25:03] labs? [19:25:05] apergos: so, the shadow reference feature is something for mediawiki localization [19:25:06] (03PS1) 10Lcarr: adding new parsoid public IP [operations/dns] - 10https://gerrit.wikimedia.org/r/93522 [19:25:36] apergos: during the fetch stage the shadow reference repo would be checked out to the version that's going to be checked out during the checkout phase [19:25:37] Ryan_Lane: that seems to have worked, thanks! [19:25:41] gwicke: yw [19:25:50] apergos: then, we'll have a custom mediawiki module [19:25:58] paravoid: yes, parsoid.wmflabs.org [19:26:06] it'll run the localization script on the target, which will generate the cache [19:26:06] and a few clients that do the actual work [19:26:24] ahhh [19:26:39] right and there is whre you need the repo as it owuld be deployed, gotcha [19:26:42] so, we'll generate the cache on all targets, rather than transferring the cache [19:27:15] and since the shadow reference feature is generic, it can be used by other repos, if they need it [19:27:28] ^d: git.wm.org isnt' very happy .. it seems [19:27:52] paravoid: the first half of this week is a bit busy, might have to defer the 0.10 upgrade until Thursday [19:28:02] k [19:28:02] (03CR) 10Lcarr: [C: 032] adding new parsoid public IP [operations/dns] - 10https://gerrit.wikimedia.org/r/93522 (owner: 10Lcarr) [19:28:52] apergos: so, the way I was going to implement this was to add a config option to the deployment hash: 'shadow_repo' => true [19:29:09] then, during the fetch stage, add a clone/fetch for the reference repo [19:29:42] then, replace all .gitmodule files in the checkout to reference the local submodules from the normal checkout [19:29:54] then do a recursive submodule update --init [19:30:54] eeww [19:30:57] I tested it for MW on my local system and it finished the clone/submodule clone for everything in about 30 seconds [19:31:00] (the git modules step) [19:31:01] apergos, Ryan_Lane: https://bugzilla.wikimedia.org/show_bug.cgi?id=56577 [19:31:08] yeah, not much you can do about that [19:31:25] 30 secs isn't awful [19:31:27] git submodules kind of suck [19:31:34] yeah, 30 secs was for a full clone, too [19:31:46] further fetches should occur much faster [19:31:49] yep [19:32:21] and since it's references/hard links it shouldn't take up much disk space [19:33:19] what do the directory paths for those look like? [19:33:28] (03PS1) 10Dzahn: remove search-pool[1-3].svc.pmtpa.wmnet from monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/93523 [19:33:33] (hard link source and target) [19:33:42] ah, that's handled by git [19:33:45] but... [19:34:26] yay mutante for removing those [19:34:38] the submodules have something like this: url = https://gerrit.wikimedia.org/r/p/mediawiki/extensions/AntiBot.git [19:35:01] so you need to modify that to point to the repo on the local filesystem [19:35:07] I'm already doing that in the main repo [19:35:13] to not reference gerrit [19:35:30] it's pretty simple to do it to point to the filesystem, since we already have the repo location [19:35:49] Ryan_Lane: oh, btw- is there a way to suppress Parsoid restarts when pushing out the config repository? [19:36:08] s#https://gerrit.wikimedia.org/r/p/mediawiki/#/srv/deployment/mediawiki/slot0/# [19:36:14] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=search-pool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93523 (owner: 10Dzahn) [19:36:15] gwicke: at this time? no [19:36:33] hmm, k [19:36:34] I should make that another allowed runner call so that it can be done separately [19:36:39] that will take out Parsoid for a bit then [19:36:49] old code won't work with new dependencies [19:37:11] !change 93523 | apergos [19:37:11] apergos: https://gerrit.wikimedia.org/r/#q,93523,n,z [19:37:13] I'll try to be quick [19:37:41] (03CR) 10Mark Bergsma: [C: 032] Add new esams upload LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/93515 (owner: 10Mark Bergsma) [19:37:56] so please ignore Parsoid warnings in the next minutes [19:39:22] I already looked at it, did you not see my yay? :-D [19:39:39] Ryan_Lane: it seems that a commit --amend confuses git-deploy a bit [19:39:46] gwicke: yeah, it would [19:39:49] badly [19:39:54] you should never ever do that [19:39:57] apergos: i didn't, now i did :) [19:40:06] one file was not committed along with the rest [19:40:06] :-D [19:40:12] which caused sync to fail [19:40:15] then you should do another commit [19:40:20] oh suck [19:40:31] Ryan_Lane: ok- next time [19:40:35] gwicke: when you amend a commit, you destroy the history [19:40:41] how can I get back to where I started? [19:40:42] and any repo that's downstream will break [19:40:55] there are no downstream repos ;) [19:41:01] gwicke: yes, there are [19:41:05] every single target [19:41:16] if it was synced, yes [19:41:19] git is the transport mechanism, remember? :) [19:41:19] it is not [19:41:24] oh [19:41:25] good [19:41:50] so you amended a commit that hadn't been synced yet? [19:41:54] then you should have no issues [19:42:00] git deploy --force sync [19:42:10] otherwise it has no clue anything changed [19:43:19] it says 'It looks like you havent started yet!' [19:43:28] git deploy start [19:43:30] :) [19:43:42] hm, ok [19:43:49] I had started earlier [19:44:12] force-syncing now [19:45:01] and done [19:45:30] RECOVERY - Disk space on wtp1018 is OK: DISK OK [19:45:40] RECOVERY - Disk space on wtp1023 is OK: DISK OK [19:47:59] !log updated Parsoid to d7b556f25353 [19:48:17] Logged the message, Master [19:51:38] git.wikimedia having troubles? [19:51:57] (03PS1) 10Lcarr: adding new public IP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 [19:52:09] marktraceur: yes. there's a bug about that [19:52:16] (03PS2) 10Lcarr: adding new public IP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 [19:52:19] https://bugzilla.wikimedia.org/show_bug.cgi?id=56557 [19:52:22] Fun times [19:52:25] naturally no one wants to fix it [19:52:31] since today's morning UTC [19:52:45] so pretty long-ish now [19:52:50] (03CR) 10jenkins-bot: [V: 04-1] adding new public IP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [19:54:08] Sigh [19:57:45] (03CR) 10Mark Bergsma: [C: 04-2] "Totally the wrong IP range for that service IP. Also, no IPv6?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [19:58:09] Reedy: online again yet? [19:58:48] !log restarted gitblit on antimony [19:58:49] (03PS6) 10Andrew Bogott: Switch to using uwsgi for the proxy api [operations/puppet] - 10https://gerrit.wikimedia.org/r/92664 [19:59:04] Logged the message, Master [20:00:05] mutante: you know that's bug 56557? [20:00:22] jeremyb: i do, just commented [20:00:29] "mid-air" [20:00:59] (03CR) 10Mark Bergsma: "Sorry, not outdated lvs.pp, but put the monitoring lines in the external services section" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [20:01:24] (03PS1) 10Lcarr: fixed the ip for parsoid-lb.eqiad.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/93567 [20:01:32] mark: i mixed up the line numbers and ip's [20:01:47] need to change my colors to be a bit more different [20:02:04] (03CR) 10Lcarr: [C: 032] fixed the ip for parsoid-lb.eqiad.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/93567 (owner: 10Lcarr) [20:02:39] LeslieCarr: why is it in the multimedia range? [20:03:03] i didn't realize we had already separated out the ranges [20:03:17] for eqiad [20:03:29] let's put a comment in the rdns file ? [20:03:58] we haven't yet, but you should at least take it into account for new ips [20:04:10] greg-g: Am now [20:04:15] An hour wasn't a bad guess [20:04:34] Reedy: weee [20:04:43] go forth and deploy and such [20:05:04] I actually did most of the prep this morning [20:05:05] after i get lunch i'll put comments in the file and make sure it's in the "misc" section [20:05:15] (03CR) 10Reedy: [C: 032] mediawikiwiki and testwikidatawiki to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93433 (owner: 10Reedy) [20:06:12] LeslieCarr: if you leave it for review I can roll them in with my other lvs changes tomorrow [20:06:52] Who do I need to slap for the bugzilla spam? [20:06:54] okay [20:07:35] Reedy: ? [20:07:51] 40 new bugzilla emails [20:09:20] * Reedy waits for jenkins [20:11:47] (03PS2) 10Dzahn: remove Tampa search-pool monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/93523 [20:12:36] (03PS3) 10Dzahn: remove Tampa search-pool monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/93523 [20:14:39] Has Jenkins died? [20:15:04] (03CR) 10Reedy: [V: 032] "Dead Jenkins is apparently dead" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93433 (owner: 10Reedy) [20:16:46] Reedy: nah, it's just very busy [20:19:51] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki and testwikidatawiki to 1.23wmf2 [20:20:06] Logged the message, Master [20:22:23] !log Created BetaFeatures tables on mediawikiwiki and testwikidatawiki [20:22:36] Logged the message, Master [20:22:37] marktraceur: Are we enabling it on testwikidatawiki? [20:23:32] what does it do? [20:23:38] probably fine [20:24:25] yeah, looks fine [20:24:39] aude: Better we break testwikidatawiki with it than wikidatawiki ;) [20:25:03] definitely [20:25:44] (03PS1) 10Aaron Schulz: Bumped jobqueue warning threshhold [operations/puppet] - 10https://gerrit.wikimedia.org/r/93585 [20:27:04] MultimediaViewer seems a bit pointless [20:27:34] CommonsMetadata for the same reason.. Though the idea of it does seem wikidata-sih [20:28:33] can always disable it [20:29:13] i doubt they break anything though doubt they are needed for wikidata :) [20:29:23] * aude doesn't care [20:29:52] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable BetaFeatures and friends on mediawikiwiki and testwikidatawiki' [20:30:05] ^demon: PHP Warning: Search backend error during full text search for ''. Error message is: IndexMissingException[[zuwikibooks_content] missing] [20:30:07] Logged the message, Master [20:30:11] (03CR) 10Dzahn: [C: 031] decommision db3[29] db4[2-6] db5[1235689] [operations/puppet] - 10https://gerrit.wikimedia.org/r/93052 (owner: 10Springle) [20:30:24] (03PS1) 10Reedy: Enable BetaFeatures and friends on mediawikiwiki and testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93586 [20:31:01] (03CR) 10Reedy: [C: 032] Enable BetaFeatures and friends on mediawikiwiki and testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93586 (owner: 10Reedy) [20:31:12] (03Merged) 10jenkins-bot: Enable BetaFeatures and friends on mediawikiwiki and testwikidatawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93586 (owner: 10Reedy) [20:31:24] <^demon> Reedy: Ohdarn. [20:31:39] (03PS2) 10Reedy: Non wikipedias to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93434 [20:31:43] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93434 (owner: 10Reedy) [20:31:45] That's better [20:32:02] (03Merged) 10jenkins-bot: Non wikipedias to 1.23wmf2 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93434 (owner: 10Reedy) [20:32:53] Reedy: Yeah, they're all getting pushed to mw.o and testwikidatawiki - I dunno why exactly, but that's what I was told [20:33:01] (I mean re: test.wikidata) [20:33:18] more testing the better [20:33:32] Absolutely [20:33:44] and eventually commons will get some of the wikibase repo features [20:33:50] probably* [20:33:51] Wonderful [20:33:55] aude: We sure hope so [20:34:13] marktraceur: it's the suite of testwikis :) [20:34:32] (why you were told what you were told) :) [20:34:36] Right [20:34:38] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.23wmf2 [20:34:48] * greg-g crosses fingers [20:34:53] Logged the message, Master [20:35:26] PHP Fatal error: Cannot use object of type EchoEvent as array in /usr/local/apache/common-local/php-1.23wmf2/extensions/Echo/includes/EventLogging.php on line 72 [20:35:34] Did a fix get committed for that already... [20:35:35] yeah, that one again? [20:35:37] I deleted the email notification [20:35:51] let me check [20:35:54] backporting it if so [20:36:06] https://gerrit.wikimedia.org/r/#/c/93516/ [20:36:09] Reedy: ^ [20:36:11] thanks [20:36:18] I'd go so far as the bug ;) [20:36:45] hth [20:36:57] a gerrit patch uploader example, ^demon ^^ [20:39:30] (03CR) 10Dzahn: [C: 032] remove Tampa search-pool monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/93523 (owner: 10Dzahn) [20:40:14] Is Ryan Lane around? [20:40:28] Eloquence: he was, but he's in Hong.... [20:40:30] there. [20:40:31] Eloquence: he is! :) [20:40:34] Hi Ryan_Lane :) [20:40:40] haha [20:40:41] howdy [20:40:46] Ryan_Lane, gwicke needs some help with an emergency parsoid revert [20:40:57] git deploy start [20:41:01] git reset --hard [20:41:09] git deploy sync [20:41:53] boarding is starting right now, but I should have a few mins [20:42:00] trying with --force [20:42:23] was there an issue with the deploy, or the code that was deployed? [20:42:42] there's a complaint on wikitech-l. i assume this is related [20:42:44] the code [20:42:47] ok [20:42:47] code that was deployed, I believe (see wikitech-l, all edits on french wikipedia are junk) [20:42:52] https://pl.wikipedia.org/w/index.php?title=Waldemar_Nol&curid=1602043&diff=37817412&oldid=27653361 [20:42:57] greg-g: all edits on all wikipedias [20:43:03] that include any non-ascii characters [20:43:04] MatmaRex: wonderful [20:43:11] yeah, reverting to the old commit id will work, then [20:43:12] which is all edits for all wikipedias except for the english one [20:43:17] so, english is maybe part ok! [20:43:17] Ryan_Lane: that seemed to work [20:43:27] great [20:44:24] gwicke, I am getting "error contacting parsoid" on fr.wp now instead [20:44:26] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [20:44:26] PROBLEM - Parsoid on wtp1017 is CRITICAL: Connection refused [20:44:26] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [20:44:26] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [20:44:36] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [20:44:36] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [20:44:41] .... [20:44:43] heh [20:44:46] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: Connection refused [20:44:49] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [20:44:50] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [20:44:50] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [20:44:50] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [20:44:50] PROBLEM - Parsoid on wtp1015 is CRITICAL: Connection refused [20:44:56] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [20:45:02] we get it, icinga-wm [20:45:05] gwicke: want me to restart parsoid via salt? [20:45:06] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [20:45:06] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [20:45:06] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [20:45:06] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [20:45:06] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [20:45:07] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [20:45:16] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [20:45:16] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [20:45:16] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [20:45:16] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [20:45:16] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [20:45:17] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [20:46:36] PROBLEM - LVS HTTP IPv4 on parsoidcache.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 769 bytes in 0.004 second response time [20:46:41] gwicke: if yes, let me know now, since I need to board.... [20:47:02] Ryan_Lane: type the command here if you have it ready [20:47:13] performing roan extraction ... [20:47:28] salt -G 'deployment_target:parsoid' parsoid.restart_parsoid [20:47:32] Eloquence: well, that's certainly progress :) [20:47:33] k [20:47:36] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:47:36] RECOVERY - LVS HTTP IPv4 on parsoidcache.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1461 bytes in 0.002 second response time [20:47:39] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [20:47:44] (the error i mean.) [20:47:46] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [20:47:48] OMG, it worked from IRC :P [20:47:49] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [20:47:50] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:47:50] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:47:50] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [20:47:50] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.026 second response time [20:47:56] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [20:48:01] salt -G 'deployment_target:parsoid' parsoid.restart_parsoid parsoid [20:48:06] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:48:06] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [20:48:06] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [20:48:06] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [20:48:06] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [20:48:07] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [20:48:16] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [20:48:16] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [20:48:16] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [20:48:16] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [20:48:16] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:48:17] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [20:48:26] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [20:48:26] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [20:48:26] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [20:48:26] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [20:48:41] when is it a good time for me to say "So... about getting Parsoid on the deploy calendar....." (we were just deploying MediaWiki right now, too) [20:48:46] it's silly that we need a custom module to restart parsoid [20:48:57] why do we need a custom module? [20:49:03] but parsoid still doesn't have a proper init script [20:49:17] oh [20:49:21] I can give that a stab [20:49:27] I think ori wrote one [20:49:46] knowing ori, he probably wrote an upstart job [20:50:08] yep [20:50:21] heheh [20:50:23] http://git.wikimedia.org/blob/mediawiki%2Fvagrant.git/8edc7847d5f8247a576905a2b7a6915b47915c97/puppet%2Fmodules%2Fmediawiki%2Ftemplates%2Fparsoid.conf.erb [20:50:50] ahoy [20:50:58] hi joel! [20:50:59] it's a joel! [20:51:00] hi cajoel [20:51:09] looking to secure some internal web services, and I need the SSL star cert+key, etc.. [20:51:17] https ftw [20:51:30] I think there was a separate wildcard for *.corp [20:51:34] there is [20:51:38] can someone point me in the right direction. [20:51:40] !log reedy synchronized php-1.23wmf2/extensions/Echo 'bug 56521' [20:51:43] our * doesn't support corp [20:51:51] cool, np [20:51:54] cajoel: there should be *.corp.wm [20:51:59] Logged the message, Master [20:52:01] no clue where corp's is :) [20:52:03] expecting IT to have that? :) [20:52:04] hah [20:52:13] we don't expect anything [20:52:19] I'll rescan for corp , that's easier to seach for... [20:52:21] ...except everything [20:53:15] cajoel: it's the one used on mingle [20:53:40] cajoel: see cert details on https://mingle.corp.wikimedia.org/ [20:54:13] Great, I can dig that out of there. [20:54:49] ok, boarding time [20:54:53] * Ryan_Lane waves [20:54:59] have a nice flight! [20:55:13] too late to point to https://www.youtube.com/watch?v=DtyfiPIHsIg [20:55:49] (03PS1) 10Odder: (bug 56570) Set up 'accountcreator' user group on cawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93591 [20:56:25] gwicke: so, about adding Parsoid deploys to the Deployments calendar.... [20:56:39] (03PS1) 10Cmjohnson: Removing mgmt dns entries for osm-cp1-4 osm-db1-2 [operations/dns] - 10https://gerrit.wikimedia.org/r/93592 [20:59:07] (03PS1) 10Reedy: Remove old commented out pmtpa solr box config for GeoData [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93593 [21:00:07] (03CR) 10Reedy: [C: 032] Remove old commented out pmtpa solr box config for GeoData [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93593 (owner: 10Reedy) [21:00:09] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for osm-cp1-4 osm-db1-2 [operations/dns] - 10https://gerrit.wikimedia.org/r/93592 (owner: 10Cmjohnson) [21:00:27] (03Merged) 10jenkins-bot: Remove old commented out pmtpa solr box config for GeoData [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93593 (owner: 10Reedy) [21:00:33] !log dns update [21:00:49] Logged the message, Master [21:02:04] (03PS1) 10RobH: new bug-attachment.wikimedia.org cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/93594 [21:03:22] (03PS1) 10Cmjohnson: Removing mgmt dns entries for wmf5815/wmf5821 (otto/varro) [operations/dns] - 10https://gerrit.wikimedia.org/r/93595 [21:03:52] cajoel: https://git.wikimedia.org/history/operations%2Fpuppet.git/153e3dc9b411c8d9ce9d3b2c56d71e4ba51cc0d1/files%2Fssl%2Fstar.wikimedia.org.pem [21:04:26] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:06:06] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:06:45] (03CR) 10RobH: [C: 032] new bug-attachment.wikimedia.org cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/93594 (owner: 10RobH) [21:29:05] !log schedule icinga downtime/disable notifications for db3[29] db4[2-6] db5[1235689] [21:29:29] Logged the message, Master [21:29:34] cmjohnson1: ^ there, that took a while, but i added them all to a scheduled downtime of 1 year and disabled notifications for hosts and services on them [21:30:15] mutante: great thank you [21:33:05] (03CR) 10Dzahn: "< mutante> !log schedule icinga downtime/disable notifications for db3[29] db4[2-6] db5[1235689]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93052 (owner: 10Springle) [21:39:33] (03PS1) 10Cmjohnson: Removing dhcpd and dsh entries for storage3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/93598 [21:40:59] (03PS2) 10Cmjohnson: Removing dhcpd and dsh entries for storage3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/93598 [21:41:05] (03CR) 10Cmjohnson: [C: 032] Removing dhcpd and dsh entries for storage3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/93598 (owner: 10Cmjohnson) [21:41:31] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for wmf5815/wmf5821 (otto/varro) [operations/dns] - 10https://gerrit.wikimedia.org/r/93595 (owner: 10Cmjohnson) [21:41:56] (03PS1) 10Cmjohnson: Removing dns entries for storage3 [operations/dns] - 10https://gerrit.wikimedia.org/r/93600 [21:42:29] (03PS1) 10Lcarr: moved parsoid into proper ssubnet Created comments about LVS subnet plans [operations/dns] - 10https://gerrit.wikimedia.org/r/93601 [21:42:40] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for storage3 [operations/dns] - 10https://gerrit.wikimedia.org/r/93600 (owner: 10Cmjohnson) [21:43:59] !log on mediawiki/core.git deleted and retagged 1.22.0rc0 (same commit: c00622b7f6f207f1a2056d47437a5b1891b490a7 ) [21:43:59] !log dns update [21:44:19] Logged the message, Master [21:46:44] (03CR) 10Lcarr: [C: 032] moved parsoid into proper ssubnet Created comments about LVS subnet plans [operations/dns] - 10https://gerrit.wikimedia.org/r/93601 (owner: 10Lcarr) [21:55:29] (03CR) 10Hashar: "You could also introduce a warning level at 10k or 50k or whatever by replicating the logic and using exit code 1. Doc at http://nagios.so" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93585 (owner: 10Aaron Schulz) [22:02:14] (03PS2) 10Mwalker: Enable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92817 [22:07:32] !log Reindexing GeoData [22:07:52] Logged the message, Master [22:15:11] !log mwalker synchronized php-1.23wmf1/extensions/CentralNotice/ 'Updating CentralNotice to master - mobile redirects and cross wiki hiding' [22:15:29] Logged the message, Master [22:15:44] !log mwalker synchronized php-1.23wmf2/extensions/CentralNotice/ 'Updating CentralNotice to master - mobile redirects and cross wiki hiding' [22:15:59] Logged the message, Master [22:16:58] !log mwalker synchronized php-1.23wmf2/resources/mediawiki/mediawiki.inspect.js 'Pushing https://gerrit.wikimedia.org/r/#/c/93587/' [22:17:14] Logged the message, Master [22:17:18] mwalker: <3; thanks. [22:17:28] (03PS1) 10MarkTraceur: Set up Beta Commons as an API repo for beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 [22:19:06] tgr: D'you want to make sure 93610 there is good? [22:19:25] Also chrismcmahon, if you want to make sure that's a sane way to do that, it'd be nice [22:19:41] I didn't want to take away the commons setup, but now betacommons is another repo [22:20:25] (03CR) 10Mwalker: [C: 032] Enable CentralNotice CrossWiki Hiding [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92817 (owner: 10Mwalker) [22:21:12] (03PS3) 10Lcarr: adding new public IP VIP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 [22:22:01] (03CR) 10jenkins-bot: [V: 04-1] adding new public IP VIP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 (owner: 10Lcarr) [22:22:54] !log mwalker synchronized wmf-config/CommonSettings.php 'Enabling cross wiki banner hiding for CentralNotice' [22:23:08] Logged the message, Master [22:24:56] (03PS4) 10Lcarr: adding new public IP VIP for parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/93527 [22:25:03] marktraceur: I'm not entirely sure what I'm seeing there, but I added hashar as a reviewer. Are you sure you want to make beta commons point to commons.wikimedia.org/w/api.php ? that seems sketchy to me, seems like it should be commons.wikimedia.beta.wmflabs.org/w/api.php [22:25:29] Argh, did I do that [22:25:30] * marktraceur fails [22:25:50] :D [22:26:03] (03PS2) 10MarkTraceur: Set up Beta Commons as an API repo for beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 [22:26:11] * marktraceur apologizes for wasting everyone's time [22:26:32] marktraceur: note that commons.beta has commons.prod as a foreign repo [22:26:45] marktraceur: APOLOGY NOT ACCEPTED! [22:26:53] marktraceur: and I thought instant commons managed that by itself [22:27:24] hashar: I guess I should turn off betacommons being a repo on betacommons [22:27:49] <^d> marktraceur: I heard you like commons in your commons. [22:27:50] marktraceur: I guess. the whole conf is a bit nasty for sure :( [22:28:18] hashar: I'd be OK replacing the instantcommons config with betacommons, but I'm afraid that would screw up testing somehow [22:28:48] marktraceur: as long as chrismcmahon is happy, any change is fine to me :] [22:28:54] marktraceur: maybe just replace $wgForeignFileRepos[x] where $wgForeignFileRepos[x]['name'] = wikimediacommons? [22:29:33] chrismcmahon: Is it a bad idea to no longer have prod commons images on beta sites? [22:30:10] (03CR) 10Gergő Tisza: [C: 031] "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93610 (owner: 10MarkTraceur) [22:30:43] Anyway, it seems like there's no good way to do that with the existing setup in InitSettings-labs [22:30:46] But [22:30:46] Ah well [22:30:51] Someone has to go first [22:31:18] hashar marktraceur this is not my area of expertise :-) . my concern is that we need a consistent set of code on beta, namely everything that is in master but not necessarily yet deployed, including API and config. [22:31:31] Yeah [22:31:56] marktraceur: the data is of less concern than the code that manipulates the data, if that makes sense [22:32:02] I see [22:32:15] So if changes to the data cause test failures [22:32:24] You're not going to be super angry with me? :D [22:32:51] marktraceur: as long as the code being tested fails on legit data, I don't care so much where that data lives [22:33:11] marktraceur: and if we fuck up, we can change it later. beta is like that. [22:33:59] Cool [22:34:46] Agh, damn it [22:34:54] tgr: BetaCommons is already a remote repo I think [22:35:21] * marktraceur tries to confirm [22:35:55] marktraceur: so lets say that we have e.g. UploadWizard on beta commons pointing an api.php on beta commons but the images it's checking against are at the real commons. If something failed, then that would indicate that the new code would fail on the production data, which would be a Good Thing. [22:36:11] Right [22:36:59] greg-g: if you have not been following along, I think I nutshelled it right there for ya ^^ [22:37:36] marktraceur: that said, hashar is way better than me at this config stuff. :-) [22:39:07] !log mwalker synchronized php-1.23wmf1/skins/vector/ 'Deploying fix for bug 56366 : https://gerrit.wikimedia.org/r/#/c/93408/' [22:39:09] marktraceur: we needed the production commons as a foreign repo, cause there are a bunch of images used from it [22:39:21] marktraceur: an example are the kittens for Wikilove [22:39:23] Logged the message, Master [22:39:47] !log mwalker synchronized php-1.23wmf2/skins/vector/ 'Deploying fix for bug 56366 : https://gerrit.wikimedia.org/r/#/c/93408/' [22:39:54] Right [22:40:01] Logged the message, Master [22:40:04] hashar: But, we also have betacommons set up, I'm just testing now [22:40:22] thanks hashar marktraceur yes exactly, examples are great [22:40:33] i think the chain is : beta -> beta commons (via instant commons maybe) -> foreignApiRepo of commons in prod [22:40:56] Yeah [22:41:01] http://en.wikipedia.beta.wmflabs.org/wiki/Lightbox_demo/Using_betacommons [22:41:05] So that works [22:41:13] Except it reveals bugs, apparently, in MMV [22:42:59] Wellp, yeah, CMD bugs ahoy [22:43:02] Thanks y'all [22:49:54] RoanKattouw: gwicke - when mark does the lvs changes tomorrow, the public parsoid ip should be live [22:50:03] parsoid-lb.eqiad.wikimedia.org [22:50:25] Awesome [22:50:28] YuviPanda: ---^^ [22:50:40] woooot! [22:50:57] and it has ipv6 [22:51:06] becuse it's 2 more awesome than ipv4 [22:51:08] LeslieCarr: yay, thanks! [22:51:12] but no ssl, I think? [22:51:19] thanks LeslieCarr! [22:51:28] we didn't add any new services, so no ssl [22:51:37] so they aren't live yet but will be tomorrow? [22:54:56] yep [22:55:04] marktraceur: yay for bugs not in prod [22:56:07] Heh [22:56:11] chrismcmahon: Always a good plan [22:56:58] I am off, see you tomorrow [23:08:18] (03PS1) 10Dzahn: fix bugzilla SSL cert in puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93617 [23:11:53] !log csteipp synchronized php-1.23wmf2/includes 'bug 55332' [23:12:07] Logged the message, Master [23:12:47] !log csteipp synchronized php-1.23wmf1/includes 'bug 55332' [23:13:00] Logged the message, Master [23:19:56] * AaronSchulz chuckles [23:20:23] (03CR) 10Dzahn: [C: 032] "semi-revert of Change-Id: I3e163570feecab48809d86b5c8c7ddaa629babbe" [operations/puppet] - 10https://gerrit.wikimedia.org/r/93617 (owner: 10Dzahn) [23:23:56] !log ori synchronized php-1.23wmf2/resources/mediawiki/mediawiki.inspect.js 'Ib2252003f2: mediawiki.inspect#dumpTable: fix broken FF workaround' [23:24:12] Logged the message, Master [23:28:20] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:29:10] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:37:35] YuviPanda: have a few minutes to help me debug something? Currently the api can read from data.db but not write to it... [23:37:37] or so it seems [23:38:01] labs instance proxy-abogott-8 [23:38:09] andrewbogott: weird. sshing [23:38:26] andrewbogott: give me a couple of minutes, new machine, so setting up a new key [23:38:49] YuviPanda: uwsgi (and, theoretically the unicorn) log to /var/log/uwsgi/app/ [23:38:51] thanks [23:40:20] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:10] um… someone reboot virt0? [23:42:21] andrewbogott: okay, I'm in [23:42:25] looking at it now [23:42:40] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:42:45] Is wikitech.wikimedia.org broken? [23:42:52] YuviPanda: ok… I'm going to be distracted now on account of virt0 [23:42:53] holy cross post batman [23:42:57] andrewbogott: heh, ok [23:43:10] RECOVERY - Host virt0 is UP: PING WARNING - Packet loss = 80%, RTA = 40.65 ms [23:43:10] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 40.63 ms [23:43:15] Yes [23:43:32] cmjohnson1: still around? [23:43:54] * Elsie beats Reedy. [23:43:55] (03PS1) 10Dzahn: fix Bugzilla SSL cert in puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/93620 [23:44:16] (03PS1) 10RobH: new bugzilla.wikimedia.org cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/93621 [23:44:30] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: Connection timed out [23:45:20] PROBLEM - SSH on virt0 is CRITICAL: Connection timed out [23:45:32] andrewbogott: let me know when you aren't distractex [23:45:37] *distracted [23:45:50] PROBLEM - HTTP on virt0 is CRITICAL: Connection timed out [23:45:59] Damn, when I can't ping or ssh into a host, I don't really know how to start debugging :( [23:46:00] PROBLEM - LDAPS on virt0 is CRITICAL: Connection timed out [23:46:30] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [23:47:17] connects to virt0.mgmt [23:47:23] andrewbogott: [23:47:30] RECOVERY - Host virt0 is UP: PING WARNING - Packet loss = 93%, RTA = 40.69 ms [23:47:33] Oh, I guess I know how to that... [23:47:38] Just don't know what I would learn [23:47:53] so, i'm on the shell [23:48:51] anyone know of springle-afk's availability today? [23:48:57] we have a set of data loss bugs we need a little help with: bugs 53687/56589/56577 [23:49:08] !log starting puppetmaster on virt0 [23:49:25] andrewbogott: opendj is up, puppetmaster starting.. ehm.. what else [23:49:33] https://bugzilla.wikimedia.org/53687 [23:49:37] mutante, does it look like the machine cycled? [23:49:42] https://bugzilla.wikimedia.org/56589 [23:49:46] And does it have a network connection? From here looks like not really [23:49:48] https://bugzilla.wikimedia.org/56577 [23:49:50] PROBLEM - LDAP on virt0 is CRITICAL: Connection timed out [23:49:51] andrewbogott: actually..no. uptime 172d [23:50:00] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:50:17] robla: Given 53687's age, I'm not sure it can be considered critical. [23:50:26] mutante: I would say that someone tripped over the network cable, except I'm pretty sure that's not a thing that can happen :) [23:50:42] Elsie: it was attempted to be fixed, but that issue is remaining [23:50:50] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 8.534 second response time [23:50:50] RECOVERY - LDAP on virt0 is OK: TCP OK - 1.041 second response time on port 389 [23:50:51] so yeah, still critical, dataloss [23:50:56] Elsie: something may have gotten worse in the latest deploy (hence 56577 and 56589) [23:50:57] (potential) [23:51:00] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.041 second response time on port 636 [23:51:35] mutante, adjacent machines (e.g. virt5) seem to be fine. [23:51:56] A couple of comments from Tim on #wikimedia-tech: [23:52:11] Bye, morebots. [23:52:11] [15:10:38] [[2019_apres_la_chute_de_New_York]] is missing on all the s6 servers except the master [23:52:12] [15:11:50] which is pretty scary [23:52:21] probably be on in a couple hours robla [23:52:37] I left him a couple messages about this earlier, figuring he would at least want to know what's up [23:53:06] mutante, probably time to try a reboot if you don't see anything... [23:53:07] thanks apergos [23:53:46] are you guys liable to be around in a couple hours? [23:53:53] Tim will be [23:53:54] it's not the same bug as 53687 [23:54:13] I'm not sure it's dataloss, per se, if master is okay. [23:54:20] andrewbogott: ok, rebooting it [23:54:20] (03PS1) 10Chad: Fix up multiversion to not require dba_* functions [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/93622 [23:54:36] Elsie: you want to wait until it turns into data loss? [23:54:45] No. [23:54:45] Elsie: hence my "(potential)" :) [23:54:53] also, are you sure there is no data loss? [23:54:58] because I am not sure [23:55:14] I add an if clause. ;-) But I'll stop distracting you. [23:55:15] !log rebooting virt0 [23:55:40] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [23:57:10] PROBLEM - LVS HTTP IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection timed out [23:57:10] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection timed out [23:57:10] PROBLEM - SSH on ssl1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:10] PROBLEM - SSH on ssl2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:11] PROBLEM - SSH on lvs5 is CRITICAL: Connection timed out [23:57:11] PROBLEM - LVS HTTP IPv6 on upload-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 1.301 second response time [23:57:14] PROBLEM - check_mysql on payments4 is CRITICAL: Slave IO: No Slave SQL: Yes Seconds Behind Master: (null) [23:57:20] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection timed out [23:57:20] PROBLEM - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection timed out [23:57:20] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.pmtpa.wikimedia.org is CRITICAL: Connection timed out [23:57:20] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: Connection timed out [23:57:22] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: Connection timed out [23:57:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: Connection timed out [23:57:46] yeah, there's a reason I don't like to use this channel for discussion [23:58:00] RECOVERY - SSH on ssl1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:58:00] RECOVERY - SSH on ssl2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:58:00] RECOVERY - LVS HTTP IPv6 on foundation-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68416 bytes in 0.183 second response time [23:58:00] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 91562 bytes in 0.217 second response time [23:58:00] RECOVERY - SSH on lvs5 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:58:10] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 67964 bytes in 0.180 second response time [23:58:10] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68416 bytes in 0.181 second response time [23:58:10] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68416 bytes in 0.180 second response time [23:58:10] maybe we can use #wikimedia-tech? [23:58:14] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 68416 bytes in 0.299 second response time [23:58:14] RECOVERY - LVS HTTPS IPv6 on wikinews-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68414 bytes in 0.296 second response time [23:58:14] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 67964 bytes in 0.182 second response time [23:58:14] RECOVERY - LVS HTTP IPv6 on upload-lb.pmtpa.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 654 bytes in 0.073 second response time [23:58:16] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:58:16] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:59:00] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [23:59:01] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [23:59:10] RECOVERY - SSH on virt0 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:59:13] sure [23:59:30] RECOVERY - Host virt0 is UP: PING WARNING - Packet loss = 86%, RTA = 40.79 ms