[14:35:33] Seems like several sites died in Norway, … Wikidata nowiki [14:36:39] enwiki loaded with errors [14:36:48] dewp very slow, too [14:39:11] Wikimedia websites are currently very slow at home, and some files refuse to load [14:39:20] Am I the only one? [14:40:26] !admin [14:40:49] Hm, any bells and whistles here? [14:42:59] yah its basically dead [14:43:16] There are known issues and the operations team are looking [14:45:00] Thanks p858snake|L [14:51:07] we're looking into it and looking at remedies [14:51:28] should be just a few minutes to get things together, updates here as soon as that happens [14:58:03] some funky stuff going on [15:01:21] (04:51:07 μμ) apergos: we're looking into it and looking at remedies [15:01:21] (04:51:28 μμ) apergos: should be just a few minutes to get things together, updates here as soon as that happens [15:12:17] worked for me until a couple min ago but nothing loads now [15:13:28] we'll give updates as soon as they are available [15:13:36] "thanks in advance for your patience" etc [15:14:17] :) [15:18:22] for me wikidata is fine, just slower loading [15:27:58] here's your periodic "no new updates yet" message [15:32:00] I just had an edit for wikidata needing 2 mins to go through [15:32:15] rest of the sites are fine for me] [15:44:03] we have put mitigations in place which should bring things more or less back to normal shortly [15:44:45] (y) [15:51:26] * tomreyn is watching https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&fullscreen&panelId=6&from=1580047200000&to=now - assumes things will be fine once we're back to more or less a single line (like it was between 14:00 and 14:30). [15:55:36] thanks [16:01:26] still working on things [16:04:43] is it just me or is esams always the weakest link [16:05:10] not necessarily [16:08:04] AntiComposite: it's the site that gets the most traffic [16:19:08] still working on mitigations [16:19:32] So... Cna I have the evening news summary on what's happening? [16:19:35] *can [16:19:49] we are working on mitigations; once they are finally in place we'll update here [16:20:22] apergos: I was also having some issues with a different web site entirely [16:20:35] can't tell you anything about that ;-) [16:20:40] but they don't have any similarity in routes [16:20:51] which was what I thought might have happened [16:52:41] Things appear to have stabilized [16:53:07] that's my impression, too. [16:54:23] frontend traffic is back to normal https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1580047200000&to=now&fullscreen&panelId=12 [16:55:03] icinga-wm isn't running around with it's hair on fire anymore either :) [16:55:35] <_joe_> We're taking some countermeasures, but things aren't stable still [16:55:52] <_joe_> it would be very useful if you could report if you have issues seeing the wikis [16:56:08] Working for me, Boston MA area [16:56:12] <_joe_> and also referenced what ip you get when resolving en.wikipedia.org [16:56:13] yes and please let us know about errors, timeouts, or just slowness [16:56:24] <_joe_> or any other wiki ofc [17:01:15] Troubles with Wikimedia sites: slow or down. Is a common issue, or is it just me? [17:01:25] Known issue [17:02:09] <_joe_> MezzeStagioni: ongoing issue, yes [17:02:23] <_joe_> I can guess you're located in southern europe from your nick, correct? [17:02:30] <_joe_> (sorry it's important for debugging) [17:03:08] I can not find any advice or link.. Yes, as you've guessed it's italian. I'm connected from Italy [17:03:44] mark: can you change the topic here [17:05:00] ty [17:05:34] <_joe_> MezzeStagioni: how are things now? [17:05:56] <_joe_> (we could switch to italian but it would be disrespectful to others :P) [17:06:25] _joe_: but we can't see your hands on IRC [17:07:16] <_joe_> good point [17:07:52] Hello from Czech Republic, I cant load wiki too, had problems yesterday around 20:44 UTC, but I think you already know all the information :)) [17:08:08] Now it's far better. But it's very variable, since yesterday evening (Europe time). Sometime you can load page, sometime they are slow and unformatted, somtime all is fine [17:08:42] <_joe_> MezzeStagioni: yeah we're aware thanks [17:08:44] much better here (France) since 4.30 UTC [17:08:48] https://smokeping.illyse.org/?target=Alexafr.Wikipedia_v6 [17:08:52] (times are UTC+1) [17:09:22] No more slow from France en.wikipedia.org → (91.198.174.192) [17:09:24] yeah, I was able to save only one edit this afternoon (UTC) [17:09:39] <_joe_> Pols12: thanks [17:10:42] not sure how useful this is... https://www.site24x7.com/public/t/results-1580058341058.html [17:11:39] <_joe_> tomreyn: given how transient the issues are, it's hard to make use of things like those monitors to extract if what we do is helping [17:11:46] <_joe_> but thanks for the link :) [17:11:55] hehe, ok [17:12:47] your own monitoring is awesome, though, i like it a lot. [17:13:45] that is, while it's reachable ;) [17:14:29] <_joe_> heh [17:14:53] <_joe_> tomreyn: can you see the wikis right now? [17:15:18] _joe_: yes, pages load fast and fine for me everywhere [17:15:29] <_joe_> ack, thanks :) [17:16:17] no, no, thank *you*. [17:53:29] report in -en that no WM sites are accessable from Pune, India or the UK [17:54:09] Again down from France, dyna.wikimedia.org [91.198.174.192] [17:54:13] <_joe_> thanks [17:54:39] down in Finland from desktop, works in mobile [17:55:29] <_joe_> Stryn: if you are able to give me a traceroute from where it doesn't work [17:55:55] from here (France) to esams, IPv6 works fine but IPv4 is in complete blackhout [17:55:58] down in Norway [17:56:28] <_joe_> if you can post me a traceroute from your connection that is down to 91.198.174.192, that could help [17:56:38] IPv4 AS path (broken): https://lg.grenode.net/prefix_bgpmap/safran+batture/ipv4?q=text-lb.esams.wikimedia.org [17:56:45] IPv6 (working): https://lg.grenode.net/prefix_bgpmap/safran+batture/ipv6?q=text-lb.esams.wikimedia.org [17:57:17] <_joe_> zorun: thanks [17:57:29] ae2.cr2-esams.wikimedia.org [80.249.209.176] [17:57:36] text-lb.esams.wikimedia.org [91.198.174.192] [17:57:43] I just ran a mtr but now it works again... [17:57:47] TCP traceroute output appreciated: traceroute --tcp --port 443 en.wikipedia.org [17:57:59] Down in Jordan, Lebanon, Palestine and Egypt [17:58:01] for mtr, mtr -z --report-wide --tcp --port 443 en.wikipedia.org [17:58:09] working in Finland now [17:58:23] <_joe_> Alaa|away: still down? [17:58:51] when IPv4 was broken, it seems it was going through LGI-UPC [17:58:56] now it takes a different AS path [17:59:09] <_joe_> zorun: yes, it's what we just changed [17:59:32] _joe_: Works in Kuwait, Egypt, Jordan, Palestine and Lebanon [17:59:32] <_joe_> (more or less, sorry I can't get into specifics) [17:59:40] <_joe_> Alaa|away: thanks a lot :) [18:00:00] works in Norway [18:00:10] works again in Austria [18:00:14] Works again in France [18:00:25] there is no such thing as a good commercial internet provider [18:00:25] no issues in SIngapore [18:00:29] <_joe_> we might have more instabilities, be advised [18:00:45] <_joe_> the underlying issue is still ongoing from time to time [18:01:10] I haz edit… nowiki [18:01:38] _joe_: yup in Levant region + Egypt sinve 2 hrs [18:02:31] we also have no control over the underlying cause here, but are doing what we can to mitigate. [18:04:00] Stops working intermittently in Canada (Freedom Mobile (Shaw Communications)). Can't connect right now as well as a few days this week. It was working 2 minutes ago. It's also working fine through Bell and Rogers [18:04:48] <_joe_> danielzgtg: uh can you get me a traceroute to en.wikipedia.org? [18:06:14] is there a task? [18:06:33] # mtr -bwc 10 wikipedia.org Start: 2020-01-26T13:05:16-0500HOST: localhost Loss% Snt Last Avg Best Wrst StDev 1.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 2.|-- 10.20.118.65 0.0% 10 38.0 34.7 28.0 50.6 6.2 3.|-- 10.20.101.29 0.0% 10 34.9 33.2 28.9 39.8 3.0 4.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 5.|-- 199-7-156-196.eng.wind.ca (199.7.156.196) 0.0% 10 34.7 35.9 29.8 58.7 8.3 6.|-- rc3fs-be8.mt.shawcable.net [18:06:33] (66.163.75.237) 0.0% 10 27.3 33.0 27.3 40.0 4.3 7.|-- rc3hu-be6.ny.shawcable.net (66.163.78.146) 0.0% 10 71.1 68.5 42.8 102.5 14.6 8.|-- rc2as-be11.vx.shawcable.net (66.163.75.81) 0.0% 10 70.0 68.3 55.6 76.5 6.8 9.|-- ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 [18:06:55] <_joe_> hauskatze: as you can guess from how little detail we're giving, this is not easily discussed in public [18:07:05] danielzgtg, careful pasting text in here, automated anti-spam measures don't like it, use a pastebin next time [18:07:21] Ok [18:07:35] _joe_: ack. I've just logged in and worked for me, but some others were complaining so I decided to take a look [18:07:49] <_joe_> hauskatze: do you happen to know where these people are? [18:08:08] The one I talked, I presume in Spain [18:08:19] *talked to [18:09:08] cit.: "sometimes up, sometimes down" [18:09:26] <_joe_> yeah which is consistent with the transient nature of the issue [18:09:42] <_joe_> but things should be more or less stable right now [18:12:00] _joe_: Wikimedia doesn't work for me locally, but appears to work via curl at my (also Czech-based) VPS. If there's anything I can do, lmk [19:24:28] if anyone is currently experiencing connectivity issues, please let us know, and follow the instructions at https://bpaste.net/BUBA to gather us some extra details [19:24:31] thank you! [19:53:24] Gateway timeout from Norway [19:53:40] Slow/difficulty loading from MA (US) [19:53:50] Got a 502 Bad Gateway from Nginx attempting to load a page. [19:55:23] icinga-wm is back to running around with it's hair on fire [19:55:26] https://gist.github.com/jeblad/a758e965a258e73ede76c826e8929b70 [19:55:55] Down in Egypt [19:56:00] I can add more entries if necessary [19:56:21] https://www.irccloud.com/pastebin/1153Q4dg/ [19:56:53] slow to load here too, Austria and VPS in Germany -- traceroute from the latter: https://bpaste.net/B4HA [19:57:07] doesn't look like the same issue [19:57:45] slowness from Spain to mediawiki.org. Ping, mtrr, and traceroute: https://dpaste.org/0Qas/raw [19:57:58] Experiencing again Wikimedia freeze from France. fr.wp returns error 502, Commons returns 504. [19:58:54] It is known and being invesetigated [19:59:31] we're aware of the issues, checking it out [20:01:12] From arwiki> Lebanon + Morocco also [20:02:06] 502 from Austria: Request from - via cp3056.esams.wmnet, ATS/8.0.5 /// Error: 502, Next Hop Connection Failed at 2020-01-26 20:01:04 GMT [20:02:06] just getting blank pages from Finland [20:08:28] Nemo_bis: Are you getting the WMF timeout or nothing at all? [20:08:36] I get the WMF timeout... [20:08:52] "Request from - via cp3064.esams.wmnet, ATS/8.0.5 [20:08:52] Error: 502, Next Hop Connection Failed at 2020-01-26 20:07:49 GMT" [20:09:31] And I have a gap in the traceroute [20:09:40] (504 Gateway Time-out) From Iraq [20:10:15] Does WMF use a mitigation service? [20:10:15] Issues are known and ops folks are working on them, see #wikimedia-operations [20:10:28] tassu: Should I report here or there? [20:10:31] here [20:10:36] Okay.. [20:10:56] The gap in the traceroute from my end is at 141.101.70.122 [20:10:58] -operations is one of our work channels so we like to keep the discussion limited there during events like these [20:11:01] during an incident, it's best to leave -operations for alerting and coordination, watching is fine [20:11:44] The traceroute is fine upto when it leaves the UK backbone [20:11:50] from what I can tell [20:12:10] On the next hop from there is where I don't see anything [20:13:14] Given past incidents it wouldn't suprise me if WMF was using an external load balancer [20:13:56] Is someone looging the traceroutes? [20:14:02] Looking for a pattern...? [20:14:14] loging [20:14:48] WMF operations is working on investigating and resolving the issue. Connection reports are usually part of that investigation [20:14:55] Thanks.. [20:15:01] Hope the above information is useful [20:15:36] For me it's saying the traceroute to 91.198.174.192 failed [20:16:02] But as this isn't just the UK I am wondering if something went wrong in the backbone [20:16:32] https://www.irccloud.com/pastebin/gkjMFw2t/ [20:16:35] Speculation, though, usually isn't helpful [20:19:50] AntiCOmposite : https://phabricator.wikimedia.org/T243713 this the incident ticket? [20:21:22] It is _a_ ticket, _the_ ticket I don't know [20:21:47] AntiComposite: Presumably you also don't want to hear the wild ceramic brain shield theories? [20:22:41] Yeah no, I already have to take a class on psychokinesis and the human consciousness this semester, I don't need more crazy things people have said about brains [20:23:29] AntiComposite: I meant the wild theories... tinfoil hats focus the 'mind control' rays apparently... [20:23:32] :ROFL [20:32:27] I wrote a story about a real device ment to zap peoples brain. It was a real research project at a real research institute. I was so tired of one of the tinfoilhat-guys. Thought it was a fun article, as I noticed he was following what I did at Wikipedia. Apparently the guy went into a deep psykosis. Then it wasn't fun anymore. Don't joke about psychokinesis and other nonsense, you don't know who read it. [20:32:42] https://twitter.com/Wikipedia/status/1221513346781982722 [20:34:08] Just thought I'd put it out there. I'm sure y'all are already aware. But it's extremely slow for me in Florida. [20:34:15] are additional traceroutes from europe helpful? [20:35:16] TrueCRaysball known and being handled [20:35:36] Just making sure! :) [20:35:44] Thanks for the report [20:36:11] redacted trace if it helps https://pastebin.com/dK1LaLEt [20:36:30] we think we've actually mitigated the connectivity-related issues for now [20:36:42] the thing that is currently ongoing is a new, different problem in our application servers [20:36:44] we're working on it :) [20:37:26] JanSch: https://phabricator.wikimedia.org/T243713 fwiw [20:39:27] thanks! an app server issue would be consistent with what I observed (I can now connect but requests time out). Thanks for the hard work and let us know if there's anything we can provide to help. [20:43:34] This is, uh, off topic, so feel free to ignore it, but when setting gadget definitions for the RL, does |rights=autoconfirmed also include confirmed users, or does that need to be set separately [20:48:50] you would need to try, but i suspect not [20:54:26] MediaWiki login to Phabricator no working for me. Returning 'Unhandled Exception ("HTTPFutureHTTPResponseStatus")' with page text showing the html from a HTTP/504 Wikimedia Error [20:55:11] are traceroutes still helpful, or is this a serverside issue? [20:55:31] Likely due to the current issue William_Avery, try again later [20:55:48] markspolakovs, if you're getting a WMF timeout, probably not [21:01:26] markspolakovs: we believe the current issue to be server-side, especially if you're getting a response that has a Wikimedia-rendered timeout page [21:03:44] cdanis, AntiComposite: seems like it, thanks [21:04:22] Everything seems to have suddenly returned to normal for me now [21:04:42] same [21:04:59] Grafana shows application server responses returning to normal for the moment [21:05:26] -operations logs seem to indicate that they are normalizing [21:07:01] this is shaping up to be a mighty interesting incident report [21:07:49] The "one report or two" decision is going to be interesting on it's own [21:08:09] it will be two [21:08:42] also two lines in tech news for interested users? :D [21:09:23] Nah, I think we can squeeze it into one line for tech news, not that there's really anything else to write about this week [21:09:34] :D [21:09:52] "Everything caught fire. Now it's not on fire." [21:09:53] :p [21:11:13] “plumber accidently welded the wrong pipe in a server” [21:12:00] "Wimedia servers not affected by new virus says WMF.." [21:12:06] :D [21:12:26] telnet towel.blinkenlights.nl 666: "It's not plugged in." [21:12:56] or "All we know is.. something broke... we not sure what.." [21:13:38] One user at nowiki claims the reason is Wikidata… [21:13:49] Oh? [21:13:55] Was there a configuration change? [21:14:07] * jeblad want to disconnect some users from internet [21:14:25] this all happened because you didn't donate enough last year [21:14:46] give jimbo One Money or the servers go down for one day [21:27:54] <_joe_> more servers would help [21:27:58] <_joe_> for sure [21:28:13] <_joe_> haha thanks for the laugh people :D [21:28:20] <_joe_> reading backscroll made me giggle [22:05:26] Is search on wikitech-static known to be broken? Special:Search results in [7cffe91bd97b733bea47d0e6] 2020-01-26 22:05:00: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [22:07:23] AntiComposite: I mean, it's wikitech-static… [22:07:37] But probably not. File a Phab task? [22:08:03] Yeah, not exactly the highest of priorities at the moment :) [22:08:10] I'll go look at phab [22:08:27] Thanks [22:16:24] https://phabricator.wikimedia.org/T243730 [22:34:30] https://873gear.com/irc/uploads/6150b16f900d7e0a/image.png [23:12:02] JanSch:yes [23:12:20] The additional traces are helpful [23:12:29] Sorry... they left earlier..? [23:12:32] * ShakespeareFan00 out [23:44:09] congrats. you sure had a long fire fighting day there. [23:45:49] Yeah, lots of fun today.