[00:01:05] why would it only affect wikimedia? [00:03:29] it could be our endpoints are problematic for his host being able to detect the path mtu problem [00:04:09] the only thing I know of that's particularly crazy about our SSL setup is the number of SANs, but I'd think that'd show up after Server Hello? [00:05:45] yeah I don't mean e.g. our SSL being bad, but lower-level tuning of our edge servers could be wrong in related ways, that only affect these minority cases [00:05:55] https://blog.cloudflare.com/path-mtu-discovery-in-practice/ covers a lot of the ground I'm thinking of [00:06:29] we don't use ECMP though, so our solution probably doesn't have to be as complex as theirs [00:07:06] I'm vaguely aware ICMP is involved in fragmentation problems [00:07:29] ECMP is something different, not a typo for ICMP [00:07:34] I know [00:08:04] but anyways, for instance we don't currently turn on /proc/sys/net/ipv4/tcp_mtu_probing like they're suggesting in that blog post [00:08:38] but that's for the v4 case. for v6 the only solutions they offer there is dumping our server-side MTU down to 1280, or using something like their pmtu daemon [00:09:27] err wait, PMTUD is their long-term v4 solution. I don't think they really call out a better answer than 1280 for the v6 version of the problem [00:09:51] maybe lvs icmp routing isn't working right, too [00:10:05] (we do have some config turned on for that, but that doesn't necessarily mean it's working right) [00:11:09] traceroute to wikimedia is broken from my machine and from ZxelA's [00:11:34] that'd indicate ICMP getting blocked wouldn't it? [00:11:37] broken in what sense? [00:11:54] (and no, traceroute usually uses UDP to unreachable ports, but then there are many different "traceroute" tools these days) [00:12:41] Presume he means broken as in it's not responding for all hosts down the chain [00:13:03] yeah that's often the case [00:13:08] like https://phabricator.wikimedia.org/P6121 [00:13:11] "mtr" might give a better idea [00:13:47] mtr works [00:13:51] Krenair: TalkTalk? Seriously? [00:14:13] Reedy, I didn't get to choose the ISP :) [00:14:24] but that was essentially my reaction [00:14:35] I've always chosen the ISP even when at my parents [00:16:55] I get the choice of shitty or shittier when it comes to ISPs around here [00:17:14] ^ [00:17:30] If you can get TT, you can get most of the retail ISPs in the UK [01:04:16] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508029451 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4163902 keys, up 4 minutes 8 seconds - replication_delay is 1508029451 [01:04:16] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508029451 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4163534 keys, up 4 minutes 8 seconds - replication_delay is 1508029451 [01:04:26] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:46] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1508029479 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4160453 keys, up 4 minutes 36 seconds - replication_delay is 1508029479 [01:05:16] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4160318 keys, up 5 minutes 7 seconds - replication_delay is 0 [01:05:26] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8866052 keys, up 5 minutes 20 seconds - replication_delay is 0 [01:05:47] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4157727 keys, up 5 minutes 41 seconds - replication_delay is 0 [01:06:16] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 4159319 keys, up 6 minutes 8 seconds - replication_delay is 0 [01:07:26] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:19:22] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 3 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3685666 (10Smalyshev) Speaking of which, if we're moving to php7 on mwscript and off hhvm, should we also... [01:20:08] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3685667 (10Reedy) [01:32:26] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:16:53] (03Draft1) 10Paladox: Enable auto submodule updates [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/384306 [02:16:55] (03PS2) 10Paladox: Enable auto submodule updates [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/384306 [02:20:11] (03Draft1) 10Paladox: Add branch field to .gitmodules [puppet] - 10https://gerrit.wikimedia.org/r/384307 [02:20:16] (03PS2) 10Paladox: Add branch field to .gitmodules [puppet] - 10https://gerrit.wikimedia.org/r/384307 [02:24:22] (03PS3) 10Paladox: Enable auto submodule updates [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/384306 [02:24:43] (03CR) 10Paladox: "See https://gerrit-review.googlesource.com/Documentation/user-submodules.html" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/384306 (owner: 10Paladox) [03:27:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.35 seconds [03:46:47] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last) [03:48:26] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last) [03:49:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last) [03:50:17] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: Traceback (most recent call last) [03:50:27] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: Traceback (most recent call last) [03:50:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last) [03:56:46] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.021 second response time [03:56:46] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 290 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [03:58:26] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 0 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [03:59:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 8 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:00:16] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 8 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:00:36] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 288 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:00:57] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 9 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:33:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 247.12 seconds [10:46:46] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.021 second response time [11:14:28] (03CR) 10Dereckson: [C: 04-1] "Config looks good, but wikidata client should be double checked." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [11:15:56] !log mobrovac@tin Started restart [electron-render/deploy@8dd5f13]: Electron hanging - T174916 [11:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:03] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [11:16:08] 10Operations, 10DBA, 10Support-and-Safety, 10Patch-For-Review, 10Wiki-Setup (Create): Create elections committee private wiki - https://phabricator.wikimedia.org/T174370#3685824 (10Dereckson) p:05Low>03Normal There is hi.wiktionary to create soon, I'll see next week to plan a window for it, and if so... [12:43:16] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 29 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:48:16] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 9 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:41:26] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:49:16] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:08:46] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:26] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:19:16] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:38:46] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:28] (03PS2) 10Giuseppe Lavagetto: Port docker builder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/384081 (https://phabricator.wikimedia.org/T177276) [15:08:37] PROBLEM - puppet last run on lvs1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:37] RECOVERY - puppet last run on lvs1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:45:41] (03CR) 10MarcoAurelio: [C: 04-1] Enable blocking feature of abuse filter in fawikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [15:49:49] (03CR) 10MarcoAurelio: "After some time without complains I guess we can safely assume this is correct IMHO." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [15:50:05] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [16:40:46] PROBLEM - Check systemd state on relforge1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:41:37] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:12:37] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:41] (03CR) 10Huji: Enable blocking feature of abuse filter in fawikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/384252 (https://phabricator.wikimedia.org/T178227) (owner: 10Ladsgroup) [18:42:36] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:47:16] (03CR) 10Luke081515: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [19:04:43] (03PS7) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [20:04:21] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3686086 (10bd808) [20:20:22] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3686094 (10bd808) >>! In T128409#2233337, @BBlack wrote: > What you can do to help modern browsers, though, without taking the redirect and/o... [20:22:17] (03PS1) 10Mforns: Fix cron job for refinery data drop of MediaWiki snapshots [puppet] - 10https://gerrit.wikimedia.org/r/384346 (https://phabricator.wikimedia.org/T178256) [20:56:04] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3686149 (10bd808) [21:11:34] (03PS2) 10Dereckson: Add additional namespaces to search results for bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383794 (https://phabricator.wikimedia.org/T178041) (owner: 10DCausse) [21:12:35] (03CR) 10Dereckson: "PS2: namespace numbers can be cryptic, and especially confusing for wikisource, as they change from one wiki to another, so we can documen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/383794 (https://phabricator.wikimedia.org/T178041) (owner: 10DCausse) [22:07:18] (03PS2) 10ArielGlenn: Increase the shard count for Wikidata entity dumps from 5 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/383414 (https://phabricator.wikimedia.org/T177486) (owner: 10Hoo man) [22:08:35] (03CR) 10ArielGlenn: [C: 032] Increase the shard count for Wikidata entity dumps from 5 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/383414 (https://phabricator.wikimedia.org/T177486) (owner: 10Hoo man) [22:11:28] (03CR) 10ArielGlenn: [C: 032] Test different batch sizes in dumpwikidatajson.sh [puppet] - 10https://gerrit.wikimedia.org/r/384204 (https://phabricator.wikimedia.org/T177486) (owner: 10Hoo man) [22:12:09] (03PS3) 10ArielGlenn: Test different batch sizes in dumpwikidatajson.sh [puppet] - 10https://gerrit.wikimedia.org/r/384204 (https://phabricator.wikimedia.org/T177486) (owner: 10Hoo man) [22:18:22] (03PS2) 10ArielGlenn: Do not make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352797 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [22:19:02] (03CR) 10ArielGlenn: [C: 032] Do not make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352797 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [22:24:25] consider it a very early monday deploy :-P [22:24:35] (01:24 am here!) [23:31:31] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3686227 (10BBlack) Be careful with `preload`. It's only purpose is to signal to the Chromium list maintainers that it's ok to you preload yo... [23:34:39] 10Operations, 10Cloud-Services, 10Toolforge, 10Traffic, 10HTTPS: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#3686228 (10bd808) >>! In T102367#3686227, @BBlack wrote: > Be careful with `preload`. It's only purpose is to signal to the Chromium list ma... [23:39:05] (03PS2) 10Ori.livneh: Drop support for the legacy configuration format [debs/pybal] - 10https://gerrit.wikimedia.org/r/317823