[00:11:10] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [00:45:18] zzz ==_____== [00:47:11] New review: tstarling; "(no comment)" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [01:36:03] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1321 seconds [01:40:57] New review: tstarling; "Looks good overall, it's certainly much better than the collection of filters that came before. Ther..." [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [01:43:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:09] !log LocalisationUpdate completed (1.18) at Tue Jan 31 02:06:09 UTC 2012 [02:06:10] Logged the message, Master [02:08:23] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 865s [02:21:21] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1642s [02:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:21] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:41] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:31] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 39s [03:07:11] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.920 seconds [03:15:50] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=60%): /var/lib/ureadahead/debugfs 120 MB (1% inode=60%): [03:15:50] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 17 MB (0% inode=60%): /var/lib/ureadahead/debugfs 17 MB (0% inode=60%): [03:17:51] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.744 seconds [03:27:00] RECOVERY - Disk space on srv223 is OK: DISK OK [03:49:40] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [03:51:30] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:51:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.421 seconds [04:05:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.750 seconds [04:07:23] RECOVERY - Disk space on srv221 is OK: DISK OK [04:12:06] who knows about the new mailman setup? [04:12:12] how many web backends? [04:12:19] is there an LVS or a squid or a varnish? [04:12:41] Dmcdevit: is getting vastly different results than i but ping gives him the same IP i have [04:13:13] and i see no relevant headers to tell me if there's a proxy [04:18:33] RECOVERY - Disk space on srv219 is OK: DISK OK [04:23:23] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:33] RECOVERY - Disk space on es1004 is OK: DISK OK [04:35:12] * jeremyb gives up on searching the puppet repo [04:36:08] Ryan_Lane: multichill: any ideas? ^^ [04:36:22] err, that was supposed to be mutante [04:36:32] jeremyb: huh? [04:36:48] multichill: < jeremyb> err, that was supposed to be mutante [04:37:08] multichill: wow, it's early there! good morning! [04:37:09] Ah, I'm generally not awake at half past 6 in the morning ;-) [04:37:22] yeah, i chose based on TZ. i left out ma rk [04:38:53] I was just called out of bed [04:39:01] oh, page? [04:39:30] * jeremyb has some trouble reconciling half past 6. isn't it half past 5? [04:39:33] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:40:24] Yes, it is ..... too early [04:40:25] multichill: are you up for good? [04:42:31] i guess not! [04:43:00] wb multichill ;) [04:43:42] wtf [04:45:13] Looks like I'm not getting anymore sleep [04:56:27] well in the absence of an actual sysadmin, maybe i could have a couple guinea pigs send themselves password reminders for any list via the web UI? [04:56:38] i just want to know if the form submits ok [04:57:19] RoanKattouw: hey, you're the one with kinda root but don't really use it? [04:57:25] New patchset: tstarling; "Disable wmerrors log_backtrace since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [04:57:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2152 [04:57:52] That's me [04:57:57] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2152 [04:58:36] where did the +2 option go? [04:59:00] TimStarling: looks like you only did +1? [04:59:13] that was the maximum [04:59:22] someone changed your bits? [04:59:23] Then you don't have sufficient rights on that repo [04:59:38] I used to have +2 on the test branch, I don't have that anymore either [04:59:42] can you poke sodium and see if it looks normal? also any idea if it's behind some kind of load balancing? (lvs/squid/varnish) [05:00:05] I don't even know what sodium is or does [05:00:29] Dmcdevit is having issues with the web UI (chrome saying "Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data") [05:00:30] presumably it is a mistake [05:00:42] RoanKattouw: mailman [05:00:47] Ryan also said he'd accidentally kicked me out of the testlabs group [05:01:13] he and I get the same IP with ping and it works fine for me [05:01:44] Sodium responds to ping and ssh, and it seems to be idle according to top(1) [05:02:42] Oh, hmm, OK [05:02:56] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2152 [05:03:15] heh, No score [05:04:06] lighttpd is running and I see nothing weird in its logs [05:06:27] the fact that puppet gives mailman the star cert makes me think that it's not proxied [05:13:00] !log since puppet is broken, disabled wmerrors backtrace logging by adding a separate configuration file in /etc/php5/conf.d and reloading apache [05:13:02] Logged the message, Master [05:13:33] puppet is broken or gerrit is? [05:14:08] puppet is broken in that it uses a broken instance of gerrit [05:14:16] which is broken in the sense that it won't let me do things on it [05:23:39] !log the segfaults didn't stop, so I'm disabling wmerrors entirely for now [05:23:40] Logged the message, Master [05:24:47] guess what? [05:25:38] they stopped? [05:25:48] not giving me a lot of choices [05:25:49] they didn't stop [05:26:12] is this dsh or just one box or what? [05:27:27] never mind, it looks like the wmerrors-related segfaults did stop [05:27:35] we've just got other segfaults now [05:27:51] and it's everywhere, not one box [05:28:46] basically you get a fatal error and then a segfault as the process shuts down [05:29:31] Jan 31 05:18:26 10.0.11.49 apache2[18807]: PHP Fatal error: Allowed memory size of 125829120 bytes exhausted (tried to allocate 72 bytes) in /usr/local/apache/common-local/php-1.18/includes/parser/Preprocessor_DOM.php on line 797 [05:29:31] Jan 31 05:18:26 10.0.11.49 apache2[5585]: [notice] child pid 18807 exit signal Segmentation fault (11) [05:29:34] like that [05:30:42] this is the same thing you enabled core dumps for the other day? [05:30:48] yes [05:31:12] but I don't have time to fix it right now so I'm trying to get the site into some approximate working order [05:31:47] did someone disable core dumps on that box? [05:32:00] i've not a clue [05:32:20] so, that limit it's hitting is exactly 120MB fwiw [05:32:30] actually I put it in a puppetized configuration file, and puppet is running again [05:32:33] so it'll be reverted [05:34:25] ok there's been no more segfaults since 05:24 [05:34:40] I only did a graceful restart so it was probably just a process that hadn't finished yet [05:34:53] no more relevant segfaults, I should say [05:42:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:37] Change abandoned: tstarling; "superseded by configuration pushed out with dsh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [05:48:09] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:54] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [05:57:04] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:14] RECOVERY - MySQL disk space on db42 is OK: DISK OK [06:34:26] !log added myself to the gerrit "administrators" group [06:34:27] Logged the message, Master [07:02:32] TimStarling: hm. you should be able to do everything in gerrit [07:02:41] lemme make sure you are in the ops groups [07:03:04] you're in the ops group... [07:12:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:51] Any server admins around? [07:17:07] Got a tech question re pending changes [07:17:31] Go ahead and ask [07:17:36] So, if pending changes were to be disabled on new [07:17:39] Enwp [07:17:55] Would it make a mess of the logs? [07:18:13] Read: disabled = extension deleted [07:19:15] I'm looking at closing an RFC on the removal of pending changes on enwp [07:19:41] Some think it's complete removal including the user right could make a mess [07:19:45] hmm, I'm not sure [07:19:52] You'd have to ask Aaron Schulz [07:20:02] Hence I thought I should ask. [07:20:05] Who is not on IRC right now, so I suggest you e-mail him [07:20:18] What's their email? [07:20:34] aschulz at wikimedia [07:20:41] Ta [07:20:48] :) [07:22:53] RoanKattouw: which change were you guys trying to merge? [07:23:17] I wasn't trying to do anything [07:23:23] Tim was trying to change something in wmerrors I think [07:23:26] I don't see one merged or still around [07:23:45] When he mentioned he couldn't +2 anymore, I mentioned I had a similar issue a while ago but that you'd fixed it by putting me back into testlabs [07:23:50] ah https://gerrit.wikimedia.org/r/2152 [07:24:00] that's the production branch [07:24:02] he's in the ops group [07:24:05] it should just work [07:24:16] I'm betting it had a hiccup with ldap at that point [07:24:24] I need to figure out how to point it to multiple LDAP servers [07:24:39] damn documentation doesn't say how [07:40:39] Ryan_Lane: run LDAP behind LVS? [07:41:00] LDAP failover is client-side, not server-side [07:41:08] i don't follow [07:41:17] i'm just saying if gerrit doesn't have that option... [07:41:20] clients automatically failover to another server [07:41:24] I'm sure gerrit has it [07:41:34] it's using LDAP libraries for java [07:41:41] it's absolutely supported there [07:42:09] worst case I just read the code and see how they are putting in the server lists [07:42:44] i saw a place that had a custom service that would just check backends periodically and update iptables to change where requests went. i.e. all reqs went to the same place until a change was made and then all went to the new place [07:43:02] o.O [07:43:12] was called pmilb. poor mans iptables lb [07:43:13] that's a terrible hack [07:44:00] i noticed [07:44:06] I guess it works when you have nothing else :D [07:45:32] the thing that really sucked about it was that it only worked on traffic coming from other machines, not on traffic from the box where the iptables was. (maybe fixable i guess but i didn't really try) so services on the same box that used LDAP auth had to hard code a backend [07:45:59] wait, this was for LDAP? [07:46:04] * Ryan_Lane groans [07:46:16] yes... [07:46:25] I don't know of a single LDAP library that doesn't support client failover [07:46:33] actually the backends where AD [07:46:35] were* [07:46:42] that doesn't matter [07:46:49] of course not [07:46:57] AD is just LDAP with a bastardized kerberos and a weird schema [07:47:10] the main consumers were java and apache i think [07:47:12] *weird proprietary schema [07:47:21] and both support client failover [07:47:38] * jeremyb has no answers [07:47:40] heh [07:47:57] someone needs to hit that admin with a cluestick :) [08:23:18] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [08:23:18] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [08:44:14] TimStarling: the wikitech page should probably mention that it runs under jetty. in particular I wonder how this part works: Gerrit shells out to GitWeb, a Perl application which is installed using the stock package. [08:44:19] does it use jetty's CGI support or some other method? [08:52:00] * jeremyb wonders if the prod apc is somehow shared between servers or each has it's own cache? [08:52:06] * jeremyb sleeps [08:54:12] That's wrong, it doesn't shell out [08:54:16] afaict [08:54:20] gitweb is just a web app afaict [09:14:13] is there any reason we use svn+ssh:// to fetch the wmf branch on fenari ? [09:14:34] It's for laziness, allows people to commit from fenari [09:14:47] that also prevents me from fetching the change [09:14:53] since my ssh key is not there :-) [09:16:18] Can't you just forward your agent or put your key there? [09:16:53] not really willing to share that private key to every WMF roots :D [09:16:59] but I have a WMF ssh key on the cluster [09:17:18] guess I can have it added to svn.wm.org [09:17:43] In the meantime, would you like me to run svn up for you? [09:17:52] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [09:18:04] http://www.google.com/search?q=ssh+commit <-- we are the first link [09:18:09] hashar: use SSH agent... [09:18:49] lol [09:19:06] RoanKattouw: that is r110001 , makes Badtitle error pages emits a 400 HTTP status code [09:19:18] RoanKattouw: that was requested by the mobile team since it seems to break some mobile agents [09:19:26] bug report : https://bugzilla.wikimedia.org/show_bug.cgi?id=33646 [09:19:33] rev: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/110001 [09:19:48] You haven't actually merged it to 1.18wmf1 [09:20:00] saper: yeah I could use SSH agent forwarding. But our guru told me it was bad practice to forward a key :-D [09:20:13] yeah I have also lost access to 1.18wmf1 [09:20:21] Oh, meh [09:20:24] I'll merge it then [09:20:31] * hashar wonders why I tried to find out how to sync when I can't even merge ... [09:20:34] sorry roan :-( [09:20:40] I am probably still asleep [09:20:47] * RoanKattouw has been up since 5am [09:21:22] been up for 20 hours yesterday, reached bed at 3am [09:21:30] ouch [09:21:40] daughter has teethes pain [09:21:46] I was up for 24 hours on Saturday/Sunday [09:22:03] at your age, you could probably stay up for 48 hours straight [09:22:27] Well, I did discover what it takes for me to sleep on a plane on this trip [09:22:54] I got on a plane at 11pm, flew for 13 hours, walked around an airport for two hours, got back on the plane, flew for another 2 hours, then slept for 3 [09:23:55] So I'd been up for, let's see ... [09:24:19] have you slept during the 13hours fly? [09:24:20] 9am to 4pm the next day, so that's 31 hours [09:24:28] No, I didn't sleep during the first leg [09:24:38] doh [09:24:45] Tried to but couldn't [09:25:06] I felt really tired around 2am, tried to sleep, then at 3:30am I felt wide awake and watched a movie [09:25:34] Grr, someone's been touching 1.18wmf1 without syncing [09:25:38] * RoanKattouw finds out who [09:25:43] hashar: the agent does not send the key anywhere, it just responds to challenges. If PKCS#11 smartcard API is used, it even cannot hase a key [09:27:47] saper: as I understand it, any root user on the server would be able to hijack my identity [09:27:50] Reedy, of course [09:27:50] !log catrope synchronized php-1.18/includes/Exception.php 'r110368' [09:27:52] Logged the message, Master [09:28:05] yeah Reedy is our merging / shell bug bot [09:28:25] !log catrope synchronized php-1.18/includes/Wiki.php 'r110368' [09:28:26] Logged the message, Master [09:29:14] hashar: There you go ---^^ [09:29:38] $ curl -I 'http://en.wikipedia.org/wiki/%5B%5B' [09:29:38] HTTP/1.0 400 Bad Request [09:29:40] \o/ [09:31:22] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:30] RoanKattouw: thanks a ton! [09:46:52] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:52] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:10] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:10] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 404454 MB (3% inode=99%): [09:58:30] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 383892 MB (3% inode=99%): [10:01:07] RECOVERY - DPKG on db42 is OK: All packages OK [10:01:07] RECOVERY - Disk space on db42 is OK: DISK OK [10:01:07] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:16:20] when is the editing from the current diff thing going to be fixed [10:16:42] cause it is getting really tiresome [10:18:20] What "thing"? [10:18:23] I'm not aware of this bug [10:18:48] You can't edit from the current diff window? [10:18:51] Or at least I cannot [10:19:01] link? [10:19:11] http://en.wikipedia.org/w/index.php?title=User_talk%3ARyulong&action=historysubmit&diff=474158643&oldid=474158598 [10:19:15] no edit links anywhere [10:19:28] and by "edit from the current diff window" I mean edit sections [10:19:40] aaah [10:19:43] there's a bugzilla report about it [10:20:10] how was this something that the dev went "Hmm, no I don't think anyone wants that anymore"? [10:20:45] it maight not have been removed deliberately [10:22:00] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [10:23:24] New review: Dzahn; "https://developer.mozilla.org/en/Mobile/Viewport_meta_tag" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2110 [10:23:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2110 [10:31:12] New review: Dzahn; "looks good and already got a +2, just not verified" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2109 [10:31:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2109 [11:08:43] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:23] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:11:53] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:23] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:43] PROBLEM - RAID on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:43] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:53] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:23] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:23] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:08] New patchset: Tim Starling; "Disable wmerrors since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [11:50:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2154 [11:52:20] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:26] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:27] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [12:14:59] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [12:20:47] TimStarling: the wikitech page should probably mention that it runs under jetty. in particular I wonder how this part works: Gerrit shells out to GitWeb, a Perl application which is installed using the stock package. [12:20:49] That's wrong, it doesn't shell out [12:20:55] final Process proc = [12:20:56] Runtime.getRuntime().exec(new String[] {gitwebCgi.getAbsolutePath()}, [12:20:56] makeEnv(req, project), repo.getDirectory()); [12:21:07] looks like shelling out to me [12:22:43] this is GitWebServlet.java [12:23:06] Ah, so it does [12:23:16] I thought gitweb had its own web frontend, guess not [12:23:58] gitweb is a perl web app [12:24:25] gerrit is pretending to be a regular web server [12:24:31] Oh, that's right [12:24:38] shelling out to perl with the environment set up like ordinary CGI [12:24:38] Gerrit impersonates every server type on the planet [12:24:53] Or at least HTTP, SSH and git [12:57:20] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [12:57:20] RECOVERY - DPKG on db42 is OK: All packages OK [12:57:50] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [12:59:30] RECOVERY - MySQL disk space on db42 is OK: DISK OK [12:59:30] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [13:01:20] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:03:10] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [13:09:00] RECOVERY - Disk space on db42 is OK: DISK OK [13:14:20] RECOVERY - MySQL Recent Restart on db42 is OK: OK 3416479 seconds since restart [13:45:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:10] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:43] saper: are you on? [15:06:08] mutante: yes [15:06:18] hallochen [15:06:22] !log changed nameservers for wikimedia.pl per [[RT:2277]]/[[bugzilla:33509]] [15:06:24] Logged the message, Master [15:06:31] <<-- :) [15:06:42] saper: i _just_ hit save [15:06:55] great [15:07:19] gerrit is taking over the world [15:07:21] https://github.com/openstack-ci/git-review/pull/3#issuecomment-3740052 [15:07:41] my pull request to github was dismissed automatically and force me to use … gerrit!! :D [15:07:50] saper: fns1.42.pl 79.98.145.34 / fns2.42.pl 195.80.237.194 [15:09:00] perfect [15:09:14] 86400 seconds are cached anyway [15:09:50] mutante: do they accept IPv6 glues? [15:10:01] fns2.42.pl is 2a02:2978::a503:4209:2 [15:11:48] saper: "Wrong format of Ip Address!" :/ ..i'll try to find out [15:12:08] don't worry [15:12:12] it's in a different domain [15:12:16] ok [15:12:23] so glue is not really necessary [15:12:35] alright [15:12:36] you might ask them or change registrars [15:13:01] i'll bring it up, might be something for IPv6 enable day or so [15:13:22] good test when looking for registrars - do they support IPv6 glues and DNSSEC [15:13:37] * mutante nods [15:17:34] sixxs has a list http://www.sixxs.net/faq/dns/?faq=ipv6glue [15:20:58] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.968 seconds [15:21:08] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Jan 31 15:20:53 UTC 2012 [15:28:15] mutante: maybe not up to date as many things there unfortunately [15:41:55] Dmcdevit: still reliably broken? [15:59:38] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.373 seconds [16:03:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:48] PROBLEM - Host ekrem is DOWN: CRITICAL - Host Unreachable (208.80.152.178) [16:13:48] is the RC feed down? [16:15:17] probably a TS issue [16:15:19] Thehelpfulone: you mean the wmf ircd? [16:15:56] hrmm, ekrem is dead? is that still the WAP proxy? can't remember what else [16:16:12] Thehelpfulone: http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=rss ? worksforme ? [16:16:22] yes [16:16:26] sorry the IRC one [16:16:27] mutante, the RC to IRC bridge [16:16:29] Oh, that'd be the IRC box, yeah [16:17:12] 16:20 <+nagios-wm> RECOVERY - HTTP on ekrem is OK: [16:17:13] hmm [16:17:15] oh, ekrem is also IRC? i forgot [16:17:29] mutante: whole host down is more recent [16:17:39] !log ekrem suddenly died around 16:03 UTC, breaking the RC IRC feed [16:17:41] Logged the message, Mr. Obvious [16:17:57] mark: ----^^ [16:19:19] connecting to ekrem mgmt [16:19:46] last output i can see is just: * Stopping web [16:20:06] frozen, will powercycle [16:21:02] !log powercycling ekrem - mgmt just showed "Stopping web" and was frozen completely [16:21:03] Logged the message, Master [16:22:23] ..recovering journal.. [16:24:18] /-\|/-\|/-\ [16:24:32] ekrem login: [16:24:47] enjoy this hold music while we replay the journal [16:25:29] whats the process you need for the IRC bridge? checking [16:26:22] does it not just fix itself on boot? [16:26:28] RECOVERY - Host ekrem is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [16:27:09] jeremyb: well, docs say "you will want to start up the bot if it is not running already. :) [16:27:20] how useful [16:27:28] It should start itself [16:27:30] ircd [16:27:51] At least according to puppet, ircd should start automatically [16:28:07] !log ekrem - su -c /usr/local/ircd-ratbox/bin/ircd irc [16:28:09] Logged the message, Master [16:28:11] Nikerabbit you about? [16:28:58] RoanKattouw: umpf, it cant run puppet :/ [16:29:21] Well ircd seems to be running [16:29:25] good [16:29:42] The IRC server let me connec [16:30:07] !log reedy synchronized php-1.18/extensions/SpamBlacklist/SpamBlacklist_body.php 'r110401' [16:30:08] !log ekrem - gets Error 500 on SERVER when running puppet [16:30:08] Logged the message, Master [16:30:10] Logged the message, Master [16:31:11] However, the bridge is not running [16:32:06] which is weird [16:32:19] I mean the relay does seem to be running, it's just not working [16:32:27] killed it again, started again (because it says to start AFTER ircd runs (of course) [16:32:30] hmm [16:32:48] i looked at http://wikitech.wikimedia.org/view/IRC#Starting_the_bot [16:32:53] aah [16:33:01] I guess that dependency isn't in puppet [16:33:13] well, puppet run breaks with Error 500 here [16:33:35] That shouldn't matter, puppet should have installed the init scripts a long time ago, right? [16:33:38] Hm, or maybe not [16:33:48] Well the relay thingy is running [16:33:52] cool [16:35:14] !log restarted IRC bot on ekrem (needs dependency to start after ircd) [16:35:16] Logged the message, Master [16:36:48] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:37:06] the channels themselves still are down [16:37:21] I can't join #en.wikipedia for example [16:38:31] Yeah [16:38:41] The IRC server is up, but the MW->IRC bridge isn't working [16:39:21] hrmm, sigh, this is about where docs are running out [16:39:22] Oh, hah [16:39:33] There are TWO udpmwircecho processes [16:39:41] aah [16:39:51] PIDs 1362 and 1668 [16:39:51] still? supposed to be killed [16:40:13] i see ,yeah [16:40:32] are you killing one? [16:42:39] irc://irc.wikimedia.org isn't working normally [16:43:09] I can connect, but none of the channels are join-able [16:43:25] Yeah, we know [16:43:26] pir^2: we're on it [16:43:37] The IRC server died and was restarted, but now the wiki->IRC bridge isn't working [16:44:02] so now there are 0 udpmwirc [16:45:27] No, I still see 2 [16:45:31] The same PIDs [16:45:43] ( ps aux | grep irc ) [16:46:51] Thehelpfulone: yes [16:47:10] great question re: translation extension [16:47:12] for {{Special:LanguageStats/nl|x=D}} [16:47:21] is there an equivalent for {{Special:MessageGroupStats|x=D}} ? [16:47:35] yes [16:47:46] Thehelpfulone: see https://www.mediawiki.org/wiki/Help:Extension:Translate/Statistics_and_reporting [16:48:05] I mean to put on a user page or main space page [16:48:20] ooh [16:49:16] althoug if the group has spaces in the name, it wont work on wmf cluster yet, the fix is not yet tehre [16:49:39] ah :S [16:49:51] yes it's {{Special:MessageGroupStats/page-Wikimania 2012/CentralNotice}} [16:52:15] hey guys,please try to join channels again [16:52:33] JOIN #es.wikipedia [16:52:34] No such channel [16:52:50] !log 17:52 [Users #en.wikipedia] [16:52:50] 17:52 -!- Irssi: #en.wikipedia: Total of 0 nicks [0 ops, 0 halfops, 0 voices, 0 normal] [16:52:52] Logged the message, Master [16:53:00] arg, didnt want to log that:) [16:53:09] mutante: i even /quit and still can't get in [16:53:47] hmm, weird. but i am on the channel [16:54:58] mutante: tried from off cluster? [16:55:08] yes [16:55:48] but i cant repeat it after quitting .... [16:55:51] foo [16:56:08] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:31] * jeremyb runs away [16:56:49] !log restarted ircd on ekrem once again because we still cant join channels .. problem remains [16:56:51] Logged the message, Master [17:03:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:07:51] mutante: any ideas? [17:08:35] jeremyb: not really, therefore started to escalate to other ops as well [17:09:13] well, brb [17:09:21] jeremyb: it's not the ircd itself [17:10:01] mutante: i got that ;) [17:14:18] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.954 seconds [17:16:49] mutante: did you find out how to bring back rc bot? [17:19:22] saper: not really. kicked it, made sure its not running twice, but not yet.. [17:19:35] strange [17:21:16] saper: ah, im being told it takes a while to start.. there might still be a chance.. [17:29:58] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [17:30:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:39] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:11] !log catrope synchronized wmf-config/InitialiseSettings.php 'Change wgRC2UDPAddress to the new ekrem IP' [18:32:13] Logged the message, Master [18:33:38] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [18:33:38] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [18:34:29] irc.wikimedia.org is FIXED, folks [18:34:31] Yay! [18:34:39] saper: :) see above [18:36:48] !log IRC breakage postmortem: MediaWiki was configured to send UDP packets to .179 (ekrem-old) instead of .178 (ekrem) [18:36:49] Logged the message, Mr. Obvious [18:37:07] mr obvious [18:37:17] well #wikipedia-nl-vandalism works again :DD [18:37:22] wc [18:38:12] #en.wikipedia is very active, ack:) [18:51:44] mutante: my bots restarted, let's see [18:52:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:53:25] Aaron|away: Poke [18:56:18] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.841 seconds [19:16:00] New patchset: Pyoungmeister; "adding support for other paging groups into our nagios. testing with the udp2log alerts. maybe also breaking nagios ;)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2155 [19:26:43] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2155 [19:26:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2155 [19:30:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] New patchset: Diederik; "Initial commit, feedback Catrope incorporated, feedback Tim Starling incorporated" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [19:46:28] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [19:48:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:57:28] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.442 seconds [20:25:49] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.274 second response time [20:32:39] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [20:38:02] AaronSchulz: Ping? [20:38:28] * SteveMobile hopes you're around, just the guy I'm looking fr [20:39:16] * RoanKattouw points out it's lunchtime at the office [20:39:31] it's usually more effective to just post the question :) [20:40:41] AaronSchulz: I posted yesterday, was told I'd need to ask you specifically :) [20:40:56] but he might have answered by now [20:41:16] Ok, so there's an RFC at enwp about the deletion/removal of pending changes [20:41:19] 31 07:19:41 < Steven_Zhang> Some think it's complete removal including the user right could make a mess [20:41:19] Complete removal [20:41:24] Ya [20:41:26] 31 07:19:15 < Steven_Zhang> I'm looking at closing an RFC on the removal of pending changes on enwp [20:41:59] So I wonder if it's removal will cause any issues on a technical level [20:42:08] Ya know, if the extension was deleted [20:42:30] e.g. will logs be available of PC stuff that happened while PC was on [20:42:34] the only issue would be existing log entries [20:42:55] What would happen to them? [20:43:02] similar to the removal of Makebot...i18n would be need for backwards compatibility [20:43:08] *be needed [20:43:19] otherwise it should be fine [20:43:21] Make not? [20:43:26] *bot [20:43:48] make bot! [20:43:55] on old extension we don't use anymore [20:44:55] Hmm, ok, so I say, community says to can the reviewer user right completely [20:45:04] You say "ok, no worries"? [20:46:48] fuck it someone already closed it [20:48:05] snooze you looze :p [20:49:49] it was a non admin [20:49:54] bah they got it wrong [20:49:56] -.- [20:50:02] * Steven_Zhang goes off to berate them :P [20:55:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.463 seconds [21:14:49] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /fake.log, have not been written to in 6 hours [21:14:59] PROBLEM - udp2log processes on emery is CRITICAL: CRITICAL: filters absent: /var/local/fakefilter, [21:38:55] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:59] New patchset: Pyoungmeister; "adding nimish and erikz to alert group for udp2log checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2158 [21:39:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2158 [21:39:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2158 [21:39:42] !log awjrichards synchronized php/extensions/DonationInterface/payflowpro_gateway/payflowpro_gateway.alias.php [21:39:44] Logged the message, Master [21:40:31] !log awjrichards synchronized php/extensions/DonationInterface/payflowpro_gateway/payflowpro_gateway.i18n.php 'r110433' [21:40:32] Logged the message, Master [21:41:05] !log awjrichards synchronized php/extensions/DonationInterface/globalcollect_gateway/globalcollect_gateway.alias.php 'r110433' [21:41:07] Logged the message, Master [21:41:53] !log awjrichards synchronized php/extensions/DonationInterface/globalcollect_gateway/globalcollect_gateway.i18n.php 'r110433' [21:41:54] Logged the message, Master [21:43:17] !log awjrichards synchronized php/extensions/DonationInterface/gateway_common/countries.i18n.php 'r110433' [21:43:19] Logged the message, Master [21:44:17] !log awjrichards synchronized php/extensions/DonationInterface/gateway_common/interface.i18n.php 'r110433' [21:44:18] Logged the message, Master [21:44:43] !log awjrichards synchronized php/extensions/DonationInterface/gateway_common/us-states.i18n.php 'r110433' [21:44:44] Logged the message, Master [21:45:31] binasher: mysql doesn't offer any way to nice a query/process, right? [21:45:53] (after it's started so you can't e.g. add a DELAYED) [21:47:09] AFAIK not, no [21:47:45] RECOVERY - udp2log log age on emery is OK: OK: all log files active [21:48:04] domas: ^ ? [21:48:41] mutante: what ever happened with ekrem don't want to run puppet? [21:50:25] RECOVERY - udp2log processes on emery is OK: OK: all filters present [21:53:01] what is 'DELAYED'? [21:53:48] domas: http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html ?! [21:54:24] :-) if I ask a stupid-sounding question, there's always a deeper meaning to it [21:54:28] no need to paste me a manual [21:54:33] :-D [21:55:04] :P [21:56:53] domas: ocwiki seems to be fixed now, btw :P [21:57:18] domas: anyway, so there's no such thing as nice, right? [21:57:26] domas: (or renice) [21:57:44] jeremyb: the resource that mysql queries are usually competing for is I/O [21:57:58] I/O doesn't have such perfect granularity as CPU [21:58:31] at FB we have extension that allows to provide concurrency limits on per-account basis [21:58:38] domas: i'm asking so that i can quote you when replying ;) http://lists.wikimedia.org/pipermail/toolserver-l/2012-January/004690.html [21:59:04] anyway, with replication [21:59:07] it has to compete for I/O [21:59:14] and priorities don't really work that well on I/O [21:59:33] unless, of course, you queue in OS and not at device level, what lowers throughput significantly [22:01:05] domas: You're working at FB? :P [22:01:06] domas: so, can i quote you? [22:01:09] hoo: yes [22:01:22] hoo: as for ocwiki, they caused quite some drama [22:01:32] there was some broken telephone at multiple levels [22:01:46] hah, i heard the brits call it chinese whispers [22:01:49] which is why now foundation sees me as some huge liability causing projects to fork by being an asshole, or something, and is willing to put me into a prison [22:02:28] ... [22:02:30] someone have a summary write up of this ocwiki thing? [22:03:29] jeremyb: it is difficult to provide a decent summary, as it is way more about someone's insulted emotions [22:03:35] than anything else [22:03:53] domas: i mean about what was lost in translation [22:04:07] domas: what did ppl want you to do and what did you actually do? [22:04:20] I deleted some templates that were causing mediawiki to OOM [22:04:27] I didn't write highest quality deletion messages though :-) [22:04:35] hah [22:04:42] if you read them in right order, there's nothing wrong in them [22:04:53] if you read them in opposite order, story is different [22:04:55] :) [22:05:09] http://oc.wikipedia.org/w/index.php?title=Especial%3AJornal&type=&user=midom&page=&year=&month=-1 [22:06:00] this grew to an incident where people at WMF decided it is enough to threaten me with loss of root access ;-) [22:06:13] southern france [22:06:25] domas: srsly? [22:06:32] sounds like something that you didn't evne use root to do [22:06:49] * jeremyb adds uselang to that link ;) [22:07:10] anyway, each of those templates was taking hundreds of megabytes of mediawiki memory to render [22:07:19] doesn't work well, when 10k pages on their site had 25 of these templates embedded [22:07:23] essentially whole wiki was broken [22:07:28] :) [22:07:44] even* [22:08:26] oh well [22:10:17] ugh, i don't even want to know what that template does [22:10:26] puts a tiny demographic growth chart [22:10:33] not for human consumption [22:10:56] it stored the population of the given zip codes, no? [22:11:31] ahem, wikidata anyone? [22:11:34] well, yeah, templates store all sorts of information about zip codes [22:11:40] jeremyb: THATS WHAT I SAID IN DELETE MESSAGES! [22:12:07] 00:07, 23 January 2012 Midom (Talk | contribs) deleted "Modèl:Popfr1968" ‎ (we need wikidata) [22:12:12] but apparently that was stepping over every possible line! [22:12:14] very insulting! [22:12:21] "templates are not databases" [22:12:23] asshole! [22:12:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.384 seconds [22:12:49] domas: You can hardly imagine how easy it is to create drama... [22:12:59] especially with global rights [22:13:59] hoo: I guess people on enwiki or dewiki have broad enough rights that they don't care about this "omg global power" [22:14:11] they understand, that if they screw up, they screw up [22:14:28] on a tiny wiki it is all about being oppressed by whomever is there out that can oppress [22:14:31] I already fixed CSS on dewiki and nobody minded... [22:14:44] but I got blocks and threats on small wikis... [22:15:10] (it was a syntax error breaking the whole style sheet) [22:15:10] :-) [22:15:28] you should've involved volunteer coordination community management committee [22:16:39] We got a urgent issues, let's discuss whether we are allowed to fix it and who we need to talk to before... [22:16:45] - a [22:17:28] * jeremyb waves a swalling [22:17:39] * StevenW waves back [22:17:55] TimStarling: is ok to backport & sync the cloudfiles code to get it out there? [22:18:24] what cloudfiles code? [22:18:32] when does /a/a2 go live again? [22:18:41] the one we reviewed, /trunk/extensions/SwiftCloudFiles [22:20:30] yes, that can go live [22:26:51] New review: Tim Starling; "(no comment)" [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [22:27:56] good night ;) [22:30:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud3 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:23] PROBLEM - check_minfraud3 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:31] * AaronSchulz wonders for how many ages srv256 complains about keys [22:30:53] a few eons [22:31:32] then it is not that urgend [22:31:41] * AaronSchulz wonders why sync gives sudo prompts [22:32:33] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [22:35:23] RECOVERY - check_minfraud2 on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 8643 bytes in 0.253 second response time [22:35:23] RECOVERY - check_minfraud3 on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.225 second response time [22:35:23] RECOVERY - check_minfraud2 on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 8643 bytes in 0.219 second response time [22:35:23] RECOVERY - check_minfraud3 on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 8644 bytes in 0.269 second response time [22:39:13] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:25] AaronSchulz: you can wonder together with apergos iirc [22:49:42] gn8 folks [22:56:19] any ops around? [22:56:35] yesterday's query about Dmcdevit's problems with mailman still stands [22:56:58] GET seems fine but when he submits a form it gives him an error in chrome [22:57:14] 31 22:53:38 < Dmcdevit> Yep, same. "Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data." [22:57:32] Dmcdevit: how many internet connections have you tried? [22:57:43] I can't submit a confirmation string, get a password reminder, or access archives. [22:57:45] when he edits enwikip with ssl it works fine [22:57:54] Just this one. [22:58:08] (no error and the edit shows up) [22:58:27] mark, Ryan_Lane, mutante? [22:59:18] how does enwiki have anything at all to do with mailman? [22:59:36] there were a couple of red herrings in there [22:59:38] just a way to diagnose whether he can do a POST over ssl at all [22:59:48] ahhhh. ok [23:00:03] and specifically that he can do so to pmtpa [23:00:09] it's probably not a client-side or network issue [23:00:29] (oh, and it works fine for me yesterday and today) [23:00:37] well, assuming Dmcdevit is not in mainland china [23:00:46] Boston, MA [23:01:08] anwyay, it's a mystery [23:01:18] to me [23:01:23] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.728 seconds [23:01:26] you know the GFW will send forged RST packets after sniffing inappropriate URLs [23:01:40] ew [23:01:40] probably not the problem ;) [23:01:42] Not quite China, just the People's Republic of Massachusetts. ;-) [23:01:46] =) [23:02:03] back in ~10 mins [23:07:42] Dmcdevit: does it give you the error immediately, or after a delay? [23:08:02] After about a minute of trying to load the page. [23:09:26] New patchset: Diederik; "Initial commit, feedback Catrope incorporated, feedback Tim Starling (2x) incorporated" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [23:12:47] which page? [23:13:54] probably https://lists.wikimedia.org/mailman/private/wikipedia_ambassadors_sc/ [23:14:12] lighttpd error.log doesn't give IP addresses or anything useful like that [23:14:36] Yeah, that was the page. [23:14:42] there are a few SSL connection setup errors, maybe if we had a timestamp then we could correlate with that [23:15:11] it seems unlikely that it's an MTU issue given the small size of the form [23:17:34] TimStarling: he's on a mac so he could tcpdump... [23:17:54] also, i realized while walking over here: i haven't actually tested chrome [23:18:06] maybe someone has chrome handy [23:18:14] kind of hard to debug SSL with tcpdump [23:18:17] (Dmcdevit=chrome) [23:18:26] TimStarling: to see if it's mtu i mean [23:18:56] I guess the outgoing packet size would be useful information, it would let us rule it out [23:19:16] maybe there's a monster 1000 byte cookie or something [23:19:23] heh [23:19:54] jeremyb, I have no problem with chromium to lists.wikimedia.org [23:20:02] Platonides: did you POST? [23:20:10] 31 22:56:58 < jeremyb> GET seems fine but when he submits a form it gives him an error in chrome [23:20:12] well, I'm not in that ml, so I can only try a dummy password [23:20:17] but with https://lists.wikimedia.org/mailman/private/otrs-permissions-l/ worked fine [23:20:23] (and yes, that's POST) [23:20:29] Platonides: just try to give yourself a passwd reset on any list [23:20:39] (reminder rather) [23:21:30] https://lists.wikimedia.org/mailman/private/otrs-permissions-l/ worked [23:21:50] * jeremyb saw [23:22:01] errors are fast, too [23:22:28] well it's not an HTTP error, it's an error from chrome [23:22:43] chrome reports an error from the server [23:23:29] I wonder if that message means that not even the ssl handshake worked [23:23:59] which would be silly given that HTTPS GET work for him :P [23:26:28] you could probably work that out from tcpdump packet counts [23:27:10] i was thinking he could maybe use a combination of s_client + hosts file to get better tcpdump? [23:27:38] anyway, back in ~4 hrs unless i get on from the bar. off to meet about wikimania takes manhattan with pharos and figment [23:31:03] Dmcdevit: try now [23:32:36] No difference, it seems. (Page still loading...) [23:33:29] just let it load until it hits that error [23:34:23] Yeah, same error. [23:34:31] Dmcdevit: Are there any proxies/web filters between you and the web? [23:34:57] Not as far as I know. I'm just in my apartment. [23:37:28] http://paste.tstarling.com/p/jiTgyH.html [23:38:27] no reset packet apparently [23:39:28] oh, F=FIN [23:40:37] so the server closed its side of the first connection at 23:30:23.918929 [23:41:08] and it took until 23:30:27.342561 for the other side to be shut down [23:41:25] then 10 seconds later the second connection started, maybe a browser retry [23:50:10] at 23:30:17 it was just a GET, then I think we're actually seeing a keepalive timeout [23:53:51] so I suppose at 23:30:37 the POST starts, but it's not in access.log [23:54:06] no relevant SSL setup errors [23:54:22] Dmcdevit: maybe you should try a different browser and/or internet connection [23:56:16] Oh! It does work in Safari. [23:56:33] No one ever suggested it might be browser-related, and I never tried. [23:59:43] New review: Tim Starling; "(no comment)" [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142