[00:00:18] thanks Coren [00:00:54] I see at least three different things crashing at different points, the only common point seem to be the memory thing. If I had to guess, those are bits of code that don't guard against malloc() returning NULL properly when resources are scarce then using the pointer. [00:01:35] aha, https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [00:02:00] yeah, that [00:03:43] related? [00:03:43] 19:02 < ireas> hi there. it seems impossible to edit [[:de:Benutzer Diskussion:Emergency doc]], error: https://tools.wmflabs.org/paste/view/a811e165 [00:03:47] 19:02 < ireas> (database error) [00:03:49] 19:02 < ireas> other pages work well [00:04:07] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2417 seconds [00:04:47] !log maxsem scap aborted: MobileApp deployment (duration: 06m 33s) [00:04:53] Logged the message, Master [00:05:03] crap. srv291: Permission denied (publickey). [00:05:07] RECOVERY - MySQL InnoDB on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:05:33] <^demon> I always get pubkey denied for crap in tampa. [00:05:38] <^demon> hume, fenari, srv291... [00:05:57] haven't seen this before [00:06:14] !log maxsem started scap: MobileApp deployment [00:06:16] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [00:06:22] Logged the message, Master [00:06:37] I shouldn't be that nervous - just usual scap errors... [00:07:31] After earlier I don't blame you [00:07:37] * MaxSem prays for death of pmtpa [00:09:06] PROBLEM - MySQL Idle Transactions on db1058 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2730 seconds [00:09:07] PROBLEM - MySQL InnoDB on db1058 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2732 seconds [00:09:54] srv291 has had repeated issues, see it in SAL [00:10:02] wasn't only it [00:10:07] those boxes that just cause trouble, should just be removed from dsh [00:10:12] and decom'ed [00:10:24] ok [00:11:32] you can just hit enter, right [00:11:35] to continue [00:16:43] (03CR) 10TTO: "???? The bug report refers to frwikiVERSITY not frwikTIONARY!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 (owner: 10Jforrester) [00:17:16] (03PS3) 10Ebe123: Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 [00:17:31] RECOVERY - MySQL Idle Transactions on db1058 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:17:31] RECOVERY - MySQL InnoDB on db1058 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:19:20] (03CR) 10PiRSquared17: [C: 031] Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [00:23:07] mutante: is your scap still going? [00:23:12] er MaxSem [00:23:24] yep [00:23:37] I have to go in about 5ish minutes [00:23:42] (03CR) 10Espreon: [C: 031] "Looks fine to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [00:23:48] just fyi [00:24:03] ok [00:24:30] but things are looking better this time anyways :) [00:24:46] No random database failure at least. [00:25:17] * greg-g knocks on wood [00:29:01] ori, looks like the scap is using the same proxy, mw1010 for all hosts [00:29:40] interesting, will look in a moment [00:29:43] bd808: ^ [00:30:29] Problem with the ping script again? [00:34:30] (03PS2) 10Ori.livneh: scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 [00:34:34] !log maxsem finished scap: MobileApp deployment (duration: 28m 19s) [00:34:42] Logged the message, Master [00:36:42] (03CR) 10Dzahn: [C: 031] "thanks for confirming, yea, that filter was already removed in one of Matanya's changes, so it doesn't get removed here anymore after reba" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 (owner: 10Dzahn) [00:38:14] okay, congratulate me everyone: I scapped without pulling the deploying change [00:38:49] (03PS1) 10MaxSem: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 [00:38:58] (03CR) 10MaxSem: [C: 032] Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 (owner: 10MaxSem) [00:39:00] MaxSem: high-five [00:39:04] what, wait, revert?:) [00:39:05] (03Merged) 10jenkins-bot: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 (owner: 10MaxSem) [00:39:15] not going to reattempt scap so late [00:39:45] oh, i didnt even understand it right first [00:39:52] i thought you meant it worked without having to revert [00:39:57] gotcha now [00:40:31] pffffffft. [00:40:42] ori: I think I see the bug in the scap.py file. Bad quoting / "interesting" behavior in scap-1 [00:40:51] yes, it uses $1 [00:40:59] so the scap proxies should be a single quoted arg [00:41:12] Yeah and utils.dsh('/usr/local/bin/scap-1 %s' % ' '.join(scap_proxies) [00:41:34] doesn't quote so only first server is seen in ping test [00:43:45] greg-g, I'm afraid that if I continue I will just make more mistakes [00:45:50] greg-g, thus I call this deployment failed and will reschedule it to a later date [00:46:13] (03CR) 10GWicke: [C: 031] "parsoid.py is no longer needed, so the removal was intentional. Forgot to remove the resource too, thanks for cleaning that up." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [00:46:26] greg-g: https://gerrit.wikimedia.org/r/#/c/112612/ explains the load spike on mw1010 earlier today [00:46:35] the cluster now has the extension but it's not enabled so it's safe for me to sleep [00:49:03] (03PS2) 10Alexandros Kosiaris: Remove parsoid.py file resource from deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 [00:49:34] <^demon> Gerrit reboot in ~10 minutes. [00:49:42] hurries [00:56:17] <^demon> Gerrit upgrade done. [00:56:25] <^demon> !log gerrit upgraded from 2.8.1 stable to 2.8.1-1-g83098d0 (custom build) to work around mysql issue pending upstream release. [00:56:25] (03PS1) 10Dzahn: add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 [00:56:34] Logged the message, Master [00:56:53] interesting rebase on https://gerrit.wikimedia.org/r/#/c/112004/1 [00:57:02] (03CR) 10jenkins-bot: [V: 04-1] add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 (owner: 10Dzahn) [00:57:03] it turned into +/- 0 [00:58:28] (03PS2) 10Dzahn: add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 [00:59:22] (03CR) 10Dzahn: [C: 04-1] "it's a noop after rebasing it, guess it's been done elsewhere and this can be abandoned" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [01:02:04] (03CR) 10Dzahn: [C: 04-2] "already done by Ryan in Change-Id: I74d63e9dfd516f7afc27f111d5919f6f81db813a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [01:31:09] (03CR) 10Jforrester: "Good point. Whoops." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 (owner: 10Jforrester) [01:31:24] (03PS2) 10Jforrester: Enable VisualEditor in opt-in mode for French Wikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 [01:31:52] * James_F coughs. [01:46:48] (03CR) 10Tim Starling: [C: 031] scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 (owner: 10Ori.livneh) [01:47:37] (03CR) 10Ori.livneh: [C: 032] scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 (owner: 10Ori.livneh) [02:10:38] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-11 02:10:38+00:00 [02:10:48] Logged the message, Master [02:19:13] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-11 02:19:13+00:00 [02:19:21] Logged the message, Master [02:27:09] What was up with Wikipedia about 4 hours ago. I logged some interesting error messages. [02:33:27] T13|sleeps: db failed. [02:34:07] Ahh. Then most of those db error messages make sense. [02:34:26] Got depooled since. [02:40:12] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-11 02:40:11+00:00 [02:40:45] Logged the message, Master [02:43:37] !log restarting apache on servers with workers in a futex wait: mw1056,mw1082,mw1189,mw1196,mw1199,mw1201,mw1202,mw1203,mw1204,mw1208 [02:43:44] Logged the message, Master [02:44:06] TimStarling: That happened again? Hmm. [02:44:18] what do you mean again? [02:44:57] PROBLEM - Disk space on wtp1010 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=70%): [02:45:45] it was only a few hours ago that springle was complaining about it, I assumed it was just unfixed since then [02:45:47] TimStarling: Email from Sean earlier on ops-l; he had noticed this on the first scap attempt [02:45:59] yes, he noticed it, but he didn't restart apache [02:46:04] Ah. My understanding is that it had been fixed since but happened again. [02:46:15] Nevermind, then. [02:48:28] hmm, didn't know it was happening on all those nodes [02:49:16] <^d> TimStarling: What apc bug are we hitting? Or is it the familiar cache-too-small problem? [02:50:16] I'm looking at an affected process on mw1184, it's in APCBagOStuff::set() [02:51:31] called from LanguageConverter::loadTables() [02:53:31] hmm, lots of things failing to get and lock the memc lock for doing the slave check at a certain time [02:53:35] 336 2014-02-10 22:31 [02:53:36] 1173 2014-02-10 22:32 [02:53:38] 312 2014-02-10 22:33 [02:53:53] gets that matches "22:32 springle: pt-kill jobs on s1 slaves killing anything sleeping longer than 10s " [02:54:33] springle: were connections to db1050 hanging or failing fast then? [02:54:40] there's no symbols for apc, so I'm not sure how to work out if it's full [02:55:04] they connected then slept, see ops list [02:55:09] if it was the former, it would makes sense that lots of boxes would have connections to other servers while hanging on that one (inside getLagTimes()) [02:55:58] AaronSchulz: hanging, at least some of the time [02:56:44] probably several things going wrong then [03:00:56] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:07] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:09] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:16] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:01:46] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3897 bytes in 0.162 second response time [03:01:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:01:57] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.761 second response time [03:01:59] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27648 bytes in 0.863 second response time [03:02:07] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27705 bytes in 0.527 second response time [03:03:07] PROBLEM - Disk space on wtp1013 is CRITICAL: DISK CRITICAL - free space: / 351 MB (3% inode=70%): [03:03:46] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 235 MB (2% inode=69%): [03:04:06] PROBLEM - Disk space on wtp1009 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=70%): [03:04:06] PROBLEM - Disk space on wtp1017 is CRITICAL: DISK CRITICAL - free space: / 323 MB (3% inode=70%): [03:04:57] PROBLEM - Disk space on wtp1005 is CRITICAL: DISK CRITICAL - free space: / 261 MB (2% inode=70%): [03:05:59] springle: it really sucks that the locking breaks in getLagTimes() so much due to constant twemproxy errors [03:06:46] PROBLEM - Disk space on wtp1020 is CRITICAL: DISK CRITICAL - free space: / 94 MB (1% inode=70%): [03:07:06] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [03:07:17] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:17] PROBLEM - Disk space on wtp1006 is CRITICAL: DISK CRITICAL - free space: / 292 MB (3% inode=70%): [03:07:56] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:57] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:07] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 184 MB (2% inode=70%): [03:08:36] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [03:08:46] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:08:56] PROBLEM - Disk space on wtp1012 is CRITICAL: DISK CRITICAL - free space: / 74 MB (0% inode=69%): [03:08:56] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 125 MB (1% inode=70%): [03:09:46] PROBLEM - Disk space on wtp1016 is CRITICAL: DISK CRITICAL - free space: / 86 MB (0% inode=70%): [03:09:46] PROBLEM - Disk space on wtp1008 is CRITICAL: DISK CRITICAL - free space: / 104 MB (1% inode=79%): [03:09:46] PROBLEM - Disk space on wtp1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:09:56] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [03:09:56] PROBLEM - Disk space on wtp1019 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:09:56] PROBLEM - Disk space on wtp1022 is CRITICAL: DISK CRITICAL - free space: / 163 MB (1% inode=70%): [03:10:06] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:07] PROBLEM - Disk space on wtp1007 is CRITICAL: DISK CRITICAL - free space: / 33 MB (0% inode=70%): [03:10:47] PROBLEM - Disk space on wtp1024 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:10:56] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [03:11:00] those look real. root full on wtp1008 [03:11:07] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:07] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:36] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [03:11:36] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [03:11:46] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:12:01] parsoid log [03:12:36] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [03:12:46] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [03:12:46] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:56] PROBLEM - Parsoid on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:06] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:07] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [03:13:26] PROBLEM - Disk space on wtp1015 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:13:46] RECOVERY - Disk space on wtp1008 is OK: DISK OK [03:13:48] Oh bah. Did the earlier deploy touch parsoid at all? [03:14:18] dunno. wto1008 was me trying a rm and parsoid restart [03:14:28] that best course? [03:14:36] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:16] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:46] springle: keep an eye on that log, if it's growing too fast there may be something that'll just keep coming back [03:16:36] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:17:28] Looking at what's up with the Parsoid servers [03:18:20] springle: I'll go in and truncate all the Parsoid logs [03:18:20] RoanKattouw, prod seems mostly fine but I did get an error on preview in like 1 out of 20 tries. not sure it's related. [03:18:29] And save a copy of one of them for Gabriel [03:18:45] RoanKattouw: thank you [03:19:41] Eloquence: Well 6/24 backends are up according to icinga so yeah that's probably related [03:19:47] ok [03:19:54] Varnish seems to be holding things together [03:19:56] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.011 second response time [03:20:03] As parsoid-lb is fine but parsoid.svc is critical [03:20:07] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:20:11] holding things together until you save? :) [03:20:41] want me to page .. nm :) [03:20:55] !log Copying wtp1005's Parsoid log for future reference [03:21:04] Logged the message, Mr. Obvious [03:22:56] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [03:23:34] !log Doing rolling restart of the Parsoid cluster [03:23:41] Logged the message, Mr. Obvious [03:24:46] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:26:16] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:26] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [03:26:32] Ahm, wat [03:26:46] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [03:26:46] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [03:27:26] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:27:57] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:47] RECOVERY - Disk space on wtp1024 is OK: DISK OK [03:28:56] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:28:56] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [03:28:56] RECOVERY - Disk space on wtp1010 is OK: DISK OK [03:28:56] RECOVERY - Disk space on wtp1019 is OK: DISK OK [03:28:56] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:28:56] RECOVERY - Disk space on wtp1012 is OK: DISK OK [03:28:57] RECOVERY - Disk space on wtp1018 is OK: DISK OK [03:28:57] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:28:58] RECOVERY - Disk space on wtp1022 is OK: DISK OK [03:28:58] RECOVERY - Disk space on wtp1005 is OK: DISK OK [03:28:59] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.013 second response time [03:29:06] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:29:07] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:08] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:29:09] RECOVERY - Disk space on wtp1017 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1009 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1013 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1007 is OK: DISK OK [03:29:10] RECOVERY - Disk space on wtp1021 is OK: DISK OK [03:29:10] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:16] RECOVERY - Disk space on wtp1006 is OK: DISK OK [03:29:19] * RoanKattouw wishes the documented restart script actually worked [03:29:26] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:36] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [03:29:36] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:36] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:29:36] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:29:46] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:46] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:46] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.010 second response time [03:29:46] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:46] RECOVERY - Disk space on wtp1011 is OK: DISK OK [03:29:47] RECOVERY - Disk space on wtp1020 is OK: DISK OK [03:29:47] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:47] RECOVERY - Disk space on wtp1023 is OK: DISK OK [03:29:48] RECOVERY - Disk space on wtp1016 is OK: DISK OK [03:29:48] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:49] RECOVERY - Disk space on wtp1014 is OK: DISK OK [03:29:49] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:57] !log Managed to restart Parsoid cleanly in the end. Turns out dsh -g parsoid service restart parsoid doesn't work but dsh -cM -g parsoid /etc/init.d/parsoid restart does work [03:30:04] Logged the message, Mr. Obvious [03:30:27] shiny, thanks RoanKattouw [03:30:32] do we know why it died in the first place? [03:31:02] Eloquence: Log file growth [03:31:09] We still don't have any log rotation for Parsoid at all [03:31:09] :P [03:31:41] So the log files grow, and as they grow mostly uniformly and the machines are mostly identical, all machines run out of disk space around the same time [03:31:56] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [03:32:03] There's something weird on wtp1016 where it's still not up and using a lot more disk than it should [03:32:07] Oh look there it is [03:32:17] Still using many gigabytes more than its siblings though [03:32:41] Actually, wtf [03:33:01] Why are almost all wtp* machines using 88-90% of their disks even with freshly cleared logs? [03:33:11] Except for wtp1008 somehow [03:33:26] gwicke, if you're around ^^ ? [03:33:45] Also, the log level is very verbose [03:33:55] starting parsing of itwiki:Stura_di_Demonte [03:33:56] completed parsing of svwiki:Sennevoy-le-Bas in 74760 ms [03:34:09] This means I now already have ~150 KB of logs per backend [03:34:16] In like 5 minutes [03:34:48] RoanKattouw: wtp1008 was the one i messed with manually earlier ^^, before you came to save the day [03:34:56] Right [03:34:58] Still [03:35:17] springle: Did you do anything to free up disk space other than clear /var/log/parsoid/nohup.out ? [03:35:34] it's /var/log/parsoid/parsoid.log [03:35:45] and i removed it, and restarted the service [03:35:58] the other boxes still have large parsoid.log [03:36:02] wtf [03:36:17] Are there two different init scripts writing to two different files or something? [03:36:49] * springle doesn't know [03:36:52] :) [03:37:08] OK yeah that log file is the culprit alright [03:37:31] And it's undocumented, awesome [03:39:47] !log Freed up disk space on wtp* by blanking /var/log/parsoid/parsoid.log [03:39:48] RECOVERY - Disk space on wtp1015 is OK: DISK OK [03:39:49] OK now I probably have to restart the service again, to make it start the right way [03:39:56] Logged the message, Mr. Obvious [03:40:55] RoanKattouw: looks like parsoid.log arrived in https://gerrit.wikimedia.org/r/#/c/107492/ [03:43:34] no, https://gerrit.wikimedia.org/r/#/c/99656/ [03:43:44] along with logrotate stuff.. heh [03:45:03] Hah [03:45:06] Not good enough then? [03:46:02] (03CR) 10Catrope: beta: manage parsoid using upstart (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [03:47:24] (03CR) 10Catrope: beta: manage parsoid using upstart (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [03:47:28] Hah, log rotation is broken [03:47:36] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [03:47:39] The puppet code that generates the logrotate.d file generates it wrong [03:48:36] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:50:46] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:50:56] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [03:51:46] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:51:57] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:58:16] RoanKattouw, fixed in the deb ;) [03:59:34] the quick fix would be to remove the variable and just insert the path [03:59:43] as in the package [03:59:50] gwicke: Where can I find the commit that fixes it? [04:00:10] there is no commit for puppet yet [04:00:32] Are you saying the deb source isn't versioned? [04:00:51] deb package != puppet [04:01:19] let me fix puppet too [04:01:30] Well ,sure [04:01:34] But what is the deb package built from? [04:01:40] Hopefully it's built from things that are versioned somewhere? [04:01:54] sure, the deploy repo [04:02:00] https://gerrit.wikimedia.org/r/#/c/110666/ [04:02:39] Aha [04:02:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:03:10] I was just about to reboot my machine after a kernel upgrade, so now reopening everything.. [04:05:06] (03PS1) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:06:06] (03PS2) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:06:46] (03CR) 10Catrope: [C: 031] Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 (owner: 10GWicke) [04:10:04] (03PS3) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:10:29] anybody with +2 powers around? [04:11:25] springle: You maybe? ---^^ [04:11:49] gwicke: I'm about to go home and eat, but after I do so, I'll write an outage report and send it to ops-l, if your commit isn't merged by then that'll hopefully get someone's attention [04:12:18] RoanKattouw, k [04:12:25] were all hosts down? [04:12:57] (03CR) 10Springle: [C: 032] Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 (owner: 10GWicke) [04:13:04] It looks like 5 of them had magically shorter logs and stayed up [04:13:13] Sean fixed wtp1008 so that made for 6 backends up [04:13:28] But we have a depool threshold that's greater than 6/24 , so LVS was sending requests to broken hosts [04:13:47] meh [04:13:55] Also, the restart command change was documented on wikitech, but: [04:14:02] * the command as documented there didn't actually work [04:14:23] * the log file location change was undocumented, so I was scratching my head as to why the disk space wasn't freeing up until Sean pointed me to the other log file [04:14:51] it is no longer in nohup.out.. [04:14:58] Yeah, I know that now [04:15:02] But the wikitech page still claims it is [04:15:14] And the old /etc/init.d scripts are still there and still work [04:15:17] I made a first pass on that earlier today [04:15:35] In fact, I used them to get the service back up because the first restart command I tried took all hosts down [04:15:48] That was dsh -g parsoid service parsoid restart [04:15:53] we only got the current code rolled out Thursday or so, after Ryan manually worked around broken Trebuchet [04:15:58] RoanKattouw: You can't merge operations/puppet? [04:16:21] I then ran that command on groups of 8 hosts at a time, and every time 1 or2 of them broke and had to be restarted again [04:16:24] So the restarts aren't very reliable [04:16:34] did you use service restart? [04:16:40] Yes [04:16:42] that will *not* use the init.d script [04:16:44] But sometimes with service restart, the service doesn't restart correctly [04:16:46] I know that [04:16:49] upstart takes precedence [04:16:58] So to clean up after myself, I did /etc/init.d/parsoid stop ; service parsoid restart [04:17:05] And in every batch of 8 hosts, 1 or 2 would break [04:17:09] And I'd have to restart those again [04:17:29] The upstart version would be running and the init version wouldn't be running, and everything would look fine, but traffic just wouldn't reach it somehow [04:17:44] * Gloria git reviews. [04:17:46] I suppose I should have kept a broken backend around for research [04:17:47] (03PS1) 10MZMcBride: Remove ArticleFeedbackv5 from Wikimedia wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 [04:17:52] we recently switched from init.d to upstart (last week) and wanted to make sure everything is working before removing the init.d script [04:18:01] Maybe if I restart the same backend over and over again I can get it to break [04:18:14] Right, OK that makes sense why the docs and presence of things were inconsistent [04:18:18] As we're in a transitional phase [04:18:32] I could say that it is all documented in puppet, but I know just as well how much it sucks to go through that.. [04:19:01] Yeah I really don't have the time or inclination to go read puppet code when a service is down [04:19:16] Gloria: I do have +2 on that repo according to the ACL, but not socially [04:19:40] Roans are not to be trusted. [04:19:55] ops was too lazy to split LDAP groups just for me, so they had to give me +2 in ops/puppet as collateral damage, and just told me not to use it [04:19:56] we are slowly cleaning up at least [04:20:15] Of course I've also been a Gerrit admin for years so I was already able to give myself +2 on any repo in theory [04:20:24] You know, in case I wanted to get in trouble with my employer :) [04:21:23] gwicke: Anyway I worry about this upstart issue, I'll play with it a bit more later tonight [04:21:25] RoanKattouw, if you used the init script then issues are not surprising [04:21:25] (03CR) 10MZMcBride: "This change isn't ready to be merged yet." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride) [04:21:53] gwicke: Sure. I think I did things correctly so as to not have interference with the init script, but that could be related [04:22:00] Also, the cluster broke the first time I tried to restart it [04:22:09] with service restart? [04:22:13] using upstart [04:22:21] Yup [04:22:28] anything in the logs? [04:22:29] I did a service restart via dsh, then two of the six hosts that were up went down [04:22:40] Didn't check, they'll have gotten wiped [04:23:13] Given that this happened for about 1 in every 4 machines when I did my last wave of restarts, when I get home tonight I'll just grab 8 backends and restart them a couple times and see if I can reproduce [04:23:17] mutante reported odd restart behavior earlier today, but we put that down to having to nuke the repo from under the process to work around broken trebuchet and then calling restart [04:23:42] he said that he had to call restart twice on some machines [04:23:48] Yeah [04:23:51] I had the same experience [04:23:56] So if the probability of failure really is 1/4 [04:24:20] Then the probability that the entire cluster will restart correctly is 0.1% [04:24:43] So if I take 8 machines and restart them 3 times (so as to not endanger the health of the entire cluster) [04:24:49] just took a statistics course? ;) [04:25:02] the probability of me observing a failure is 99.9% [04:25:05] A while ago, but yes :) [04:25:21] mine is a few years back already.. [04:25:30] And for 1/8 it's 96% [04:25:49] anyway, back up your 1/4 ;) [04:25:51] So, I'll sit down tonight and prod a couple boxes, and if there's anything wrong with upstart I should discover that really quickly [04:26:28] I tweaked a few things in the packaged upstart script [04:27:50] trying a few restarts with that on ruthenium now.. [04:28:00] OK [04:28:03] Let me know how it goes [04:28:07] I'll be back in 30-60 mins [04:34:59] aha! after around 30 'service parsoid restart's I managed to get two parsoid instances running at the same time [04:41:01] or so it seemed [04:47:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [04:50:12] PROBLEM - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [04:50:12] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [04:50:57] RECOVERY - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.402 second response time [04:50:58] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64811 bytes in 0.630 second response time [04:52:13]