[00:00:18] thanks Coren [00:00:54] I see at least three different things crashing at different points, the only common point seem to be the memory thing. If I had to guess, those are bits of code that don't guard against malloc() returning NULL properly when resources are scarce then using the pointer. [00:01:35] aha, https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [00:02:00] yeah, that [00:03:43] related? [00:03:43] 19:02 < ireas> hi there. it seems impossible to edit [[:de:Benutzer Diskussion:Emergency doc]], error: https://tools.wmflabs.org/paste/view/a811e165 [00:03:47] 19:02 < ireas> (database error) [00:03:49] 19:02 < ireas> other pages work well [00:04:07] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2417 seconds [00:04:47] !log maxsem scap aborted: MobileApp deployment (duration: 06m 33s) [00:04:53] Logged the message, Master [00:05:03] crap. srv291: Permission denied (publickey). [00:05:07] RECOVERY - MySQL InnoDB on db1040 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:05:33] <^demon> I always get pubkey denied for crap in tampa. [00:05:38] <^demon> hume, fenari, srv291... [00:05:57] haven't seen this before [00:06:14] !log maxsem started scap: MobileApp deployment [00:06:16] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [00:06:22] Logged the message, Master [00:06:37] I shouldn't be that nervous - just usual scap errors... [00:07:31] After earlier I don't blame you [00:07:37] * MaxSem prays for death of pmtpa [00:09:06] PROBLEM - MySQL Idle Transactions on db1058 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2730 seconds [00:09:07] PROBLEM - MySQL InnoDB on db1058 is CRITICAL: CRIT longest blocking idle transaction sleeps for 2732 seconds [00:09:54] srv291 has had repeated issues, see it in SAL [00:10:02] wasn't only it [00:10:07] those boxes that just cause trouble, should just be removed from dsh [00:10:12] and decom'ed [00:10:24] ok [00:11:32] you can just hit enter, right [00:11:35] to continue [00:16:43] (03CR) 10TTO: "???? The bug report refers to frwikiVERSITY not frwikTIONARY!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 (owner: 10Jforrester) [00:17:16] (03PS3) 10Ebe123: Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 [00:17:31] RECOVERY - MySQL Idle Transactions on db1058 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:17:31] RECOVERY - MySQL InnoDB on db1058 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:19:20] (03CR) 10PiRSquared17: [C: 031] Add namespace aliases for ang.wikipedia and wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [00:23:07] mutante: is your scap still going? [00:23:12] er MaxSem [00:23:24] yep [00:23:37] I have to go in about 5ish minutes [00:23:42] (03CR) 10Espreon: [C: 031] "Looks fine to me." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112409 (owner: 10Ebe123) [00:23:48] just fyi [00:24:03] ok [00:24:30] but things are looking better this time anyways :) [00:24:46] No random database failure at least. [00:25:17] * greg-g knocks on wood [00:29:01] ori, looks like the scap is using the same proxy, mw1010 for all hosts [00:29:40] interesting, will look in a moment [00:29:43] bd808: ^ [00:30:29] Problem with the ping script again? [00:34:30] (03PS2) 10Ori.livneh: scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 [00:34:34] !log maxsem finished scap: MobileApp deployment (duration: 28m 19s) [00:34:42] Logged the message, Master [00:36:42] (03CR) 10Dzahn: [C: 031] "thanks for confirming, yea, that filter was already removed in one of Matanya's changes, so it doesn't get removed here anymore after reba" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112587 (owner: 10Dzahn) [00:38:14] okay, congratulate me everyone: I scapped without pulling the deploying change [00:38:49] (03PS1) 10MaxSem: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 [00:38:58] (03CR) 10MaxSem: [C: 032] Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 (owner: 10MaxSem) [00:39:00] MaxSem: high-five [00:39:04] what, wait, revert?:) [00:39:05] (03Merged) 10jenkins-bot: Revert "Deploy MobileApp on test and test2" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112609 (owner: 10MaxSem) [00:39:15] not going to reattempt scap so late [00:39:45] oh, i didnt even understand it right first [00:39:52] i thought you meant it worked without having to revert [00:39:57] gotcha now [00:40:31] pffffffft. [00:40:42] ori: I think I see the bug in the scap.py file. Bad quoting / "interesting" behavior in scap-1 [00:40:51] yes, it uses $1 [00:40:59] so the scap proxies should be a single quoted arg [00:41:12] Yeah and utils.dsh('/usr/local/bin/scap-1 %s' % ' '.join(scap_proxies) [00:41:34] doesn't quote so only first server is seen in ping test [00:43:45] greg-g, I'm afraid that if I continue I will just make more mistakes [00:45:50] greg-g, thus I call this deployment failed and will reschedule it to a later date [00:46:13] (03CR) 10GWicke: [C: 031] "parsoid.py is no longer needed, so the removal was intentional. Forgot to remove the resource too, thanks for cleaning that up." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [00:46:26] greg-g: https://gerrit.wikimedia.org/r/#/c/112612/ explains the load spike on mw1010 earlier today [00:46:35] the cluster now has the extension but it's not enabled so it's safe for me to sleep [00:49:03] (03PS2) 10Alexandros Kosiaris: Remove parsoid.py file resource from deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 [00:49:34] <^demon> Gerrit reboot in ~10 minutes. [00:49:42] hurries [00:56:17] <^demon> Gerrit upgrade done. [00:56:25] <^demon> !log gerrit upgraded from 2.8.1 stable to 2.8.1-1-g83098d0 (custom build) to work around mysql issue pending upstream release. [00:56:25] (03PS1) 10Dzahn: add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 [00:56:34] Logged the message, Master [00:56:53] interesting rebase on https://gerrit.wikimedia.org/r/#/c/112004/1 [00:57:02] (03CR) 10jenkins-bot: [V: 04-1] add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 (owner: 10Dzahn) [00:57:03] it turned into +/- 0 [00:58:28] (03PS2) 10Dzahn: add account for Sahar Massachi (stat servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112613 [00:59:22] (03CR) 10Dzahn: [C: 04-1] "it's a noop after rebasing it, guess it's been done elsewhere and this can be abandoned" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [01:02:04] (03CR) 10Dzahn: [C: 04-2] "already done by Ryan in Change-Id: I74d63e9dfd516f7afc27f111d5919f6f81db813a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [01:31:09] (03CR) 10Jforrester: "Good point. Whoops." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 (owner: 10Jforrester) [01:31:24] (03PS2) 10Jforrester: Enable VisualEditor in opt-in mode for French Wikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112582 [01:31:52] * James_F coughs. [01:46:48] (03CR) 10Tim Starling: [C: 031] scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 (owner: 10Ori.livneh) [01:47:37] (03CR) 10Ori.livneh: [C: 032] scap: ensure local copy is up-to-date; chgrp wikidev [operations/puppet] - 10https://gerrit.wikimedia.org/r/112503 (owner: 10Ori.livneh) [02:10:38] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-11 02:10:38+00:00 [02:10:48] Logged the message, Master [02:19:13] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-11 02:19:13+00:00 [02:19:21] Logged the message, Master [02:27:09] What was up with Wikipedia about 4 hours ago. I logged some interesting error messages. [02:33:27] T13|sleeps: db failed. [02:34:07] Ahh. Then most of those db error messages make sense. [02:34:26] Got depooled since. [02:40:12] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-11 02:40:11+00:00 [02:40:45] Logged the message, Master [02:43:37] !log restarting apache on servers with workers in a futex wait: mw1056,mw1082,mw1189,mw1196,mw1199,mw1201,mw1202,mw1203,mw1204,mw1208 [02:43:44] Logged the message, Master [02:44:06] TimStarling: That happened again? Hmm. [02:44:18] what do you mean again? [02:44:57] PROBLEM - Disk space on wtp1010 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=70%): [02:45:45] it was only a few hours ago that springle was complaining about it, I assumed it was just unfixed since then [02:45:47] TimStarling: Email from Sean earlier on ops-l; he had noticed this on the first scap attempt [02:45:59] yes, he noticed it, but he didn't restart apache [02:46:04] Ah. My understanding is that it had been fixed since but happened again. [02:46:15] Nevermind, then. [02:48:28] hmm, didn't know it was happening on all those nodes [02:49:16] <^d> TimStarling: What apc bug are we hitting? Or is it the familiar cache-too-small problem? [02:50:16] I'm looking at an affected process on mw1184, it's in APCBagOStuff::set() [02:51:31] called from LanguageConverter::loadTables() [02:53:31] hmm, lots of things failing to get and lock the memc lock for doing the slave check at a certain time [02:53:35] 336 2014-02-10 22:31 [02:53:36] 1173 2014-02-10 22:32 [02:53:38] 312 2014-02-10 22:33 [02:53:53] gets that matches "22:32 springle: pt-kill jobs on s1 slaves killing anything sleeping longer than 10s " [02:54:33] springle: were connections to db1050 hanging or failing fast then? [02:54:40] there's no symbols for apc, so I'm not sure how to work out if it's full [02:55:04] they connected then slept, see ops list [02:55:09] if it was the former, it would makes sense that lots of boxes would have connections to other servers while hanging on that one (inside getLagTimes()) [02:55:58] AaronSchulz: hanging, at least some of the time [02:56:44] probably several things going wrong then [03:00:56] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:07] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:09] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:01:16] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [03:01:46] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3897 bytes in 0.162 second response time [03:01:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:01:57] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.761 second response time [03:01:59] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27648 bytes in 0.863 second response time [03:02:07] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27705 bytes in 0.527 second response time [03:03:07] PROBLEM - Disk space on wtp1013 is CRITICAL: DISK CRITICAL - free space: / 351 MB (3% inode=70%): [03:03:46] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 235 MB (2% inode=69%): [03:04:06] PROBLEM - Disk space on wtp1009 is CRITICAL: DISK CRITICAL - free space: / 287 MB (3% inode=70%): [03:04:06] PROBLEM - Disk space on wtp1017 is CRITICAL: DISK CRITICAL - free space: / 323 MB (3% inode=70%): [03:04:57] PROBLEM - Disk space on wtp1005 is CRITICAL: DISK CRITICAL - free space: / 261 MB (2% inode=70%): [03:05:59] springle: it really sucks that the locking breaks in getLagTimes() so much due to constant twemproxy errors [03:06:46] PROBLEM - Disk space on wtp1020 is CRITICAL: DISK CRITICAL - free space: / 94 MB (1% inode=70%): [03:07:06] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [03:07:17] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:17] PROBLEM - Disk space on wtp1006 is CRITICAL: DISK CRITICAL - free space: / 292 MB (3% inode=70%): [03:07:56] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:57] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:07] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 184 MB (2% inode=70%): [03:08:36] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [03:08:46] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:08:56] PROBLEM - Disk space on wtp1012 is CRITICAL: DISK CRITICAL - free space: / 74 MB (0% inode=69%): [03:08:56] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 125 MB (1% inode=70%): [03:09:46] PROBLEM - Disk space on wtp1016 is CRITICAL: DISK CRITICAL - free space: / 86 MB (0% inode=70%): [03:09:46] PROBLEM - Disk space on wtp1008 is CRITICAL: DISK CRITICAL - free space: / 104 MB (1% inode=79%): [03:09:46] PROBLEM - Disk space on wtp1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:09:56] PROBLEM - Parsoid on wtp1020 is CRITICAL: Connection refused [03:09:56] PROBLEM - Disk space on wtp1019 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:09:56] PROBLEM - Disk space on wtp1022 is CRITICAL: DISK CRITICAL - free space: / 163 MB (1% inode=70%): [03:10:06] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:07] PROBLEM - Disk space on wtp1007 is CRITICAL: DISK CRITICAL - free space: / 33 MB (0% inode=70%): [03:10:47] PROBLEM - Disk space on wtp1024 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:10:56] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [03:11:00] those look real. root full on wtp1008 [03:11:07] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:07] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:36] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [03:11:36] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [03:11:46] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:12:01] parsoid log [03:12:36] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [03:12:46] PROBLEM - Parsoid on wtp1014 is CRITICAL: Connection refused [03:12:46] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:56] PROBLEM - Parsoid on wtp1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:06] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:07] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [03:13:26] PROBLEM - Disk space on wtp1015 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=70%): [03:13:46] RECOVERY - Disk space on wtp1008 is OK: DISK OK [03:13:48] Oh bah. Did the earlier deploy touch parsoid at all? [03:14:18] dunno. wto1008 was me trying a rm and parsoid restart [03:14:28] that best course? [03:14:36] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:16] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:46] springle: keep an eye on that log, if it's growing too fast there may be something that'll just keep coming back [03:16:36] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:17:28] Looking at what's up with the Parsoid servers [03:18:20] springle: I'll go in and truncate all the Parsoid logs [03:18:20] RoanKattouw, prod seems mostly fine but I did get an error on preview in like 1 out of 20 tries. not sure it's related. [03:18:29] And save a copy of one of them for Gabriel [03:18:45] RoanKattouw: thank you [03:19:41] Eloquence: Well 6/24 backends are up according to icinga so yeah that's probably related [03:19:47] ok [03:19:54] Varnish seems to be holding things together [03:19:56] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.011 second response time [03:20:03] As parsoid-lb is fine but parsoid.svc is critical [03:20:07] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:20:11] holding things together until you save? :) [03:20:41] want me to page .. nm :) [03:20:55] !log Copying wtp1005's Parsoid log for future reference [03:21:04] Logged the message, Mr. Obvious [03:22:56] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [03:23:34] !log Doing rolling restart of the Parsoid cluster [03:23:41] Logged the message, Mr. Obvious [03:24:46] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:26:16] PROBLEM - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:26] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [03:26:32] Ahm, wat [03:26:46] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [03:26:46] PROBLEM - Parsoid on wtp1002 is CRITICAL: Connection refused [03:27:26] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:27:57] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:47] RECOVERY - Disk space on wtp1024 is OK: DISK OK [03:28:56] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:28:56] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [03:28:56] RECOVERY - Disk space on wtp1010 is OK: DISK OK [03:28:56] RECOVERY - Disk space on wtp1019 is OK: DISK OK [03:28:56] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:28:56] RECOVERY - Disk space on wtp1012 is OK: DISK OK [03:28:57] RECOVERY - Disk space on wtp1018 is OK: DISK OK [03:28:57] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:28:58] RECOVERY - Disk space on wtp1022 is OK: DISK OK [03:28:58] RECOVERY - Disk space on wtp1005 is OK: DISK OK [03:28:59] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.013 second response time [03:29:06] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:29:07] RECOVERY - LVS HTTP IPv4 on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:08] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:29:09] RECOVERY - Disk space on wtp1017 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1009 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1013 is OK: DISK OK [03:29:09] RECOVERY - Disk space on wtp1007 is OK: DISK OK [03:29:10] RECOVERY - Disk space on wtp1021 is OK: DISK OK [03:29:10] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:16] RECOVERY - Disk space on wtp1006 is OK: DISK OK [03:29:19] * RoanKattouw wishes the documented restart script actually worked [03:29:26] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:36] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.002 second response time [03:29:36] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:36] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:29:36] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.004 second response time [03:29:46] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:46] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.006 second response time [03:29:46] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.010 second response time [03:29:46] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:46] RECOVERY - Disk space on wtp1011 is OK: DISK OK [03:29:47] RECOVERY - Disk space on wtp1020 is OK: DISK OK [03:29:47] RECOVERY - Parsoid on wtp1002 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:47] RECOVERY - Disk space on wtp1023 is OK: DISK OK [03:29:48] RECOVERY - Disk space on wtp1016 is OK: DISK OK [03:29:48] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [03:29:49] RECOVERY - Disk space on wtp1014 is OK: DISK OK [03:29:49] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:29:57] !log Managed to restart Parsoid cleanly in the end. Turns out dsh -g parsoid service restart parsoid doesn't work but dsh -cM -g parsoid /etc/init.d/parsoid restart does work [03:30:04] Logged the message, Mr. Obvious [03:30:27] shiny, thanks RoanKattouw [03:30:32] do we know why it died in the first place? [03:31:02] Eloquence: Log file growth [03:31:09] We still don't have any log rotation for Parsoid at all [03:31:09] :P [03:31:41] So the log files grow, and as they grow mostly uniformly and the machines are mostly identical, all machines run out of disk space around the same time [03:31:56] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.008 second response time [03:32:03] There's something weird on wtp1016 where it's still not up and using a lot more disk than it should [03:32:07] Oh look there it is [03:32:17] Still using many gigabytes more than its siblings though [03:32:41] Actually, wtf [03:33:01] Why are almost all wtp* machines using 88-90% of their disks even with freshly cleared logs? [03:33:11] Except for wtp1008 somehow [03:33:26] gwicke, if you're around ^^ ? [03:33:45] Also, the log level is very verbose [03:33:55] starting parsing of itwiki:Stura_di_Demonte [03:33:56] completed parsing of svwiki:Sennevoy-le-Bas in 74760 ms [03:34:09] This means I now already have ~150 KB of logs per backend [03:34:16] In like 5 minutes [03:34:48] RoanKattouw: wtp1008 was the one i messed with manually earlier ^^, before you came to save the day [03:34:56] Right [03:34:58] Still [03:35:17] springle: Did you do anything to free up disk space other than clear /var/log/parsoid/nohup.out ? [03:35:34] it's /var/log/parsoid/parsoid.log [03:35:45] and i removed it, and restarted the service [03:35:58] the other boxes still have large parsoid.log [03:36:02] wtf [03:36:17] Are there two different init scripts writing to two different files or something? [03:36:49] * springle doesn't know [03:36:52] :) [03:37:08] OK yeah that log file is the culprit alright [03:37:31] And it's undocumented, awesome [03:39:47] !log Freed up disk space on wtp* by blanking /var/log/parsoid/parsoid.log [03:39:48] RECOVERY - Disk space on wtp1015 is OK: DISK OK [03:39:49] OK now I probably have to restart the service again, to make it start the right way [03:39:56] Logged the message, Mr. Obvious [03:40:55] RoanKattouw: looks like parsoid.log arrived in https://gerrit.wikimedia.org/r/#/c/107492/ [03:43:34] no, https://gerrit.wikimedia.org/r/#/c/99656/ [03:43:44] along with logrotate stuff.. heh [03:45:03] Hah [03:45:06] Not good enough then? [03:46:02] (03CR) 10Catrope: beta: manage parsoid using upstart (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [03:47:24] (03CR) 10Catrope: beta: manage parsoid using upstart (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [03:47:28] Hah, log rotation is broken [03:47:36] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [03:47:39] The puppet code that generates the logrotate.d file generates it wrong [03:48:36] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:50:46] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [03:50:56] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [03:51:46] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.007 second response time [03:51:57] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.003 second response time [03:58:16] RoanKattouw, fixed in the deb ;) [03:59:34] the quick fix would be to remove the variable and just insert the path [03:59:43] as in the package [03:59:50] gwicke: Where can I find the commit that fixes it? [04:00:10] there is no commit for puppet yet [04:00:32] Are you saying the deb source isn't versioned? [04:00:51] deb package != puppet [04:01:19] let me fix puppet too [04:01:30] Well ,sure [04:01:34] But what is the deb package built from? [04:01:40] Hopefully it's built from things that are versioned somewhere? [04:01:54] sure, the deploy repo [04:02:00] https://gerrit.wikimedia.org/r/#/c/110666/ [04:02:39] Aha [04:02:57] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:03:10] I was just about to reboot my machine after a kernel upgrade, so now reopening everything.. [04:05:06] (03PS1) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:06:06] (03PS2) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:06:46] (03CR) 10Catrope: [C: 031] Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 (owner: 10GWicke) [04:10:04] (03PS3) 10GWicke: Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 [04:10:29] anybody with +2 powers around? [04:11:25] springle: You maybe? ---^^ [04:11:49] gwicke: I'm about to go home and eat, but after I do so, I'll write an outage report and send it to ops-l, if your commit isn't merged by then that'll hopefully get someone's attention [04:12:18] RoanKattouw, k [04:12:25] were all hosts down? [04:12:57] (03CR) 10Springle: [C: 032] Fix parsoid log rotation [operations/puppet] - 10https://gerrit.wikimedia.org/r/112637 (owner: 10GWicke) [04:13:04] It looks like 5 of them had magically shorter logs and stayed up [04:13:13] Sean fixed wtp1008 so that made for 6 backends up [04:13:28] But we have a depool threshold that's greater than 6/24 , so LVS was sending requests to broken hosts [04:13:47] meh [04:13:55] Also, the restart command change was documented on wikitech, but: [04:14:02] * the command as documented there didn't actually work [04:14:23] * the log file location change was undocumented, so I was scratching my head as to why the disk space wasn't freeing up until Sean pointed me to the other log file [04:14:51] it is no longer in nohup.out.. [04:14:58] Yeah, I know that now [04:15:02] But the wikitech page still claims it is [04:15:14] And the old /etc/init.d scripts are still there and still work [04:15:17] I made a first pass on that earlier today [04:15:35] In fact, I used them to get the service back up because the first restart command I tried took all hosts down [04:15:48] That was dsh -g parsoid service parsoid restart [04:15:53] we only got the current code rolled out Thursday or so, after Ryan manually worked around broken Trebuchet [04:15:58] RoanKattouw: You can't merge operations/puppet? [04:16:21] I then ran that command on groups of 8 hosts at a time, and every time 1 or2 of them broke and had to be restarted again [04:16:24] So the restarts aren't very reliable [04:16:34] did you use service restart? [04:16:40] Yes [04:16:42] that will *not* use the init.d script [04:16:44] But sometimes with service restart, the service doesn't restart correctly [04:16:46] I know that [04:16:49] upstart takes precedence [04:16:58] So to clean up after myself, I did /etc/init.d/parsoid stop ; service parsoid restart [04:17:05] And in every batch of 8 hosts, 1 or 2 would break [04:17:09] And I'd have to restart those again [04:17:29] The upstart version would be running and the init version wouldn't be running, and everything would look fine, but traffic just wouldn't reach it somehow [04:17:44] * Gloria git reviews. [04:17:46] I suppose I should have kept a broken backend around for research [04:17:47] (03PS1) 10MZMcBride: Remove ArticleFeedbackv5 from Wikimedia wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 [04:17:52] we recently switched from init.d to upstart (last week) and wanted to make sure everything is working before removing the init.d script [04:18:01] Maybe if I restart the same backend over and over again I can get it to break [04:18:14] Right, OK that makes sense why the docs and presence of things were inconsistent [04:18:18] As we're in a transitional phase [04:18:32] I could say that it is all documented in puppet, but I know just as well how much it sucks to go through that.. [04:19:01] Yeah I really don't have the time or inclination to go read puppet code when a service is down [04:19:16] Gloria: I do have +2 on that repo according to the ACL, but not socially [04:19:40] Roans are not to be trusted. [04:19:55] ops was too lazy to split LDAP groups just for me, so they had to give me +2 in ops/puppet as collateral damage, and just told me not to use it [04:19:56] we are slowly cleaning up at least [04:20:15] Of course I've also been a Gerrit admin for years so I was already able to give myself +2 on any repo in theory [04:20:24] You know, in case I wanted to get in trouble with my employer :) [04:21:23] gwicke: Anyway I worry about this upstart issue, I'll play with it a bit more later tonight [04:21:25] RoanKattouw, if you used the init script then issues are not surprising [04:21:25] (03CR) 10MZMcBride: "This change isn't ready to be merged yet." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112639 (owner: 10MZMcBride) [04:21:53] gwicke: Sure. I think I did things correctly so as to not have interference with the init script, but that could be related [04:22:00] Also, the cluster broke the first time I tried to restart it [04:22:09] with service restart? [04:22:13] using upstart [04:22:21] Yup [04:22:28] anything in the logs? [04:22:29] I did a service restart via dsh, then two of the six hosts that were up went down [04:22:40] Didn't check, they'll have gotten wiped [04:23:13] Given that this happened for about 1 in every 4 machines when I did my last wave of restarts, when I get home tonight I'll just grab 8 backends and restart them a couple times and see if I can reproduce [04:23:17] mutante reported odd restart behavior earlier today, but we put that down to having to nuke the repo from under the process to work around broken trebuchet and then calling restart [04:23:42] he said that he had to call restart twice on some machines [04:23:48] Yeah [04:23:51] I had the same experience [04:23:56] So if the probability of failure really is 1/4 [04:24:20] Then the probability that the entire cluster will restart correctly is 0.1% [04:24:43] So if I take 8 machines and restart them 3 times (so as to not endanger the health of the entire cluster) [04:24:49] just took a statistics course? ;) [04:25:02] the probability of me observing a failure is 99.9% [04:25:05] A while ago, but yes :) [04:25:21] mine is a few years back already.. [04:25:30] And for 1/8 it's 96% [04:25:49] anyway, back up your 1/4 ;) [04:25:51] So, I'll sit down tonight and prod a couple boxes, and if there's anything wrong with upstart I should discover that really quickly [04:26:28] I tweaked a few things in the packaged upstart script [04:27:50] trying a few restarts with that on ruthenium now.. [04:28:00] OK [04:28:03] Let me know how it goes [04:28:07] I'll be back in 30-60 mins [04:34:59] aha! after around 30 'service parsoid restart's I managed to get two parsoid instances running at the same time [04:41:01] or so it seemed [04:47:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [04:50:12] PROBLEM - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [04:50:12] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [04:50:57] RECOVERY - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.402 second response time [04:50:58] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64811 bytes in 0.630 second response time [04:52:13] hm [04:56:35] gwicke: Ha! Did that result in breakage, too? [04:56:37] I guess if not then it'll probably break on the next restart after that [04:58:24] i was not able to reproduce it since [04:58:40] I switched from grep node to grep -c node [04:58:46] and the count actually stayed constant [04:58:58] another 300 restarts in now [04:59:30] so I likely overlooked something in the verbose output when I thought I had seen two copies [04:59:42] I'd say the package version works fine [05:00:14] https://gerrit.wikimedia.org/r/#/c/110666/25/debian/parsoid.upstart [05:01:06] Getting a 503 from en.wikipedia.org [05:02:38] let me remove the transitional init.d stop bit [05:02:56] PROBLEM - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [05:02:58] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 9.46399239583 [05:03:07] PROBLEM - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:09] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:26] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:46] PROBLEM - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:56] PROBLEM - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:58] RECOVERY - LVS HTTPS IPv6 on bits-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 3893 bytes in 0.428 second response time [05:04:06] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64689 bytes in 0.638 second response time [05:05:16] PROBLEM - Varnish traffic logger on cp4010 is CRITICAL: Timeout while attempting connection [05:05:16] PROBLEM - Varnish traffic logger on cp4007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:16] PROBLEM - DPKG on cp4002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:16] PROBLEM - Disk space on cp4016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:16] PROBLEM - puppet disabled on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:17] PROBLEM - LVS HTTPS IPv4 on bits-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:18] PROBLEM - HTTPS on cp4019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:18] PROBLEM - HTTPS on cp4020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:18] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:19] PROBLEM - HTTPS on cp4016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:26] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 573 bytes in 2.801 second response time [05:05:27] PROBLEM - RAID on lvs4001 is CRITICAL: Timeout while attempting connection [05:05:27] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:29] PROBLEM - puppet disabled on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:36] RECOVERY - LVS HTTP IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27814 bytes in 0.252 second response time [05:05:37] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:40] PROBLEM - RAID on cp4003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:46] PROBLEM - Disk space on cp4004 is CRITICAL: Timeout while attempting connection [05:05:46] RECOVERY - LVS HTTPS IPv6 on mobile-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 27866 bytes in 0.515 second response time [05:05:48] RECOVERY - LVS HTTP IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 64684 bytes in 0.398 second response time [05:05:56] PROBLEM - Varnishkafka log producer on cp4003 is CRITICAL: Timeout while attempting connection [05:05:56] PROBLEM - HTTPS on cp4001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:05:56] PROBLEM - DPKG on cp4001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:57] PROBLEM - RAID on lvs4002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:57] PROBLEM - puppet disabled on cp4018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:57] PROBLEM - Varnish traffic logger on cp4016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:57] PROBLEM - puppet disabled on cp4002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:58] PROBLEM - Varnish traffic logger on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:58] PROBLEM - Disk space on cp4017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:06:03] ugh [05:06:07] RECOVERY - Varnish traffic logger on cp4010 is OK: PROCS OK: 2 processes with command name varnishncsa [05:06:07] RECOVERY - Disk space on cp4016 is OK: DISK OK [05:06:07] RECOVERY - Varnish traffic logger on cp4007 is OK: PROCS OK: 2 processes with command name varnishncsa [05:06:07] RECOVERY - HTTPS on cp4019 is OK: OK - Certificate will expire on 01/20/2016 12:00. [05:06:07] RECOVERY - puppet disabled on cp4020 is OK: OK [05:06:08] RECOVERY - HTTPS on cp4020 is OK: OK - Certificate will expire on 01/20/2016 12:00. [05:06:09] RECOVERY - DPKG on cp4002 is OK: All packages OK [05:06:09] RECOVERY - HTTPS on cp4016 is OK: OK - Certificate will expire on 01/20/2016 12:00. [05:06:09] RECOVERY - LVS HTTPS IPv4 on bits-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3903 bytes in 0.389 second response time [05:06:10] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64683 bytes in 0.381 second response time [05:06:10] Looks like ulsfo is falling over? [05:06:11] this sucks :\ [05:06:16] RECOVERY - RAID on lvs4001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:06:26] RECOVERY - puppet disabled on cp4011 is OK: OK [05:06:30] hey I found an android bug. if the rate of inbound texts is too high, I can't unlock my phone screen to read them, because each incoming message resets the process of unlocking it :P [05:06:46] RECOVERY - Disk space on cp4004 is OK: DISK OK [05:06:46] PROBLEM - DPKG on cp4006 is CRITICAL: Timeout while attempting connection [05:06:46] RECOVERY - HTTPS on cp4001 is OK: OK - Certificate will expire on 01/20/2016 12:00. [05:06:56] RECOVERY - Varnish traffic logger on cp4016 is OK: PROCS OK: 2 processes with command name varnishncsa [05:06:56] RECOVERY - RAID on lvs4002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:06:56] RECOVERY - Varnishkafka log producer on cp4003 is OK: PROCS OK: 1 process with command name varnishkafka [05:06:56] RECOVERY - puppet disabled on cp4002 is OK: OK [05:06:56] RECOVERY - Varnish traffic logger on cp4006 is OK: PROCS OK: 2 processes with command name varnishncsa [05:06:57] RECOVERY - Disk space on cp4017 is OK: DISK OK [05:06:57] RECOVERY - DPKG on cp4001 is OK: All packages OK [05:07:16] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64690 bytes in 0.643 second response time [05:07:20] hahaha [05:07:26] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27864 bytes in 0.475 second response time [05:07:28] RECOVERY - RAID on cp4003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:07:36] RECOVERY - DPKG on cp4006 is OK: All packages OK [05:07:45] I wonder why I'm not getting paged [05:07:46] RECOVERY - puppet disabled on cp4018 is OK: OK [05:08:07] I also didn't get a page when parsoid.svc.eqiad.wmnet (but not parsoid-lb) went CRITICAL earlier [05:08:08] Did anyone else? [05:08:14] yes [05:08:50] OK so that did page [05:08:54] But paging is just broken for me somehow [05:09:01] I did get paged at 2pm for today's main outage [05:11:20] it's because you're set for CET time zone [05:11:39] (03PS1) 10GWicke: Remove old parsoid init script [operations/puppet] - 10https://gerrit.wikimedia.org/r/112640 [05:13:11] Oh! [05:13:13] hah [05:13:19] jgage: Could you move me back to PST? [05:13:24] I'll be in PST for the next 3 months or so [05:13:28] sure [05:13:35] I guess I asked someone to move me to CET back in December and never set it back [05:13:55] To be fair I went to AWDT afterwards, then PST, then CET, then PST again [05:14:03] But now I'm stable in PST [05:16:36] PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 9.52673494845 [05:16:56] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 1.83953969388 [05:18:00] after 500 restarts the packaged parsoid is still at 20 workers on a 16 core machine [05:19:15] But does it respond to HTTP? [05:19:19] Did you test that after each restart? [05:19:43] RoanKattouw, I'm suspecting that the transitional /etc/init.d/parsoid stop call caused issues even though no init-based parsoid should be running [05:20:15] Then why did it also fail the first time (before I'd used /etc/init.d) and why did mutante have problems [05:20:44] that call was in the upstart config for the transition from init.d [05:21:20] see https://gerrit.wikimedia.org/r/#/c/112640/ [05:21:25] aha [05:21:51] is there a saner way to do this kind of thing in puppet btw? [05:22:10] Do what in puppet? [05:22:30] switching from one init system to another [05:22:49] killing the old version when the new one is about to be installed [05:22:56] then removing the old init script [05:23:22] Probably not [05:23:25] something like debian postrm scripts [05:23:32] I've never known puppet to be good at transitional anything [05:23:59] just getting rid of the old init file is a two-step process [05:24:14] first ensure => absent, then remove the stanza when that has run [05:26:36] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.62411907216 [05:37:53] ok RoanKattouw, confirmed you are switched to PST/PDT for icinga alerts [05:38:54] Sweet thanks [05:49:56] fwiw, the puppet init script works fine on a test host too [05:50:06] eh, upstart config [06:09:12] (03CR) 10GWicke: "When deploying earlier today Daniel observed that some parsoid processes needed to be restarted twice to come up correctly. We put this do" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112640 (owner: 10GWicke) [06:12:20] gwicke: so restart works properly when using upstart? [06:12:32] Ryan_Lane: Not in my experience [06:12:36] with the service-restart command? [06:12:37] What I just noticed while writing the post mortem: [06:12:45] The first restart that I did didn't clear the log file [06:12:49] the init script definitely didn't work [06:12:55] because it was simply broken [06:12:58] Because disk utilization was still at 88%-90% after I restarted using the init script [06:13:16] because that had cleared nohup.out but not parsoid.log [06:13:26] erm, what is " 273 fault (11) " in fatalmonitor? [06:13:29] right, we know the init script is broken [06:13:35] the upstart should work, though [06:13:37] Ryan_Lane: No, not really [06:13:44] The init script *saved* us initially [06:13:58] I'm writing a post mortem, but the rough order of events was: [06:14:10] 1) disk space exhaustion, 19/24 backends die within 6 minutes of each other [06:14:18] 2) Sean manually fixes wtp1008 [06:14:30] (by removing the log file and restarting the service) [06:14:31] restarting is not supposed to clear the log file [06:14:33] what caused disk space exhaustion? [06:14:41] Broken logrotate config [06:14:43] log rotate should clear the log right? [06:14:43] running the log rotation manually will help though [06:14:44] ah [06:14:59] gwicke: Hah, that explains why the restart didn't work the first time [06:15:00] it is set to only run daily otherwise [06:15:27] the init script definitely makes it impossible to use the service-restart command [06:15:27] 3) I do a rolling restart using sequential dsh as documented on wikitech; more backends die [06:15:32] similar with dsh [06:15:41] it doesn't exit properly [06:15:45] I know [06:15:54] For #3 I used 'service parsoid restart' as the dsh command [06:15:59] Which does exit properly [06:16:09] * Ryan_Lane nods [06:16:19] sorry, but the number in the fatalmonitor is growing. Is it anything major? About to look deeper [06:16:45] 4) I restart the service using the old init script, this clears nohup.out which was 10% of disk space, all backends come back up [06:17:07] 5) Sean tells me the (undocumented) new location of the log file, I figure out what's going on with init script vs upstart and clean up [06:17:16] RoanKattouw, there is no mention of /etc/init.d/parsoid in the docs [06:17:21] only service parsoid restart [06:17:28] Where "clean up" meant /etc/init.d/parsoid stop ; service parsoid restart [06:17:37] gwicke: What I meant by undocumented was /var/log/parsoid/parsoid.log [06:17:43] That's not mentioned in the docs, only nohup.out is [06:17:51] yeah, that's now fixed [06:17:51] We had this discussion earlier, the docs are in an in-between state [06:17:59] btw, we should delete nohup.out [06:17:59] or were at the time, at least [06:18:04] and you explained why [06:18:04] could you do so? [06:18:06] Yeah we should [06:18:09] Sure [06:18:11] the only thing that sucks about 'service' is that it delegates to one of only two possible init systems [06:18:15] i hope that with systemd it'll be three [06:18:29] that should make debugging fun [06:18:50] ori: https://gerrit.wikimedia.org/r/#/c/112640/ ;) [06:18:50] Anyway, after stopping the init service and starting the upstart service, about 1/4 didn't come up correctly, so I had to restart those again [06:19:08] This seems to have been experienced before while migrating from init to upstart, and Gabriel hasn't been able to reproduce it since [06:19:24] I'm guessing that it was the log file still being too large [06:19:37] which makes sense as you didn't truncate it [06:20:27] there are 300+ apache2[2392]: [notice] child pid 9298 exit signal Segmentation fault (11) -- style errors. Sorry if this is a known issue [06:20:36] yurik: That just means PHP is segfaulting a lot [06:20:45] Which is not good but probably not urgent [06:20:52] RoanKattouw, ok, thx [06:20:54] Probably Yet Another PRCE Bug [06:21:17] gwicke: Yeah probably what happened with my first restart is nothing [06:21:36] Then a few minutes later, a couple of other machines organically ran out of disk space and broke, and I blamed that on the restart [06:22:20] the load balancing looks pretty good with so many machines running out of space close to each other after a week [06:22:23] what are you guys writing into those log files anyway? [06:22:35] /var/log/parsoid/ [06:22:51] the puppetized version of the logrotate.d config didn't work [06:23:42] ori: Lots and lots and lots of CACHE MISS: \n Started parsing \n Finished parsing in