[02:06:54] ori, https://gerrit.wikimedia.org/r/#/c/123830/ is ready, and it will even ask for your username on the first start [02:12:13] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3234 MB (3% inode=99%): [02:19:13] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3390 MB (3% inode=99%): [02:29:51] !log LocalisationUpdate completed (1.23wmf22) at 2014-04-21 02:29:49+00:00 [02:29:59] Logged the message, Master [02:42:32] !log LocalisationUpdate completed (1.24wmf1) at 2014-04-21 02:42:30+00:00 [02:42:39] Logged the message, Master [03:00:13] RECOVERY - Disk space on virt0 is OK: DISK OK [03:12:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:12:43] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:13:03] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.054 second response time [03:13:14] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.061 second response time [03:13:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69615 bytes in 0.239 second response time [03:14:03] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [03:15:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:20:26] what was that? [03:26:39] !log ap_busy_workers spike on image scalers eqiad, started ~2:55, subsided around ~3:20 [03:26:45] Logged the message, Master [03:26:59] wow, fatals are high [03:30:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 21 03:30:15 UTC 2014 (duration 30m 14s) [03:30:26] Logged the message, Master [03:32:57] !log 5.5k fatals over last 20 hrs, of which 3.5k are calls to doTransform() on a non-object at TimedMediaThumbnail.php:201, and 0.9k are Lua API OOMs at LuaSandbox/Engine.php:264 [03:33:03] Logged the message, Master [03:37:31] tgr: hello [03:38:17] http://commons.wikimedia.org/w/thumb_handler.php/2/2c/Closed_Friedmann_universe_zero_Lambda.ogg/220px--Closed_Friedmann_universe_zero_Lambda.ogg.jpg fatals [03:40:34] all 3.5k fatals are for that URL [03:55:33] !log reset pc100* slaves previously replicating from pmtpa [03:55:39] Logged the message, Master [04:13:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [06:04:13] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:04:23] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:14] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:13] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 4.212 second response time [06:06:13] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.046 second response time [06:06:23] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 7.624 second response time [06:06:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:07:03] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.078 second response time [06:07:03] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.071 second response time [06:07:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [06:07:43] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.065 second response time [06:08:03] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [06:47:34] <_joe_> mh what was that now [07:15:16] (03CR) 10TTO: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [08:04:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [08:25:13] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [08:25:14] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [08:25:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [08:25:23] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:33] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:25:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [08:26:13] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [08:26:16] <_joe_> mmh let me see [08:26:33] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:27:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [08:32:03] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [08:32:03] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 1.034 second response time [08:32:14] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 8.853 second response time [08:32:14] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 3.515 second response time [08:32:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 0.370 second response time [08:32:35] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 5.309 second response time [08:32:43] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [08:33:04] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.051 second response time [08:33:05] <_joe_> I was just looking at this- seems genuine load or something [08:33:14] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.057 second response time [08:36:53] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 955.037817366 [08:37:46] it's commons User:Fae [08:37:56] files such as https://commons.wikimedia.org/wiki/File:Manhattan,_V._9,_Double_Page_Plate_No._210_%28Map_bounded_by_St._Ann%27s_Ave.,_East_156th_St.,_Concord_Ave.,_East_149th_St.%29_NYPL1998898.tiff [08:38:03] 150MB tiff files [08:38:11] see https://commons.wikimedia.org/wiki/Special:Contributions/F%C3%A6 for more [08:38:54] our imagescaler setup is very easy to saturate [08:44:30] <_joe_> paravoid: how did you get that info? [08:44:45] <_joe_> paravoid: I got to strace the processes around recovery time [08:45:14] <_joe_> so I could not get the reason of the problem (fatal logs don't help either) [08:46:58] paravoid, for reference: https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F [08:57:31] thanks, andre__ [08:57:33] I left https://commons.wikimedia.org/wiki/User_talk:F%C3%A6#Large_file_uploads [08:57:44] and I've mentioned it over at #wikimedia-commons [08:58:41] hah paravoid [08:58:50] We only just enabled uploads from NYPL a week ago or so. [08:58:57] through GWToolset [08:59:04] No one foresaw that thou [08:59:27] noone thought that uploading hundreds of 100+MB files would create a problem? :) [09:00:05] next time, please check with us first [09:00:14] How would we know that? [09:00:22] People upload hundreds of 100+MB files every single day [09:01:03] well, maybe tens of them :-P [09:01:24] :) [09:01:24] <_joe_> that's an order of magnitude [09:01:26] <_joe_> :) [09:27:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [09:27:23] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:33] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [09:27:33] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:27:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [09:28:14] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [09:28:14] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [09:28:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [09:29:14] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:03] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [09:31:13] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 5.726 second response time [09:31:14] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 2.615 second response time [09:31:33] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 1.035 second response time [09:31:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 0.405 second response time [09:31:43] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [09:32:03] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 1.657 second response time [09:32:04] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 2.431 second response time [09:33:13] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.068 second response time [10:32:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [11:27:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [11:32:33] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 595.733337 [11:39:33] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:41:27] ping apergos [11:41:42] it's a holiday here today [11:42:02] umn, indeed. sorry :P [11:42:15] paravoid: for you thunbnail generation slow too? [11:42:17] no worries, I'm just setting expectations :) [11:42:26] https://ganglia.wikimedia.org/latest/?c=Swift%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 O_O [11:42:33] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 463.600006 [11:42:55] we had a thumb issue a few hours ago, but it should be okay now [11:43:06] what is the thumbnail that you're seeing an issue with specifically? [11:43:26] on all, som tubs take +30 seconds to load [11:43:33] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [11:44:14] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:06] !log reedy synchronized php-1.23wmf22/extensions/TimedMediaHandler 'I7483c8b7ec75f5149998da2b530ca0467ac70de7' [11:45:12] Logged the message, Master [11:45:21] HE trouble again [11:45:30] the cp3012 issue is stemming from a bunch of 503s from mw hackends [11:45:35] *backends [11:45:38] 200 packets transmitted, 7 received, 96% packet loss, time 20229ms [11:45:42] ugh [11:45:43] PROBLEM - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:45] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:51] <_joe_> shit. [11:45:56] on it [11:46:02] it's HE lon<->nyc [11:46:06] ok [11:46:14] PROBLEM - Varnish HTTP mobile-frontend on cp3012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:23] PROBLEM - Varnish HTTP mobile-frontend on cp3013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:33] PROBLEM - Varnish HTTP mobile-frontend on cp3014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:47:47] what's going on? just got woken by pages (mid-day nap, so rare) [11:48:19] * apergos does the backread [11:48:43] PROBLEM - Varnish HTTP mobile-frontend on cp3011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:49:04] RECOVERY - Varnish HTTP mobile-frontend on cp3012 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.176 second response time [11:49:04] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20422 bytes in 0.596 second response time [11:49:13] RECOVERY - Varnish HTTP mobile-frontend on cp3013 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.174 second response time [11:49:19] * Steinsplitter bring apergos biscuits and tea. [11:49:20] !log deactivating eqiad<->HE peering, >90% packet loss between lon<->nyc [11:49:24] RECOVERY - Varnish HTTP mobile-frontend on cp3014 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.191 second response time [11:49:28] Logged the message, Master [11:49:33] RECOVERY - LVS HTTP IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 20357 bytes in 0.259 second response time [11:49:36] RECOVERY - Varnish HTTP mobile-frontend on cp3011 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.195 second response time [11:49:36] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 20359 bytes in 0.267 second response time [11:50:55] !log deactivating esams<->HE peering, >90% packet loss between lon<->nyc [11:51:01] Logged the message, Master [11:51:51] !log reedy synchronized php-1.23wmf22/extensions/TimedMediaHandler 'I7483c8b7ec75f5149998da2b530ca04' [11:51:57] Logged the message, Master [11:52:44] much better [12:00:29] (03PS1) 10Reedy: Add ttf-vlgothic to imagescalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/127623 [12:07:13] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [12:07:23] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [12:07:24] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:43] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:04] for crying out loud [12:08:11] <_joe_> and... again. [12:08:13] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [12:08:13] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [12:08:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:24] ? [12:08:43] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:10] <_joe_> paravoid: same as before, also, mw is writing a local copy of the tiff image to /tmp, so maybe we have some i/o issues as well [12:10:21] no, it's just maxclients [12:10:26] dear users: please don't upload humongous globs of data that break us [12:10:27] I still see uploads from Fae [12:10:43] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 3.694 second response time [12:11:04] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.054 second response time [12:11:04] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 1.722 second response time [12:11:04] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 1.452 second response time [12:11:07] <_joe_> nevermind, php is the bottleneck here [12:11:14] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.075 second response time [12:11:14] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [12:11:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 0.390 second response time [12:12:02] twkozlowski: ping [12:12:03] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.605 second response time [12:12:33] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.061 second response time [12:12:35] Reedy: how well do you know GWToolset? [12:13:03] Not that well.. Why? [12:13:10] Might be able to help [12:13:19] Fae said that "There are a few more committed to GWT (I estimate 300 images in the queue) which I cannot halt" [12:13:32] 300 is 300 too many [12:13:32] where is that queue? [12:14:08] Redis job queue I think [12:14:12] let me have a look [12:14:40] I'm checking out the extension to read its code [12:14:54] GWToolset\Jobs\UploadMediafileJob [12:15:51] gwtoolsetUploadMediafileJob: 0 queued; 124 claimed (31 active, 93 abandoned); 0 delayed [12:16:21] That doesn't look like 300 to me [12:16:33] that was 3 hours ago [12:17:18] I think the extension's throttling is crap [12:17:34] I see burst of uploads spaced 20-25' apart [12:17:56] It has throttling in its own code? [12:18:04] supposedely [12:18:35] <_joe_> looking into it [12:19:44] If it's just throttling what it adds to the job queue, it's not exactly sufficient [12:19:52] The runners will essentially "do what they want" [12:21:42] paravoid: public static $mediafile_job_queue_max = 1000; [12:21:48] lol [12:21:52] "the maximum number of UploadMediafileJob’s that should exist in the overall job queue." [12:22:05] <_joe_> ... [12:22:24] Suggestions of a value to drop that to? [12:22:43] it's not GWToolset that kills us right now, it's Fae's uploads specifically [12:23:49] I have a more basic question [12:24:31] why do these uploads happen from the user's account? [12:24:39] instead of say, a bot account, that we can ban? [12:26:14] Maybe no one has thought about it [12:27:30] Should be easy enough to change over - create account and assign to the gwtoolset group [12:34:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:36:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:38:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:38:54] there is a wikipedia-commons-gwtoolset-metadata swift container [12:39:09] 29 files under /Fæ there [12:39:34] 444 filenames [12:39:42] I'm going to delete those 29 XMLs [12:40:45] !log deleting 29 GWToolset XML under Swift's wikipedia-commons-gwtoolset-metadata container for user Fæ/ [12:40:48] oh also [12:40:51] Logged the message, Master [12:40:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:41:08] !log escalating myself to Commons bureaucrat/admin, then adding GWToolset privileges [12:41:14] Logged the message, Master [12:41:22] Do you not have +sysop on the site? [12:41:26] no [12:41:44] I just used createAndPromote.php [12:41:49] :) [12:41:56] now I just want to remove that again :) [12:42:08] Special:Userrights [12:42:33] that's what I used to add GWToolset user [12:42:39] but admin/crat is under "groups you can't change" :) [12:42:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:43:19] You can delete them from the database... [12:44:49] maybe it just doesn't let me do it myself? [12:44:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:45:43] Was this intervention really necessary? [12:45:50] which one? [12:45:55] You know there are Commons bureaucrats on IRC right now, right? [12:46:04] I pinged you before [12:46:25] see backlog [12:46:31] Yeah, and logged your action at 14:41 CEST [12:46:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:47:24] after gettign reports for slow thumbs and a page at 14:12 CEST [12:48:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:49:21] no contributions from Fae since 12:39 UTC, my removal from swift probably worked [12:50:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:52:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:54:27] !log demoting myself, removing Commons crat/admin rights [12:54:33] Logged the message, Master [12:54:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:55:54] Did not appear to have worked, paravoid [12:56:08] Member of: Bureaucrats, GWToolset users and Administrators [12:56:13] User cache [12:56:19] that's just cached, it will expire soon I think [12:56:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [12:58:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [13:00:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 12:31:23 PM UTC [13:00:52] RECOVERY - Puppet freshness on wtp1011 is OK: puppet ran at Mon Apr 21 13:00:48 UTC 2014 [13:01:04] (03CR) 10Nemo bis: "It seems this isn't working but I don't know why. Docs don't mention it explicitly but as the planet outputs atom I assume it also reads i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 (owner: 10Dzahn) [13:02:52] PROBLEM - Puppet freshness on wtp1011 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 01:00:48 PM UTC [13:03:53] twkozlowski: should be better now [13:05:37] !log De-activated status.wm.o monitor for icinga due to false positive from HTTP auth [13:05:43] Logged the message, Master [13:07:19] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:29] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:29] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:19] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:09] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [13:14:33] paravoid: Maybe we should have stricter rules about who we assign GWToolset user rights to. [13:14:49] gwtoolsetUploadMediafileJob: 0 queued; 122 claimed (0 active, 122 abandoned); 0 delayed [13:14:50] Given that it can kill the servers dead [13:17:03] <_joe_> Reedy: how do you get that number? [13:17:25] Run this on tin [13:17:26] mwscript showJobs.php commonswiki --group | grep gw [13:17:36] I presume those abandoned are dead and won't get retried, right? or is that a different group? [13:17:42] <_joe_> ok thanks [13:18:37] apergos: Yup. I think they've exhausted their attempt count [13:18:45] right [13:31:39] RECOVERY - Puppet freshness on wtp1011 is OK: puppet ran at Mon Apr 21 13:31:29 UTC 2014 [13:38:19] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.090 second response time [13:39:39] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 7.692 second response time [13:40:19] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 4.775 second response time [13:40:39] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 4.628 second response time [13:41:22] !log rolling restart on some of the Elasticsearch servers to pick up new plugins. should not cause any trouble. [13:41:27] Logged the message, Master [13:42:19] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 2.606 second response time [13:42:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:29] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:29] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [13:43:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:48:09] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 6.164 second response time [13:48:29] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 9.457 second response time [13:49:09] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 2.804 second response time [13:49:12] (03PS1) 10Gerrit Patch Uploader: Change "Open Directory" to DMOZ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127627 [13:49:17] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127627 (owner: 10Gerrit Patch Uploader) [13:49:19] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 5.420 second response time [13:50:09] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.046 second response time [13:50:30] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 1.936 second response time [13:51:09] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 4.506 second response time [13:53:19] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [13:53:30] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 2.468 second response time [14:00:09] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [14:00:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:19] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:29] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:29] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [14:02:10] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [14:02:16] <_joe_> will this ever stop? [14:03:19] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.140 second response time [14:03:39] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 5.205 second response time [14:03:51] _joe_: I'm not sure what is causing it. Nor am I sure who is looking into it [14:05:13] these are image scalars running out of memory as they have no swap [14:05:59] <_joe_> hoo: exactly [14:06:20] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [14:06:30] <_joe_> hoo: swap will only kill them I think. [14:06:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:45] Yeah, probably [14:07:30] I wonder why the job queue isn't preventing so many from launching at once. Might be we've just configured it too aggressively. Or because it is configured for processes that use less ram. it doesn't have a "weight" per job [14:07:38] <_joe_> hoo: also, it's just killing objects in a cgroup [14:07:59] _joe_: Yeah, am watching on mw1155 atm [14:08:21] <^demon|away> Has all of Aaron's wrappers for Tiff/DjVu/etc in PoolCounter gone live? [14:08:30] <^demon|away> If not, let's get that out in today's SWAT. [14:08:32] <^demon|away> That'll help. A lot [14:08:56] We had cases where a single image was able to force boxes out of memory [14:10:56] <_joe_> hoo: so since someone is trying to convert a large number of large images, we constantly run OOM for the converting jobs [14:11:09] <_joe_> I can't see why apache goes down then [14:11:25] <_joe_> it should just choke an error [14:11:36] probably the scaler stuff is child of apache? [14:12:00] <_joe_> yes it's launched with a system() or some other mean by php [14:12:22] <_joe_> but still, it should just return with non-zero status to php [14:12:22] │ │ └─bash /usr/local/apache/common-local/php-1.23wmf22/includes/limit.sh... [14:14:29] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.112 [14:15:34] _joe_: very weird indeed [14:18:00] <_joe_> hoo: one way to understand what happens would be to strace an apache process which is father of one of the long-running convert jobs and get what happens when the oom-killer kills the child. [14:18:50] <_joe_> hoo: not even sure this is why apache stops responding, it may be due to reaching MaxChild [14:19:33] image scalers are supposed to be capped at an amount of jobs which can't exhaust their memory [14:20:02] either they're running too many jobs or the max area was raised too much [14:20:27] Actually that's true... it *should* not be able to start more convert processes on a box than it has mem. [14:20:29] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 8.782 second response time [14:20:30] unless those are all machines with less memory than they were expected to have [14:20:47] <_joe_> Nemo_bis: I have no evidence of apache reaching MaxChild today [14:20:50] not sure what the limits for a convert are set to [14:21:09] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.072 second response time [14:21:09] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.072 second response time [14:21:09] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.071 second response time [14:21:09] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 7.485 second response time [14:21:19] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 4.389 second response time [14:21:39] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 70996 bytes in 4.689 second response time [14:22:19] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.069 second response time [14:22:39] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 9.472 second response time [14:23:25] (03CR) 10Cmjohnson: [C: 032] turn off rsyncs to/from dataset2, prep for 12th floor move [operations/puppet] - 10https://gerrit.wikimedia.org/r/127235 (owner: 10ArielGlenn) [14:23:32] MW_MEM_LIMIT=409600 [14:23:59] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2344.98545734 [14:24:30] that times 1024 [14:24:40] these boxes have 12gb ram [14:26:23] (03PS2) 10QChris: Add cp301[34] to mobile caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/127456 [14:26:28] _joe_: what abut the ganglia graphs: http://ganglia.wikimedia.org/latest/?c=Image%20scalers%20eqiad&h=mw1153.eqiad.wmnet& look at busy threads [14:26:52] oh, maxchild you said, sorry [14:30:00] <_joe_> apergos: mmmh yes but that is probably measuring the same thing. Still, apache did not log reaching maxchild [14:33:28] _joe_: These boxes have MaxClients 18 [14:33:42] These servers are being relocated in about a hour, and will be down for a short period of time: fenari, mexia, dataset2, amaranth, mchenry, dobson, sanger, pdf*, virt0, tridge. [14:34:19] PROBLEM - SSH on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:29] PROBLEM - RAID on mw1084 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:34:29] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [14:34:56] <_joe_> hoo: I know [14:35:09] RECOVERY - SSH on mw1084 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [14:35:19] RECOVERY - RAID on mw1084 is OK: OK: no RAID installed [14:36:09] PROBLEM - SSH on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:36:09] PROBLEM - RAID on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:19] PROBLEM - check configured eth on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:24] So the requests fail because maxrequests is exhausted or am I missing something? [14:36:29] PROBLEM - puppet disabled on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:39] PROBLEM - check if dhclient is running on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:39] PROBLEM - DPKG on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:09] PROBLEM - twemproxy port on mw1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:09] RECOVERY - check configured eth on mw1040 is OK: NRPE: Unable to read output [14:37:19] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:19] RECOVERY - puppet disabled on mw1040 is OK: OK [14:37:29] RECOVERY - check if dhclient is running on mw1040 is OK: PROCS OK: 0 processes with command name dhclient [14:37:29] RECOVERY - DPKG on mw1040 is OK: All packages OK [14:37:59] RECOVERY - SSH on mw1040 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [14:37:59] RECOVERY - twemproxy port on mw1040 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [14:37:59] RECOVERY - RAID on mw1040 is OK: OK: no RAID installed [14:38:09] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.069 second response time [14:41:58] we ok? [14:47:00] well the image scalers are still not very thrilled [14:48:57] * greg-g nods [14:50:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:08] *cough* [14:51:09] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [14:51:19] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:29] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71195 bytes in 0.299 second response time [14:51:38] * greg-g is reading the thread now [14:52:29] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:29] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [14:52:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [14:53:12] plenty of free ram on the boxes, but max clients exceeded [14:53:29] <_joe_> hoo: do you see the line in the logs? [14:53:29] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 9.098 second response time [14:53:32] <_joe_> on which box? [14:53:38] mw1155 [14:54:09] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.744 second response time [14:54:19] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 9.990 second response time [14:54:19] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.072 second response time [14:54:30] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.065 second response time [14:54:44] <_joe_> hoo: on 3 other machines it was not reached, though [14:55:19] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.670 second response time [14:55:41] it took me about 22s to connect to the apache, that's why the icinga checks are timing out [14:56:10] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.074 second response time [15:01:13] waiting for a free slot [15:02:22] apergos: _joe_: What's your opinion about slightly raising maxclients for the scalers? Neither CPU nor mem. really seems exhausted [15:02:32] by a small number indeed [15:03:09] knowing that if e get a batch of requests in that all take close to max memory then the oom killer will get them I suppose [15:03:24] because cumulatively it will be too much for 12 gb [15:03:54] apergos: Ok.. so 18 -> 22 maybe? [15:04:04] <_joe_> hoo: I had the same idea, but I'm not confident that is a good one. I know too little of the overall application to be able to assert that thr right thing to do [15:04:42] <_joe_> I'm 2 weeks into the ops team after all [15:05:05] well, worst case is that more image scalings fail because they hit the timeout because the box is more busy or because the whole box runs out of mem. [15:05:29] I think it'll be ok, for the short term anyways [15:05:45] if you go to 23 then that's the equiv of 40 more slots [15:05:58] miight help some [15:06:19] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:19] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:19] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:57] <_joe_> hoo: still I'd prefer to understand what is causing this strain on those boxes [15:06:58] (03PS1) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 [15:07:30] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:49] _joe_: I think there's a dedicated log for that on terbium [15:07:53] * fluorine [15:08:12] <_joe_> hoo: yes I'm looking at that now :) [15:09:29] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 8.809 second response time [15:10:09] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.549 second response time [15:10:10] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 8.393 second response time [15:10:29] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 6.688 second response time [15:12:09] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.069 second response time [15:12:29] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.073 second response time [15:13:29] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:49] hey [15:14:05] I'm out [15:14:19] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.067 second response time [15:14:25] are we still having issues? [15:14:53] <_joe_> paravoid_: yes [15:15:10] do you know why? [15:15:19] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:29] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:49] check the ttl of the memcached key that limits the imagescaling attempts [15:16:04] the changeset is linked on the bug I mailed earlier today [15:16:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:50] also, from a quick glance of ganglia from over here, it looks like swift frontends are saturating their gige [15:17:04] probably from sending these 100-200MB files over to the imagescalers over and over [15:17:38] <_joe_> yes I was seeing quite some issues on retreiving original images on the thumbnail log [15:17:39] this will make the transfers slower, the imagescaler requess slower, which will then pile up until they DoS the cluster [15:17:49] wake up Aaron [15:18:05] or some other MW developer that feels confident to modify thumb.php [15:18:09] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.051 second response time [15:18:19] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [15:18:28] * Nemo_bis invokes bawolff [15:18:29] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.069 second response time [15:18:31] <_joe_> paravoid_: also, we have a lot of OOMs going on on those boxes [15:18:59] to either abort early if the file is > X MB or create some more permanent limit than a memcache keyed by (orig, thumb) with a potentially low TTL [15:19:10] the OOMs are normal, it's OOMing in the cgroup [15:19:22] <^d> Aaron should be around soon, it's already 8am here. [15:19:27] <_joe_> yes yes, I did not know if that was 'normal' [15:19:52] we're spawning scales inside memory-constrained cgroups [15:20:08] <_joe_> yes I've seen the limit.sh wrapper [15:20:09] ^d: how comfortable are you on deploying those fixes you mentioned in your email? [15:20:48] are these fixes for this? [15:20:55] related [15:20:57] <^d> Huh? Fixes in e-mail? [15:21:05] <_joe_> Is there some thread I'm not aware of, I'd say [15:21:19] <^d> paravoid_: Aaron's been tightening up the PoolCounter code around Tiffs and DjVu. [15:21:25] <^d> Making sure all that's live wouldn't be a bad idea. [15:21:25] that's not going to fix this [15:22:06] !log dataset2 going down to be relocated to the 12th floor [15:22:07] https://gerrit.wikimedia.org/r/107419 [15:22:12] ttl 3600 [15:22:12] Logged the message, Master [15:22:17] and keyed by thumbnail size as well [15:22:22] ^d: errr, now I can't find them.... [15:22:36] ^d: but the things you just mentioned, aaron's fixes in poolcounter [15:22:50] this needs to become more permanent and for all thumbnails for a given original [15:22:51] paravoid_: "that's not going to fix this" ah :/ [15:22:57] <^d> paravoid_'s right, won't help. [15:23:02] <^d> Won't hurt, but won't fix it [15:23:08] plus probably not even attempt to scale this if the file is too big [15:23:22] can someone either do that or wake up aaron? [15:23:39] <^d> Do *what*? [15:23:41] <^d> Deploy that? [15:24:20] no, add code to fix this [15:24:21] <^d> Oh, I misread you. [15:24:22] <^d> Got it. [15:25:29] <_joe_> paravoid_: I think having a limit that depends also on image size would be good [15:25:47] <_joe_> so that larger images can fail less times and be "banned" longer [15:26:42] that could be done rather easily, you could just make it fail after one attempt if the file is bigger than n [15:26:49] would be trivial in the current implementation [15:27:01] * greg-g preps for a call in 4 minutes [15:27:14] * ^d has thumb.php open now [15:27:23] * ^d is hacking [15:27:35] thanks ^d [15:27:43] <_joe_> thanks! [15:27:44] I'm ooo, sorry :/ [15:27:46] ^d: Just make it set the value of the memcached key to 5 [15:27:53] if the img size is higher than n [15:28:04] paravoid_: thanks for the help [15:28:23] ok, I'm going to quit from here [15:28:31] I'll read the backlog (from my regular handle) when I get back [15:28:40] <_joe_> yeah quite the unfortunate day for us europeans - it's bank holiday all across the continent [15:29:19] is it not already enforcing a fail limit of 5 on all thumbs? [15:29:33] apergos: Yep [15:29:38] yes, but it saves that in memcache with a ttl of 3600 [15:29:45] that's why I suggest only doing it once per hour for large images [15:30:00] also keyed by the thumb size, so 50px has a different count than 150px [15:30:25] both are easy fixes [15:30:39] PROBLEM - Host dataset2 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:03] then just add a stanza before that, if the size is > 50MB, abort -- no need for failed attempts or anything [15:31:23] we can render images with more than 50MiB just fine [15:31:51] I'd fix the outage for the general case now, deal with the false negatives later [15:31:54] but ymmv [15:31:55] <_joe_> hoo: if not 50, some other - configurable threshold is needed [15:32:03] gotta go [15:32:07] see ya [15:32:30] Yep... I'd go for like 150 or so? Or more? Don't know how large the biggest images that succeed are... [15:32:52] <_joe_> hoo: shoot for the 75 MB range [15:32:58] it varies, it's not just the abosulte size of the image that's the issue, but we should try to make a good guess at a cutoff [15:33:38] <_joe_> but let's ^d work on it :) [15:33:43] True, there are multiple factors... [15:35:11] <_joe_> hoo: I don't believe raising the maxchild setting would help us now significantly if we don't raise it 2x/3x [15:35:15] I see that currently we're ok... (says ganglia) [15:35:32] <_joe_> apergos: we will be down again in ~ 30 mins [15:35:37] heh [15:35:41] when the hour is up [15:35:45] <_joe_> when the memcache key expires again :) [15:36:17] I would vote against anything like a 2-3fold increase [15:37:59] <_joe_> apergos: that was my point :) [15:44:33] <^d> Almost done. [15:59:56] (03PS1) 10Aaron Schulz: Removed redundant config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127635 [16:04:31] (03PS1) 10Ottomata: elastic1014 master eligible [operations/puppet] - 10https://gerrit.wikimedia.org/r/127636 [16:04:44] (03CR) 10Ottomata: [C: 032 V: 032] elastic1014 master eligible [operations/puppet] - 10https://gerrit.wikimedia.org/r/127636 (owner: 10Ottomata) [16:04:52] !log reinstalling elastic1014 [16:04:59] Logged the message, Master [16:06:39] PROBLEM - Host elastic1014 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:56] <^d> hoo|away, apergos, paravoid, AaronSchulz, _joe_: Change up for review for image failure limits https://gerrit.wikimedia.org/r/#/c/127637/ [16:09:01] <_joe_> ^d: I hope someone else can take a look, I'm off now (it's a bank holiday after all) [16:09:18] <^d> Hence the ping for AaronSchulz ;-) [16:09:21] <_joe_> I'll be back in ~ 2 hours or so, if I'm still needed [16:09:34] <^d> Just letting everyone know it was up. [16:10:13] AaronSchulz: online? if not, I'mma gonna call you :) [16:10:44] sec [16:10:50] * AaronSchulz has been the office since 8:40 [16:11:19] <^d> Yeah, I figured you were around now. [16:11:49] RECOVERY - Host elastic1014 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:11:58] ^d: oh, you can also look at https://gerrit.wikimedia.org/r/#/c/126052/ now that people are interested [16:12:31] AaronSchulz: you you mostly caught up on the state of the image scalers having issues? [16:13:43] yes [16:13:49] PROBLEM - SSH on elastic1014 is CRITICAL: Connection refused [16:13:49] PROBLEM - check configured eth on elastic1014 is CRITICAL: Connection refused by host [16:13:59] PROBLEM - RAID on elastic1014 is CRITICAL: Connection refused by host [16:13:59] PROBLEM - check if dhclient is running on elastic1014 is CRITICAL: Connection refused by host [16:14:01] the ba thing about that is that in three hours we'll get nailed again the next time someone pulls up the user uploads or their gallery or scrolls back far enough in special:newfiles [16:14:09] PROBLEM - puppet disabled on elastic1014 is CRITICAL: Connection refused by host [16:14:09] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [16:14:29] PROBLEM - DPKG on elastic1014 is CRITICAL: Connection refused by host [16:14:29] PROBLEM - Disk space on elastic1014 is CRITICAL: Connection refused by host [16:18:19] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [16:18:40] <^d> apergos: Well that was just a random shot in the dark. It's configurable. [16:23:40] <^d> apergos: I don't think Aaron liked my approach though, which is fair enough. [16:23:41] <^d> :) [16:24:44] ah the range request thing, I never remember that piece of the puzzle [16:26:20] PROBLEM - NTP on elastic1014 is CRITICAL: NTP CRITICAL: No response from NTP server [16:27:49] RECOVERY - SSH on elastic1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [16:31:04] (03PS1) 10Aaron Schulz: Configured "renderfile-nonstandard" throttle [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127643 [16:32:09] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [16:34:25] !log reinstalling elastic1013 (elastic1014 is still coming back online, but I don't want there to an extra eligible master for long) [16:34:26] ^ manybubbles [16:34:31] Logged the message, Master [16:34:42] ottomata: find by me [16:34:49] we blasted both of them for tht reason [16:35:07] but elastic1013 needs to be back soon [16:35:20] 1014 is online [16:35:24] its running puppet still [16:35:27] because without it come close to missing the quorum [16:35:36] still need to fix salt, run puppet again, and then bounce [16:35:40] so long as it'll be online in 2 minutes or so, we can blast the next one [16:35:58] sounds ok then [16:36:04] i mean, 1014 online, es is running, i just have a few more things to do to it before i am done with it [16:36:49] RECOVERY - check configured eth on elastic1014 is OK: NRPE: Unable to read output [16:36:59] RECOVERY - RAID on elastic1014 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:36:59] RECOVERY - check if dhclient is running on elastic1014 is OK: PROCS OK: 0 processes with command name dhclient [16:37:09] RECOVERY - puppet disabled on elastic1014 is OK: OK [16:37:09] PROBLEM - Host elastic1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:29] RECOVERY - DPKG on elastic1014 is OK: All packages OK [16:37:29] RECOVERY - Disk space on elastic1014 is OK: DISK OK [16:38:27] heya, has anyone has luck using Putty + Plink to ssh proxy through a bastion? [16:38:47] ezachte is having trouble getting into stat1003 without forwarding his ssh key :/ [16:38:53] and I'm not much help so far [16:39:25] aww [16:39:41] (03CR) 10Aaron Schulz: [C: 032] Removed redundant config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127635 (owner: 10Aaron Schulz) [16:39:48] (03Merged) 10jenkins-bot: Removed redundant config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127635 (owner: 10Aaron Schulz) [16:41:43] (03PS1) 10Aaron Schulz: Revert "Removed redundant config" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127645 [16:41:50] (03CR) 10Aaron Schulz: [C: 032] Revert "Removed redundant config" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127645 (owner: 10Aaron Schulz) [16:41:58] (03Merged) 10jenkins-bot: Revert "Removed redundant config" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127645 (owner: 10Aaron Schulz) [16:42:19] RECOVERY - Host elastic1013 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [16:42:22] (03PS1) 10Aaron Schulz: Removed redundant pool counter config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127646 [16:42:26] manybubbles: elastic1014 is good to go, i'm moving shards back there [16:42:35] ottomata: sweet [16:42:45] (03CR) 10Aaron Schulz: [C: 04-1] "Everything needs to be on 1.24" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127646 (owner: 10Aaron Schulz) [16:44:19] PROBLEM - check if dhclient is running on elastic1013 is CRITICAL: Connection refused by host [16:44:20] PROBLEM - Disk space on elastic1013 is CRITICAL: Connection refused by host [16:44:20] PROBLEM - check configured eth on elastic1013 is CRITICAL: Connection refused by host [16:44:29] PROBLEM - DPKG on elastic1013 is CRITICAL: Connection refused by host [16:44:29] PROBLEM - SSH on elastic1013 is CRITICAL: Connection refused [16:44:29] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [16:45:09] PROBLEM - puppet disabled on elastic1013 is CRITICAL: Connection refused by host [16:45:09] PROBLEM - RAID on elastic1013 is CRITICAL: Connection refused by host [16:50:19] RECOVERY - NTP on elastic1014 is OK: NTP OK: Offset -0.0155210495 secs [16:56:30] PROBLEM - NTP on elastic1013 is CRITICAL: NTP CRITICAL: No response from NTP server [16:58:29] RECOVERY - SSH on elastic1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [17:02:29] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:02:41] !log aaron synchronized php-1.23wmf22/thumb.php '95913654cbf0a2c35ddbef74d135eeea71600d54' [17:02:48] Logged the message, Master [17:03:11] !log aaron synchronized php-1.24wmf1/thumb.php '44c46589ef02a30e60177ea268bc8de27a740434' [17:03:15] Logged the message, Master [17:03:48] :) [17:07:19] RECOVERY - Disk space on elastic1013 is OK: DISK OK [17:07:20] RECOVERY - check configured eth on elastic1013 is OK: NRPE: Unable to read output [17:07:29] RECOVERY - DPKG on elastic1013 is OK: All packages OK [17:08:09] RECOVERY - puppet disabled on elastic1013 is OK: OK [17:08:09] RECOVERY - RAID on elastic1013 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:08:19] RECOVERY - check if dhclient is running on elastic1013 is OK: PROCS OK: 0 processes with command name dhclient [17:11:18] (03CR) 10Chad: [C: 032] Configured "renderfile-nonstandard" throttle [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127643 (owner: 10Aaron Schulz) [17:11:26] (03Merged) 10jenkins-bot: Configured "renderfile-nonstandard" throttle [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127643 (owner: 10Aaron Schulz) [17:11:44] manybubbles: 1013 good! [17:11:45] ^d: you'd have to backport the core change for that to do anything [17:11:46] moving shards there too [17:12:13] <^d> AaronSchulz: Indeed, but I was on a roll and it's harmless to go out now [17:12:29] sure [17:13:24] !log demon synchronized wmf-config/InitialiseSettings.php 'renderfile-nonstandard throttle config' [17:13:29] Logged the message, Master [17:14:20] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:49] PROBLEM - Host es9 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:50] (03PS1) 10Aaron Schulz: Adjust large file download pool counter config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127654 [17:16:19] (03PS2) 10Aaron Schulz: Adjust large file download pool counter config to tie up less workers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127654 [17:16:28] manybubbles, ^d, awesome! [17:16:30] cmjohnson1, mark - around? re: pdf migration, I'd say we'll want to disable the Collection extension first per Max' suggestion, and perhaps send a brief note about downtime [17:16:34] so everything is done except for elastic1008 [17:16:43] (03CR) 10Aaron Schulz: [C: 032] Adjust large file download pool counter config to tie up less workers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127654 (owner: 10Aaron Schulz) [17:16:50] just need to get that mgmt console problem fixed on that [17:16:51] and then I can do that [17:16:56] not sure when someone will be at eqiad to do that [17:16:57] we'll see [17:17:06] but! we shoudl be good to go aside from that [17:18:08] !log aaron synchronized php-1.23wmf22/includes/filerepo/file/LocalFile.php '01ce2888d0d9011514a3fc16c1606b5b42f1ef37' [17:18:09] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [17:18:14] Logged the message, Master [17:18:36] eloquence: I am thinking of pushing that move until tomorrow morning [17:18:44] (03PS2) 10BryanDavis: [WIP] Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [17:19:01] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 (owner: 10BryanDavis) [17:19:21] (03PS3) 10BryanDavis: [WIP] Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [17:19:28] lunchtime, back ina bit [17:19:54] cmjohnson1, ok - we should probably have a deploy window to deactivate Collection til it's done. greg-g can you help with this? [17:20:07] jenkins is takes way too long for just mediawiki-config ;) [17:20:08] do we really have to make this complicated? [17:20:45] a deploy window to deactivate an extension? [17:21:11] haha [17:21:19] RECOVERY - NTP on elastic1013 is OK: NTP OK: Offset -0.01619303226 secs [17:21:20] at first I read that as "activate" [17:21:32] (03Merged) 10jenkins-bot: Adjust large file download pool counter config to tie up less workers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127654 (owner: 10Aaron Schulz) [17:21:51] mark, the download PDF link is in the sidebar on every page and gets a fair bit of usage. it'd be nice if we could coordinate the timing so users don't just get errors. [17:22:00] !log aaron synchronized wmf-config/PoolCounterSettings-eqiad.php 'Adjust large file download pool counter config to tie up less workers' [17:22:06] Logged the message, Master [17:24:53] chris says tomorrow 14:00-15:00 UTC should work [17:25:00] !log aaron synchronized php-1.24wmf1/includes/filerepo/file/LocalFile.php '2026e4aee7ffe00e0192914c0bd4a9ce04681c36' [17:25:06] Logged the message, Master [17:27:22] (03PS1) 10Chad: Disable Collection for server move [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127658 [17:27:24] (03PS1) 10Chad: Revert "Disable Collection for server move" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127659 [17:27:32] <^d> There's 90% of the work. [17:27:47] mark: cmjohnson1 Eloquence tomorrow at 14 UTC is perfect [17:27:54] thanks ^d :) [17:29:51] thanks - appreciated. [17:30:33] thanks greg-g [17:32:03] cmjohnson1: are you able/want to do the config change deploys before/after, or do you want someone else to do them? [17:32:21] someone else would probably be good [17:32:23] yeah [17:32:36] 7am pacific is tough for us west coasters :) [17:33:05] 13:32 < greg-g> cmjohnson1: are you able/want to do the config change deploys before/after, or do you want someone else to do them? [17:33:08] 13:32 < mark> someone else would probably be good [17:33:10] 13:32 < greg-g> yeah [17:33:13] 13:32 < greg-g> 7am pacific is tough for us west coasters :) [17:34:56] cmjohnson: asw-b-sdtpa doesn't let me in [17:34:58] any reason for that? [17:35:09] the sandbox vlan (for amaranth) is not setup yet on the 12th floor [17:35:10] the rest is done [17:35:23] manybubbles: hey, can you do a 10am config change for cmjohnson tomorrow? [17:35:36] greg-g: sure [17:35:39] coolio [17:35:50] my 10am or yours? [17:35:53] yours [17:35:56] 14 UTC [17:36:00] I don't know why it will not let you in [17:36:04] ok [17:36:07] added to calendar: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_April_21st [17:36:12] cmjohnson: don't worry about it [17:36:13] i'll figure it out [17:36:34] k..can't do amaranth until tomorrow same time as pdf anyway [17:37:21] !log shutting down mexia to relocate to 12th floor [17:37:28] Logged the message, Master [17:38:37] mark: try asw-b now... [17:39:54] cmjohnson: i think sandbox is done now too [17:39:57] !log shutting down and relocating fenari [17:39:57] but i can't guarantee it, hard to test [17:40:03] Logged the message, Master [17:40:07] okay [17:41:19] PROBLEM - Host ns1.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:41:26] eek [17:42:09] RECOVERY - Check rp_filter disabled on lvs3003 is OK: OK: kernel parameters are set to expected value. [17:42:22] ok [17:42:24] greg-g: mexia AKA ns1 [17:42:25] all done [17:42:35] cmjohnson: i'll make sure to be available tomorrow during that move time frame [17:42:47] if you hit any roadblocks / issues today, please call me :) [17:42:59] PROBLEM - Host fenari is DOWN: PING CRITICAL - Packet loss = 100% [17:43:15] if you create a ticket with more network assignments I can get those done for you as well [17:43:31] k [17:45:49] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [18:23:49] RECOVERY - Host ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [18:24:30] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 35.61 ms [18:25:12] Duh, fenari is still down [18:25:29] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [18:27:29] PROBLEM - twemproxy port on fenari is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [18:27:30] PROBLEM - twemproxy process on fenari is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [18:35:46] (03PS1) 10Ottomata: Fixing documentation on varnishkafka_ganglia.py [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/127674 [19:09:27] do you have two star certificates for *.wikipedia.org signed by DigiCert High Assurance CA-3 [19:09:34] but with different fingerprint? [19:09:51] (dates are the same) [19:11:25] Platonides: they were replaced when heartbleed happened, and yes, the "not-valid-before" date is the same on the new one [19:11:37] Platonides: others can do more exact answers, but yes, due to Heartbleed, we replaced our certs, and due to mis-naming between the Cert Issuers and Browsers, there is confusion [19:12:36] (03PS6) 10BBlack: Replace Linux RPS setting with a smarter script [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [19:13:10] see the one FAQ here: https://blog.wikimedia.org/2014/04/10/wikimedias-response-to-the-heartbleed-security-vulnerability/ [19:13:21] (03CR) 10BBlack: "PS6 is just a manual rebase of the previous work" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [19:13:23] what scared me was that I thought I had recently seen the opposite transition [19:13:33] but seems it was the same one [19:14:11] must have been for a different domain [19:14:12] well, I *think* we also started transitioning away from $crappy_cert_issuer to $slight_less_crappy_cert_issuer [19:14:18] ie: even after heartbleed [19:15:01] those *.wikipedia.org are issued by the same CA [19:15:07] !log shutting down and relocating linne [19:15:12] !log shutting down and relocating dobson [19:15:13] Logged the message, Master [19:15:15] just with a different certificate chain [19:15:19] Logged the message, Master [19:22:31] (03PS7) 10BBlack: Replace Linux RPS setting with a smarter script [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [19:26:30] (03CR) 10BBlack: "PS7 improves on the case where there are more queues than CPUs. In PS6, the leftover queues were all given (effectively) 0xff mask. In P" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [19:37:49] RECOVERY - Host dataset2 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [19:38:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:39:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:40:19] PROBLEM - Host linne is DOWN: PING CRITICAL - Packet loss = 100% [19:40:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:40:59] PROBLEM - Recursive DNS on 208.80.152.131 is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:41:19] PROBLEM - Host dobson is DOWN: PING CRITICAL - Packet loss = 100% [19:41:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:42:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:42:39] PROBLEM - Host 208.80.152.131 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:44:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:45:22] !log rolling restart on more of the elasticsearch servers to pick up plugins (06, 07, 09) [19:45:28] Logged the message, Master [19:45:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:46:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:47:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:48:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:37:43 PM UTC [19:48:49] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 19:48:42 UTC 2014 [19:49:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:50:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:51:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:52:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:52:39] PROBLEM - NTP on dataset2 is CRITICAL: NTP CRITICAL: Offset unknown [19:53:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:53:39] RECOVERY - Host linne is UP: PING OK - Packet loss = 0%, RTA = 35.39 ms [19:54:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:55:19] RECOVERY - Host dobson is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [19:55:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:55:49] RECOVERY - Recursive DNS on 208.80.152.131 is OK: DNS OK: 0.077 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:55:59] RECOVERY - Host 208.80.152.131 is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms [19:56:09] PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [19:56:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:57:09] RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.005569 secs [19:57:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:58:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [19:59:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:00:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:01:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:02:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:03:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:04:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:05:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:06:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:07:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:08:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:09:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:10:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:11:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:12:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:13:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:14:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:15:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:16:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:17:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:18:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 07:48:42 PM UTC [20:19:09] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 20:19:06 UTC 2014 [20:19:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:20:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:21:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:22:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:23:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:24:18] ottomata: elastic1013 is reporting down in ganglia, but I believe it is up [20:24:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:25:29] PROBLEM - Host es8 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:33] <^d> manybubbles: Sometime almost 4h ago [20:25:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:25:49] PROBLEM - Host db68 is DOWN: PING CRITICAL - Packet loss = 100% [20:25:49] yeah, but its really actually up [20:25:55] its just ganglia being a bitch [20:25:58] (03PS4) 10BryanDavis: [WIP] Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [20:26:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:19:06 PM UTC [20:26:39] PROBLEM - Host db77 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:29] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 20:27:21 UTC 2014 [20:27:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:27:21 PM UTC [20:28:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:27:21 PM UTC [20:29:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:27:21 PM UTC [20:29:43] Whose project is dataset2? Looks like it's a lucid box trying to apply a puppet class that requires nfs4. [20:29:49] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 20:29:39 UTC 2014 [20:30:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:29:39 PM UTC [20:30:49] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 20:30:40 UTC 2014 [20:31:27] !log rolling restart on remaining Elasticsearch servers to get the plugin (1010, 1011, 1012, 1015) [20:31:34] Logged the message, Master [20:31:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:32:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:33:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:34:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:35:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:36:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:37:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:37:42] andrewbogott, RoanKattouw On restart of parsoid cluster after deploy, i got the following error for wtp1018 [20:37:43] ssh: connect to host wtp1018 port 22: Connection timed out [20:37:55] gwicke, ^ [20:38:17] what is the best way to fix this? can that node be restarted? [20:38:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:38:39] subbu: you can reboot it from the labs interface, can't you? [20:38:43] If that's what you want... [20:38:51] Oh, wait, sorry, this is a production box? [20:38:52] andrewbogott, this is the prod cluster. [20:38:55] yes. [20:38:55] andrewbogott, that's a prod machine [20:38:58] * andrewbogott looks [20:39:19] I have no luck logging in on wtp1018 either [20:39:29] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.143 [20:39:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:39:44] subbu: you say 'on restart of parsoid cluster…' does that mean that you just now rebooted that box? [20:39:59] PROBLEM - Host es7 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:04] andrewbogott, just restarted parsoid. [20:40:08] parsoid service, not reboot [20:40:25] sorry, i communicated incorrectly there. [20:40:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:40:45] Ah, ok. I can certainly reboot that box if you think that will be non-dangerous [20:40:56] gwicke, ^ ? [20:41:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC [20:41:39] RECOVERY - Host es7 is UP: PING OK - Packet loss = 0%, RTA = 35.45 ms [20:41:45] a reboot would be great [20:41:55] ACKNOWLEDGEMENT - NTP on dataset2 is CRITICAL: NTP CRITICAL: Offset unknown andrew bogott I dont see how puppet can ever have worked on this -- trying NFS4 on a lucid box. [20:41:55] ACKNOWLEDGEMENT - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:30:40 PM UTC andrew bogott I dont see how puppet can ever have worked on this -- trying NFS4 on a lucid box. [20:42:18] andrewbogott, so yes. [20:45:07] !log deployed Parsoid ec51e5d1 (deploy SHA 0dd607fc) [20:45:11] !log rebooted wtp1018 [20:45:13] Logged the message, Master [20:45:19] Logged the message, Master [20:47:39] RECOVERY - Host wtp1018 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [20:48:09] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [20:48:11] subbu: looks like the box came up ok… can you log in? [20:48:15] gwicke, https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Parsoid+eqiad&h=wtp1018.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS .. looks like 1018 was down for a few days. [20:48:35] andrewbogott, one sec. let me check. [20:48:52] yes, i can. [20:48:55] thanks for the reboot. [20:48:59] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 20:48:51 UTC 2014 [20:49:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:50:16] subbu: can you do some logfile digging and see why it locked up? Possibly oom? [20:50:36] gwicke, wtp1018 has the old deployed version of parsoid. [20:50:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:51:01] andrewbogott, will do. i'll wait for gwicke to be back from his meeting and we'll investigate. [20:51:18] subbu: thanks [20:51:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:52:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:53:07] (03Abandoned) 10Jforrester: wikibugs: Set up #mediawiki-visualeditor [operations/puppet] - 10https://gerrit.wikimedia.org/r/37570 (owner: 10Jforrester) [20:53:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:53:54] dammit I acknowledged, why is dataset2 still complaining? [20:54:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:55:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:56:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:57:09] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:57:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [20:57:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:58:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [20:59:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [21:00:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [21:01:11] andrewbogott: did you ack the service or the host? [21:01:30] service [21:01:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [21:02:02] andrewbogott: do you see that in the icinhga log on neon [21:02:19] PROBLEM - NTP on wtp1018 is CRITICAL: NTP CRITICAL: Offset unknown [21:02:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC [21:03:08] Hm, I think I just checked two boxes and the gui ignored the second checkbox [21:03:26] ACKNOWLEDGEMENT - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 08:48:51 PM UTC andrew bogott Puppet will pretty much never run properly on this box. [21:06:31] woo, a moment's peace at last! [21:07:19] RECOVERY - NTP on wtp1018 is OK: NTP OK: Offset -0.0009940862656 secs [21:07:29] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:07:30] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1948: active_shards: 5783: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:16:00] (03CR) 10Manybubbles: "I'll push this out during the SWAT tomorrow morning." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127363 (owner: 10Chad) [21:18:47] (03PS1) 10Manybubbles: Add more redundency for enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127793 [21:19:29] RECOVERY - Puppet freshness on dataset2 is OK: puppet ran at Mon Apr 21 21:19:24 UTC 2014 [21:19:31] (03CR) 10Manybubbles: "I was thinking I could push this during a SWAT soon, after we're further building the indexes for the new wikis." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127793 (owner: 10Manybubbles) [21:19:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:20:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:21:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:22:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:23:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:24:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:25:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:26:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:27:39] PROBLEM - Puppet freshness on dataset2 is CRITICAL: Last successful Puppet run was Mon 21 Apr 2014 09:19:24 PM UTC [21:31:59] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [21:34:19] RECOVERY - Host tarin is UP: PING OK - Packet loss = 0%, RTA = 37.10 ms [21:38:39] PROBLEM - Host ps1-b2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:40:19] PROBLEM - Host ps1-b3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:45:14] (03PS5) 10Jforrester: Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 [21:45:25] (03CR) 10jenkins-bot: [V: 04-1] Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [21:46:02] (03CR) 10Jforrester: "PS5 adds the forgotten CirrusSearch (whoops!) and removes the now-deployed/-removed Typography Refresh." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [21:48:49] PROBLEM - Host tarin is DOWN: PING CRITICAL - Packet loss = 100% [21:48:49] PROBLEM - Host nfs1 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:59] PROBLEM - Host stafford is DOWN: PING CRITICAL - Packet loss = 100% [21:48:59] PROBLEM - Host ekrem is DOWN: PING CRITICAL - Packet loss = 100% [21:49:07] (03PS6) 10Jforrester: Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 [21:49:19] PROBLEM - Host pdf3 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:49] PROBLEM - Host ps1-a2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [21:50:25] greg-g: I think 121892 (BetaFeatures whitelist) is good to go. OK to put in the SWAT if RoanKattouw/etc. agree? [21:50:36] James_F: yeppers [21:50:39] Yay. [21:50:53] shit! [21:51:07] greg-g: Thanks. [21:51:10] that wasn't supposed to happen....guess I will move them not [21:51:15] no [21:52:52] so much for scheduling [21:52:57] (03PS5) 10BryanDavis: [WIP] Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [21:53:43] greg-g...yeah..not planned...i was workinging on a different rack ....guess it was tied to that rack [21:53:45] sorry [21:57:06] cmjohnson1: no worries, you're doing a lot today, stuff happens [22:00:26] andrewbogott, re wtp1018: we can't read the interesting log files like /var/log/messages; could you check if there is anything suspicious in there? [22:00:35] gwicke: ok [22:00:53] andrewbogott, thanks! [22:01:06] ganglia graphs all look like it just stopped accepting requests [22:01:22] combined with the ssh login failure I suspect a kernel / HW issue [22:01:49] (03PS1) 10CSteipp: Check assumptions, fix heap overflow [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/127804 [22:03:25] gwicke: could that box have been down for days? There's a hole in the syslog from 4/19 to the reboot [22:04:33] andrewbogott, that would fit with ganglia showing no data for that period [22:04:44] could also have been a hard kernel hang [22:04:54] anything in /var/log/messages ? [22:05:43] especially re kernel panics or the like [22:06:28] nothing in the SAL about wtp1018 [22:06:56] There is no /var/log/messages, I'm looking at the syslog [22:07:26] k, I guess that's a debianism [22:08:15] It looks like the box was just unplugged on the 19th. But… why no icinga alarms? [22:08:28] yeah, /var/log/messages was removed in Ubuntu [22:08:57] I'm not sure if there is any monitoring on the parsoid hosts [22:09:00] unplugged? [22:09:11] I just mean that it stops dead with no sign of distress [22:09:33] There's definitely minimal monitoring on every host [22:09:35] it could still be a kernel panic that did not make it to disk [22:09:41] unless it doesn't include base [22:10:04] there are graphs in ganglia, but not sure about alerts [22:10:36] serial console? drac? (might have caught a fatal kernel panic..) [22:11:53] do these boxes have serial console and/or drac by default? [22:12:09] yes, that's how I rebooted it [22:15:39] RECOVERY - Host ekrem is UP: PING WARNING - Packet loss = 93%, RTA = 35.33 ms [22:18:02] k [22:20:39] RECOVERY - Host tarin is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [22:21:09] RECOVERY - Host pdf3 is UP: PING OK - Packet loss = 0%, RTA = 35.50 ms [22:22:29] RECOVERY - Host stafford is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [22:23:55] andrewbogott, I'll keep an eye on it and will try to get somebody to check console output from drac in case it happens again [22:24:20] gwicke: thanks. I'm mostly concerned about the lack of icinga notice. It shouldn't take two days to notice that a box is down [22:25:00] agreed. are icinga warnings part of the base config, or are they set up separately? [22:30:29] PROBLEM - NTP on ekrem is CRITICAL: NTP CRITICAL: Offset unknown [22:30:45] gwicke: they should be part of base, as of a week ago [22:31:51] hmm, ok [22:32:29] RECOVERY - NTP on ekrem is OK: NTP OK: Offset 0.008579611778 secs [22:32:38] I did not try a ping before the reboot [22:33:35] not sure what the icinga warning is based on [22:33:42] s/warning/alert/ [22:33:55] ping is the one I'd expect to have failed [22:35:46] I've seen quite a few kernel panics that didn't stop ping replies [22:50:58] !log Restarted stuck Jenkins [22:51:04] Logged the message, Mr. Obvious [23:07:58] SWAT time :) [23:08:33] greg-g: FYI I've struck two changes from https://wikitech.wikimedia.org/wiki/Deployments#Week_of_April_21st , one is a DNS change (which is ops territory) and one is a questionable config change [23:09:09] The former will not be done during a SWAT window and needs to be done by ops. The latter could be moved to a later SWAT window if community consensus appears or I can be convinced the change is sane [23:10:31] (03CR) 10Catrope: [C: 032] Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [23:10:38] (03Merged) 10jenkins-bot: Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [23:11:04] fsockopen(): unable to connect to irc.wikimedia.org:6667 (Connection refused) < What went bang? [23:11:18] !log catrope synchronized wmf-config/InitialiseSettings.php 'Beta Features whitelist' [23:11:24] Logged the message, Master [23:11:40] !log catrope synchronized wmf-config/CommonSettings.php 'BetaFeatures whitelist' [23:11:46] Logged the message, Master [23:12:00] Damianz: Tampa migration maybe? [23:13:05] Damianz: Yup, irc.wm.o resolves to ekrem which is on the list of things that are getting moved in Tampa. Reported on the ops list [23:13:09] That would explain it... though didn't see an email (though I've got about 2 weeks woth to read) [23:13:20] RobH: ----^^ re ekrem [23:13:33] Ah :) I'll tell the bots to stop spamming me with the world is on fire and go to bed then. [23:14:45] greg-g: FYI, whitelist is now live. [23:14:54] greg-g: (And appears to be working fine.) [23:15:01] what's broken? ;) [23:15:17] greg-g: Strike one of those strikes, I'm including the betawikiversity change after all after talking to the nominator [23:15:20] But not the DNS change [23:15:35] (03CR) 10Catrope: [C: 032] New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:15:43] (03Merged) 10jenkins-bot: New namespace aliases for lt test project on BetaWV [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/127278 (owner: 10Gerrit Patch Uploader) [23:16:16] greg-g: :-P [23:16:56] !log catrope synchronized wmf-config/InitialiseSettings.php 'Lithuanian namespace aliases for betawikiversity' [23:17:02] Logged the message, Master [23:22:20] (03PS1) 10BBlack: Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 [23:23:34] !log catrope synchronized php-1.24wmf1/extensions/VisualEditor 'SWAT deploy cherry-picks' [23:23:39] Logged the message, Master [23:23:47] !log catrope synchronized php-1.24wmf1/extensions/MultimediaViewer 'SWAT deploy cherry-picks' [23:23:53] Logged the message, Master [23:24:01] !log catrope synchronized php-1.24wmf1/extensions/MobileFrontend 'SWAT deploy cherry-picks' [23:24:03] (03PS2) 10BBlack: Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 [23:24:07] Logged the message, Master [23:24:10] RoanKattouw: thanks for handling swat [23:24:18] ori: Welcome [23:24:18] !log catrope synchronized php-1.23wmf22/extensions/MultimediaViewer 'SWAT deploy cherry-picks' [23:24:24] Logged the message, Master [23:24:24] Today's was a bit complicated with VE stuff anyway [23:24:30] ori: We were the cause of most of it after all. :-) [23:30:41] (03PS2) 10Jforrester: Create a dblist for non-Beta Features wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 [23:30:44] (03CR) 10Jforrester: "PS2 is a rebase." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120171 (owner: 10Jforrester) [23:31:29] (03PS3) 10BBlack: Set bnx2x num_queues to match physical cores on lvs nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/127827 [23:31:37] greg-g: Not sure if you have an opinion on ^^^^ if we should treat loginwiki and votewiki differently by moving from all.dblist to all-OK-to-do-things-on.dblist. [23:31:47] hmmm [23:32:45] greg-g: Not urgent, just would be good practice. [23:32:54] sure? I mean, yeah, the fewer ways we can ef up and accidentally put new crappy extension on loginwiki the better, I gues [23:33:04] s [23:33:08] * James_F nods. [23:33:22] (03PS6) 10BryanDavis: Move beta scap source directory off of NFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/127399 [23:54:09] rc irc is not running? [23:54:59] ori, re https://gerrit.wikimedia.org/r/#/c/112647/: the timing info is readily (and more reliably) available in the VE extension which is calling Parsoid [23:55:03] Danny_B: It seems to have gone down due to ekrem being moved in Tampa [23:56:00] any eta when it will be back up? [23:56:33] Eloquence: btw, the pdf servers were accidentally powered down a while ago, chris moved them, and the pdf creation is working again [23:57:47] !log Started ircd on ekrem, startup doesn't seem to be puppetized [23:57:53] Logged the message, Mr. Obvious [23:58:06] Danny_B: ekrem is alive, it just didn't have ircd running. I just started it, could you check if it's running now [23:58:07] ? [23:58:55] RoanKattouw: yup, running [23:59:05] no channels though yet [23:59:07] Are edits coming through? [23:59:16] (i would be in favour of their persistence)