[00:18:46] (03CR) 10AndyRussG: "This wouldn't be needed if this code were working:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [01:18:08] I'm getting a 503 alert popup whenever I try to "Review your changes" on VE. [01:18:50] varnish cache server, XID 2251730856 [01:22:20] nevermind, I think this was because I had the 'cite by url' gadget enabled. [01:25:52] (03PS1) 10GWicke: Increase thrift_framed_transport_size_in_mb from 15 to 256m [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 [02:00:00] (03PS1) 10AndyRussG: Remove unused mobile CentralNotice URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182132 [02:01:44] ragesoss: "all your cite by url are belong to us" -VE [04:32:51] (03PS1) 10AndyRussG: Add comment about not redirecting ugly URLs [puppet] - 10https://gerrit.wikimedia.org/r/182141 [04:43:13] (03PS2) 10AndyRussG: Add comment about not redirecting ugly URLs [puppet] - 10https://gerrit.wikimedia.org/r/182141 [05:32:44] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [05:49:46] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:59:58] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:35] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:46] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:04] (03CR) 10Springle: [C: 04-1] "May require some change to db-eqiad.php to map "wikishared" to a master server. Eg, sectionsByDB." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [06:14:59] (03PS3) 10KartikMistry: WIP: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 [06:22:38] springle: thanks! [06:26:00] springle: 'wikishared' => 'x1' - is right? [06:26:12] or is it x1-master? [06:28:54] ok. it is 'x1' :) [06:30:30] stat1002's disk check is failing because fuse oopsed the kernel [06:30:35] commands such as df hang [06:31:38] however i see loadavg is high and users are running jobs so i'm hesitant to reboot it [06:31:59] their jobs do not use fuse, so i'll wait to discuss with ottomata in the morning [06:32:42] (03PS1) 10KartikMistry: Map wikishared DB to x1 master server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182148 [06:32:51] (03CR) 10jenkins-bot: [V: 04-1] Map wikishared DB to x1 master server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182148 (owner: 10KartikMistry) [06:35:02] ACKNOWLEDGEMENT - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:02] ACKNOWLEDGEMENT - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:02] ACKNOWLEDGEMENT - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:05] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:06] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:09] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:10] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:15] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:16] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:26] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:44] springle: can you add sectionLoad in db-eqiad? [06:38:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:48] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:31] (for x1) [06:45:46] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:46] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:47] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:58] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:50:59] kart_: see ['externalLoads']['extension1'] instead of ['sectionLoads'] [06:51:16] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 3 failures [06:52:47] kart_: then ask for a core dev review from someone who know the intricacies of stuff outside db-eqiad.php better than an opsen like me :) maybe TimStarling or AaronSchultz [06:52:57] or Reedy [06:59:28] springle: sure. [06:59:34] Thanks! [07:00:26] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [07:30:28] andre__: what is the file size limit for files in phab? my upload fails with storage max limit exceeded [07:39:40] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host search1018 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host 208.80.154.157 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host mc1008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host amslvs4 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:41] PROBLEM - Host amslvs3 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:41] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db2023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db2017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db1031 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host ms-be3002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host ps1-a4-eqiad is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host mw1099 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host mw1081 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host mw1193 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:41:37] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [07:41:37] RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 42.94 ms [07:41:38] RECOVERY - Host ms-be2013 is UP: PING OK - Packet loss = 0%, RTA = 42.89 ms [07:41:38] RECOVERY - Host db2037 is UP: PING OK - Packet loss = 0%, RTA = 42.92 ms [07:41:39] RECOVERY - Host db2036 is UP: PING OK - Packet loss = 0%, RTA = 42.92 ms [07:41:39] RECOVERY - Host capella is UP: PING OK - Packet loss = 0%, RTA = 42.94 ms [07:41:40] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 42.90 ms [07:41:40] RECOVERY - Host mc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [07:41:41] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 1.98 ms [07:41:41] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 96.08 ms [07:41:42] RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 95.18 ms [07:41:50] RECOVERY - Host mw1096 is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [07:41:50] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [07:42:09] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 5.37 ms [07:42:14] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms [07:42:14] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [07:42:38] RECOVERY - Host mw1164 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [07:43:19] RECOVERY - Host ps1-b6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.97 ms [07:43:22] RECOVERY - Host db2035 is UP: PING OK - Packet loss = 0%, RTA = 43.13 ms [07:44:29] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [07:44:29] RECOVERY - Host 208.80.154.157 is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [07:44:29] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 69.83 ms [07:45:57] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 70.27 ms [07:48:28] hmmm [07:52:48] nothing out of the ordinary ... [07:55:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [07:57:22] neon problems me thinks [08:18:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:31:08] PROBLEM - OCG health on ocg1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:50] (03CR) 10Nemo bis: "Ping Hoo :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [09:30:32] greetings [09:30:50] akosiaris: yeah likely [09:41:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [500.0] [09:51:43] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [09:58:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [11:11:25] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [11:17:49] (03PS1) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:19:17] (03PS2) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:20:08] ori: great blog post, enjoyed much [11:21:10] anyone exept chasemp can tell me the phab attach file size ? [11:23:54] matanya: which one? [11:24:11] YuviPanda|zzz: the upper limit [11:24:18] link? [11:24:28] matanya: oh, I meant the blog post :) [11:25:12] https://blog.wikimedia.org/2014/12/29/how-we-made-editing-wikipedia-twice-as-fast/ [11:36:13] PROBLEM - Host db2010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:22] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host mw1040 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host analytics1032 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host gallium is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host mw1046 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:24] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:34] uhm [11:36:47] it’s all ok [11:36:54] (mostly) [11:36:56] (I guess) [11:37:03] You playing with the network? :P [11:37:08] PROBLEM - Host mw1029 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] PROBLEM - Host ps1-b2-eqiad is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] PROBLEM - Host wtp1020 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] RECOVERY - Host db2010 is UP: PING OK - Packet loss = 0%, RTA = 42.97 ms [11:37:08] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:37:08] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:37:08] RECOVERY - Host analytics1032 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:37:11] Or icinga [11:37:16] hoo: no, neon the icinga host being overloaded [11:37:17] RECOVERY - Host mw1046 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [11:37:22] memory and CPU pressure, mostly [11:37:22] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 5.59 ms [11:37:22] RECOVERY - Host mw1029 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [11:37:22] RECOVERY - Host wtp1020 is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [11:37:23] still? [11:37:24] meh [11:37:29] But good to know [11:37:31] RECOVERY - Host mw1040 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:37:53] hoo: yeah [11:38:00] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [11:39:34] YuviPanda|zzz: Are you firm with logrotate? [11:39:38] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.71 ms [11:39:49] Not sure it's the cause... but I have incomplete logs for the last WD json dump [11:39:53] but the dumps look ok [11:39:58] hoo: not really. I mostly get by copy pasting another config and tweaking till I’m happy with what I have [11:39:59] oh [11:40:07] hoo: incomplete in what sense? [11:40:20] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 43.95 ms [11:40:43] Log output ends after 800k entities (per shard... so 3.2M overall) [11:40:51] but we have 16-17M entities [11:41:02] logs from the week before are complete [11:41:04] do you know if it’s the first 800k or the last? [11:41:09] First [11:41:20] it says Dumped entities up to 80000 [11:41:23] or something like that [11:41:53] Ah... seems like logrotate rotated them midway through [11:42:01] -rw-rw-r-- 1 datasets datasets 101910 Dec 29 06:25 dumpwikidatajson-3.log.1.gz [11:42:01] -rw-rw-r-- 1 datasets datasets 538877 Dec 22 13:41 dumpwikidatajson-3.log.2.gz [11:42:07] spot the difference :P [11:42:27] grrr [11:43:51] hoo: are the log files being written out from php or is it just a stdout redirection? [11:44:03] stderr it is, yes [11:44:45] hmm, ideally for something as long running as that it should write log files itself and ‘reload’ on sighup or something [11:45:37] hoo: with ^ at least it won’t be cut off in the middle :) [11:45:41] $this->addOption( 'log', "Log file (default is stderr). Will be appended.", false, true ); [11:46:33] But even in that case I'd need to rotate stuffs [11:48:23] right [11:48:32] wait [11:48:38] I don’t think I actuall understood the problem now [11:48:38] I could manually invoke logrotate from the script that does the dumps [11:48:55] did old and new logs somehow get conflated? [11:49:02] did you lose log data? or is it just confusingly done? [11:49:13] Lost log data [11:49:23] not a problem here, but could be if stuff actually fails [11:50:47] (03PS3) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:54:33] YuviPanda: Any idea? [11:54:40] hoo: oh, because logrotate deleted your ‘old’ file? [11:54:49] I’m trying to understand what exactly happened :) [11:58:26] (03PS4) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:30:26] YuviPanda: leaving in a bit [12:30:33] would be nice if you could keep me updated [12:30:48] hoo: oh, on the logrotate? I’m not actually investigating atm :| [12:30:54] hoo: file a bug? I’ll take a look later. [12:32:07] Maybe I'll just hack it up like this: Have the logrotate not run automatically, but just invoke it pre-run of the script [12:32:23] so taht we rotate once before running the the dump creation [12:32:44] Not super nice, but easy to do [12:33:32] YuviPanda: Worth looking at that or a no-go? [12:34:18] hoo: hmm, maybe tack date of run to end of log file, and clean up everything else? [12:34:56] mh? [12:35:21] hoo: so it shall output to dumpwikidatajson-20141229021245.log [12:35:32] Oh [12:35:40] and then let the script do the deletions like it does for old dumps [12:35:57] like keep 5, kill the rest [12:36:09] yeah [12:36:23] Not that convenient, but probably nicer [12:36:39] hoo: or put logs into logstash :P [12:37:12] :P [12:38:07] Off for now [12:38:17] Thanks for the advice... I count on you to merge it, then :D [12:46:37] (03PS5) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:48:38] (03PS6) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:50:32] (03PS7) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:53:11] (03PS8) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [13:34:52] springle: I don't get how to map wikishared with x1 :( [13:44:20] x1? [14:02:56] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:03] this category says there are 301 pages in it, while there actually are 315 in main namespace: https://nl.wikipedia.org/wiki/Categorie:Wikipedia:Etalage-artikelen [14:14:26] I am pointed to here as ops an do a full recount [14:17:39] (03PS1) 10Hoo man: Don't use logrotate for the wikidata dump logs [puppet] - 10https://gerrit.wikimedia.org/r/182173 [14:17:53] YuviPanda: ^ [14:17:55] untested [14:19:17] hoo: out for food. Will Check when back. Add me as reviewer? [14:19:58] Oh, sure [14:20:47] Reedy: https://gerrit.wikimedia.org/r/#/c/182148/ and https://phabricator.wikimedia.org/T84969 [14:21:03] Reedy: any suggestion? [14:21:45] Reedy: springle: kart_: see ['externalLoads']['extension1'] instead of ['sectionLoads'] [14:21:52] but I don't get it :) [14:23:49] Romaine: Not sure how to do that for just a specific category [14:24:03] and I don't want to run such a script for all of nlwiki [14:24:30] who kinows more about this? [14:24:58] Reedy might [14:25:14] But I dobut this is even possible w/o manually tempering [14:25:30] everything returns to Reedy [14:25:32] :) [14:25:56] :P [14:27:51] hoo: I created a bug for it: https://phabricator.wikimedia.org/T85527 [14:29:05] Who can clone Reedy? ;) [14:29:56] good point [14:30:36] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 4 failures [14:30:49] Romaine: thx [14:31:14] you are better in placing it in the right projects etc [14:31:48] (03PS2) 10Hoo man: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:35:04] (03CR) 10Hoo man: [C: 032] "Doing this actually means a way smaller change than not doing it... also it's way less likely to break than just letting this run out of i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:35:09] (03Merged) 10jenkins-bot: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:36:16] !log hoo Synchronized wmf-config/CommonSettings.php: Enable unregistered users editing on it.m.wikipedia.org after Dec 31 (duration: 00m 06s) [14:36:22] Nemo_bis: ^ [14:37:01] morebots: ping [14:37:01] I am a logbot running on tools-exec-14. [14:37:01] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:37:01] To log a message, type !log . [14:39:07] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [14:44:46] thanks goo [14:44:49] hoo [14:45:01] You're welcome [14:45:29] log successful https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:35] Yeah, saw that [14:45:43] just wondered because it didn't tell me it worked [14:45:48] So morebots is just being rude today [14:45:54] Log that :D [14:46:06] !log morebots is being rude today [14:46:12] Logged the message, Master [14:46:20] :'-( [14:46:22] See, education works [14:51:44] (03PS1) 10Florianschmidtwelzow: Hygiene: Move wgMFAnonymousEditing to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 [14:52:58] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:59:20] (03PS1) 10Florianschmidtwelzow: Hygiene: Change wgMFAnonymousEditing to wgMFEditorOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182177 [15:10:05] PROBLEM - MySQL Processlist on db1056 is CRITICAL: CRIT 73 unauthenticated, 0 locked, 0 copy to table, 1 statistics [15:13:36] RECOVERY - MySQL Processlist on db1056 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 1 statistics [15:21:49] <^demon|zzz> paravoid: "Once any bot/tool/etc authors using the old search are contacted (and helped, if possible) lsearchd will go away there [enwiki, ruwiki, nlwiki] too." [15:22:02] <^demon|zzz> That was before Xmas, I hadn't checked the situation or tried to contact anyway yet. [15:22:40] <^demon|zzz> *anyone [15:32:51] (03PS1) 10Ottomata: Run logster job for varnishkafka every minute to get smooth derivative in grafana graphs [puppet] - 10https://gerrit.wikimedia.org/r/182184 [15:34:13] (03CR) 10Ottomata: [C: 032] Run logster job for varnishkafka every minute to get smooth derivative in grafana graphs [puppet] - 10https://gerrit.wikimedia.org/r/182184 (owner: 10Ottomata) [15:36:42] (03CR) 10Glaisher: [C: 031] "Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 (owner: 10Florianschmidtwelzow) [15:39:37] (03PS1) 10RobH: setting dbstore1002/1002 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/182185 [15:40:03] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:40] (03CR) 10RobH: [C: 032] setting dbstore1002/1002 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/182185 (owner: 10RobH) [15:59:16] (03PS1) 10RobH: setting dbstore2001/2002 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/182188 [16:04:58] (03CR) 10RobH: [C: 032] setting dbstore2001/2002 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/182188 (owner: 10RobH) [16:13:23] gwicke: ping? [16:14:06] (03PS1) 10Ottomata: Install local txstatsd and logster job for varnishkafka on all cache servers [puppet] - 10https://gerrit.wikimedia.org/r/182190 [16:15:00] bblack: things are looking good, i'm going to go ahead and do that unless you have objections ^ [16:15:01] paravoid: grr, thought I was connected [16:15:03] but evidently weechat was still in limbo mode [16:15:09] so, what's the issue with the tests? [16:15:20] I guess you can't see your backlog? [16:15:35] kill them first, I'll explain :) [16:16:15] alright, stopped them [16:16:21] something about DNS? [16:16:32] so you're hammering restbase with a lot of RPS [16:16:38] for some reason I haven't yet understood [16:16:48] you're using carbon.wikimedia.org aka webproxy.eqiad.wmnet [16:16:54] a squid running there [16:16:56] ottomata: yeah sounds good [16:17:03] and this is killing the box [16:17:22] oh [16:17:26] damn [16:17:28] (netfilter conntrack limits being hit etc.) [16:17:38] I must have forgotten to unset the env vars in one of the shells [16:18:03] I could bump those limits but I don't think traffic should pass through there anyway :) [16:18:24] also note [16:18:24] ./node_modules/heapdump/build/config.gypi: "https_proxy": "http://webproxy.eqiad.wmnet:8080/", [16:18:53] I don't think you're doing https [16:19:02] nope, this is http [16:19:07] but I'm not really sure if this config variable being used in a different context [16:19:08] I did unset the vars in most of the shells [16:19:14] (03CR) 10Ottomata: [C: 032] Install local txstatsd and logster job for varnishkafka on all cache servers [puppet] - 10https://gerrit.wikimedia.org/r/182190 (owner: 10Ottomata) [16:19:21] so you probably only saw the requests from one of four dumpers [16:19:36] this was happening earlier [16:19:38] I killed a screen of yours [16:19:51] yes, I was talking about those [16:19:59] and it was happening again now [16:20:06] with your new shells [16:20:27] do you see any traffic now? [16:20:40] restarted one client [16:20:52] no [16:20:58] okay [16:21:04] great [16:21:26] if this is testbox->testbox I don't really mind [16:21:38] but if you intend to use prod stuff for those tests you should probably !log those [16:21:42] sorry for DOSing the cache [16:21:57] the whole machine was DoSed, not just the cache :) [16:22:12] the testing has been ongoing for weeks [16:22:29] that's what we need the boxes for [16:22:42] sure, I don't mind :) [16:23:07] if you intend to (intentionally) use prod resources for those tests, just ping and/or !log [16:24:17] robh: things should be working again [16:24:32] cool, thanks for letting me know (im doing installs, well, trying ;) [16:24:44] paravoid: those boxes are all prod [16:25:22] gwicke: you really don't want me to start treating those as prod :) [16:25:51] ok, thought you meant those boxes [16:25:52] I've found at least one of those boxes (I think it was ruthenium) with arbitrary sources.list fetching hhvm & node from the internet [16:26:15] yeah, ruthenium has been a test server for a while [16:26:59] so for test boxes it's (sort of) okay, but certainly not okay for anything that can be labeled "production" [16:27:00] the cassandra / restbase boxes are all puppet-driven [16:27:06] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [16:27:08] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:38] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:27:41] anyway, go on with your tests [16:27:47] we'll figure it out as this graduates in prod [16:28:05] RECOVERY - Disk space on stat1002 is OK: DISK OK [16:29:51] paravoid: do you see any traffic on the cache? [16:30:10] I restarted the other clients, just making sure [16:30:11] no [16:30:21] ok, thx [16:30:24] thank you :) [16:30:41] well, sorry for creating busywork ;) [16:30:52] np [16:33:14] paravoid, btw, once graphite has recovered this will be the graph corresponding to the testing: http://bit.ly/173yM3h [16:33:40] oh? is graphite down? [16:33:50] damn [16:34:00] godog: are you working on it? [16:34:37] oh, no I wasn't aware of it [16:34:39] looking now [16:35:04] well I meant if it was down because of something you were working on [16:35:12] but that's even better :P [16:35:19] haha indeed [16:35:25] it worked a few minutes ago [16:36:43] yeah memory usage is through the roof [16:37:48] uwsgi, killing it [16:38:00] !log killing uwsgi on tunsten, blew memory [16:38:08] Logged the message, Master [16:40:22] should be recovering shortly [16:47:05] I suspect the huge flow of new metrics didn't make graphite very amused [16:47:16] I'd like to add mobrovac to the operations mailing list. Should I just have him make a subscription request via mailman, or should I file a request? [16:47:53] robla: +100 [16:47:56] ottomata: obvious in hindsight now but I forgot that the varnishkafka metrics would go in all at the same time anyways in this case, see above [16:48:28] uh oh [16:48:46] godog: ja I thought we talked about that, but we decided to do local txstatsd anyway because of pickle or something [16:49:02] godog: what should we do? want me to revert? [16:49:32] ottomata: yeah, for creations pickle doesn't change that they get created all at the same time [16:49:42] anyways no it is fine as it is, no revert [16:50:35] I'll file a short incident report just so we don't lose track, creating new metrics is supposed to be rate limited [16:51:01] ah, its just the creation of the metrics that is causing the problem? [16:51:06] now that it is created it might be ok? [16:51:14] yeah I think so [17:00:58] unrelated, anyone else seeing this behaviour? https://phabricator.wikimedia.org/T85532 [17:01:06] in phab itself that is [17:02:01] ottomata: how many machines will be pushing metrics btw? [17:08:51] i'm on the ops list now, thnx robla [17:09:25] i just added you [17:09:39] i imagine due to robla's putting in the request in mailman [17:09:55] pretty sure I automatically just let anyone with @wikimedia.org on that list =D [17:10:20] (non wikimedia.org addresses have to confirm NDA status before adding) [17:12:21] robh: make staff wait a week and when they say why go 'this is how it feels for volunteers' :p [17:12:57] godog, all caches, pretty much [17:13:22] JohnLewis: i just totally typed up a funny reply to back that up, then realized i shouldn't mock how horrible we are at that =P [17:13:37] I'm hoping our migration into transparent ticketing helps that a lot [17:13:44] ottomata: ack, I was trying to gauge how many metrics it'll be total [17:13:52] robh: query me it :D [17:14:38] its forced me to deal with old bz tickets that i forgot existed [17:14:53] since once ops started using RT, i fully admit i really stopped checking BZ with any regularity [17:15:05] godog, uh a lot. [17:15:27] one sec... [17:16:29] godog, how many cache servers are there? about 100? [17:17:12] godog, if so, somewhere around 35K-40K new metrics total [17:17:19] (03PS1) 10Glaisher: Update noc's index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182193 [17:17:31] i'm looking at one vk, and its sending 376 [17:17:49] ottomata: ack, there are about 12K created now [17:27:05] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 (owner: 10GWicke) [17:27:57] (03CR) 10Alexandros Kosiaris: "recheck" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 (owner: 10GWicke) [17:29:14] (03CR) 10Alexandros Kosiaris: [V: 032] "recheck wouldn't work, a different submodule. +2 Verified manually" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 (owner: 10GWicke) [17:32:26] (03CR) 10Alexandros Kosiaris: "Btw, the puppet module also needs a submodule update for this to become live." [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 (owner: 10GWicke) [17:34:44] (03PS1) 10Filippo Giunchedi: install-server: include lsiutil and megaraid-status [puppet] - 10https://gerrit.wikimedia.org/r/182194 [17:35:38] (03PS1) 10RobH: taking analytics1001/1002 out of incorrect partman setup [puppet] - 10https://gerrit.wikimedia.org/r/182195 [17:35:40] (03PS2) 10Filippo Giunchedi: install-server: include lsiutil and megaraid-status [puppet] - 10https://gerrit.wikimedia.org/r/182194 [17:35:46] paravoid: ^ [17:36:10] (03CR) 10RobH: [C: 032] taking analytics1001/1002 out of incorrect partman setup [puppet] - 10https://gerrit.wikimedia.org/r/182195 (owner: 10RobH) [17:48:26] ottomata: yeah about 100 hosts I was looking at this on tungsten for i in /var/lib/carbon/whisper/varnishkafka/* ; do echo -n "$i :" ; find $i -type f | wc -l ; done [17:48:56] aye [17:49:50] well without the wc -l is more interesting [17:50:11] no, nevermind the last comment [18:04:46] (03CR) 10Faidon Liambotis: [C: 031] "+1 for lsiutil, not entirely happy with megaraid-status but that'd be okay if you silence it/make it not run a daemon." [puppet] - 10https://gerrit.wikimedia.org/r/182194 (owner: 10Filippo Giunchedi) [18:04:54] (03PS3) 10Filippo Giunchedi: install-server: include lsiutil in reprepro [puppet] - 10https://gerrit.wikimedia.org/r/182194 [18:10:39] (03CR) 10Filippo Giunchedi: "amended to skip megacli-status for now, I'll take a look at check-raid.py too!" [puppet] - 10https://gerrit.wikimedia.org/r/182194 (owner: 10Filippo Giunchedi) [18:11:35] (03PS4) 10Filippo Giunchedi: install-server: include lsiutil in reprepro [puppet] - 10https://gerrit.wikimedia.org/r/182194 [18:22:13] (03PS1) 10Jackmcbarn: Change tboverride to tboverride-account for enwiki accountcreators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182197 [18:26:00] (03PS2) 10Jackmcbarn: Change tboverride to tboverride-account for enwiki accountcreators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182197 [18:41:01] what the hell [18:41:21] why is neon writing ganglia RRDs to /var/lib/ganglia/rrds [18:42:13] kafka metrics [18:42:14] argh [18:43:11] paravoid: ? [18:43:24] is that check_ganglia? [18:43:28] I don't know yet [18:43:39] but what the hell, why would it do this [18:43:44] gmetad is writing those rrds [18:43:53] must be check ganglia, maybe it is caching them? [18:44:18] why is there a gmetad running on neon? [18:44:37] # Ganglia Meta Daemon for Wikimedia [18:44:42] # This file is managed by Puppet! [18:44:50] oh, gmetad, this is sounding familiar... [18:45:31] Keep the configuration that allows gmetad to work from neon, [18:45:31] since check_ganglia uses that. [18:45:43] is it though? I only see check_ganglia connecting to uranium [18:45:58] paravoid: [18:46:04] no, i don't think it is check ganglia though [18:46:09] see line 235 in ganglia.pp [18:46:14] # neon needs gmetad config [18:46:14] /^neon$/: { [18:46:14] $data_sources = { [18:46:14] ... [18:46:24] yes [18:46:24] this was [18:46:28] - # neon needs gmetad config for ganglios [18:46:33] see commit db1c635b8ed3b5329443c61a442f3dfb9470dfd9 [18:46:36] sounds familiar [18:46:49] but monitoring::ganglia seems to connect to uranium [18:47:10] YuviPanda: here? [18:47:58] (03PS1) 10Gage: hadoop-logstash: emit filtered stack traces [puppet/cdh] - 10https://gerrit.wikimedia.org/r/182208 [18:49:10] I'm pretty sure it's not being used [18:52:33] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=neon.wikimedia.org&m=cpu_report&s=descending&mc=2&g=load_report&c=Miscellaneous+eqiad [18:52:40] (03CR) 10Gage: [C: 032] hadoop-logstash: emit filtered stack traces [puppet/cdh] - 10https://gerrit.wikimedia.org/r/182208 (owner: 10Gage) [18:52:42] magic [18:55:14] bblack: ^ is why "sync" took such a long time [18:56:15] (03PS1) 10Gage: hadoop-logstash: emit filtered stack traces [puppet] - 10https://gerrit.wikimedia.org/r/182210 [18:56:38] (03CR) 10Gage: [C: 032] hadoop-logstash: emit filtered stack traces [puppet] - 10https://gerrit.wikimedia.org/r/182210 (owner: 10Gage) [18:58:07] (03CR) 10MaxSem: [C: 031] Remove unused mobile CentralNotice URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182132 (owner: 10AndyRussG) [18:58:26] (03PS1) 10Faidon Liambotis: ganglia: remove unused reference to neon [puppet] - 10https://gerrit.wikimedia.org/r/182211 [18:58:49] ottomata: ^ [18:59:57] ooook [19:00:24] i have a feeling something might break paravoid, but I can't think of what [19:00:35] (03CR) 10Faidon Liambotis: [C: 032] ganglia: remove unused reference to neon [puppet] - 10https://gerrit.wikimedia.org/r/182211 (owner: 10Faidon Liambotis) [19:05:03] ottomata: it was entirely unused [19:05:12] and even if it was supposed to be used, it was running unpuppetized [19:05:15] aye [19:05:17] also: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=neon.wikimedia.org&m=cpu_report&s=descending&mc=2&g=load_report&c=Miscellaneous+eqiad [19:05:23] well, aye, true, the puppet change certainly won't break anything [19:05:38] no I also purged gmetad from the system manually [19:05:42] hokay [19:05:58] !log manually stopping acct on neon and setting /etc/default/acct ACCT_ENABLE to 0 [19:06:02] Logged the message, Master [19:06:16] (03CR) 10MaxSem: Hygiene: Move wgMFAnonymousEditing to InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 (owner: 10Florianschmidtwelzow) [19:07:42] check_ganglia is horrible [19:11:47] omfg so horrible [19:14:56] :) [19:15:33] it's almost like someone was trying to win a contest for the most inefficient way to monitor those variables [19:17:19] bblack: so, we have acct running on those systems [19:17:41] well, on all systems [19:17:47] when I turned off acct the other day, load halfed [19:18:00] but this was when neon was starved of I/O because of gmetad running there for no good reason [19:18:05] it doesn't seem to have made a difference now [19:19:32] so we call check_ganglia 119 times [19:19:42] each of those instances fetches a 34MB XML from the aggregator [19:19:46] every five minutes [19:19:47] lolol [19:20:04] (and each fetch contains all the data the other fetches need!) [19:20:18] right [19:20:19] but [19:20:26] check_ganglia actually has code for this to /not/ happen [19:20:42] and this is a misconfiguration on our part [19:20:48] oh? [19:20:48] we are not passing cluster (-C) [19:20:59] so [19:21:05] gmetad has a query interface [19:21:11] check_ganglia can use this, with -q [19:21:18] the query interface is basically "/$cluster/$host" [19:21:38] # echo '//amssq31.esams.wmnet' | nc uranium.wikimedia.org 8654 |wc -c [19:21:41] 34955906 [19:21:42] # echo '/Text caches esams/amssq31.esams.wmnet' | nc uranium.wikimedia.org 8654 |wc -c [19:21:45] 172675 [19:23:07] command_line $USER1$/check_ganglia -q -g $ARG1$ -p $ARG2$ -H $ARG3$ -m '$ARG4$' -w '$ARG5$' -c '$ARG6$' -C '$ARG7$' [19:23:22] check_command => "check_ganglia!${gmetad_host}!${gmetad_query_port}!${metric_host}!${metric}!${warning}!${critical}!${::ganglia::cname}", [19:23:44] and that doesn't work [19:24:11] akosiaris: still around? [19:25:06] ha [19:25:22] it works where ganglia is included [19:25:25] but not where ganglia_new is [19:25:49] which is all of esams, i.e. 52 out of 119 [19:25:52] subtle isn't it! [19:26:21] does this explain why otrs monitoring broke randomly? [19:26:52] no [19:26:54] ok [19:27:55] ottomata: ping [19:28:08] so many things going wrong [19:29:26] (03PS1) 10Jhobs: Enable $wgAllowSiteCSSOnRestrictedPages for zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182227 [19:30:16] gwicke: hiya, what's up? (i'm about to quit working for the day, working a half day and hanging with fam) [19:30:18] (03PS1) 10Cmjohnson: Removing production dns entries for dickson ipv4 and ipv6 adding reverse for mgmt [dns] - 10https://gerrit.wikimedia.org/r/182228 [19:31:08] ottomata: oh, okay - just a quick question: do you know if the varnishkafka stuff is doing anything complicated beyond enqueuing the basic request log (as a squid log line?) [19:31:37] I was wondering about feeding into the same kafka cluster from node [19:32:02] gwicke: it speaks the kafka api [19:32:09] / protocol [19:32:21] so, yeah, it kinda is, but is uses librdkafka to do so [19:32:22] yeah, I'm mainly wondering about the messages [19:32:29] you want to produce to kafka? [19:32:44] (03CR) 10Faidon Liambotis: [C: 04-1] "Why?" [dns] - 10https://gerrit.wikimedia.org/r/182228 (owner: 10Cmjohnson) [19:32:46] if the messages are easy to replicate, then yes that could be an option [19:33:01] you would just need a producer in whatever lang you are using [19:33:10] that's not a problem [19:33:13] https://cwiki.apache.org/confluence/display/KAFKA/Clients#Clients-Node.js [19:33:37] it's more if the message format is a moving target or requires info that we'd have trouble to come by [19:34:14] (03CR) 10Cmjohnson: "Are we not getting rid of Dickson as an IRC server?" [dns] - 10https://gerrit.wikimedia.org/r/182228 (owner: 10Cmjohnson) [19:36:36] ottomata: looking at https://github.com/wikimedia/varnishkafka [19:36:55] are we using json output? [19:37:02] quick question on wikibugs -- should all ops-* project tasks be reported here? [19:37:10] yes [19:37:14] gwicke, ohoh, i saw your email [19:37:22] this is for http request logging to restbase? [19:37:32] from restbase, potentially [19:37:44] if you want to log from restbase, i doubt you would use varnishkafka [19:37:57] i you want to log http requests to restbase, varnishkafka is probably a good idea [19:38:00] I'm not saying that we should do this, but the complexity of nginx + varnish left me wondering if we could just use plain node [19:38:06] a [19:38:06] ah [19:38:10] (03CR) 10Faidon Liambotis: "Not as far as I know. Is there a ticket for this?" [dns] - 10https://gerrit.wikimedia.org/r/182228 (owner: 10Cmjohnson) [19:38:14] sure, don't see why not, it is just json [19:38:19] so you could code something to format the messages in json [19:38:20] but [19:38:21] hm [19:38:39] the trouble is, if you want the restbase logs to go in with the regular webrequest logs [19:38:43] and you use node to do that [19:38:49] we'd have to maintain two systems [19:38:57] varnishkafka is very flexible with how the json format is done [19:39:11] yeah, that is my main worry [19:39:30] are we changing that a lot? [19:39:41] (03PS1) 10Faidon Liambotis: ganglia: export $cname when ganglia_new is used [puppet] - 10https://gerrit.wikimedia.org/r/182233 [19:39:43] bblack: ^ [19:39:44] not often [19:39:47] bblack: I wonder if this will work... [19:39:51] but i'd like to reserve the ability to do so [19:40:04] paravoid: are you around next week? if so do you have a couple hours to work on frack/codfw firewall config? [19:40:04] *nod* [19:40:14] ottomata: a json schema could check for compatibility [19:40:23] we could hook that up in CI [19:40:47] Jeff_Green: I am around next week, couple of hours is probably fine but more I don't know... [19:40:55] gwicke: aye, sounds possible [19:41:02] paravoid: ok. [19:41:12] gwicke: coudln't you just put restbase behind misc web lb varnish? [19:41:19] ottomata: is there an example for current json output anywhere? [19:41:36] hm [19:41:41] (03CR) 10Faidon Liambotis: [C: 032] "Let's see if this will work." [puppet] - 10https://gerrit.wikimedia.org/r/182233 (owner: 10Faidon Liambotis) [19:41:50] gwicke: i can get you one [19:42:11] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/ETL#varnishkafka.conf [19:42:30] proposed ops channel filter: https://gerrit.wikimedia.org/r/#/c/182235/3/channels.yaml [19:42:49] (03Abandoned) 10Cmjohnson: Removing production dns entries for dickson ipv4 and ipv6 adding reverse for mgmt [dns] - 10https://gerrit.wikimedia.org/r/182228 (owner: 10Cmjohnson) [19:42:57] gwicke: https://gist.github.com/ottomata/fd146bc030e4e64c8d57 [19:43:10] ottomata: merci beaucoup! [19:43:27] k, gonna sign off, laters all! [19:43:34] ottomata: btw [19:43:38] before you go [19:43:39] yes? [19:43:43] see the commit message of the above commit [19:43:49] (03PS2) 10Florianschmidtwelzow: Hygiene: Move wgMFAnonymousEditing to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 [19:43:50] this probably has something to do with all those esams (null)s [19:44:32] aahhhhHHH! [19:44:34] good find. [19:44:48] cool, welp thanks. hopefully i'll decomission check_ganglia in the next week or so anyway [19:46:19] paravoid, bblack: do we have the ability to spin up new prod machines with jessie already? [19:46:27] probably, almost [19:46:54] there's a couple of minor things left in my TODO before I announce it [19:47:04] - check_command check_ganglia!uranium.wikimedia.org!8654!amssq31.esams.wmnet!kafka.varnishkafka.kafka_drerr.per_second!0.1:29.9!30.0! [19:47:07] + check_command check_ganglia!uranium.wikimedia.org!8654!amssq31.esams.wmnet!kafka.varnishkafka.kafka_drerr.per_second!0.1:29.9!30.0!Text caches esams [19:47:11] awesome [19:47:53] paravoid: cool! Do you think it'll be realistic to spin one up ~2 weeks from now? [19:47:59] yes [19:48:12] although note that jessie has not been released yet [19:48:20] and there's still bugs to be fixed (and found) [19:48:25] yeah, but it works on my laptop™ [19:48:34] so, it depends really on what kind of service do you want to run on it [19:48:46] that would be nginx [19:48:58] for what? [19:49:12] tls termination [19:49:16] for where? [19:49:26] and spdy [19:49:31] we need to make a custom nginx for jessie anyways I think [19:49:51] the outstanding question is whether we're required to port the udplog stuff or not [19:50:05] paravoid: we could perhaps use restbase as a low-traffic test [19:51:02] we certainly don't need udplog there [19:52:24] you need varnish there too don't you? [19:52:58] so even if we go ahead with your plan (which I don't think we want to), we need to either build varnish3 (for jessie) or nginx 1.6 (for trusty) [19:53:04] it's not clear if we *need* it for perf [19:53:17] but in any case, we could install varnish on another box [19:53:37] if we can't make it work with jessie for now [19:53:40] it's nice if we keep our endpoints standard [19:53:59] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=neon.wikimedia.org&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [19:54:13] (nginx for tls[/spdy], varnish for http and as local nginx destination) [19:54:23] I'm actually curious how restbase with direct spdy support compares with nginx + varnish + restbase [19:54:44] especially with varnish being two varnishes really [19:54:54] that's a lot of hops [19:55:00] it's just more attack surface for new protocol implementations if we also put restbase's spdy out there for a direct public hit [19:55:26] yeah, agreed on that [19:55:30] you mean terminating SSL/SPDY within node?? [19:55:39] yup [19:55:55] I haven't benchmarked it yet, but might be worth doing [19:56:24] the author of https://github.com/indutny/node-spdy knows what he's doing [19:56:37] even if perf is acceptable, doesn't sound great from a security perspective [19:56:58] it's also not very maintainable, esp. since those things seem to moving very quickly these days [19:57:02] spdy versions, ciphers etc. [19:57:17] but nginx->node (sans varnish) might be a good tradeoff [19:57:44] *nod* [19:58:00] or a single-layer varnish only [19:59:06] the parsoid varnishes don't perform too well right now, but that might also have to do with the large caches they maintain [19:59:10] bblack: see the ganglia graph above [19:59:30] nice [19:59:41] they're still not done updating to the new commandlines completely, but it's already better [19:59:52] I salt'ed it [19:59:57] ah I see [20:01:18] paravoid, bblack: what's your take on spinning up a jessie node once that's ready & giving nginx 1.6 a spin on that as the potential services front-end? [20:02:06] I'd be open to it but don't quote me on this just yet :) [20:02:20] well I've been kinda watching/waiting on what happens with our jessie stuff, but that's kinda why I stopped converting varnishes to trusty [20:02:36] bblack: I'll send a mail next week probably [20:02:47] I think if jessie's ready soon, the currenty trusty prod test box will just become a jessie one and we'll go straight to jessie from there, imho [20:02:57] there's two small things left, I'd be done if it wasn't for this whole neon craziness [20:03:06] but also http://nthykier.wordpress.com/2014/12/30/status-on-jessie-december-2014/ [20:03:13] paravoid, bblack: that sounds great! [20:03:26] apt under puppet was broken until a week ago [20:03:30] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=772641 [20:04:25] Jan 5th is the next freeze milestone, I think we can slowly start deploying in prod, but carefully [20:04:47] paravoid: FYI I think I'm gonna push out two new gdnsd releases today. a 2.2.0 with the current master features stuff (e.g. geoip2), and a 2.1.1 with just the important+simple bugfixes backported (on a side branch like 1.x was at the end) [20:04:49] I was thinking of starting with mostly-internal boxes [20:05:56] restbase is still experimental, and our main users will be all internal [20:06:39] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=neon.wikimedia.org&m=cpu_report&s=descending&mc=2&g=network_report&c=Miscellaneous+eqiad [20:06:42] \o/ [20:06:50] no memory spikes either [20:07:03] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=neon.wikimedia.org&m=cpu_report&s=descending&mc=2&g=mem_report&c=Miscellaneous+eqiad [20:07:16] (and then 2.3.0 is going to be mostly the re-re-re-factor of all the daemonization/control stuff to do everything automagically smoothly even under systemd, + API fixups for Dyn) [20:07:29] paravoid: that's awesome :) [20:09:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:10:29] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [20:11:08] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: Puppet has 1 failures [20:11:39] esams packet loss [20:11:41] grumble [20:11:46] http://smokeping.wikimedia.org/?displaymode=n;start=2014-12-30%2017:11;end=now;target=ESAMS.Core.cr1-esams [20:11:47] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [20:13:05] be ncie when we get all those new redundant links going [20:13:07] *nice [20:13:19] we still haven't approached redundant links for esams though :( [20:13:28] we're not very close to that [20:13:29] oh I thought that was in the new plan too? [20:13:31] 6-12 months I'd say [20:13:34] well, it is, sure [20:13:52] the plan is to get an esams-eqord link at some point [20:13:59] but we don't have eqord just yet, so... :) [20:14:07] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:14:25] also replacing the esams-eqiad link would also be good, this hasn't worked very well [20:14:56] we have some quotes for an unprotected wave which wouldn't have packet loss issues/non-guranteed latency too [20:14:59] but it's too expensive :/ [20:21:19] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:25:19] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:25:40] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:25:49] greg-g_, any objections to a minor config zero-portal-only depl? https://gerrit.wikimedia.org/r/#/c/182227/ [20:26:21] MaxSem, ^ [20:29:37] (03PS1) 10Legoktm: Undeploy OpenSearchXml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182288 [20:41:43] yurikR: probably are objections :/ [20:41:52] until after new years [20:42:22] aude, its a very minor and zeroportal specific, and we are trynig to launch it asap, this is one of the few blockers ( [20:43:43] * aude nods [20:43:43] (03CR) 10Yurik: [C: 032] Enable $wgAllowSiteCSSOnRestrictedPages for zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182227 (owner: 10Jhobs) [20:43:47] (03Merged) 10jenkins-bot: Enable $wgAllowSiteCSSOnRestrictedPages for zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182227 (owner: 10Jhobs) [20:46:01] !log yurik Synchronized wmf-config/CommonSettings.php: ZeroPortal 182227 (duration: 00m 06s) [20:46:10] Logged the message, Master [20:47:05] I'm going to restart icinga [20:55:44] PROBLEM - puppet last run on amslvs3 is CRITICAL: CRITICAL: puppet fail [20:55:44] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:56:04] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 2 failures [21:09:52] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:12:32] RECOVERY - puppet last run on amslvs3 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:12:42] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:13:23] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:44:32] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 67 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 2, uunassigned_shards: 65, utimed_out: False, uactive_primary_shards: 40, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 53, uinitializing_shards: 2, unumber_of_data_nodes: 2} [21:44:52] * bd808 grumbles [21:46:03] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 67 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 2, uunassigned_shards: 65, utimed_out: False, uactive_primary_shards: 40, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 53, uinitializing_shards: 2, unumber_of_data_nodes: 2} [21:46:21] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [21:49:49] logstash1001 has logged that it can't see logstash1002 any more [21:52:43] !log restarted elasticsearch on logstash1002; it had dropped from the cluster [21:52:49] Logged the message, Master [23:05:33] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.005 second response time [23:08:33] PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [23:11:02] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.203 second response time [23:13:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [23:14:41] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [23:15:26] bblack: 22s to run against esams :( [23:15:32] oh wait, wrong window [23:16:51] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:16:51] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [23:24:52] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:28:02] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [23:30:51] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:33:26] (03PS1) 10Faidon Liambotis: nagios: restructure check_ssl & misc fixes [puppet] - 10https://gerrit.wikimedia.org/r/182303 [23:33:28] (03PS1) 10Faidon Liambotis: nagios: add --no-sni to check_ssl to disable SNI [puppet] - 10https://gerrit.wikimedia.org/r/182304 [23:33:30] (03PS1) 10Faidon Liambotis: nagios: introduce new check_sslxNN check [puppet] - 10https://gerrit.wikimedia.org/r/182305 [23:33:32] (03PS1) 10Faidon Liambotis: nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 [23:33:59] perl review anyone [23:42:21] (03PS2) 10Faidon Liambotis: nagios: restructure check_ssl & misc fixes [puppet] - 10https://gerrit.wikimedia.org/r/182303 [23:42:23] (03PS2) 10Faidon Liambotis: nagios: add --no-sni to check_ssl to disable SNI [puppet] - 10https://gerrit.wikimedia.org/r/182304 [23:42:25] (03PS2) 10Faidon Liambotis: nagios: introduce a new check_sslxNN check [puppet] - 10https://gerrit.wikimedia.org/r/182305 [23:42:27] (03PS2) 10Faidon Liambotis: nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 [23:42:33] (03CR) 10BBlack: [C: 031] nagios: restructure check_ssl & misc fixes [puppet] - 10https://gerrit.wikimedia.org/r/182303 (owner: 10Faidon Liambotis) [23:42:39] ooh thanks [23:42:52] (03CR) 10Faidon Liambotis: [C: 032] "Tested." [puppet] - 10https://gerrit.wikimedia.org/r/182303 (owner: 10Faidon Liambotis) [23:44:01] (03CR) 10BBlack: [C: 031] nagios: add --no-sni to check_ssl to disable SNI [puppet] - 10https://gerrit.wikimedia.org/r/182304 (owner: 10Faidon Liambotis) [23:44:13] (03CR) 10Faidon Liambotis: [C: 032] nagios: add --no-sni to check_ssl to disable SNI [puppet] - 10https://gerrit.wikimedia.org/r/182304 (owner: 10Faidon Liambotis) [23:46:28] paravoid: so with the my $e = Local::CheckSSL->run() and one check re-using the module of the other... does icinga pre-load all epn perl modules before running any of them or something? [23:46:43] I guess I never thought about it, and figured it loaded them as it encountered them to run them [23:46:53] there's a require above [23:47:01] and I'm not using the ePN yet [23:47:08] ah right, ok [23:47:17] well, I tried it [23:47:24] with check_ssl (not xNN) [23:47:33] 1-min load went to 25-30 [23:47:49] but everytime I wrote to the file, the check started failing [23:48:09] it tried to load it again and it clashed with what was already loaded [23:48:23] so I'm not sure I should really rely on ePN [23:51:12] (03CR) 10BBlack: [C: 031] nagios: introduce a new check_sslxNN check [puppet] - 10https://gerrit.wikimedia.org/r/182305 (owner: 10Faidon Liambotis) [23:51:26] I think half of the issue is esams [23:51:41] all the roundtrips to esams are killing us [23:52:53] (03CR) 10BBlack: [C: 031] nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 (owner: 10Faidon Liambotis) [23:53:07] note I didn't run any of the code, I just looked it over manually :) [23:53:51] so long as you're non-epn, I bet you could fork children for the 72 checks, do them in batches of N in parallel, to cut down on all the serial RTT [23:54:09] (fork within perl I mean, and gather results over a pipe or whatever) [23:55:04] or alternatively, just split it up a little more (e.g. run xNN separately at the icinga level for sni and non-sni and cut the time in half) [23:55:23] hmm [23:55:27] the 22s could be problematic if it grows into a failing timeout on check execution under adverse conditions [23:57:09] well, the other way to see it is that this may need yet another rewrite :) [23:57:27] fetching the unified since should be enough [23:57:55] you can check validity once, then check all CNs against the SANs once [23:57:58] without refetching it [23:58:10] that leaves a window of a strange misconfiguration, though [23:58:26] where nginx doesn't serve you the unified for some domain for some reason [23:59:50] (03PS1) 10Faidon Liambotis: cache: use check_sslxNN instead of NN x check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/182326