[00:18:46] (03CR) 10AndyRussG: "This wouldn't be needed if this code were working:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [01:18:08] I'm getting a 503 alert popup whenever I try to "Review your changes" on VE. [01:18:50] varnish cache server, XID 2251730856 [01:22:20] nevermind, I think this was because I had the 'cite by url' gadget enabled. [01:25:52] (03PS1) 10GWicke: Increase thrift_framed_transport_size_in_mb from 15 to 256m [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 [02:00:00] (03PS1) 10AndyRussG: Remove unused mobile CentralNotice URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182132 [02:01:44] ragesoss: "all your cite by url are belong to us" -VE [04:32:51] (03PS1) 10AndyRussG: Add comment about not redirecting ugly URLs [puppet] - 10https://gerrit.wikimedia.org/r/182141 [04:43:13] (03PS2) 10AndyRussG: Add comment about not redirecting ugly URLs [puppet] - 10https://gerrit.wikimedia.org/r/182141 [05:32:44] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [05:49:46] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:59:58] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:35] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:46] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:04] (03CR) 10Springle: [C: 04-1] "May require some change to db-eqiad.php to map "wikishared" to a master server. Eg, sectionsByDB." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [06:14:59] (03PS3) 10KartikMistry: WIP: Content Translation configuration for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 [06:22:38] springle: thanks! [06:26:00] springle: 'wikishared' => 'x1' - is right? [06:26:12] or is it x1-master? [06:28:54] ok. it is 'x1' :) [06:30:30] stat1002's disk check is failing because fuse oopsed the kernel [06:30:35] commands such as df hang [06:31:38] however i see loadavg is high and users are running jobs so i'm hesitant to reboot it [06:31:59] their jobs do not use fuse, so i'll wait to discuss with ottomata in the morning [06:32:42] (03PS1) 10KartikMistry: Map wikishared DB to x1 master server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182148 [06:32:51] (03CR) 10jenkins-bot: [V: 04-1] Map wikishared DB to x1 master server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182148 (owner: 10KartikMistry) [06:35:02] ACKNOWLEDGEMENT - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:02] ACKNOWLEDGEMENT - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:02] ACKNOWLEDGEMENT - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Jeff Gage FUSE driver crashed. reboot likely needed. [06:35:05] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:06] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:09] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:10] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:15] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:16] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:26] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:44] springle: can you add sectionLoad in db-eqiad? [06:38:27] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:48] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:31] (for x1) [06:45:46] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:47:46] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:47] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:48:58] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:50:59] kart_: see ['externalLoads']['extension1'] instead of ['sectionLoads'] [06:51:16] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 3 failures [06:52:47] kart_: then ask for a core dev review from someone who know the intricacies of stuff outside db-eqiad.php better than an opsen like me :) maybe TimStarling or AaronSchultz [06:52:57] or Reedy [06:59:28] springle: sure. [06:59:34] Thanks! [07:00:26] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [07:30:28] andre__: what is the file size limit for files in phab? my upload fails with storage max limit exceeded [07:39:40] PROBLEM - Host analytics1027 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host search1018 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host 208.80.154.157 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host mc1008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:40] PROBLEM - Host amslvs4 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:41] PROBLEM - Host amslvs3 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:39:41] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db2023 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db2017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host db1031 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host ms-be3002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host ps1-a4-eqiad is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:22] PROBLEM - Host mw1099 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host mw1081 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host analytics1026 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:40:23] PROBLEM - Host mw1193 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:41:37] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [07:41:37] RECOVERY - Host db2034 is UP: PING OK - Packet loss = 0%, RTA = 42.94 ms [07:41:38] RECOVERY - Host ms-be2013 is UP: PING OK - Packet loss = 0%, RTA = 42.89 ms [07:41:38] RECOVERY - Host db2037 is UP: PING OK - Packet loss = 0%, RTA = 42.92 ms [07:41:39] RECOVERY - Host db2036 is UP: PING OK - Packet loss = 0%, RTA = 42.92 ms [07:41:39] RECOVERY - Host capella is UP: PING OK - Packet loss = 0%, RTA = 42.94 ms [07:41:40] RECOVERY - Host ms-be2006 is UP: PING OK - Packet loss = 0%, RTA = 42.90 ms [07:41:40] RECOVERY - Host mc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [07:41:41] RECOVERY - Host analytics1026 is UP: PING OK - Packet loss = 0%, RTA = 1.98 ms [07:41:41] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 96.08 ms [07:41:42] RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 95.18 ms [07:41:50] RECOVERY - Host mw1096 is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [07:41:50] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [07:42:09] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 5.37 ms [07:42:14] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms [07:42:14] RECOVERY - Host misc-web-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 1.83 ms [07:42:38] RECOVERY - Host mw1164 is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [07:43:19] RECOVERY - Host ps1-b6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.97 ms [07:43:22] RECOVERY - Host db2035 is UP: PING OK - Packet loss = 0%, RTA = 43.13 ms [07:44:29] RECOVERY - Host 208.80.154.50 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [07:44:29] RECOVERY - Host 208.80.154.157 is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [07:44:29] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 69.83 ms [07:45:57] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 70.27 ms [07:48:28] hmmm [07:52:48] nothing out of the ordinary ... [07:55:13] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [07:57:22] neon problems me thinks [08:18:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:31:08] PROBLEM - OCG health on ocg1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:50] (03CR) 10Nemo bis: "Ping Hoo :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [09:30:32] greetings [09:30:50] akosiaris: yeah likely [09:41:05] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [500.0] [09:51:43] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [09:58:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [11:11:25] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [11:17:49] (03PS1) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:19:17] (03PS2) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:20:08] ori: great blog post, enjoyed much [11:21:10] anyone exept chasemp can tell me the phab attach file size ? [11:23:54] matanya: which one? [11:24:11] YuviPanda|zzz: the upper limit [11:24:18] link? [11:24:28] matanya: oh, I meant the blog post :) [11:25:12] https://blog.wikimedia.org/2014/12/29/how-we-made-editing-wikipedia-twice-as-fast/ [11:36:13] PROBLEM - Host db2010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:22] PROBLEM - Host analytics1019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host analytics1022 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host mw1040 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host analytics1032 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host 208.80.153.42 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host gallium is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:23] PROBLEM - Host mw1046 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:24] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:36:34] uhm [11:36:47] it’s all ok [11:36:54] (mostly) [11:36:56] (I guess) [11:37:03] You playing with the network? :P [11:37:08] PROBLEM - Host mw1029 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] PROBLEM - Host ps1-b2-eqiad is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] PROBLEM - Host wtp1020 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:37:08] RECOVERY - Host db2010 is UP: PING OK - Packet loss = 0%, RTA = 42.97 ms [11:37:08] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:37:08] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:37:08] RECOVERY - Host analytics1032 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:37:11] Or icinga [11:37:16] hoo: no, neon the icinga host being overloaded [11:37:17] RECOVERY - Host mw1046 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [11:37:22] memory and CPU pressure, mostly [11:37:22] RECOVERY - Host gallium is UP: PING OK - Packet loss = 0%, RTA = 5.59 ms [11:37:22] RECOVERY - Host mw1029 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [11:37:22] RECOVERY - Host wtp1020 is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [11:37:23] still? [11:37:24] meh [11:37:29] But good to know [11:37:31] RECOVERY - Host mw1040 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [11:37:53] hoo: yeah [11:38:00] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [11:39:34] YuviPanda|zzz: Are you firm with logrotate? [11:39:38] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 71.71 ms [11:39:49] Not sure it's the cause... but I have incomplete logs for the last WD json dump [11:39:53] but the dumps look ok [11:39:58] hoo: not really. I mostly get by copy pasting another config and tweaking till I’m happy with what I have [11:39:59] oh [11:40:07] hoo: incomplete in what sense? [11:40:20] RECOVERY - Host 208.80.153.42 is UP: PING OK - Packet loss = 0%, RTA = 43.95 ms [11:40:43] Log output ends after 800k entities (per shard... so 3.2M overall) [11:40:51] but we have 16-17M entities [11:41:02] logs from the week before are complete [11:41:04] do you know if it’s the first 800k or the last? [11:41:09] First [11:41:20] it says Dumped entities up to 80000 [11:41:23] or something like that [11:41:53] Ah... seems like logrotate rotated them midway through [11:42:01] -rw-rw-r-- 1 datasets datasets 101910 Dec 29 06:25 dumpwikidatajson-3.log.1.gz [11:42:01] -rw-rw-r-- 1 datasets datasets 538877 Dec 22 13:41 dumpwikidatajson-3.log.2.gz [11:42:07] spot the difference :P [11:42:27] grrr [11:43:51] hoo: are the log files being written out from php or is it just a stdout redirection? [11:44:03] stderr it is, yes [11:44:45] hmm, ideally for something as long running as that it should write log files itself and ‘reload’ on sighup or something [11:45:37] hoo: with ^ at least it won’t be cut off in the middle :) [11:45:41] $this->addOption( 'log', "Log file (default is stderr). Will be appended.", false, true ); [11:46:33] But even in that case I'd need to rotate stuffs [11:48:23] right [11:48:32] wait [11:48:38] I don’t think I actuall understood the problem now [11:48:38] I could manually invoke logrotate from the script that does the dumps [11:48:55] did old and new logs somehow get conflated? [11:49:02] did you lose log data? or is it just confusingly done? [11:49:13] Lost log data [11:49:23] not a problem here, but could be if stuff actually fails [11:50:47] (03PS3) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [11:54:33] YuviPanda: Any idea? [11:54:40] hoo: oh, because logrotate deleted your ‘old’ file? [11:54:49] I’m trying to understand what exactly happened :) [11:58:26] (03PS4) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:30:26] YuviPanda: leaving in a bit [12:30:33] would be nice if you could keep me updated [12:30:48] hoo: oh, on the logrotate? I’m not actually investigating atm :| [12:30:54] hoo: file a bug? I’ll take a look later. [12:32:07] Maybe I'll just hack it up like this: Have the logrotate not run automatically, but just invoke it pre-run of the script [12:32:23] so taht we rotate once before running the the dump creation [12:32:44] Not super nice, but easy to do [12:33:32] YuviPanda: Worth looking at that or a no-go? [12:34:18] hoo: hmm, maybe tack date of run to end of log file, and clean up everything else? [12:34:56] mh? [12:35:21] hoo: so it shall output to dumpwikidatajson-20141229021245.log [12:35:32] Oh [12:35:40] and then let the script do the deletions like it does for old dumps [12:35:57] like keep 5, kill the rest [12:36:09] yeah [12:36:23] Not that convenient, but probably nicer [12:36:39] hoo: or put logs into logstash :P [12:37:12] :P [12:38:07] Off for now [12:38:17] Thanks for the advice... I count on you to merge it, then :D [12:46:37] (03PS5) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:48:38] (03PS6) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:50:32] (03PS7) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [12:53:11] (03PS8) 10Yuvipanda: Refactor to not be a big ball of mud [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182164 [13:34:52] springle: I don't get how to map wikishared with x1 :( [13:44:20] x1? [14:02:56] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:03] this category says there are 301 pages in it, while there actually are 315 in main namespace: https://nl.wikipedia.org/wiki/Categorie:Wikipedia:Etalage-artikelen [14:14:26] I am pointed to here as ops an do a full recount [14:17:39] (03PS1) 10Hoo man: Don't use logrotate for the wikidata dump logs [puppet] - 10https://gerrit.wikimedia.org/r/182173 [14:17:53] YuviPanda: ^ [14:17:55] untested [14:19:17] hoo: out for food. Will Check when back. Add me as reviewer? [14:19:58] Oh, sure [14:20:47] Reedy: https://gerrit.wikimedia.org/r/#/c/182148/ and https://phabricator.wikimedia.org/T84969 [14:21:03] Reedy: any suggestion? [14:21:45] Reedy: springle: kart_: see ['externalLoads']['extension1'] instead of ['sectionLoads'] [14:21:52] but I don't get it :) [14:23:49] Romaine: Not sure how to do that for just a specific category [14:24:03] and I don't want to run such a script for all of nlwiki [14:24:30] who kinows more about this? [14:24:58] Reedy might [14:25:14] But I dobut this is even possible w/o manually tempering [14:25:30] everything returns to Reedy [14:25:32] :) [14:25:56] :P [14:27:51] hoo: I created a bug for it: https://phabricator.wikimedia.org/T85527 [14:29:05] Who can clone Reedy? ;) [14:29:56] good point [14:30:36] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 4 failures [14:30:49] Romaine: thx [14:31:14] you are better in placing it in the right projects etc [14:31:48] (03PS2) 10Hoo man: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:35:04] (03CR) 10Hoo man: [C: 032] "Doing this actually means a way smaller change than not doing it... also it's way less likely to break than just letting this run out of i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:35:09] (03Merged) 10jenkins-bot: Permanently enable unregistered users editing on it.m.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181791 (owner: 10Nemo bis) [14:36:16] !log hoo Synchronized wmf-config/CommonSettings.php: Enable unregistered users editing on it.m.wikipedia.org after Dec 31 (duration: 00m 06s) [14:36:22] Nemo_bis: ^ [14:37:01] morebots: ping [14:37:01] I am a logbot running on tools-exec-14. [14:37:01] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:37:01] To log a message, type !log . [14:39:07] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [14:44:46] thanks goo [14:44:49] hoo [14:45:01] You're welcome [14:45:29] log successful https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:35] Yeah, saw that [14:45:43] just wondered because it didn't tell me it worked [14:45:48] So morebots is just being rude today [14:45:54] Log that :D [14:46:06] !log morebots is being rude today [14:46:12] Logged the message, Master [14:46:20] :'-( [14:46:22] See, education works [14:51:44] (03PS1) 10Florianschmidtwelzow: Hygiene: Move wgMFAnonymousEditing to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 [14:52:58] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:59:20] (03PS1) 10Florianschmidtwelzow: Hygiene: Change wgMFAnonymousEditing to wgMFEditorOptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182177 [15:10:05] PROBLEM - MySQL Processlist on db1056 is CRITICAL: CRIT 73 unauthenticated, 0 locked, 0 copy to table, 1 statistics [15:13:36] RECOVERY - MySQL Processlist on db1056 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 1 statistics [15:21:49] <^demon|zzz> paravoid: "Once any bot/tool/etc authors using the old search are contacted (and helped, if possible) lsearchd will go away there [enwiki, ruwiki, nlwiki] too." [15:22:02] <^demon|zzz> That was before Xmas, I hadn't checked the situation or tried to contact anyway yet. [15:22:40] <^demon|zzz> *anyone [15:32:51] (03PS1) 10Ottomata: Run logster job for varnishkafka every minute to get smooth derivative in grafana graphs [puppet] - 10https://gerrit.wikimedia.org/r/182184 [15:34:13] (03CR) 10Ottomata: [C: 032] Run logster job for varnishkafka every minute to get smooth derivative in grafana graphs [puppet] - 10https://gerrit.wikimedia.org/r/182184 (owner: 10Ottomata) [15:36:42] (03CR) 10Glaisher: [C: 031] "Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182175 (owner: 10Florianschmidtwelzow) [15:39:37] (03PS1) 10RobH: setting dbstore1002/1002 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/182185 [15:40:03] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:40] (03CR) 10RobH: [C: 032] setting dbstore1002/1002 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/182185 (owner: 10RobH) [15:59:16] (03PS1) 10RobH: setting dbstore2001/2002 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/182188 [16:04:58] (03CR) 10RobH: [C: 032] setting dbstore2001/2002 install parameters [puppet] - 10https://gerrit.wikimedia.org/r/182188 (owner: 10RobH) [16:13:23] gwicke: ping? [16:14:06] (03PS1) 10Ottomata: Install local txstatsd and logster job for varnishkafka on all cache servers [puppet] - 10https://gerrit.wikimedia.org/r/182190 [16:15:00] bblack: things are looking good, i'm going to go ahead and do that unless you have objections ^ [16:15:01] paravoid: grr, thought I was connected [16:15:03] but evidently weechat was still in limbo mode [16:15:09] so, what's the issue with the tests? [16:15:20] I guess you can't see your backlog? [16:15:35] kill them first, I'll explain :) [16:16:15] alright, stopped them [16:16:21] something about DNS? [16:16:32] so you're hammering restbase with a lot of RPS [16:16:38] for some reason I haven't yet understood [16:16:48] you're using carbon.wikimedia.org aka webproxy.eqiad.wmnet [16:16:54] a squid running there [16:16:56] ottomata: yeah sounds good [16:17:03] and this is killing the box [16:17:22] oh [16:17:26] damn [16:17:28] (netfilter conntrack limits being hit etc.) [16:17:38] I must have forgotten to unset the env vars in one of the shells [16:18:03] I could bump those limits but I don't think traffic should pass through there anyway :) [16:18:24] also note [16:18:24] ./node_modules/heapdump/build/config.gypi: "https_proxy": "http://webproxy.eqiad.wmnet:8080/", [16:18:53] I don't think you're doing https [16:19:02] nope, this is http [16:19:07] but I'm not really sure if this config variable being used in a different context [16:19:08] I did unset the vars in most of the shells [16:19:14] (03CR) 10Ottomata: [C: 032] Install local txstatsd and logster job for varnishkafka on all cache servers [puppet] - 10https://gerrit.wikimedia.org/r/182190 (owner: 10Ottomata) [16:19:21] so you probably only saw the requests from one of four dumpers [16:19:36] this was happening earlier [16:19:38] I killed a screen of yours [16:19:51] yes, I was talking about those [16:19:59] and it was happening again now [16:20:06] with your new shells [16:20:27] do you see any traffic now? [16:20:40] restarted one client [16:20:52] no [16:20:58] okay [16:21:04] great [16:21:26] if this is testbox->testbox I don't really mind [16:21:38] but if you intend to use prod stuff for those tests you should probably !log those [16:21:42] sorry for DOSing the cache [16:21:57] the whole machine was DoSed, not just the cache :) [16:22:12] the testing has been ongoing for weeks [16:22:29] that's what we need the boxes for [16:22:42] sure, I don't mind :) [16:23:07] if you intend to (intentionally) use prod resources for those tests, just ping and/or !log [16:24:17] robh: things should be working again [16:24:32] cool, thanks for letting me know (im doing installs, well, trying ;) [16:24:44] paravoid: those boxes are all prod [16:25:22] gwicke: you really don't want me to start treating those as prod :) [16:25:51] ok, thought you meant those boxes [16:25:52] I've found at least one of those boxes (I think it was ruthenium) with arbitrary sources.list fetching hhvm & node from the internet [16:26:15] yeah, ruthenium has been a test server for a while [16:26:59] so for test boxes it's (sort of) okay, but certainly not okay for anything that can be labeled "production" [16:27:00] the cassandra / restbase boxes are all puppet-driven [16:27:06] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [16:27:08] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:38] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:27:41] anyway, go on with your tests [16:27:47] we'll figure it out as this graduates in prod [16:28:05] RECOVERY - Disk space on stat1002 is OK: DISK OK [16:29:51] paravoid: do you see any traffic on the cache? [16:30:10] I restarted the other clients, just making sure [16:30:11] no [16:30:21] ok, thx [16:30:24] thank you :) [16:30:41] well, sorry for creating busywork ;) [16:30:52] np [16:33:14] paravoid, btw, once graphite has recovered this will be the graph corresponding to the testing: http://bit.ly/173yM3h [16:33:40] oh? is graphite down? [16:33:50] damn [16:34:00] godog: are you working on it? [16:34:37] oh, no I wasn't aware of it [16:34:39] looking now [16:35:04] well I meant if it was down because of something you were working on [16:35:12] but that's even better :P [16:35:19] haha indeed [16:35:25] it worked a few minutes ago [16:36:43] yeah memory usage is through the roof [16:37:48] uwsgi, killing it [16:38:00] !log killing uwsgi on tunsten, blew memory [16:38:08] Logged the message, Master [16:40:22] should be recovering shortly [16:47:05] I suspect the huge flow of new metrics didn't make graphite very amused [16:47:16] I'd like to add mobrovac to the operations mailing list. Should I just have him make a subscription request via mailman, or should I file a request? [16:47:53] robla: +100 [16:47:56] ottomata: obvious in hindsight now but I forgot that the varnishkafka metrics would go in all at the same time anyways in this case, see above [16:48:28] uh oh [16:48:46] godog: ja I thought we talked about that, but we decided to do local txstatsd anyway because of pickle or something [16:49:02] godog: what should we do? want me to revert? [16:49:32] ottomata: yeah, for creations pickle doesn't change that they get created all at the same time [16:49:42] anyways no it is fine as it is, no revert [16:50:35] I'll file a short incident report just so we don't lose track, creating new metrics is supposed to be rate limited [16:51:01] ah, its just the creation of the metrics that is causing the problem? [16:51:06] now that it is created it might be ok? [16:51:14] yeah I think so [17:00:58] unrelated, anyone else seeing this behaviour? https://phabricator.wikimedia.org/T85532 [17:01:06] in phab itself that is [17:02:01] ottomata: how many machines will be pushing metrics btw? [17:08:51] i'm on the ops list now, thnx robla [17:09:25] i just added you [17:09:39] i imagine due to robla's putting in the request in mailman [17:09:55] pretty sure I automatically just let anyone with @wikimedia.org on that list =D [17:10:20] (non wikimedia.org addresses have to confirm NDA status before adding) [17:12:21] robh: make staff wait a week and when they say why go 'this is how it feels for volunteers' :p [17:12:57] godog, all caches, pretty much [17:13:22] JohnLewis: i just totally typed up a funny reply to back that up, then realized i shouldn't mock how horrible we are at that =P [17:13:37] I'm hoping our migration into transparent ticketing helps that a lot [17:13:44] ottomata: ack, I was trying to gauge how many metrics it'll be total [17:13:52] robh: query me it :D [17:14:38] its forced me to deal with old bz tickets that i forgot existed [17:14:53] since once ops started using RT, i fully admit i really stopped checking BZ with any regularity [17:15:05] godog, uh a lot. [17:15:27] one sec... [17:16:29] godog, how many cache servers are there? about 100? [17:17:12] godog, if so, somewhere around 35K-40K new metrics total [17:17:19] (03PS1) 10Glaisher: Update noc's index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182193 [17:17:31] i'm looking at one vk, and its sending 376 [17:17:49] ottomata: ack, there are about 12K created now [17:27:05] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/182129 (owner: 10GWicke) [17:27:57]