[00:00:53] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [00:00:57] there we go [00:01:00] welcome back [00:01:02] switched now matanya [00:01:14] that's the new thing,thx [00:04:05] !log ori rebuilt wikiversions.cdb and synchronized wikiversions files: [00:04:07] !log ori finished scap: (no message) (duration: 29m 26s) [00:04:11] Logged the message, Master [00:04:19] Logged the message, Master [00:04:51] done for now [00:07:35] i'm going to sleep. night folks [00:08:55] matanya: night. cya [00:20:07] (03PS2) 10Dzahn: emery: remove last log before decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [00:24:22] why do our mysql parser cache nodes not use innodb_file_per_table? [00:24:47] (03CR) 10Dzahn: [C: 031] "ottomata: "as well as being left on emery"? matanya said it the box is ready for shutdown, true or do you need to copy anything? also re: " [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [00:30:39] (03CR) 10Dzahn: "well, nice that it's merged and all but now there are still just some FIXMEs in there, and we need to know stuff like if "check_stomp" is " [operations/puppet] - 10https://gerrit.wikimedia.org/r/109655 (owner: 10Dzahn) [00:33:02] Jeff_Green: "check_stomp.pl" or is it dead? it was a check on erzurumi, what is the fix: replace hostname or remove check entirely [00:33:16] was added back in RT 703 :p [00:33:39] (03PS2) 10TTO: Add variant rewrites for zhwikivoyage [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 [00:33:42] #703: Mysterious Nagios error for erzurumi ,hehe [00:34:36] http://search.cpan.org/~lbrocard/Net-Stomp-0.32/lib/Net/Stomp.pm [00:35:04] (03CR) 10TTO: "zh-mo and zh-my now are added." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [00:35:14] mutante: stomp is some archaic monitoring setup for polling activemq when it ran on erzurumi [00:35:21] mutante: you can remove it entirely. we monitor that from inside frack now--there's a job that runs on silicon which reports by nsca instead of nrpe [00:35:24] i had that conversation with Jeff a few days ago [00:35:32] and there he is [00:35:34] cmjohnson1: yes, ActiveMQ, nod [00:35:38] * Jeff_Green runs away again [00:35:40] Jeff_Green: ok, thanks [00:36:07] i'll do that and then kill erzurumi [00:37:30] !log ebernhardson synchronized php-1.23wmf12/extensions/Echo/ 'Update echo for Special:Notifications fix' [00:37:37] Logged the message, Master [00:38:08] !log ebernhardson synchronized php-1.23wmf12/extensions/Flow/ 'Update flow for Special:Notifications fix' [00:38:16] Logged the message, Master [00:43:41] !log ebernhardson synchronized php-1.23wmf11/extensions/Echo/ 'Update echo for Special:Notifications fix' [00:43:48] Logged the message, Master [00:44:21] !log ebernhardson synchronized php-1.23wmf11/extensions/Flow/ 'Update flow for Special:Notifications fix' [00:44:29] Logged the message, Master [00:44:36] !log finished deploying Special:Notifications fix to Echo and Flow [00:44:44] Logged the message, Master [00:51:29] (03PS1) 10Dzahn: remove old erzurumi ActiveMQ monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/111677 [00:52:58] (03PS2) 10Dzahn: remove old erzurumi ActiveMQ monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/111677 [00:54:03] going to rerun scap [00:58:19] !log ori started scap: no-diff scap to test script changes [00:58:26] Logged the message, Master [00:59:14] (03CR) 10Dzahn: [C: 032] "< Jeff_Green> mutante: you can remove it entirely. we monitor that from inside frack now--there's a job that runs on silicon which reports" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111677 (owner: 10Dzahn) [01:00:53] !log ori finished scap: no-diff scap to test script changes (duration: 02m 34s) [01:01:01] Logged the message, Master [01:01:11] woot :) [01:02:11] * bd808 wonders how a scap finished in 2.5m [01:02:38] --versions=1.23wmf10 [01:02:41] and no changes to push out [01:02:56] and more rsync servers than just tin [01:06:35] !log apt-get remove libnet-stomp-perl on neon, i just removed that from puppet but didn't think it should stay in as an "absent" package forever [01:06:44] Logged the message, Master [01:08:16] (03CR) 10Dzahn: "package removed from neon manually, we'll likely never use this again and just a single host" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111677 (owner: 10Dzahn) [01:27:46] (03CR) 10Dzahn: [C: 031] "so, donate-lb doesn't need 2 MX's anymore? could fundraising people actually make/merge fundraising changes? (one of the 2 is enough), thx" [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 (owner: 10Dzahn) [01:45:43] (03Abandoned) 10Dzahn: remove erzurumi from DNS [operations/dns] - 10https://gerrit.wikimedia.org/r/109656 (owner: 10Dzahn) [01:47:21] (03CR) 10Dzahn: "sigh, duplicates that are also linked on the tickets btw" [operations/dns] - 10https://gerrit.wikimedia.org/r/109656 (owner: 10Dzahn) [01:48:29] (03CR) 10Dzahn: "i don't think anymore cares if people want to replace dsh completely" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [01:49:24] (03Abandoned) 10Dzahn: move dsh to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [01:59:33] (03PS1) 10Springle: s2 repool db1034, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111699 [02:00:06] (03CR) 10Springle: [C: 032] s2 repool db1034, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111699 (owner: 10Springle) [02:00:12] (03Merged) 10jenkins-bot: s2 repool db1034, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111699 (owner: 10Springle) [02:01:34] !log springle synchronized wmf-config/db-eqiad.php 's2 repool db1034 warm up' [02:01:42] Logged the message, Master [02:14:32] (03PS4) 10Dzahn: linting openstack, quoting, arrows [operations/puppet] - 10https://gerrit.wikimedia.org/r/109295 [02:15:53] (03Abandoned) 10Dzahn: linting openstack, quoting, arrows [operations/puppet] - 10https://gerrit.wikimedia.org/r/109295 (owner: 10Dzahn) [02:21:38] (03PS1) 10Springle: s2 depool db1009 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111705 [02:21:47] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-06 02:21:47+00:00 [02:21:57] Logged the message, Master [02:22:08] (03CR) 10Springle: [C: 032] s2 depool db1009 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111705 (owner: 10Springle) [02:22:14] (03Merged) 10jenkins-bot: s2 depool db1009 schema changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111705 (owner: 10Springle) [02:22:32] (03CR) 10Dzahn: [C: 031] sudoers: remove two files, seems not to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [02:23:08] !log springle synchronized wmf-config/db-eqiad.php 's2 depool db1009 schema changes' [02:23:16] Logged the message, Master [02:23:36] (03CR) 10Dzahn: "nagios and rainman, cough:) pretty sure this can go, just can't submit in this state and nrpe_fundraising must be from Jeff" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [02:24:08] (03PS2) 10Matanya: sudoers: remove two files, seems not to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 [02:24:27] (03CR) 10Dzahn: "rebase on a change that just does "D" on files? :p" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [02:42:19] (03PS1) 10Springle: clean up s2 and s4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/111709 [02:44:10] (03CR) 10Springle: [C: 032] clean up s2 and s4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/111709 (owner: 10Springle) [02:44:10] !log LocalisationUpdate completed (1.23wmf11) at 2014-02-06 02:44:10+00:00 [02:44:19] Logged the message, Master [02:45:34] !log xtrabackup clone db1034 to db1009 [02:45:41] Logged the message, Master [02:57:45] (03CR) 10Byfserag: "hmm, where is jenkins-bot?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [03:05:51] (03PS1) 10Springle: Sideline db1034 for hardware checks RT 6783. Assign db1024 to s2 as replacement. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111713 [03:07:27] (03CR) 10Springle: [C: 032] Sideline db1034 for hardware checks RT 6783. Assign db1024 to s2 as replacement. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111713 (owner: 10Springle) [03:10:11] (03PS1) 10BBlack: Move KH,MY,PH,SG,TW to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111714 [03:10:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:10:44] !log xtrabackup clone db1018 to db1024 [03:10:47] (03CR) 10BBlack: [C: 032 V: 032] Move KH,MY,PH,SG,TW to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111714 (owner: 10BBlack) [03:10:52] Logged the message, Master [03:22:26] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-06 03:22:26+00:00 [03:22:34] Logged the message, Master [03:27:33] Can I see the kafka broker config somewhere? [03:36:13] PROBLEM - mysqld processes on db1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:40:35] Snaps: sure, I can retrieve it. Just a second. [03:41:17] ori: I really just want the value of the "message.max.bytes" property :) [03:41:53] PROBLEM - Puppet freshness on db1024 is CRITICAL: Last successful Puppet run was Thu 06 Feb 2014 03:18:56 AM UTC [03:43:28] Snaps: on the bits caches at least it is not explicitly set [03:44:27] ori: okay, thanks :) [03:44:30] Snaps: the file is generated by puppet from a template; the output on bits is this config file: https://dpaste.de/3Dbb/raw [03:44:51] ah, thats for varnishkafka, I was talking about the kafka broker [03:45:02] ohhh, I misread. Sorry. Hang on [03:45:30] it runs on analytics1022.eqiad I think [03:47:37] it's not set [03:48:18] okay, using defaults. All I need to know, thank you ori :) [03:48:25] no problem [03:59:11] springle: Have you seen https://bugzilla.wikimedia.org/60907 (apparently replica s7, "ERROR 1548 (HY000) at line 1: Cannot load from mysql.proc. The table is probably corrupted")? [03:59:37] scfc_de: yes, i've seen it [03:59:51] (i'm assigned to it) [04:01:49] I did that, but that doesn't guarantee that it is noticed :-). [04:02:01] :) [04:08:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [04:17:00] going to run another no-op scap [04:17:44] noöp [04:17:50] :) [04:24:06] !log ori started scap: no new code. testing scap changes. [04:24:13] Logged the message, Master [04:24:31] Production is the best test environment. [04:24:35] Gloria: Hush. [04:28:42] !log ori finished scap: no new code. testing scap changes. (duration: 04m 35s) [04:28:50] Logged the message, Master [04:29:07] that was a successful run of all branches [04:29:49] bd808, pack your bags, we're going to vegas [04:30:23] the tampa app servers don't have a usable rsync server, they're all defaulting to tin [04:30:35] o_O [04:30:56] it was like that for all servers until a few hours ago [04:31:05] That was the perl bug? [04:31:27] yeah [04:31:33] Nice find [04:32:11] Why didn't anyone notice the network load on tin due to that? [04:32:19] they were gathering requirements [04:33:25] * bd808 dodges the jab [04:33:35] it was a joke :D [04:33:59] probably because no one was looking; i wasn't [04:34:26] what would you expect the outbound bandwidth to look like if things were working properly? it's not hard to estimate, but it takes some deliberate care. [04:34:51] Yeah. Or graphs with scap start/end lines on them too [04:35:22] well, yeah, you'd see it go up during scap, because that's tin's primary function [04:35:54] But it should go up only during fanout and then drop back down long before the end [04:35:59] I think [04:36:37] it's realllllly hard to read a real world graph and detect this sort of thing when you're not looking for it [04:37:27] we're also swallowing the output somewhere [04:37:36] console output, i mean [04:38:41] well, i suspect we are; i'm not certain. [04:38:46] The output from dsh? [04:40:36] Can scappy be tested in beta? At least the fundamental bits? That seems like an faster palce to iterate on the little changes. [04:40:37] oh, hah. i don't think we were affected by that bug. [04:40:57] god damn it. that's hubris for you. [04:41:44] the python script reads the host files using re.findall(r'^\w+', hosts_file.read(), re.MULTILINE) [04:41:51] which doesn't match '.' [04:41:58] so it was truncating the domain name [04:42:54] but scap-proxies actually specifies nodes via the fully-qualified name, so they are pingable across DCs [04:44:14] it probably still bit us here and there because whenever pinging a host failed for whatever reason find-nearest-rsync would pick it as the proxy [04:44:36] so, back to your original question: [04:44:41] Why didn't anyone notice the network load on tin due to that? [04:44:47] probably there wasn't any [04:44:58] i'll go eat a hat now [04:45:23] Don't forget to have it toasted first [04:45:33] salted :) [04:48:11] * bd808 had to explain why "salted" was funny to his wife [05:08:49] !log redeploying parsoid/deploy on wtp* [05:08:57] Logged the message, Master [05:09:50] !log scratch that, will redploy parsoid/deploy in about an hour [05:09:57] Logged the message, Master [05:12:24] (03PS1) 10Springle: Reduce labsdb1003 mariadb global memory usage (buffer pool size) to allow for high per-thread usage. Kernel OOM killer hit instance on port 3306. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111725 [05:14:10] (03CR) 10Springle: [C: 032] Reduce labsdb1003 mariadb global memory usage (buffer pool size) to allow for high per-thread usage. Kernel OOM killer hit instance on port [operations/puppet] - 10https://gerrit.wikimedia.org/r/111725 (owner: 10Springle) [05:15:52] !log restart labsdb1003 mariadb instances [05:16:00] Logged the message, Master [05:18:33] PROBLEM - mysqld processes on labsdb1003 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [05:23:33] RECOVERY - mysqld processes on labsdb1003 is OK: PROCS OK: 3 processes with command name mysqld [05:29:33] PROBLEM - Varnish HTCP daemon on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:53] PROBLEM - Varnish traffic logger on cp1054 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:29:53] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:35:39] null-scapping again [05:39:56] !log ori started scap: no new code. testing scap changes. (again.) [05:40:05] Logged the message, Master [05:40:38] what happened to cp1054? [05:44:09] (03PS1) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.23 [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/111728 [05:44:52] (03CR) 10Liangent: [C: 031] Add variant rewrites for zhwikivoyage [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [05:44:57] !log ori finished scap: no new code. testing scap changes. (again.) (duration: 05m 00s) [05:45:05] Logged the message, Master [05:45:33] wait time spiked on that varnish [05:48:35] !log varnish on cp1054: CPU wait spiked at 05:27. dmesg|tail: XFS: possible memory allocation deadlock in kmem_alloc. not investigating further. [05:48:42] Logged the message, Master [05:49:38] it needs to be rebooted, but there are plenty of other text varnishes that are doing fine, so i'm leaving it for someone in ops [05:56:27] (03PS1) 10Ori.livneh: replace scap bash script with equivalent python code [operations/puppet] - 10https://gerrit.wikimedia.org/r/111730 [06:05:30] (03CR) 10Ori.livneh: [C: 032] replace scap bash script with equivalent python code [operations/puppet] - 10https://gerrit.wikimedia.org/r/111730 (owner: 10Ori.livneh) [06:24:53] (03CR) 10MZMcBride: "Related: bug 27294" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110904 (owner: 10Ori.livneh) [06:46:30] (03PS1) 10Springle: s2 pool db1024 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111733 [06:46:54] (03CR) 10Springle: [C: 032] s2 pool db1024 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111733 (owner: 10Springle) [06:47:00] (03Merged) 10jenkins-bot: s2 pool db1024 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111733 (owner: 10Springle) [06:47:56] !log springle synchronized wmf-config/db-eqiad.php 's2 pool db1024 warm up' [06:48:03] Logged the message, Master [07:01:24] !log xtrabackup clone db1018 to db1009 (take #2) [07:01:32] Logged the message, Master [07:28:03] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 206.433334 [07:29:12] !log redeploying parsoid/deploy on wtp* [07:29:13] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 290.866669 [07:29:20] Logged the message, Master [07:29:23] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 248.933334 [07:33:03] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [07:34:23] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [07:35:13] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [07:44:23] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 2590.699951 [07:46:13] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 954.766663 [07:48:03] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 295.833344 [07:57:03] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [08:00:03] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 77.066666 [08:06:10] (03PS24) 10Matanya: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 [08:09:16] (03CR) 10Byfserag: [C: 031] Add variant rewrites for zhwikivoyage [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [08:16:03] PROBLEM - Puppet freshness on labsdb1003 is CRITICAL: Last successful Puppet run was Thu 06 Feb 2014 05:15:15 AM UTC [08:16:13] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [08:19:13] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 970 bytes in 0.005 second response time [08:19:13] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [08:23:03] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [08:24:23] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [08:31:08] moin [08:33:05] hello paravoid [08:33:48] paravoid: howdy [08:39:51] (03CR) 10Mattflaschen: [C: 04-1] "Looks good, except for GuidedTour and living people. I'd like to flesh out the living people categories." (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [09:00:33] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:03] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [09:06:53] (03PS1) 10ArielGlenn: add tantalum misc server [operations/dns] - 10https://gerrit.wikimedia.org/r/111740 [09:09:36] (03CR) 10ArielGlenn: [C: 032] add tantalum misc server [operations/dns] - 10https://gerrit.wikimedia.org/r/111740 (owner: 10ArielGlenn) [09:11:49] cp1054 is still in a kmem_alloc funk [09:12:30] fixing [09:12:43] RECOVERY - Varnish traffic logger on cp1054 is OK: PROCS OK: 2 processes with command name varnishncsa [09:12:43] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.002 second response time [09:12:44] I'll add a cronjob for compact memory [09:12:47] thanks, also: morning [09:12:49] thanks [09:13:04] sorry if i was too aggressive yesterday or on the lists [09:13:12] ori: is this syntax corrct related to puppet3: <% if @lsbdistrelease >= "12.04" %> [09:13:23] RECOVERY - Varnish HTCP daemon on cp1054 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [09:13:27] (03CR) 10Zhuyifei1999: [C: 031] "Adding Reedy to reviewer" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/110155 (owner: 10TTO) [09:14:06] it's erb so it's just ruby, so i guess it's doing a vanilla lexical comparison [09:14:22] instead of regrets, I think you should start contributing to that pad, I think you can be really really helpful if you want :) [09:14:46] * matanya is cirous about that pad paravoid mentioned [09:14:48] good morning! [09:15:17] ori: the current syntax is <% if scope.function_versioncmp([lsbdistrelease, "12.04"]) >= 0 %> which is not puppet3 friendly [09:15:36] i wonder if my suggestion will in fact fix it [09:15:55] hi hashar [09:15:57] no, probably not, because versioncmp is probably aware of version string semantics [09:16:16] so what would you suggest to do? [09:16:43] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [09:16:54] I see this on the docs: http://docs.puppetlabs.com/references/stable/function.html#versioncmp [09:17:22] well, http://docs.puppetlabs.com/references/latest/function.html#versioncmp [09:17:22] ori: I would like to add a symlink bin/scap.py to bin/scap so we can get pyflakes/pep8 jobs. What do you think ? [09:17:23] yeah [09:17:32] hashar: there is no scap.py [09:17:36] only scap [09:17:46] ori: yeah and pyflakes/pep8 can't find bin/scap :/ [09:17:54] ori: or I should use tox :-] [09:18:47] `file` detects scap as 'scap: a python script text executable' [09:18:57] presumably based on the shebang [09:19:19] could you configure jenkins to match linters to files on that basis? [09:19:45] oh, BTW, today I celebrate passing 100 merged patches by me to WMF code base YAY! :) [09:20:02] matanya: congrats! [09:20:37] nice! [09:20:43] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.001310110092 secs [09:21:12] matanya: well done! thank you for all the patches! [09:21:21] http://docs.puppetlabs.com/guides/templating.html#using-functions-within-templates [09:21:27] doesn't seem like anything changed between puppet 2 and 3 [09:21:28] ori: yeah that could be done using file. [09:21:54] ori: I will just get tox instead, this way we can let people tweak their pep8/pyflakes / unit tests however they want [09:22:04] ori: Dynamic lookup of $lsbdistrelease at /etc/puppet/modules/ganglia_new/templates/gmond.conf.erb:60 is deprecated. Support will be removed in Puppet 2.8. Use a fully-qualified variable name (e.g., $classname::variable) or parameterized classes. [09:22:15] ori: I can get it installed via pip https://gerrit.wikimedia.org/r/#/c/111536/1/modules/contint/manifests/packages/labs.pp,unified :D [09:22:46] I first need to get Faidon to scream at making puppet to use pip as a package provider [09:22:53] aaaaaaaaaa [09:22:56] :P [09:23:03] and thank you all for encourge and help [09:23:21] matanya: right, so ::lsbdistrelease [09:23:33] yeah, ori, it is in erb [09:24:28] scope.lookupvar('::lsbdistrelease') [09:24:45] or just compute it in the manifest and assign it to a variable [09:24:56] better way i guess [09:24:59] mark, I'm running a command that needs the 'EXTERNAL_INTERFACE_GATEWAY' and 'EXTERNAL_INTERFACE_CIDR'. The external interface is set to 10.64.22.11, so I'm guessing that the CIDR is 10.64.22.11/24? But I don't know what to use as the gateway. [09:25:09] it should be available as @lsbdistrelease because it's a fact iirc [09:25:21] the gateway is 10.64.22.1 [09:25:26] yeah, that is what i thought a first too [09:25:28] so this should work: <% if scope.function_versioncmp([@lsbdistrelease, "12.04"]) >= 0 %> [09:25:39] and that CIDR sounds correct yes [09:26:02] yeah, so my code above was close to correct :) [09:26:20] well, you were solving the wrong problem [09:26:30] means? [09:26:43] eliminating the scope.function_versioncmp when it wasn't the issue [09:26:51] oh, yeah [09:27:26] thanks for the directions [09:28:02] thanks for the patches [09:28:40] (03PS1) 10Matanya: ganglia: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111743 [09:29:21] mark, it would also be handy if you could allocate a smallish pool of public IPs for me to use in eqiad. (Eventually I'll recapture those from tampa but right now there isn't a good continuous range for me to swipe.) [09:29:31] (and if the two ranges overlap then hilarity will ensue) [09:29:38] there is one [09:29:50] you have used one ip for it already for NAT [09:29:52] let me look it up [09:30:11] ; 208.80.155.128/25 Eqiad Labs virtualization subnet [09:30:12] all yours [09:30:32] (03PS1) 10ArielGlenn: add tantalum to dhcp, netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/111744 [09:30:47] sweet [09:32:05] (03CR) 10ArielGlenn: [C: 032] add tantalum to dhcp, netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/111744 (owner: 10ArielGlenn) [09:33:36] grr, this must mean something new when it says 'floating ip' because it's complainign that my ip range isn't on 10.64.22.0/24 which is obviously not useful for floating ips... [09:34:04] way to constantly redefine your terminology, openstack! [09:34:21] (03PS1) 10Matanya: ldap: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111745 [09:36:46] nice [09:38:37] Um… am I miscounting my bits? {u'message': u"The allocation pool {u'start': u'10.64.22.14', u'end': u'10.64.22.255'} spans beyond the subnet cidr 10.64.22.0/24." [09:38:57] (03PS1) 10Ryan Lane: Temporarily disable multi-master salt [operations/puppet] - 10https://gerrit.wikimedia.org/r/111746 [09:39:01] mark, paravoid: ^^ [09:39:02] no that's correct [09:39:04] mind a review? [09:39:17] specifically the template [09:39:18] hopefully paravoid can review? i'm finally gonna get breakfast now ;) [09:39:21] or I can review later [09:39:26] mark, the error is correct? Or my range is correct? [09:39:33] andrewbogott: the range looks correct [09:39:41] 10.64.22.0/24 is 10.64.22.0 to 10.64.22.255 [09:39:52] So neutron can't count then. [09:39:56] * andrewbogott curses [09:39:59] heh [09:40:10] ok, bbl :) [09:40:29] oh, it things 254 is fine. [09:40:33] Off-by-one ftw [09:40:36] I guess that change can wait till tomorrow. I need to sleep :) [09:40:52] revieing [09:40:56] reviewing [09:42:10] mostly I wanted to make sure adding the -'s did what I intended [09:43:22] (03CR) 10Faidon Liambotis: [C: 032] Temporarily disable multi-master salt [operations/puppet] - 10https://gerrit.wikimedia.org/r/111746 (owner: 10Ryan Lane) [09:43:53] thanks [09:44:02] are you deploying? [09:45:12] looks like you did :) [09:45:27] yeah :) [09:45:32] I'm testing on tin [09:45:56] have I mentioned how much I really, really love hiera? [09:46:24] it's not too often you'll hear me praise a puppet feature :) [09:48:56] cool, that change worked [09:55:18] (03PS1) 10Ryan Lane: Add an eventual consistency call for deploy.deployment_server_init [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 [09:56:43] eventual consistency? [09:57:15] yes. when the puppet master updates the pillars, it also calls deploy.deployment_server_init on tin [09:57:29] (03CR) 10Ori.livneh: "no unless / onlyif?" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 (owner: 10Ryan Lane) [09:57:32] if tin doesn't receive that call for some reason, puppet should ensure it occurs [09:57:41] no unless or onlyif [09:57:43] there's no need [09:57:48] it should run every single time [09:57:58] the function itself ensures a state [09:58:10] as mentioned in the commit message, it's safe to run every time [09:58:41] it also returns in roughly 1-2 seconds if no changes need to be made (like a clone) [09:59:27] unless / onlyif has very little to do with safety [09:59:32] (03CR) 10Ryan Lane: "As mentioned in the commit message, this should be run on every puppet run. It's safe to do so and the command returns in 1-2 seconds if n" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111749 (owner: 10Ryan Lane) [09:59:45] what would be the purpose here? [10:00:10] the point of this is to ensure consistency and the function call is what does it [10:00:37] if it isn't called every time, then it can't ensure consistency [10:02:37] puppet is declarative, not procedural. it checks the state of the system and modifies it as needed to match a declared state. [10:02:51] when you call it every time, you are essentially conceding that there is a bit of state that is a black box to puppet [10:03:06] that it must attempt to modify every time [10:03:26] this is state that's managed outside of puppet [10:03:43] right, and that's bad [10:03:55] why? [10:04:23] first of all, because you're managing it from puppet [10:04:30] absolutely not [10:04:35] puppet is the eventual consistency system [10:04:36] what is that patch then? [10:04:46] salt is the immediate consistency system [10:05:01] you back up an immediate consistency system with an eventual consistency syste [10:05:37] that's a fancy way of saying that there's a bug somewhere you don't care to chase down and will fix by just having a script run until it sticks [10:05:37] salt is ensuring an immediate state for deployment, and puppet calls it locally in case it fails to be called from the master [10:05:51] why is it failing to be called from the master? [10:05:53] no. you *must* assume that an immediate consistency system will fail [10:06:11] and you *must* back that up with something that causes the state to eventually become consistent [10:06:19] ori: I already chased down the bug and fixed it [10:06:32] ori: https://gerrit.wikimedia.org/r/#/c/111746/ [10:06:35] i see, so when i rm ~/oldfiles, i should also add a cron job that rms it just in case? [10:07:06] if you run salt '*' cmd.run 'rm ~/oldfiles' you should definitely have a state that also ensures it's gone [10:07:35] a remote execution system is there to speed up the process [10:07:36] hashar_: I suppose https://gerrit.wikimedia.org/r/#/c/111536/ is not the first pip installed package on CI right ? [10:07:40] it's not meant to be failsafe [10:07:56] i......ok. [10:07:59] if you ever rely on it being so, you've made a mistake [10:08:25] akosiaris: it is [10:09:02] if it's in scope for puppet to manage it [10:09:06] it's in scope for puppet to know its state [10:09:15] and thus in scope to run it as needed, based on local state [10:09:17] that kind of sucks Ryan [10:09:17] akosiaris: I could use tox 1.6+ but I could not manage to backport the debian package from another ubuntu version. There is a bunch of dependencies that have changed such as python/ some new virtual packages and a bunch of other packages that are not in Precise [10:09:22] paravoid: why? [10:09:24] and not generate log churn that makes you think puppet had to modify the system over and over [10:09:36] does this mean that we have to have puppet code that clones mediawiki in case salt failed to do it? :) [10:09:43] akosiaris: so I though the pragmatic approach would be to use pip since tox would only be used on labs (I have added a fail() call whenever $::realm is 'production' [10:09:48] no. you'd have puppet call the salt module [10:09:57] though/thought [10:09:58] just like we have things that call scap [10:10:00] in 0-30 minutes? :) [10:10:06] paravoid: yes [10:10:07] akosiaris: in production I have packaged all the python modules I needed (for Zuul) [10:10:19] that sucks [10:10:28] you can never assume that a remote execution system is going to be 100% reliable [10:10:33] hashar_: ok.. that makes me feel a lot better [10:10:45] I can assume it will be 100% reliable if it doesn't throw any errors [10:10:48] you should take the output of the system, check the failures and base an action on it [10:10:56] but you're not checking the failure [10:10:56] I suppose the idea is that at some point we move to trusty and this is no longer needed right ? [10:11:00] you're running this unconditionally [10:11:12] then you should have a mechanism to ensure it's eventually consistent [10:11:16] cause trusty *might* have the version you *now* want [10:11:18] :P [10:11:20] akosiaris: we can probably backport python-tox to Precise though. Would need to heavily hack the dependencies mentioned in the later Ubuntu version. But that is beyond my capabilities :-( [10:11:22] there's a difference between calling aunt sally to check if she got your christmas card [10:11:30] vs. sending her the same christmas card every week, just in case [10:11:31] if I want salt to revoke a user account, it better revoke that account immediately or tell me where it failed [10:11:44] hashar_: I think I already tried that and kind of gave up [10:11:47] paravoid: it will tell you [10:12:06] okay :) [10:12:18] let's step back a second and let me explain why this is necessary [10:12:24] yeah, I lack context [10:12:37] I just jumped in by reading a very small quote from you that sounded wrong :) [10:12:43] hashar_: I will add a comment with a TODO: after moving to trusty reevaluate this and merge. OK ? [10:13:06] deploy.deployment_server_init is called *automatically* by the puppet master when it sees that a new repository is added [10:13:53] we can have puppet fail on the master if that call fails [10:13:56] if you'd prefer that [10:14:06] but then it'll be another 30 minutes before it tries again [10:14:29] but why bother when the system that's being updated can bring itself into consistency? [10:14:33] akosiaris: sounds good thanks. That change also depends on three other tiny changes which only impact jenkins slaves in labs [10:14:41] the point of the master calling the function is to make that faster [10:15:04] akosiaris: basically I have to use git::clone() to deploy integration/jenkins.git automatically ( can't use git-deploy from tin to deploy on labs instance, so git::clone() fill the gasp there). [10:15:09] it honestly doesn't matter if it happens immediately if the minion can bring itself into a consistent state [10:15:33] git status [10:15:33] in the case of an actual deployment, a deployment occurs and errors are returned immediately to a deployer [10:15:41] meh.. wrong window [10:15:43] that deployer should take an action based on that [10:16:03] if for some reason a system is broken and can't take a deployment, it should be depooled [10:16:04] however [10:16:15] the system should be able to eventually make itself consistent [10:16:28] and should report to the deployment system that it's now consistent [10:16:49] at which point it should be repooled [10:17:02] ideally this would happen automatically via orchestration (with a depool threshold) [10:17:26] so that ops doesn't always need to be involved for deployment failures to broken hosts [10:17:53] am I still somehow crazy for thinking this way? [10:21:45] and btw, I do believe that we should have deploy.sync_all calls on all deployment targets through puppet. the state of a deployment is known by the deployment system, so it's actually declarative (but it's declared through trebuchet rather than puppet) [10:23:03] anyway, it seems I'm talking to the air here, so I'm going to go to sleep [10:26:29] ls [10:26:35] gah [10:28:35] (03PS1) 10Matanya: base: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111754 [10:29:57] (03PS1) 10Andrew Bogott: Fix up ownership for /var/lib/nova/instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/111756 [10:32:49] (03PS2) 10Andrew Bogott: Fix up ownership for /var/lib/nova/instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/111756 [10:32:57] (03PS2) 10Alexandros Kosiaris: contint: on slave labs, install tox from pip [operations/puppet] - 10https://gerrit.wikimedia.org/r/111536 (owner: 10Hashar) [10:34:30] (03CR) 10Andrew Bogott: [C: 032] Fix up ownership for /var/lib/nova/instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/111756 (owner: 10Andrew Bogott) [10:35:04] (03PS1) 10Matanya: puppetmaster: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111758 [10:35:26] (03Abandoned) 10Matanya: puppetmaster: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111758 (owner: 10Matanya) [10:37:54] (03CR) 10Alexandros Kosiaris: [C: 032] contint: on slave labs, install tox from pip [operations/puppet] - 10https://gerrit.wikimedia.org/r/111536 (owner: 10Hashar) [10:39:53] (03PS1) 10Matanya: puppetmaster: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111759 [10:40:50] 10:40:27 oh, it things 254 is fine. [10:40:50] 10:40:31 Off-by-one ftw [10:41:12] andrewbogott: in a normal subnet that makes sense [10:41:22] oh? Why not 255? [10:41:24] because then .255 is the broadcast address [10:41:28] and you can't use that for hosts [10:41:32] but the error message is definitely confusing here [10:41:43] also, the floating ip public range is not a normal subnet [10:41:46] so there we can use .255 [10:46:38] (03PS1) 10Matanya: varnish: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111761 [10:54:38] akosiaris: thank you very much for tox! [10:54:52] akosiaris: the Wikimedia python cabal is honoring you right now. [10:55:11] there's a wikimedia python cabal? and i wasn't invited? sniff sniff. [10:55:27] you can only be part of one cabal at the same time :D [10:57:38] am I part of some cabal ? [10:57:55] i wanna join the python one, where do I apply ? [10:58:55] let me fill a rt to get a pythonists@wikimedia.org mailing list [10:59:47] lame debian question: is there a way to have "apt-get upgrade" to skip downgrading packages ? [11:00:06] on an instance I have some nodejs/npm packages that have been manually installed and that prevents me from upgrading the other packages [11:00:12] (upgrade would downgrade nodejs/npm) [11:00:31] (03PS1) 10Springle: s2 increase db1024 load after warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111764 [11:00:49] (03PS1) 10Yuvipanda: Deploy Extension:Popups on betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 [11:00:59] (03CR) 10Springle: [C: 032] s2 increase db1024 load after warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111764 (owner: 10Springle) [11:01:07] (03Merged) 10jenkins-bot: s2 increase db1024 load after warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111764 (owner: 10Springle) [11:02:51] hey hashar! [11:03:00] hashar: another extension to betalabs! this one for the design team. https://gerrit.wikimedia.org/r/#/c/111765/ [11:03:02] !log springle synchronized wmf-config/db-eqiad.php 's2 increase db1024 load after warm up' [11:03:09] Logged the message, Master [11:03:24] yuvipanda: is it registered in mediawiki/extensions.git ? [11:03:47] hashar: checking that now, it should be. [11:03:50] moment [11:04:46] hashar: hmm, it isn't. isn't that supposed to be automatic? [11:05:06] no [11:05:09] hmm [11:05:13] there's a script to update it though [11:05:24] Nemo_bis: do you have that script handy? :) [11:05:35] it's in the repo IIRC [11:05:40] never used [11:05:42] oh [11:05:44] i found it [11:05:47] hashar: you can apt to ignore packages [11:06:24] hashar: i.e. sudo apt-mark hold [11:06:50] my hero [11:06:51] http://devopsreactions.tumblr.com/post/75576729444/explaining-how-git-gerrit-and-jenkins-work-together [11:07:00] hehe [11:07:31] one point hashar it doesn't work before 12.04 [11:07:35] greg sent me that link a couple days ago. Even managed to find the original video: http://www.youtube.com/watch?v=r-qhj3sJ5qs [11:08:31] matanya: worked for me. Thank you very much [11:08:50] matanya: I guess if a newer version appears in apt.wikimedia.org the package will not be updated would it be ? [11:09:04] no it won't [11:09:13] the apt pinning will block it [11:09:25] and apt-get upgrade would never complains right? [11:09:56] no, this package is mark on hold for him, so he should be fine with it [11:10:05] but paravoid can surely confirm [11:10:28] (03CR) 10Hashar: [C: 04-1] "Please make sure Popups is registered in mediawiki/extensions.git or it will not be deployed on beta cluster :-) Then you can get that ch" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [11:10:50] (03PS1) 10TTO: Get rid of echowikis.dblist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111766 [11:10:55] mind you apt-mark hold is undocumented [11:10:55] :D [11:11:35] hashar: registered at https://gerrit.wikimedia.org/r/#/c/111767/, can you +2? [11:11:51] I could too, but self merging ugh. [11:12:11] matanya: thank you very much [11:12:30] yuvipanda: I self merge on mediawiki/extensions.git all the time :-D will review [11:13:44] hashar: thanks! :) [11:14:09] hashar: np, at your service :) http://manpages.ubuntu.com/manpages/precise/en/man8/apt-mark.8.html [11:14:50] !log jenkins: added label hasTox on integration-slave01.pmtpa.wmflabs. Will let us run tox based Jenkins jobs there. [11:14:58] Logged the message, Master [11:15:17] (03CR) 10TTO: "https://www.mediawiki.org/wiki/Extension:Popups" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [11:15:23] (03CR) 10TTO: [C: 04-1] Deploy Extension:Popups on betalabs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [11:17:03] PROBLEM - Puppet freshness on labsdb1003 is CRITICAL: Last successful Puppet run was Thu 06 Feb 2014 05:15:15 AM UTC [11:18:53] RECOVERY - Puppet freshness on labsdb1003 is OK: puppet ran at Thu Feb 6 11:18:46 UTC 2014 [11:28:13] yuvipanda: my net is slow sorry :( [11:28:51] hashar: :) [11:32:27] hashar: want me to self-merge mw/extensions? [11:54:53] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 306 seconds [11:55:43] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 357 seconds [12:01:43] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [12:01:53] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [12:07:32] (03PS1) 10Faidon Liambotis: Switch BD, ID, MN to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111775 [12:07:51] (03CR) 10Faidon Liambotis: [C: 032] Switch BD, ID, MN to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111775 (owner: 10Faidon Liambotis) [12:11:54] lunch time bbl [12:17:44] (03PS1) 10Matanya: protoproxy: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111776 [12:20:15] paravoid: isn't that worth a !log [12:20:32] yes, will do, I'm not done :) [12:20:43] ok sorry [12:20:57] no need to apologise, you're right [12:21:06] we haven't logged all the other changes the last days either [12:21:19] I thought MN was Minnesota and I wondered what ID and BD were ^^ [12:22:12] (03PS2) 10Matanya: protoproxy: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111776 [12:30:42] (03PS1) 10Faidon Liambotis: Switch VN to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111779 [12:31:01] (03CR) 10Faidon Liambotis: [C: 032] Switch VN to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111779 (owner: 10Faidon Liambotis) [12:35:14] (03PS1) 10Andrew Bogott: Added auth_uri to a few more places, as requested by havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111780 [12:36:30] (03CR) 10Andrew Bogott: [C: 032] Added auth_uri to a few more places, as requested by havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111780 (owner: 10Andrew Bogott) [12:39:36] (03PS1) 10Matanya: toollabs: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 [12:40:33] (03PS1) 10Faidon Liambotis: Switch TH & MM to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111782 [12:40:50] (03CR) 10Faidon Liambotis: [C: 032] Switch TH & MM to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111782 (owner: 10Faidon Liambotis) [12:42:47] um, my review queue is around 30, anyone here has some time to review a few, mostly easy fixes [12:43:10] !log pointing the rest of East Asia (except CN) to ulsfo [12:43:17] Logged the message, Master [12:43:22] Nemo_bis: there :) [13:00:25] (03PS1) 10Faidon Liambotis: Point US & Canada west coastal states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111784 [13:02:49] (03PS1) 10Andrew Bogott: switch nova.conf to use neutron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111785 [13:03:07] (03CR) 10Faidon Liambotis: [C: 032] Point US & Canada west coastal states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111784 (owner: 10Faidon Liambotis) [13:04:46] (03CR) 10Andrew Bogott: [C: 032] switch nova.conf to use neutron. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111785 (owner: 10Andrew Bogott) [13:06:10] thanks :) [13:09:09] * paravoid waits for US pacific to wake up [13:10:09] matanya: Do you know if and how the puppet 3 changes affect templates à la modules/toollabs/templates/mail-relay.erb: "route_list = * <%= fqdn %>"? [13:10:16] yep, traitors, all sleeping now, can't clearly see effect on ganglia [13:10:31] it's by design [13:10:34] yes scfc_de you need to use @fqdn [13:10:38] sure, just kidding ^^ [13:10:42] I want it to gradually ramp up [13:11:01] rather than just switch a large portion of traffic at peak hours [13:12:58] matanya: Ah: http://docs.puppetlabs.com/guides/templating.html#referencing-variables [13:13:02] (03PS1) 10Matanya: cache: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111787 [13:13:08] !log US/Canada pacific states being served by ulsfo [13:13:14] yeah exactly scfc_de [13:13:15] Logged the message, Master [13:13:32] (03PS1) 10Faidon Liambotis: Switch the rest of the NA western states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111788 [13:13:34] (03CR) 10jenkins-bot: [V: 04-1] Switch the rest of the NA western states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111788 (owner: 10Faidon Liambotis) [13:14:05] (03CR) 10Faidon Liambotis: "recheck" [operations/dns] - 10https://gerrit.wikimedia.org/r/111788 (owner: 10Faidon Liambotis) [13:21:40] (03PS2) 10Tim Landscheidt: Tools: Fully qualify variables for Puppet 3 compatibility [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [13:23:55] (03CR) 10Tim Landscheidt: [C: 031] Tools: Fully qualify variables for Puppet 3 compatibility [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [13:24:29] (03PS1) 10Andrew Bogott: Remove a weird redefinition of controller_hostname. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111789 [13:24:41] scfc_de: no, maildomain won't work this way [13:25:41] Why not? [13:26:10] becuase it is not a fact [13:26:16] it is defined by us [13:26:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [13:26:33] (03PS2) 10Andrew Bogott: Remove a weird redefinition of controller_hostname. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111789 [13:26:35] eep [13:26:59] matanya: But http://docs.puppetlabs.com/guides/templating.html#referencing-variables speaks unconditionally of all variables in scope? I'll test the change on Toolsbeta to be sure. [13:27:34] scfc_de: yes, but in this scope [13:27:42] ok, unrelated [13:27:46] maildomain isn't defined in this scope [13:28:18] But how would it then be evaluated now? [13:28:19] (03CR) 10Andrew Bogott: [C: 032] Remove a weird redefinition of controller_hostname. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111789 (owner: 10Andrew Bogott) [13:28:51] wait, this inherits confuses me [13:29:29] oh, ok. it is called in the same scope [13:29:37] just in a weird way :) [13:30:14] sorry scfc_de [13:31:13] I'll still test it to be sure :-). [13:39:39] thank you [13:51:55] matanya: Works as expected, so we're good to go. [13:52:11] great, thanks for checking out [13:53:42] (03PS2) 10Faidon Liambotis: Switch the rest of the NA western states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111788 [13:56:31] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta, especially the @maildomain bit in the templates, and works fine." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [14:25:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:29:18] (03PS1) 10Andrew Bogott: Try using a proper username/pass for neutron auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/111795 [14:31:52] (03CR) 10Andrew Bogott: [C: 032] Try using a proper username/pass for neutron auth [operations/puppet] - 10https://gerrit.wikimedia.org/r/111795 (owner: 10Andrew Bogott) [14:54:36] (03CR) 10Ottomata: "Can we wait a bit before we decom? I want to make sure everything is sane on erbium before we turn this off." [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [14:55:52] (03PS1) 10Andrew Bogott: Further attempt to actually add neutron_ldap_user_pass to nova.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/111797 [14:58:03] (03CR) 10Andrew Bogott: [C: 032] Further attempt to actually add neutron_ldap_user_pass to nova.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/111797 (owner: 10Andrew Bogott) [15:08:51] (03CR) 10Faidon Liambotis: [C: 032] Switch the rest of the NA western states to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/111788 (owner: 10Faidon Liambotis) [15:11:46] (03PS2) 10Phuedx: Enable the GettingStarted extension on non-enwiki wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 [15:13:18] (03CR) 10Phuedx: Enable the GettingStarted extension on non-enwiki wikis. (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [15:28:01] !log reedy updated /a/common to {{Gerrit|I1e5ea52ae}}: s2 increase db1024 load after warm up [15:28:08] Logged the message, Master [15:35:22] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111800 [15:35:24] (03PS1) 10Reedy: Wikipedias to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111801 [15:35:26] (03PS1) 10Reedy: Point php to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111802 [15:35:28] (03PS1) 10Reedy: Update group0 wikis to 1.23wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111803 [15:35:38] (03CR) 10jenkins-bot: [V: 04-1] Update group0 wikis to 1.23wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111803 (owner: 10Reedy) [15:35:40] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111800 (owner: 10Reedy) [15:55:17] (03PS4) 10Diederik: [WIP] MySQL scripts to generate Kanban chart for RT [operations/puppet] - 10https://gerrit.wikimedia.org/r/111152 [15:57:20] !log restarted jenkins by mistake :-( [15:57:28] Logged the message, Master [16:00:06] !log reedy started scap: testwiki to 1.23wmf13 and build l10n cache [16:00:14] Logged the message, Master [16:01:33] (03PS3) 10Matanya: protoproxy: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111776 [16:04:31] !log reedy started scap: testwiki to 1.23wmf13 and build l10n cache [16:04:39] Logged the message, Master [16:07:53] !log reedy started scap: testwiki to 1.23wmf13 and build l10n cache [16:08:00] Logged the message, Master [16:11:05] !re-enabling puppet on analytics1021 [16:25:22] (03PS1) 10Ottomata: Parameterizing num.replica.fetchers and replica.fetch.max.bytes [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/111812 [16:27:04] (03CR) 10Ottomata: [C: 032 V: 032] Parameterizing num.replica.fetchers and replica.fetch.max.bytes [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/111812 (owner: 10Ottomata) [16:29:03] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:20] (03PS1) 10Ottomata: Increasing num_replica_fetchers to 2 on kafka brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/111816 [16:29:30] (03PS2) 10Ottomata: Increasing num_replica_fetchers to 2 on kafka brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/111816 [16:30:03] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.919 second response time [16:32:53] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:54] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:59] (03CR) 10Ottomata: [C: 032 V: 032] Increasing num_replica_fetchers to 2 on kafka brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/111816 (owner: 10Ottomata) [16:34:03] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:14] huh [16:34:23] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:23] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:33] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:33] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:33] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:43] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:53] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:53] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:53] PROBLEM - Apache HTTP on mw1165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:53] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:53] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:54] * hashar ducks [16:34:54] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:54] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:55] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:59] someone tripped on the wrong cable. [16:35:03] shit? [16:35:23] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:33] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:45] what's going on? [16:35:50] that you, Reedy? [16:35:53] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:53] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:53] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:56] don't ask me, I'm management [16:35:57] Possibly [16:36:03] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:03] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:06] scap is running deploying code and l10n cache for testwiki [16:36:17] considering testwiki actually only runs on one apache... [16:36:18] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1391704546&g=load_report&z=large [16:36:23] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:33] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:33] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:45] shit [16:36:53] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:53] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:53] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:53] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:53] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:54] enwiki looks to be still up [16:37:15] maybe down 1/10 [16:37:17] 697 PHP Warning: require() [function.require]: Unable to allocate memory for pool. in /usr/local/apache/common-local/ [16:37:17] php-1.23wmf12/includes/AutoLoader.php on line 1222 [16:37:38] someting ate all the memory? [16:37:53] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:06] We usually get that as we changeover versions [16:38:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [16:38:25] not this many [16:38:29] oh, that error [16:38:31] APC errors we do [16:38:42] But the apaches are still only running wmf11 and wmf12, mw1017 is the only one running wmf13 code [16:38:53] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:03] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:03] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:13] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1520.69473334 [16:39:25] wooo [16:39:27] yeah, wtf [16:39:49] 2014-02-06 16:39:41 mw1198 mediawikiwiki: [187f25a0] /w/api.php?action=query&format=json&list=threads&thid=38882%7C38451%7C38239%7C38190%7C36920&thprop=id%7Csubject%7Cparent%7Cmodified Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf13/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [16:39:53] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:29] [06-Feb-2014 16:39:15] Fatal error: Unsupported operand types at /usr/local/apache/common-local/php-1.23wmf13/languages/Language.php on line 516 [16:40:30] :| [16:40:31] 2014-02-06 16:39:41 mw1198 mediawikiwiki: [187f25a0] /w/api.php?action=query&format=json&list=threads&thid=38882%7C38451%7C38239%7C38190%7C36920&thprop=id%7Csubject%7Cparent%7Cmodified Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf13/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [16:40:31] #0 /usr/local/apache/common-local/php-1.23wmf13/includes/cache/LocalisationCache.php(326): LocalisationCache->initLanguage('en') [16:40:32] wmf13 [16:40:38] Why is mediawiki running wmf13!? [16:40:54] reedy@tin:/a/common$ grep wmf13 wikiversions.dat [16:40:54] testwiki php-1.23wmf13 * [16:41:04] * andre__ gets the first bug reports [16:41:12] scap I assume [16:41:15] the new scap? [16:41:23] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [16:41:31] Logged the message, Master [16:41:33] what's with all the texvc stuff? [16:41:35] That's running, but I'm not sure it's at fault [16:41:53] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:55] root@mw1051:~# ps aux|grep -c texvc [16:41:55] 67 [16:42:05] ori, ^^^^^^ [16:42:25] texvc is math [16:42:42] presumably means it's rendering things [16:42:43] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.906 second response time [16:43:09] I see a shitload of new "Unable to allocate memory for pool" [16:43:21] [16:40:37] Why is mediawiki running wmf13!? [16:43:33] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.524 second response time [16:43:43] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.235 second response time [16:43:43] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.699 second response time [16:43:43] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.005 second response time [16:43:44] RECOVERY - Apache HTTP on mw1165 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.395 second response time [16:43:53] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.056 second response time [16:43:56] Yay [16:44:05] yay, just 7 seconds [16:44:15] that's probably pybal depooling them, them recovering [16:44:23] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.726 second response time [16:44:33] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.921 second response time [16:44:34] is it possible it's an APC cache stampede from being rolled out immediately to enwp? [16:44:43] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [16:44:43] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.961 second response time [16:44:44] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.459 second response time [16:44:44] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.516 second response time [16:44:44] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.079 second response time [16:44:47] Nothing was being rolled out to enwiki [16:44:48] he didn't do the enwp deploy yet though? [16:44:53] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.105 second response time [16:44:53] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.423 second response time [16:44:53] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.083 second response time [16:44:56] legoktm: right [16:45:03] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.235 second response time [16:45:06] It should've only been testwiki to 1.23wmf13 [16:45:13] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [16:45:13] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.690 second response time [16:45:14] legoktm: at least, that he thought, there's new scap in place [16:45:19] enwiki was switched to wmf12 earlier [16:45:21] For some reason 1.23wmf13 was going to mediawikiwiki too [16:45:24] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.084 second response time [16:45:33] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.215 second response time [16:45:33] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.948 second response time [16:45:35] MaxSem: No it wasn't [16:45:43] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [16:45:43] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [16:45:43] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.085 second response time [16:45:43] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.094 second response time [16:45:43] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.127 second response time [16:45:44] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [16:45:44] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.052 second response time [16:45:45] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.649 second response time [16:45:45] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [16:45:53] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.788 second response time [16:45:53] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [16:45:53] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.942 second response time [16:46:13] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.096 second response time [16:46:13] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [16:46:33] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.098 second response time [16:46:43] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.052 second response time [16:46:43] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [16:46:43] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [16:47:34] :D [16:48:25] 2014-02-06 16:48:10 mw1174 mediawikiwiki: [627387b9] /wiki/Special:CentralAutoLogin/createSession?token=e091e533ecb821612b11518a7d4bfbfb&type=1x1&from=enwiki&proto=https Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf13/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php. [16:48:26] Again [16:48:28] Reedy: can you explain what was supposed to happen and what happened? [16:48:39] wtf is putting mediawikiwiki on wmf13 [16:49:10] are you aware scap was replaced (I think) yesterday? [16:49:41] paravoid: 17.34 < Reedy> There's nothing changed at all for mediawkiwiki [16:49:42] reedy@tin:/usr/local/apache/common$ grep mediawikiwiki wikiversions.dat [16:49:42] mediawikiwiki php-1.23wmf12 * [16:50:03] scap doesn't change mediawikiwiki versions though [16:50:10] well, it doesn't modify wikiversions.dat [16:50:49] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [16:50:55] new scap may misinterpret though? [16:50:57] Logged the message, Master [16:51:01] that's a lot of math generation btw [16:51:03] misinterpret what? [16:51:19] the dat [16:51:22] No [16:51:24] * greg-g shrugs [16:51:24] 1.23wmf12 doesn't magically become 1.23wmf13 [16:51:52] the scap source on tin doesn't have mediawikiwiki set to 1.23wmf13 either [16:52:03] greg-g, it's MWMultiversion that decides which wiki runs which code [16:52:21] * greg-g nods [16:53:22] testwiki is broken due to it not having built the l10n cache for whatever reason [16:56:27] scap has been running for 50 minutes now [16:57:30] is it actually running the new python version? [16:57:41] I'm guessing so [16:57:46] The output at the start was different [16:57:47] greg-g: Yes. Ori switched it last night [16:58:01] The old version is available as scap-old [16:58:31] I couldn't tell from the operations/puppet repo [16:59:14] it's not puppet anymore [16:59:21] as of two days ago [16:59:34] https://gerrit.wikimedia.org/r/#/q/project:mediawiki/tools/scap,n,z [16:59:34] right, but there's no indication that those files in puppet (in the scap dir) aren't used/put in place [16:59:39] yeah [16:59:39] I just found that out yesterday [16:59:48] accidentally too :) [16:59:50] https://git.wikimedia.org/tree/operations%2Fpuppet.git/d103c5719996b42fd3b5335d861cbcdc585d86a1/files%2Fscap [17:00:09] testwiki is now fixed (in the meantime)... [17:00:58] Here's a logstash dashboard of the "Unable to allocate memory for pool" mess -- https://logstash.wikimedia.org/index.html#/dashboard/elasticsearch/memory_pool [17:01:06] scap is still apparently pushing to EQIAD hosts [17:01:08] that wasn't the issue though [17:01:12] paravoid: how would one tell what puts scap on tin? I mean, I see bashscap in that puppet repo, how would I know to look somewhere else? (/me is basically asking how to interpret puppet here, I guess) [17:01:29] greg-g: I have no idea what has changed in which ways [17:01:34] * greg-g nods [17:01:35] k [17:01:59] bd808: What caused it is fairly obvious (trying to run 3 versions of MW simultaneously). Why that was happening (mediawikiwiki running 1.23wmf13) is a differnt matter [17:01:59] The scripts are in /srv/scap [17:02:25] We usually get that as a transient error as the wikipedias come onto the newer version, followed by group0 coming onto the newly deployed version [17:02:28] shit, wikidata call [17:02:38] puppet symlinks the files in /srv/scap/bin to /usr/local/bin [17:02:39] cbd building [17:02:55] Sorry, laptop rebooted. What did I miss? Are we recovering? [17:02:59] Reedy: you're excused from it, btw ;) [17:03:13] guillom: It should be fine now... [17:03:18] Okay; thanks [17:04:02] (03PS1) 10ArielGlenn: giving back tantalum, taking an already named spare [operations/dns] - 10https://gerrit.wikimedia.org/r/111818 [17:05:33] (03CR) 10ArielGlenn: [C: 032] giving back tantalum, taking an already named spare [operations/dns] - 10https://gerrit.wikimedia.org/r/111818 (owner: 10ArielGlenn) [17:12:46] 1 hour 5 minutes; still running [17:17:44] !log reedy finished scap: testwiki to 1.23wmf13 and build l10n cache (duration: 69m 51s) [17:17:51] Logged the message, Master [17:17:58] ouch [17:18:26] wow [17:18:27] WIN [17:18:36] python is soooo fast ;) [17:18:46] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.23wmf12 till window [17:18:54] let's try Node.js? [17:18:54] Logged the message, Master [17:19:20] MaxSem: :P [17:20:13] Reedy: did you figure out what happened with mediawikiwiki? [17:22:38] paravoid: Nope [17:23:10] The diff in the /a/common working copy was only changing testwiki to 1.23wmf13 [17:23:43] RECOVERY - RAID on ms-be11 is OK: OK: optimal, 13 logical, 13 physical [17:27:12] (03PS1) 10ArielGlenn: add caesium, internal ip [operations/dns] - 10https://gerrit.wikimedia.org/r/111819 [17:28:43] (03CR) 10ArielGlenn: [C: 032] add caesium, internal ip [operations/dns] - 10https://gerrit.wikimedia.org/r/111819 (owner: 10ArielGlenn) [17:29:39] Reedy: can I ask you to write a "what happened.... as far as we know" email to engineering@? [17:29:57] s/can I ask you to/please/ # :) [17:37:14] (03PS14) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [17:41:34] It would seem there are numerous apache segfaults going on.. [17:41:54] Duh, all 10.64.32.35 [17:42:16] mw1165 [17:42:25] Can someone have a look? And/or depool it? [17:42:58] !log mw1165 is segfaulting a lot [17:43:05] Logged the message, Master [17:43:05] looking [17:44:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [17:44:59] !log depooling mw1165 [17:45:08] Logged the message, Master [17:45:31] I don't see segfaults, though [17:46:37] They look to have stopped [17:46:39] Feb 6 17:45:22 10.64.32.35 apache2[1821]: [notice] child pid 6830 exit signal Segmentation fault (11) [17:46:39] Feb 6 17:45:23 10.64.32.35 apache2[1821]: [notice] child pid 6419 exit signal Segmentation fault (11) [17:46:39] Feb 6 17:45:25 10.64.32.35 apache2[1821]: [notice] child pid 4743 exit signal Segmentation fault (11) [17:46:53] reedy@fenari:~$ tail -n 1000 /home/wikipedia/syslog/apache.log | grep -i Segmentation [17:47:08] oh right [17:56:23] hmmm [17:56:28] the 'wlnote' message at pl.wp disappeared [17:56:39] Reedy: halp [17:56:50] i can see it on en.wp for example [17:57:03] https://en.wikipedia.org/wiki/MediaWiki:Wlnote vs https://pl.wikipedia.org/wiki/MediaWiki:Wlnote [17:57:50] greg-g: ^ [17:58:00] impatient much? [17:58:23] weird [17:58:25] :) [17:59:13] Did it disappear sometime in the last few hours? [17:59:37] I'm tempted not to do anything till plwiki is on 1.23wmf12 (about an hour from now) [17:59:48] Reedy: yes, last minutes even i'd say [18:00:06] this by itself is not much of a problem [18:00:16] i'm just hoping it's the only message that magically disappeared :) [18:00:49] Someone should add scap to the blame wheel [18:02:20] (03PS1) 10Odder: Add National Library of Wales to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111827 [18:02:57] (03PS2) 10Odder: Add National Library of Wales to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111827 [18:07:52] (03PS1) 10ArielGlenn: remove tantalum, add caesium to dhcp and netboot for install [operations/puppet] - 10https://gerrit.wikimedia.org/r/111828 [18:10:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I am pretty sure this wont work. The script needs to be executed on the LVS and not on the icinga host (which is what this is doing). This" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111163 (owner: 10Ori.livneh) [18:11:43] (03CR) 10ArielGlenn: [C: 032] remove tantalum, add caesium to dhcp and netboot for install [operations/puppet] - 10https://gerrit.wikimedia.org/r/111828 (owner: 10ArielGlenn) [18:13:48] (03PS1) 10Chad: Experimentally enable interwiki searches for beta enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111829 [18:15:29] (03CR) 10Manybubbles: [C: 031] "Fine with me. Merge when you feel it is safe." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111829 (owner: 10Chad) [18:25:24] (03PS1) 10Nuria: Fixing domain name for erbium host [operations/puppet] - 10https://gerrit.wikimedia.org/r/111830 [18:27:54] (03CR) 10Chad: [C: 032] Experimentally enable interwiki searches for beta enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111829 (owner: 10Chad) [18:28:01] (03Merged) 10jenkins-bot: Experimentally enable interwiki searches for beta enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111829 (owner: 10Chad) [18:30:45] (03PS2) 10Nuria: Fixing domain name for erbium host [operations/puppet] - 10https://gerrit.wikimedia.org/r/111830 [18:30:49] (03CR) 10Ottomata: [C: 032 V: 032] Fixing domain name for erbium host [operations/puppet] - 10https://gerrit.wikimedia.org/r/111830 (owner: 10Nuria) [18:36:46] !log demon synchronized wmf-config/CirrusSearch-labs.php 'Enabled interwiki searches for beta -- no-op in prod' [18:36:54] Logged the message, Master [18:56:33] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:26] bblack, around? [19:02:45] bblack, did you update varnish ver in prod? [19:02:48] yurik: only for a few more minutes, then gone till monday [19:02:59] to what? [19:03:16] bblack, hi! to the latest ? or to 3.0.4/5? [19:03:19] not 3.0.5 if that's what you're asking [19:03:28] just to 3.0.3 with latest fixups [19:03:30] i just saw that zero was updated [19:03:35] zero.erb [19:03:48] I pushed that update because we're running 3.0.5 in ulsfo on the same config [19:03:59] so yeah, I guess 3.0.5 is in prod on ulsfo, as ulsfo is just coming into prod again [19:04:08] cool, thx, could you update the beta cluster? [19:04:23] no, but you can if you want :) [19:04:42] I really have to run out the door in a few minutes, I have a car loaded on a trailer and the truck engine is running [19:04:52] no rush :) [19:04:57] apt-get install varnish varnish-dbg libvarnishapi1 [19:04:58] will do it on monday [19:05:02] it's in our repo [19:05:21] do i need to specify a full path of any sort to the repo? [19:05:24] no [19:05:31] just the apt-get command above [19:05:31] just what you wrote? cool, thx [19:05:37] excellent [19:05:41] have a good vacation! [19:05:44] weekend [19:05:48] :) [19:05:51] there's always a chance the upgrade will fail when it tries to restart varnish, due to mmap address error [19:06:02] if that happens, do "apt-get -f install" repeatedly until it randomly succeeds [19:06:07] bleh [19:06:09] thx [19:06:10] :) [19:06:15] maybe i should wait for you :))) [19:07:33] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 488605 bytes in 8.453 second response time [19:08:05] (03PS9) 10Yurik: Handle HTTPS for Zero traffic [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 [19:10:07] (03PS15) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [19:11:16] (03PS10) 10Yurik: Handle HTTPS for Zero traffic [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 [19:11:18] (03CR) 10MaxSem: [C: 04-2] "This extension's repo in VCS is only 9 hours old and contains only boilerplate - should be deployed only after there's actual code and it " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [19:11:38] (03PS2) 10Reedy: Wikipedias to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111801 [19:11:42] (03CR) 10Reedy: [C: 032] Wikipedias to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111801 (owner: 10Reedy) [19:11:52] MaxSem: well, that's the point of betalabs, no? to get them out earlier and iterate? It has a featureflag that's off by default as well [19:11:56] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111801 (owner: 10Reedy) [19:12:11] Reedy: did you figure out what happened earlier? [19:12:11] (03PS2) 10Reedy: Point php to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111802 [19:12:14] ori: Nope [19:12:22] (03CR) 10Reedy: [C: 032] Point php to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111802 (owner: 10Reedy) [19:12:24] MaxSem: the actual code has been undergoing review for about 10 days now, as patches to VectorBeta. [19:12:28] (03Merged) 10jenkins-bot: Point php to 1.23wmf12 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111802 (owner: 10Reedy) [19:13:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf12 [19:13:27] Logged the message, Master [19:13:46] Reedy: don't run scap-old! run scap! [19:13:53] I did [19:14:01] cool [19:14:03] And I didn't even know scap-old existed [19:14:18] yuvipanda, betalabs is preparation for production. there should be some kind of plans to deploy on WMF. and security review [19:14:50] MaxSem: this was initially supposed to go on with VectorBeta, which is alredy deployed. It was split out to keep concerns separate, and will be deployed as well. [19:15:10] (03PS2) 10Reedy: Update group0 wikis to 1.23wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111803 [19:15:43] see https://gerrit.wikimedia.org/r/#/c/109878/ and dependent commits [19:18:30] (03PS16) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [19:20:17] OK, so the confusion w/versions earlier had nothing to do with the new scap script [19:20:22] just putting that out there [19:20:25] (yay) [19:20:42] what happened? [19:22:02] (03CR) 10Reedy: [C: 032] Update group0 wikis to 1.23wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111803 (owner: 10Reedy) [19:22:08] (03Merged) 10jenkins-bot: Update group0 wikis to 1.23wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111803 (owner: 10Reedy) [19:22:40] MatmaRex: Try plwiki now? [19:23:08] Reedy: fixed at a glance, thanks [19:25:04] so, what was the issue with versions? [19:25:34] ori: ^ [19:25:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is finally ready but it will cause puppet to fail on zirconium due to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [19:25:46] not sure yet [19:26:19] you did say it had nothing to do with new scap though? [19:26:52] It seems unlikely [19:26:55] what's with new scap? [19:27:09] the new script can't arbitariy change the version of MW a wiki is running itself [19:27:10] it's a new script? (and we now have scap-old?) [19:27:20] ok [19:27:24] aude: scap rewritten in python [19:27:53] because the deployments were becoming boring ~ [19:27:59] oooh [19:28:01] Noting only scap [19:28:11] Not scap-1, scap-2 and other scary scripts [19:28:13] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf13 [19:28:21] Logged the message, Master [19:28:33] paravoid: 10.64.32.35 is segfaulting again [19:28:42] must be new scap [19:28:48] :) [19:29:15] ha-ha [19:30:15] is the new scap responsible for scap taking 69m or do we have to look somewhere else for that? [19:30:49] it's a new version, that's how long it takes [19:30:54] so somewhere else as well [19:31:39] !log reedy started scap: run 2, should be a noop [19:31:46] Logged the message, Master [19:32:19] Reedy said "ouch" and MaxSem said "WIN" at the time, this is the only data I have on the subject [19:32:23] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:49] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&me=Wikimedia&m=cpu_report&s=by+name&mc=2&g=load_report [19:32:53] doesn't look very noop [19:32:53] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:23] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:29] ... [19:33:33] The script reports no l10n updates [19:33:36] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:36] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:36] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:36] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:36] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:36] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:37] but yeah, irony will fix everything [19:33:44] ctrl + c'd [19:33:47] PROBLEM - Apache HTTP on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1071 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:47] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:48] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:48] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:49] PROBLEM - Apache HTTP on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:56] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:56] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:57] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:06] Of course there are going to be process count increases, rysnc still has to be run on the apaches [19:34:06] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:16] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:26] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:26] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:26] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:36] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:36] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:39] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:45] yay pages [19:34:46] PROBLEM - Apache HTTP on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:46] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:46] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:46] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:46] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:47] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:47] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:48] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:48] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:49] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:49] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:56] PROBLEM - Apache HTTP on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:57] wth? [19:35:08] texvccheck [19:35:14] /bin/bash /usr/local/apache/common-local/php-1.23wmf12/includes/limit.sh /usr/local/apache/uncommon/bin/texvccheck 'W^{k,p}(\Omega)' MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50; MW_CGROUP='/sys/fs/cgroup/memory/mediawiki/job'; MW_MEM_LIMIT=307200; MW_FILE_SIZE_LIMIT=102400; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes [19:35:18] http://p.defau.lt/?ttkQySf5OkEgGNzsSNV45g [19:35:36] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.378 second response time [19:35:38] wtf! [19:35:46] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:46] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:46] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:46] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:46] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:47] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:56] PROBLEM - Apache HTTP on mw1180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:56] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:02] Those appeared last time [19:36:08] getting errors on en (unsurprising given the criticals) [19:36:10] But scap doesn't run them [19:36:24] hi [19:36:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:36:36] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:46] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:46] PROBLEM - Apache HTTP on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:46] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:46] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:46] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:47] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:47] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:54] I am getting 503s on enwiki [19:36:57] What'd you do Reedy [19:37:01] me too [19:37:26] I did: [19:35:18] http://p.defau.lt/?ttkQySf5OkEgGNzsSNV45g [19:37:45] wm1165 is segfaulting quite often [19:37:45] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.808 second response time [19:37:46] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:46] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:46] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:51] bd808: it's depooled [19:37:55] PROBLEM - Apache HTTP on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:05] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:05] Oh. [19:38:15] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:31] me too [19:38:40] paravoid: But still being talked to some how? 51 segfaults in last 15 minutes [19:38:41] matanya: you're depooled? [19:38:45] :) [19:38:45] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:46] PROBLEM - Apache HTTP on mw1186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:57] bd808: icinga runs things against apache on each box [19:39:05] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:05] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:06] what Reedy ? [19:39:09] :/ [19:39:15] matanya: a context joke [19:39:15] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:24] wasn't following [19:39:34] sorry [19:39:35] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [19:39:41] +1 reedy [19:39:45] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [19:39:45] PROBLEM - Apache HTTP on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:46] PROBLEM - Apache HTTP on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:46] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:46] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:55] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:55] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:55] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:05] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.007 second response time [19:40:20] Just got a 504 (rather than 503) [19:40:22] I wonder if this is replicable by just running scap again [19:40:25] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:37] Reedy: let's try a little debugging before we try again ;) [19:40:38] nothing obvious in logs (or in error message), like a fatal error [19:40:41] I only get 503 [19:40:45] or exception [19:40:45] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:46] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:47] greg-g: I was going to wait for the apaches to recover first :P [19:40:54] I'm getting both 503 and 504s [19:40:55] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:56] Not a fair test otherwise [19:40:57] "oh they're back? KILL THEM!" [19:40:58] it's coming from somwehere else probably [19:41:06] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.090 second response time [19:41:15] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:35] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.106 second response time [19:41:45] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [19:41:45] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.636 second response time [19:41:46] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:55] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:25] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:35] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [19:42:45] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:46] https://gdash.wikimedia.org/dashboards/reqerror/ [19:42:51] https://bugzilla.wikimedia.org/show_bug.cgi?id=60970 [19:42:55] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:56] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:56] PiRSquared: Not helpful [19:43:02] Reedy: [19:43:04] ? [19:43:05] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 65281 bytes in 0.322 second response time [19:43:22] We knew it was a problem 10 minutes ago [19:43:25] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:33] PiRSquared: he means he know about the problem [19:43:48] So should I just make it RESOLVED FIXED again? [19:43:54] No, because it's not fixed [19:43:55] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:57] I didn't say it was [19:44:00] which changes went out? [19:44:02] they need to be reverted [19:44:05] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:08] ori: in that scap? [19:44:11] anything remotely to do with jobs or texvc [19:44:11] yes [19:44:17] I didn't deploy any code (knowingly) [19:44:19] did any wikis get bumped to a new version? [19:44:25] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:26] not in the last 20 minutes [19:44:26] at all, today [19:44:29] at all [19:44:33] yeah, this morning [19:44:36] roll back [19:44:40] things looked fine on wmf12 enwiki [19:44:41] Reedy: ^ on it? [19:44:45] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:48] 19:28 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf13 [19:44:48] 19:13 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf12 [19:44:50] before it exploded [19:44:55] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:08] it looked fine but now it doesn't; roll back [19:45:10] * aude checked stuff that uses wikidata heavily and all was good [19:45:32] this is texvc related, could be a specific page or formula [19:45:45] hmmm [19:45:46] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.065 second response time [19:45:46] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:46] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:46] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:46] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.097 second response time [19:45:48] But why did it become prevelant during running scap? [19:45:55] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.034 second response time [19:45:56] IIRC paravoid saw something simlar earlier [19:45:58] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Revert earlier version changes [19:46:05] Logged the message, Master [19:46:12] some wikis are back [19:46:15] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:27] matanya: I have been getting through to wikis about 50% of the time no problem [19:46:34] hi folks, it looks like this might be a case for https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public ? [19:46:45] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.571 second response time [19:46:45] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:47] would this work for a tweet from @wikimedia and @wikipedia: [19:46:50] coming back [19:46:51] We are currently experiencing issues across Wikimedia sites. Our operations team is addressing them. [19:47:03] HaeB: sure [19:47:05] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.448 second response time [19:47:05] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.930 second response time [19:47:15] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:15] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:47:35] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [19:47:35] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [19:47:35] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [19:47:35] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [19:47:35] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [19:47:35] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.103 second response time [19:47:45] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.048 second response time [19:47:45] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [19:47:45] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.057 second response time [19:47:46] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.430 second response time [19:47:53] everything appears to be jumping back to wmf11 now :) [19:47:55] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.325 second response time [19:47:55] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.919 second response time [19:47:55] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.921 second response time [19:47:56] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.077 second response time [19:47:56] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.177 second response time [19:48:05] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:48:05] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:48:05] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:05] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 65321 bytes in 0.320 second response time [19:48:10] addshore: These things don't happen magically [19:48:15] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.084 second response time [19:48:15] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.295 second response time [19:48:25] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.904 second response time [19:48:34] I always assumed it was the hamsters ;p [19:48:35] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.042 second response time [19:48:35] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [19:48:35] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [19:48:35] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [19:48:35] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [19:48:35] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [19:48:35] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.067 second response time [19:48:36] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.071 second response time [19:48:36] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.084 second response time [19:48:37] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.075 second response time [19:48:45] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.046 second response time [19:48:45] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:48:45] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.078 second response time [19:48:45] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.845 second response time [19:48:45] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.652 second response time [19:48:45] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.856 second response time [19:48:45] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.951 second response time [19:48:46] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [19:48:46] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.010 second response time [19:48:47] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [19:48:47] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.120 second response time [19:48:48] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.170 second response time [19:48:48] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.490 second response time [19:48:50] weeeee [19:48:51] addshore: i think it is better no to bother ops now [19:48:55] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [19:48:55] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [19:48:55] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [19:48:55] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.079 second response time [19:48:55] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.560 second response time [19:48:56] *not [19:49:02] nah, ops didnt cause or fix this outage ;] [19:49:05] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.183 second response time [19:49:05] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.224 second response time [19:49:15] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.965 second response time [19:49:15] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.846 second response time [19:49:17] yeah, i mean the person in charge [19:49:25] devs are fixin =] [19:49:27] feel free to retweet from your personal accounts: https://twitter.com/Wikimedia/status/431514566975959040 [19:49:35] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [19:49:35] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.070 second response time [19:49:35] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [19:49:39] Reedy: Working now? [19:49:40] i assume it was mediawiki related push that is now being tweaked but i just got back to keyboard, so dunno [19:49:44] rdwrer: I have no idea [19:49:45] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.054 second response time [19:49:45] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [19:49:45] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.484 second response time [19:49:45] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.508 second response time [19:49:45] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.243 second response time [19:49:45] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.985 second response time [19:49:45] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.458 second response time [19:49:46] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.260 second response time [19:49:46] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.092 second response time [19:49:47] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.280 second response time [19:49:49] Heh [19:49:55] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.987 second response time [19:49:55] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [19:49:56] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.370 second response time [19:49:56] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.094 second response time [19:50:05] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.491 second response time [19:50:07] i know how outages can be stressful, when everybody is behind your back waiting for you to rescue the world [19:50:08] rdwrer: things should be getting better [19:50:15] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.504 second response time [19:50:19] I note the apaches recovered earlier when scap was still running [19:50:25] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [19:50:32] Yay, APC spam [19:50:35] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [19:50:35] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [19:50:35] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.045 second response time [19:50:35] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [19:50:35] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [19:50:35] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [19:50:35] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.073 second response time [19:50:45] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [19:50:46] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [19:50:46] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [19:50:46] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.078 second response time [19:51:32] ok, let's start noting relevant details here: https://etherpad.wikimedia.org/p/Feb6Outage devs and ops only please. [19:51:35] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.068 second response time [19:51:45] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.083 second response time [19:51:46] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [19:53:45] job queue is still increasing [19:54:37] ok i saw it go down a little, confirmed jobs are being processed [19:55:29] the runners aren't doing so hot: https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [19:55:49] probably just a repurcusion of the apaches dieing [19:58:17] <^d> greg-g +1, that's probably it. [19:58:24] <^d> If they don't start rebounding, then we can worry. [19:59:41] <^d> https://gdash.wikimedia.org/dashboards/jobq/ has add/removals roughly in line [20:03:34] Here's a logstash search showing texvc going nuts both times: https://logstash.wikimedia.org/index.html#dashboard/temp/Q2lRL7sHR_y5yvxiYLQGSA [20:06:08] bd808: mind uploading to commons or such? [20:06:20] a logstash? [20:06:20] https://www.mediawiki.org/w/index.php?title=MediaWiki_1.23/wmf12/Changelog#Math [20:06:31] twkozlowski: what is logstash? [20:06:33] matanya: It's not in a format that's useable [20:06:49] PiRSquared: http://lmgtfy.com/?q=logstash&l=1 [20:07:28] https://gerrit.wikimedia.org/r/#/c/104991/ [20:07:32] i meant a print screen, but whatever [20:07:46] it seems that we are in the clear now so that we could send a followup tweet from @wikimedia and @wikipedia ? How about this: [20:07:49] "All our sites are back up after brief period of issues https://gdash.wikimedia.org/dashboards/reqerror/ Thanks for your patience!" [20:07:59] yeah, a screencap could be fine [20:08:02] greg-g: Are we meant to be back on wmf11 for phase2 wikis? [20:08:03] matanya: It's really not much use without being able to access it [20:08:09] James_F: Yes [20:08:12] OK. [20:08:18] * James_F stops trying to debug then. [20:08:26] We're not exactly sure if it is/was related/at fault [20:08:30] * James_F nods. [20:08:42] James_F: stuff be crazy [20:08:43] texvccheck seems to have only got upset when scap started (both times around) [20:08:46] http://imgur.com/aABo2M6 [20:08:49] matanya: ^ [20:08:59] no private data there [20:09:07] thanks greg-g [20:10:23] oooh, texvccheck, deja vu. we caught its absence on beta labs but not the performance problem [20:10:26] seems chrome only tells me the last visit time and i have no way to find when i visited enwiki before when it was good [20:10:36] greg-g: Thanks that was the same shot I had on my desktop [20:10:38] * aude poking at sqlite history db [20:10:53] can't give a timestamp [20:12:29] bd808: shutter rocks [20:12:40] (gtk screenshot app) [20:12:47] i agree greg-g [20:12:53] it is so great [20:15:19] * gwicke hopes that we can move math off the cluster soon [20:15:36] when is scap-recompile supposed to run? [20:16:01] "manually" [20:16:08] But the answer is never now [20:16:15] I moved it to a version independent directory [20:17:25] At one point, it was run on every scap run [20:17:38] ok, sending out the followup tweet now unless there are objections [20:18:44] ok, wmf12 was fine at 19:21:35 UTC on enwiki [20:19:06] if that helps at all... [20:19:34] That's about 8 minutes later [20:19:41] The problems *seem* to have started when I ran scap again [20:19:52] yes and before phase0 wikis on wmf13 [20:19:52] 19:31 logmsgbot: reedy started scap: run 2, should be a noop [20:19:52] 19:28 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf13 [20:19:52] 19:13 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf12 [20:20:01] scap was after [20:20:08] oh ok [20:20:50] AFK for 10-15 [20:20:57] * aude nods [20:21:49] ottomata: regarding $fundraising_log_directory at templates/udp2log/filters.erbium.erb how would you like yo puppetize3 it? [20:22:10] We could run scap again and see if the problem occurs within a few minutes ;) [20:22:20] Or move the wikis over again and wait half an hour or so without doing anything else [20:22:35] the shouting method [20:23:04] no, let's not do that yet [20:23:06] still no idea of cause? [20:23:08] oh, hm, matanya we should use the template_variables parameter [20:23:10] some fun things to fix meanwhile [20:23:11] [06-Feb-2014 19:57:55] Fatal error: Access to undeclared static property: AbuseFilter::$editboxName at /usr/local/apache/common-local/php-1.23wmf12/extensions/AbuseFilter/AbuseFilter.hooks.php on line 609 [20:23:21] hoo: legoktm ^ [20:23:25] [06-Feb-2014 19:57:56] Fatal error: Using $this when not in object context at /usr/local/apache/common-local/php-1.23wmf12/includes/deferred/SquidUpdate.php on line 1782 [20:23:29] looking [20:23:30] Reedy: It looks to me like the texvc log message rate started to climb at 19:13 [20:23:38] [06-Feb-2014 19:57:13] Fatal error: Call to a member function translate() on a non-object at /usr/local/apache/common-local/php-1.23wmf12/includes/deferred/SquidUpdate.php on line 1416 [20:23:45] … [20:23:45] template_variables => { 'fundraising_log_directory' => balblablabal' }, [20:23:45] … [20:23:47] and in template [20:23:51] bd808: Ah. Which matches with wikipedia version change [20:23:51] [06-Feb-2014 19:57:08] Catchable fatal error: Argument 2 passed to ^B::^B() must be an instance of H<89>\$H<89>l$H<89>L<89>d$L<89>l$ [20:23:51] H<83>HH<8B>/H<8B>G8D<8B>e0I^AA^OT$^T<80>^CtD<80>^Gv'E1<80>^C^O<87>, boolean given, called in /usr/local/apache/common-local/php-1.23wmf12/extensions/Math/MathTexvc.php on line 314 and defined at on line 39 [20:23:58] Reedy: Yes [20:24:00] <%= @template_variables['fundraising_log_directory'] %> [20:24:01] bd808: not during the previous scap? [20:24:02] that one is a math thing [20:24:25] greg-g: I'll zoom out and look again [20:24:32] bd808: Did you look earlier? 1610 onwards [20:25:03] [06-Feb-2014 21:02:33] Catchable fatal error: Argument 1 passed to varValue::() must be an instance of _context, array given, called in /usr/local/apache/common-local/php-1.23wmf12/extensions/Math/MathTexvc.php on line 314 and defined at on line 39 [20:25:04] There was another bump starting at ~16:17 [20:25:21] hello, fatal, file and line number [20:25:27] Ending at ~16:56 [20:25:52] As for the squid update ones, the file is a lot smaller in master [20:27:16] So, I'm late to the party. Did this happen during a deployment, or just sort of randomly? [20:27:20] bd808: that makes sense [20:27:27] zellfaze: deployment [20:27:39] reedy@ubuntu64-web-esxi:~/git/mediawiki/core$ wc -l includes/deferred/SquidUpdate.php [20:27:39] 311 includes/deferred/SquidUpdate.php [20:28:07] reedy@tin:/a/common$ wc -l php-1.23wmf12/includes/deferred/SquidUpdate.php [20:28:07] 311 php-1.23wmf12/includes/deferred/SquidUpdate.php [20:28:25] How is there errors on lines 1416 and 1782? [20:31:55] * ori heads to the office [20:33:19] getting fatals on abuse filters [20:33:39] legoktm: ^ [20:33:45] uhoh [20:33:48] i filed the bug [20:33:51] let me actually look at it now [20:34:09] it saves the change, but gives a blank page [20:34:13] on it [20:34:16] thanks [20:34:34] heya bd808, i'd like to keep momentum on the jvm deployment project [20:34:59] should we sync about what should be done? is that part of your overall deployment scheming stuff? [20:35:10] matanya: after you save a filter? [20:35:11] are you on the ops@ mailling list? [20:35:15] yes legoktm [20:35:19] k [20:35:34] ottomata: I am on ops list and it is related [20:35:35] mostly after modifing [20:35:51] matanya: is it every time? [20:35:58] I'm not able to reproduce locally...hrm [20:36:14] aye, so [20:36:36] no one replied so much about the git-annex issues [20:36:37] (03PS1) 10QChris: Reflect emery -> erbium move in documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/111841 [20:36:40] i dont' think I want to use it [20:36:44] which means I need to do something else [20:36:54] i could just go ahead and start working on it [20:37:12] build in support to git deploy to wget urls and checksum them [20:37:17] and then put them somewhere in local repo [20:37:18] or something [20:37:42] ottomata: What about good old rsync? [20:37:48] yes legoktm every time [20:38:25] thanks qchris [20:38:40] (03CR) 10Matanya: [C: 031] Reflect emery -> erbium move in documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/111841 (owner: 10QChris) [20:38:58] (03CR) 10Ottomata: [C: 04-1] lucene: puppet 3 compatibility fix: fully qualify variable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 (owner: 10Matanya) [20:38:59] thanks matanya :-) [20:39:08] well bd808 [20:39:10] rsync how? [20:39:16] right now git-deploy uses git, right? [20:39:29] git fetch and merge on minions [20:39:40] ottomata: just curious (as an avid fan of both git-annex and joey hess) why not git-annex? [20:39:52] greg-g are you on ops@ list? [20:39:55] that code path hasn't changed since 2011... [20:40:05] * greg-g looks [20:40:26] subject? [20:40:29] greg-g: Subject "JVM App Deployment" I think [20:40:31] uhh [20:40:31] yeah [20:40:33] that's it [20:40:38] legoktm: want to like on a live server when i change? [20:40:49] I don't have shell access :/ [20:40:50] *look [20:40:55] I just can look at the exception logs [20:41:07] * matanya looks around to find ops :) [20:41:49] and no hoo :( [20:41:59] (03CR) 10Ottomata: [C: 032 V: 032] Reflect emery -> erbium move in documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/111841 (owner: 10QChris) [20:42:00] ottomata: so you mean copy all the puppet code until the "class service" part? [20:42:14] server [20:42:18] ottomata: I think the assertion that git-annex isn't good for 3 and 4 is wrong [20:42:39] matanya: everythign that was in config class shoudl go in server class, right? [20:42:41] you can have both annex'd files and normal ol' git tracked files in the same repo [20:42:47] yes ottomata [20:42:50] k [20:43:01] though i did get lost in the ,multi-nested crap [20:43:14] (03PS4) 10JanZerebecki: turn planet into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108674 (owner: 10Dzahn) [20:43:18] greg-g: how so? [20:43:22] eg, in my Podcasts folder I have "feeds" which is a text file, tracked by git, containing podcast urls, then git-annex downloads/manages the podcast episodes for me [20:43:24] ah yes [20:43:26] that is not the problem [20:43:27] ottomata: lemme find the documentation [20:43:29] oh [20:43:45] the problem is reviewing the changes [20:43:57] and the pushing of local laptop info to the repo [20:44:02] oh, that issue [20:44:04] the metadata would say stuff like [20:44:14] This includes our personal laptops. The output of [20:44:14] git-annex whereis would quickly fill up with a mess of names of personal [20:44:17] computers and repository locations. [20:44:19] that one [20:44:24] this file is on: [20:44:24] andrew's laptop [20:44:24] graig's laptop [20:44:24] archivia [20:44:25] minion1 [20:44:25] minion2 [20:44:25] etc. [20:44:28] yeah [20:44:29] it would get long and messy really fast [20:44:32] yeah [20:44:33] and [20:44:33] also [20:44:38] we want to be able to review the changes in gerrit [20:44:39] there's a work around for that [20:44:41] that's kinda the point of this [20:44:46] you still can do that, why not? [20:44:50] matanya: actually, I don't see anything in the logs, the one I found was someone editing a random page. what wiki are you doing it on? [20:44:51] this is what changes look like [20:44:51] https://github.com/ottomata/git-annex-test/commit/f26341ae35c76766ee55f6d4326fa1d74f6cd20d [20:45:04] he wiki legoktm [20:45:37] ottomata: that's the git-annex branch, you can ignore that change [20:45:46] what would we review then? [20:45:51] i want to be able to review: [20:45:56] this: https://github.com/ottomata/git-annex-test/commit/d85a8a22a2ef065bf3f5e721375389f65ef78473 [20:46:27] the git-annex branch tracks all the associated metadata, whereas 'master' has the files [20:46:34] (symlinks to files, that is) [20:46:44] right [20:46:56] its still hard to review though [20:46:57] i'd like to see [20:47:17] (03PS3) 10Matanya: lucene: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 [20:47:21] sure, it ain't going to be a simple clean diff, as git-annex tracks a lot of extra info for you (which is part of it's uber error checking) [20:47:54] matanya: can you pm me the export of the abusefilter rule you're editing? [20:47:56] -libs/artifact-0.1.0.jar http://archiva.wmf.org/snapshots/artifact-0.1.0.jar BLBLABL_CHECKSUMA [20:47:56] +libs/artifact-0.1.1.jar http://archiva.wmf.org/snapshots/artifact-0.1.1.jar BLBLABL_CHECKSUMB [20:48:04] sure [20:48:12] ain't going to happen if you want more error checking :) [20:48:25] well, deploy system could do error checking [20:48:26] but, the way we get around the multiple personal laptop issue, for eg: https://github.com/RichiH/conference_proceedings [20:48:29] like, not deploy if checksum doesn't match [20:48:46] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [20:49:13] I have a clone of that that I use to add new videos to it (eg: the new FOSDEM 2014 videos) so that I can push up additions to the repo publicly, but I have a separate clone for my own personal use/viewing [20:49:31] hm [20:49:52] ie: clone 1 is "clean" and won't add to whereis, whereas clone 2 is 'messy'/personal/not shared [20:50:22] hmm, aye, and as long as personal never runs git annex sync [20:50:23] or pushes [20:50:23] hm [20:50:28] (also, there's #git-annex on oftc, if you want tips/suggestions, joey is pretty helpful) [20:50:39] right [20:50:45] there's a feature request to make that easier [20:50:48] * greg-g looks [20:51:39] (and joey is cheap ;) ) [20:53:09] ottomata: Why not build something that can do what you imagined (simple manifest of location, url, checksum) and mange the complex state on tin somehow? [20:54:11] I haven't got into understanding the current capabilities of trebuchet-* yet but it seems like the sort of task that a build system should be able to do [20:54:25] Hi. [20:54:37] Is Special:Notifications being broken on en.wiki a known issue? [20:54:49] Marybelle: do you have a progression of different sets of names to choose from? [20:54:49] It's fatalling. [20:54:55] just curious [20:55:02] yuvipanda: This is my alt when I'm not at my main computer. :-) [20:55:08] hmm, I see [20:55:17] ottomata: found it: http://git-annex.branchable.com/todo/untracked_remotes/ [20:55:27] Anyone on the fatal? Looks fairly serious. [20:55:30] "set remote..annex-readonly=true to prevent git-annex from pushing changes to the remote, or modifying the contents of the remote in any way." [20:56:11] but, yeah, I can see how this is probably overkill for the JVM use case [20:56:24] unless we were a git-annex for file management generally kind of place [20:56:50] paravoid, Reedy: segfault problem has moved over to mw1215 now I think [20:57:08] lol [20:57:11] Marybelle: That may be your issue or it may be something else entirely [20:57:39] https://bugzilla.wikimedia.org/show_bug.cgi?id=60985 filed. [20:57:42] I copied Reedy and Greg. [20:57:48] Not sure who's in charge of Echo these days. [20:58:13] Not me [20:58:18] Marybelle: core features team is in charge of echo, #wikimedia-corefeatures [20:58:25] Fail [20:58:37] Marybelle: I'll attach a stack trace [20:59:02] ebernhardson: This has become an operations issue, but yeah... :-) [20:59:23] I don't think the core features team has a shared Bugzilla account (like fr-tech or whatever). [20:59:31] I used to copy Ryan K. on these things, but he's moved to mobile. [20:59:34] How many of those other errors ori paste did anyone file? [21:00:08] <^d> Marybelle is back [21:00:11] Just AbuseFilter [21:00:13] ty guys [21:00:19] <^d> I liked Gloria better ;-) [21:00:29] ^d: Gloria is also around, so we now have two of them [21:00:33] ^d: Gloria lives! I was just checking in because I hit a fatal at work. [21:00:34] Just assign it to the right component. [21:00:37] Marybelle != Gloria [21:01:36] Special:Notifications is part of Echo. [21:01:41] Reedy: https://gerrit.wikimedia.org/r/111891 should be enough for Echo to protect itself in this instance [21:01:43] Looks like Erik and Lego are on it. [21:01:48] hey, sorry guys, in meetings [21:01:49] Reedy: I think they do :) [21:01:50] convert the error from fatal into catchable [21:01:56] greg-g: will respond shortly [21:02:00] * Marybelle waves. [21:02:09] wait, that's a flow issue? [21:02:13] its a flow issue [21:02:17] <^d> greg-g: I tried the annex assistant on my phone. "Detected a crippled filesystem" [21:02:18] huh [21:02:19] well [21:02:20] kinda [21:02:20] greg-g: its both, echo should not be able to fatal [21:02:22] <^d> :P [21:02:33] ^d: :) yeah, ie: no symlink support or somesuch [21:02:47] ebernhardson: legoktm I'll let you two handle it [21:02:50] :) [21:03:09] <^d> I can make them! [21:03:10] <^d> :p [21:03:13] * greg-g was just trying to change component, and after 3 mid-air collisions saw it was set to flow and got confused [21:03:35] Lol [21:03:37] mw1187 [21:03:45] [06-Feb-2014 19:57:56] Fatal error: Using $this when not in object context at /usr/local/apache/common-local/php-1.23wmf12/includes/deferred/SquidUpdate.php on line 1782 [21:03:50] is actually a math related error [21:04:05] greg-g, Reedy, is the transition of the Wikipedias to wmf12 still in progress? [21:04:08] No [21:04:11] It's been reverted [21:04:18] Should the issue be closed then? [21:04:26] Bug 60970 [21:04:36] <^d> Reedy: Easy way to avoid that error: don't use static functions so much :) [21:04:43] no, the cause wasn't found [21:04:50] Reedy, okay, are the wmf11 i18n messages being put back? https://en.wikipedia.org/wiki/Special:GettingStarted has the wmf12 messages and wmf11 code, causing a bug. [21:05:08] There's nothing "to put back" [21:05:26] Localisation caches are seperate [21:05:29] Reedy, okay, but the messages are missing (https://bugzilla.wikimedia.org/show_bug.cgi?id=60971), and I need to figure out why. [21:05:49] superm401: since the cache clears slower [21:06:17] Should bug 60970 list the deployment that caused it? [21:06:21] Thanks, matanya. Do you know what the normal expiration time is? [21:06:26] Yes, zellfaze, I'll explain in a comment. [21:06:27] Jesus [21:06:29] Math is a mess [21:06:35] Reedy: echo has backports ready [21:06:40] superm401: iirc up to 24h [21:06:41] on 11, 12 and 13 [21:06:52] superm401, Thank you. Right now that bug is pretty useless to anyone who doesn't already know what was going on. [21:06:58] we should merge phyzkerwelts patches [21:07:25] (03PS2) 10Ebe123: Add transwiki import options for zh.wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 [21:08:33] zellfaze, how's https://bugzilla.wikimedia.org/show_bug.cgi?id=60971#c3 ? [21:08:46] matanya, anything we can do, or should I just wait it out? [21:09:15] We have a deployment scheduled today, so that's one of the things I'm trying to figure out. [21:09:27] greg-g: Which ones? [21:09:44] superm401, I was actually referring to Bug 60970, but 60971 looks good to me. [21:09:49] superm401: not sure [21:10:00] zellfaze, sorry, my bad. [21:10:05] No worries. :) [21:10:19] Reedy: all of them to math, I guess? [21:10:43] superm401: technically you can force cache clear, but i don't know if you can/should do it [21:10:57] * zellfaze goes back to troubleshooting firewalls on his own network. [21:11:17] Thank you all for getting everything back online, even if you didn't find the root cause. [21:11:18] And hope one of them fixes it? [21:11:20] https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Math,n,z [21:13:27] ebernhardson: are you going to deploy the echo fixes or are we waiting on other things to stabilize? [21:14:12] Reedy: it was more of a long term response to how math is mostly unmaintained :) [21:14:28] self merges ftw [21:14:30] legoktm: i was going to wait, but i can [21:15:31] (03PS1) 10Springle: s2 repool db1009 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111895 [21:16:09] (03CR) 10Springle: [C: 032] s2 repool db1009 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111895 (owner: 10Springle) [21:16:17] (03Merged) 10jenkins-bot: s2 repool db1009 warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111895 (owner: 10Springle) [21:16:44] Reedy: :P [21:17:32] !log springle synchronized wmf-config/db-eqiad.php 's2 repool db1009 warm up' [21:17:40] Logged the message, Master [21:20:19] Reedy: so... anything new re texvc? [21:24:52] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=60988 and https://bugzilla.wikimedia.org/show_bug.cgi?id=60992 are the relevant bugs? [21:26:16] [06-Feb-2014 21:05:01] Fatal error: Access to undeclared static property: FRInclusionManager::$instance at /usr/local/apache/common-local/php-1.23wmf12/extensions/FlaggedRevs/backend/FRInclusionManager.php on line 16 [21:26:30] [06-Feb-2014 21:04:43] Fatal error: Call to undefined function () at /usr/local/apache/common-local/php-1.23wmf12/includes/deferred/SquidUpdate.php on line 46 [21:26:49] [06-Feb-2014 21:02:27] Catchable fatal error: Argument 1 passed to e/common-local/php-1.23wmf12/includes/parser/Tidy.php::() must be an instance of , array given, called in /usr/local/apache/common-local/php-1.23wmf12/extensions/Math/MathTexvc.php on line 314 and defined at on line 39 [21:30:33] This is getting stupud [21:30:37] *stupid [21:31:49] first was right [21:32:24] Reedy: what do you need/want me to do, if anything? [21:33:38] I've no idea [21:33:45] <^d> I'm looking at the FlaggedRevs one. [21:33:48] I guess we should revert wmf12 math to wmf11 [21:33:52] <^d> Not immediately clear what's wrong there, looks fine in master. [21:34:21] The math code seems to be spawning a lot of really strange looking bugs [21:34:40] <^d> Sounds good to me, let's roll that back to wmf11. [21:34:46] * greg-g nods [21:34:48] do it [21:35:32] ^d: What's the easiest way to move the wmf/1.23wmf12 to point at the same revision as wmf/1.23wmf11 in the Math extension? [21:36:05] <^d> checkout the sha1 of wmf/1.23wmf11's version. [21:36:14] <^d> I can do it though quickly enough [21:39:12] !log demon synchronized php-1.23wmf12/extensions/Math/ 'Revert to known-good 2b85347 from wmf11' [21:39:20] Logged the message, Master [21:40:21] !log demon updated /a/common/php-1.23wmf12 to {{Gerrit|I91e982773}}: Update Echo and Flow [21:40:28] Logged the message, Master [21:40:52] I was more thinking for the remote branches.. [21:41:40] <^d> I'm updating wmf12 branch of core now to have the right submodule. [21:41:41] I wonder if that newer version of math is causing some sort of math cache invalidation (even more likely with the code errors that touches squidupdate) causing all the extra renders with texvccheck [21:41:45] Thanks [21:43:26] mmm https://gerrit.wikimedia.org/r/#/c/104991/ [21:43:34] titled "caching" [21:44:20] <^d> Oh well, yeah that'd do it. [21:44:40] so, obvious question: any similar errors on Beta Cluster? [21:44:57] greg-g, beta is not appropriate for load testing [21:45:13] it had missed a ton of perf regressions already [21:45:23] <^d> You wouldn't have seen it on any of the smaller wikis. [21:45:28] <^d> Not enough to matter. [21:45:37] mmm [21:45:50] MaxSem: wasn't thinking it was load-related [21:45:59] <^d> greg-g: But yes, technically you could see it on beta. [21:46:04] I just logged a maths sucks bug at https://bugzilla.wikimedia.org/show_bug.cgi?id=60997 [21:46:14] <^d> greg-g: If someone was looking at cache invalidations when it was merged. [21:46:24] I note numerous of those stack traces were related to de.wikipedia [21:46:25] the lack of our attention to it is what sucks [21:46:28] <^d> But again, most wouldn't notice it. Not enough math to matter. [21:46:41] *cough*selfmerge*cough* [21:47:02] <^d> So yeah, that's caching cache invalidations of all sorts on paths with [21:47:09] <^d> That explains...well...most everything [21:47:20] <^d> s/paths/pages/ [21:48:34] Sooo. Should we risk going back to the 12 and 13? [21:49:14] <^d> wmf13 needs Math rolled back too. [21:49:18] <^d> I only did wmf12. [21:50:43] (03CR) 10Ebe123: "Why is /docroot/bits/WikipediaMobileFirefoxOS different without I even opening it?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [21:50:55] !log demon updated /a/common/php-1.23wmf13 to {{Gerrit|I233eaf25c}}: Revert "Add sequence support for externallinks table" [21:51:03] Logged the message, Master [21:51:17] <^d> wmf13 done now too [21:51:24] Thanks [21:52:11] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedia back to 1.23wmf12, group0 to 1.23wmf13, Math reverted to 1.23wmf11 state in both branches [21:52:18] Logged the message, Master [21:52:37] so should we revert https://gerrit.wikimedia.org/r/#/c/104991/ ? [21:52:56] He's in -dev now [21:53:22] * Reedy watches apache process count [21:53:35] here we goooooo [21:54:16] !log demon synchronized php-1.23wmf13/extensions/Math 'Revert to known-good 2b85347 from wmf11' [21:54:24] <^d> Ok, 13 done now too [21:54:24] Logged the message, Master [21:54:44] <^d> And all done in gerrit too so people won't get confused. [21:55:42] thanks [21:55:48] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report [21:58:03] (03CR) 10Dzahn: [C: 04-2] "thanks for the detailed explanation, there is no rush more than the usual "we want to get out of Tampa" rush, partly it's just that everyb" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [21:59:27] (03CR) 10Dzahn: "also done in https://gerrit.wikimedia.org/r/#/c/111841/ and https://gerrit.wikimedia.org/r/#/c/111830/ and" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [22:00:40] <^d> greg-g, MaxSem, Reedy: Talking about the change in -dev. [22:02:10] Reedy, ^d ming adding a one-liner about what happend and how long it took to Tech News [22:02:15] mind* [22:02:17] !technext [22:02:22] something broke [22:02:23] it took a while [22:02:44] we broke math, it was all 2+2=5 around here [22:06:25] so it was ca. 19:30 to 21:54 (both UTC)? [22:07:06] 21:54 logmsgbot: demon synchronized php-1.23wmf13/extensions/Math 'Revert to known-good 2b85347 from wmf11' is the last log in SAL [22:07:18] and backscroll says first reports ca. 19:30 UTC [22:07:49] twkozlowski: much shorter [22:07:54] Well, no [22:07:58] It was sort of longer [22:08:13] 19:45 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Revert earlier version changes [22:08:22] it started around 16:30 [22:08:30] well, yes [22:08:51] sorry, I'll be more precise next time [22:08:55] soemthing happened then, and then something happened 19:30-19:45 [22:09:00] first reports of 503s were 19:36 UTC [22:09:08] * aude not around for the earlier incident [22:09:21] I'm gonna feature that in Tech News [22:09:31] FYI I just got a report of an error page a few minutes ago [22:09:34] and I hope we'll have the second week without any incident this year [22:09:38] I hoped :-( [22:09:39] Not sure how accurate it is. [22:09:43] bad luck, I guess. [22:11:41] (03Abandoned) 10Matanya: emery: remove last log before decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/110394 (owner: 10Matanya) [22:11:43] twkozlowski: https://gdash.wikimedia.org/dashboards/reqerror/ (might take a while to load) [22:12:38] Reedy: so just to confirm I read the backscroll correctly, it was testwiki that was influenced at 16:30 UTC? [22:12:53] Do you guys know about PHP fatal error in /usr/local/apache/common-local/php-1.23wmf12/extensions/Math/Math.hooks.php line 50: Call to undefined method [22:13:03] aude: thanks, very helpful [22:13:05] eg. on https://en.wikipedia.org/wiki/Control_theory [22:13:45] And most other pages with equations? [22:14:45] Yeah, Math is causing trouble [22:14:51] -tech agrees [22:15:11] <^d> Blehhh, reverting to new math again. [22:16:22] I expect you know this already, but: PHP fatal error in /usr/local/apache/common-local/php-1.23wmf12/extensions/Math/Math.hooks.php line 50: [22:16:22] Call to undefined method ParserOptions::getMath() [22:16:27] <^d> Yes yes. [22:16:30] Ross_Hill: thank you, we do :) [22:16:46] * twkozlowski notes that Wikimedia users are very helpful [22:16:50] !log demon synchronized php-1.23wmf12/extensions/Math 'New math causes cache stampedes but without fatals' [22:16:58] Logged the message, Master [22:17:16] no problem. Good luck :) [22:17:21] <^d> Reedy, greg-g: So, wmf12 has new math again. [22:18:06] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report [22:18:08] IT's growing... [22:19:13] <^d> What's the other option? [22:19:22] Not sure [22:19:31] But past experience (today) suggests we're heading for an outage [22:19:49] <^d> Could move everyone back to wmf11? [22:19:56] <^d> Or big wikis? [22:20:12] big wikis might make sense [22:20:14] ick [22:20:16] "Brace for impact!" [22:20:32] not wikidata [22:20:52] * aude hopes people don't use math on wikidata [22:21:02] * twkozlowski runs to insert some code somewhere. [22:21:05] a talk page! [22:21:08] no! [22:21:30] Yes, Ma'am. [22:21:36] aude: :-) [22:21:39] ;) [22:21:40] PoolCounter for Math? [22:22:01] <^d> Ooh, we could do that pretty quick. [22:22:09] https://doc.wikimedia.org/puppet/ [22:22:14] mutante: doc.wikimedia.org has been unreliable for over a week. It's stuck on a redirect loop for half the time, it kind of random. Something to do with dns or one of the apaches not working properly. This happaned when ops changed it to run from misc-web instead of gallium [22:22:15] The page isn't redirecting properly [22:22:35] Usually your local dns, or changing the url to be slightly more unique fixes it [22:22:43] e.g. flush dns cache locally [22:22:43] Krinkle: nod, ack, it's a different computer i'm using too [22:22:53] thanks [22:22:58] (there's a bug fpor that btw) [22:23:28] <^d> Reedy: Got stacktraces for math? Trying to figure the best place to drop PoolCounters in. [22:23:28] putting everyone in swimming pools? [22:23:40] MatmaRex: for doc.wm.o redirect loop dns issue? [22:23:46] CAn't find an open bug for it, maybe a closed one? [22:23:49] * MatmaRex notes that it seems to have stopped growing [22:24:59] indeed.. [22:25:09] Krinkle: https://bugzilla.wikimedia.org/show_bug.cgi?id=60822 [22:25:26] seems like the thing, at a glance [22:28:23] !log demon updated /a/common/php-1.23wmf13 to {{Gerrit|I5ba48ca51}}: Reverting Math to known-good 2b8534793fad9db18fcdb9621dc8d79ff36fdeb1 [22:28:24] !log demon synchronized php-1.23wmf13/extensions/Math 'Unfataling wmf13 Math too' [22:28:31] Logged the message, Master [22:28:38] Logged the message, Master [22:31:05] <^d> job queue looks fine. [22:31:09] <^d> load looks fine on apaches. [22:33:55] texvc log messages are back up to ~3000/minute [22:34:23] eh, what was /home/wikipedia/sbin/apache-kill-it ? [22:34:30] i mean, i can guess:) [22:35:01] can this be deleted? https://gerrit.wikimedia.org/r/#/c/111444/2/files/sudo/sudoers.search [22:35:13] sudoers for search, it _looks_ like it's not being used [22:36:11] "unscap-2" :P [22:38:36] <^d> It doesn't install anywhere? [22:39:53] <^d> bd808: Are they failing? [22:39:57] <^d> Or just running fine? [22:40:15] (03PS1) 10Springle: db1055 resume groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111913 [22:40:16] ^d: Running fine. Just a lot of them [22:40:24] <^d> Yeah, to be expected. [22:40:38] (03CR) 10Springle: [C: 032] db1055 resume groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111913 (owner: 10Springle) [22:40:45] <^d> I think as long as we're managing the load fine, we should just get through the cache invalidation. [22:40:45] (03Merged) 10jenkins-bot: db1055 resume groupLoadsByDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111913 (owner: 10Springle) [22:41:51] !log springle synchronized wmf-config/db-eqiad.php 's1 db1055 resume groupLoadsByDB' [22:41:58] Logged the message, Master [22:42:06] So far the peak rate is about half of what we hit in the second attempt today [22:42:42] hey, just discovered an XFF chain on mobile that contains 10.2.4.26, an LVS IP [22:42:47] Well, probably closer to 2/3 [22:42:56] robh: Should I re-open https://rt.wikimedia.org/Ticket/Display.html?id=6594 or file a new ticket for the breakage it resulted in? doc.wm.o has been oscillating ever since from being available to being down with a redirect loop to itself, and https is enforced so there's no work around it seems. [22:43:19] ( https://bugzilla.wikimedia.org/show_bug.cgi?id=60822 ) [22:43:45] reopen old ticket [22:43:49] new ticket just means linking to old [22:43:57] and im not sure its redirection and not shitty dns [22:44:00] but i can check later. [22:44:16] <^d> bd808: So, generally speaking, let's talk about getting this behind PoolCounter. [22:44:20] i'd like to know what dns points to for those having the issue [22:44:22] <^d> It's a great candidate for it. [22:44:35] Krinkle: hashar asked for info on it [22:44:43] and no one has answered him [22:44:53] so i dont expect to do much until someone having the issue can give more info, cuz im not having it ;] [22:45:18] oh, he wasnt askin, nm [22:45:27] yea, reopen the rt and link to bz and i'll look into it later [22:45:31] Thx [22:45:33] (it wont be today, im taking a sick day ;) [22:45:40] still working, its how i roll. [22:45:42] Convenient :) [22:45:46] no worries. [22:46:07] Krinkle: We found a fix for that on the logstash.w.o vhost. It's a pretty easy apache config change I bet [22:46:24] Ah, right, it wasn't Faidon who debugged that last week, it was you. [22:46:42] well, may have been him as well, but i looked into it a few times when folks reported in -tech [22:46:53] bd808: oh? [22:46:57] I'll find the apache config in puppet and submit a patch [22:47:06] cool, put me as reviewer please? [22:47:09] i wanna know the solution =] [22:47:40] robh: Yeah Ori figured out a trick that gets the Vary header to propagate back to varnish [22:47:59] bd808: cool, i appreciate the fix on it [22:48:06] cuz i keep looking and seeing nuthin =P [22:49:43] bd808: robh: It also seems that integration.wikimedia.org no longer enforces https, doc.wikimedia.org still does however. [22:50:02] I don't recall that being done intentionally though, was it? [22:50:33] hrmm, no, it shouldnt is my understanding [22:50:35] but i see it does [22:50:56] i'll reopen that one. [22:52:37] 10.64.16.170 and 10.64.32.35 are segfaulting a lot [22:52:48] 170 more so [22:54:47] (03PS1) 10BryanDavis: Send Vary header on http to http redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 [22:55:24] robh, Krinkle: ^ [22:55:40] im too sick and out of it to merge things to production today [22:55:48] but i'll review and merge tomorrow =] [22:56:21] if something breaks im in no state to work on it [22:57:57] (it shouldnt break from what i can see, but those are what folks say before things really break right?) [23:00:03] robh: I totally understand that attitude :) [23:00:22] thx for submitting fix though =] [23:02:17] doc.wm.o seems to be the only apache config other than logstash.w.o in the ops/puppet repo that is using that X-Forwarded-Proto rewrite rule. [23:03:04] operations/apache-config is chock full of the pattern though [23:03:08] !log running sync-common on mw1142 [23:03:17] Logged the message, Master [23:13:14] <^d> greg-g, others: I've updated the etherpad at the top with the working theory for what happened today. [23:13:27] <^d> Please feel free to adjust the summary as appropriate. [23:16:46] (03PS1) 10BryanDavis: Send Vary header on http to http redirect [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 [23:16:54] (03CR) 10jenkins-bot: [V: 04-1] Send Vary header on http to http redirect [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 (owner: 10BryanDavis) [23:17:20] Reedy: ^^ [23:18:18] ^d: re last bit, what *should* we do re Math next week? [23:18:51] <^d> I'd pin the version in wmf12/13 for 14 so it doesn't sneak in further. [23:18:55] <^d> wmf12 is the big thing. [23:19:22] * greg-g nods [23:20:00] <^d> Is nothing on wmf11 anymore? [23:20:07] <^d> I think we're fine then, if enwiki and so forth are on wmf12. [23:20:10] <^d> That was the worst. [23:20:45] <^d> It wasn't wmf13 that caused the problem, it was wmf12 going to big wikis :) [23:20:55] <^d> But if they've got it, we'll just move forward. [23:22:23] Yep, all Wikipedias are on wmf12 now. [23:22:43] <^d> Ok, then I guess we'll be fine. Just keep a close eye on things for a bit. [23:22:53] <^d> Getting a rush on that PoolCounter patch would let me sleep a little better. [23:25:05] (03PS1) 10MaxSem: Mobile ulsfo LVS appears in XFF chains, whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111927 [23:25:56] greg-g, I need to deploy ^^ [23:26:22] MaxSem: wanna do it now? [23:26:38] yes, if possible [23:26:52] to double check, wmf11 is not running anywhere? Just making sure what patches i need to prep for LD [23:26:54] (03PS2) 10BryanDavis: Send Vary header on http to http redirect [operations/apache-config] - 10https://gerrit.wikimedia.org/r/111925 [23:27:51] ebernhardson: /a/common/multiversion/activeMWVersions [23:28:01] greg-g: Notifications still broken? [23:28:06] Hell [23:28:13] It's even on https://noc.wikimedia.org/conf/ :) [23:28:18] I figured someone would've pushed the fix... [23:28:21] Reedy: awsome, thanks! [23:29:37] might want to add a comma inside that implode, Reedy [23:30:02] * Reedy shrugs [23:31:02] Gloria: i think ebernhardson is fixing it [23:31:06] ebernhardson: were you going to deploy the Echo/Flow fix? [23:31:10] greg-g: yes [23:31:11] he [23:31:13] h [23:31:19] greg-g: prepping wmf12 and wmf13 cherry picks [23:31:24] awesome [23:32:57] Thanks, Erik. [23:35:14] texvc is down from ~3000/minute to ~1800/minute. Still about 10x normal volume but getting better. [23:35:38] * bd808 <3's logstash [23:35:47] yet another reason to move math to a separate service [23:36:08] Yeah. And throttle it [23:36:13] having it on other appservers allows us to render a shitload of formulas though [23:36:17] PoolCounter :) [23:36:19] What was the cause of the Math issues? [23:36:31] superm401, changes chae key schema [23:36:40] *cache key [23:37:06] greg-g, so what about my change? [23:37:11] "Remove call to deprecated ParserOptions::getMath"? [23:37:26] yup [23:37:32] greg-g, yeah, we're also wondering, but no pressure to make an immediate call. [23:43:55] StevenW, I thought our excluded cat was broken for a few minutes. [23:44:03] But it's just wrong in the proposed config patch. [23:44:07] oh [23:44:19] I need to go through and remove the namespace from all of them. [23:45:05] you mean the permutation of "Category:" [23:46:44] Yeah [23:46:52] k [23:47:26] (03CR) 10Mattflaschen: [C: 04-1] "I should have caught this before, but the category prefix is not supposed to be here for either variable." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [23:54:27] MaxSem: any word on getting https://gerrit.wikimedia.org/r/#/c/111927/1 deployed? [23:54:46] waiting for greg-g [23:54:53] kk [23:56:52] (03PS3) 10Mattflaschen: Enable the GettingStarted extension on non-enwiki wikis. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx)