[00:05:35] (03CR) 10Ori.livneh: [C: 031] "There certainly are resources on this domain that are expensive to generate (LTR-flipped CSS, compiled LESS, dynamically-generated JavaScr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [00:10:47] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [00:28:27] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%): [00:31:27] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Epic puppet fail [00:36:27] RECOVERY - Disk space on stat1002 is OK: DISK OK [00:41:47] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [00:50:27] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:54:27] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 530 MB (0% inode=99%): [00:56:27] RECOVERY - Disk space on stat1002 is OK: DISK OK [01:41:33] ori: Re User:Caothu9669, what type of spam? 'cause he created another account on Friday. [01:42:36] scfc_de: it was a weird gerrit change, let me find the link [01:42:52] scfc_de: https://gerrit.wikimedia.org/r/157663 [01:43:55] legoktm: k, thanks. [02:11:47] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [02:12:07] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3616 MB (3% inode=99%): [02:15:33] !log LocalisationUpdate completed (1.24wmf18) at 2014-09-02 02:14:30+00:00 [02:15:48] Logged the message, Master [02:26:39] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-02 02:25:36+00:00 [02:26:47] Logged the message, Master [02:42:47] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [02:53:37] !log restarted dbstore1002 mysqld for upgrade [02:53:44] Logged the message, Master [03:01:07] RECOVERY - Disk space on virt0 is OK: DISK OK [03:16:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 2 03:15:22 UTC 2014 (duration 15m 21s) [03:16:34] Logged the message, Master [03:33:52] (03PS1) 10Springle: depool db1035 for upgrade, move s3 vslow/dump to db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157779 [03:34:31] (03CR) 10Springle: [C: 032] depool db1035 for upgrade, move s3 vslow/dump to db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157779 (owner: 10Springle) [03:34:35] (03Merged) 10jenkins-bot: depool db1035 for upgrade, move s3 vslow/dump to db1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157779 (owner: 10Springle) [03:35:34] !log springle Synchronized wmf-config/db-eqiad.php: depool db1035 (duration: 00m 07s) [03:35:40] Logged the message, Master [03:41:47] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:41:59] (03PS1) 10Springle: prepare db1035 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/157780 [03:42:56] (03CR) 10Springle: [C: 032] prepare db1035 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/157780 (owner: 10Springle) [04:12:16] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [04:43:16] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [04:54:03] (03PS1) 10Springle: prepare db1067 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/157783 [04:55:17] (03CR) 10Springle: [C: 032] prepare db1067 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/157783 (owner: 10Springle) [04:59:14] !log springle Synchronized wmf-config/db-eqiad.php: depool db1067 (duration: 00m 07s) [04:59:25] Logged the message, Master [05:09:22] (03CR) 10Chmarkine: [C: 031] puppetmaster - use ssl_ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [05:37:26] !log springle Synchronized wmf-config/db-eqiad.php: repool db1067, warm up (duration: 00m 08s) [05:37:32] Logged the message, Master [05:39:52] (03PS1) 10KartikMistry: WIP: Update cxserver (beta) config [puppet] - 10https://gerrit.wikimedia.org/r/157787 [05:42:07] (03CR) 10KartikMistry: [C: 04-1] "Dictionaries packages need to update." [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [05:55:13] (03PS1) 10Chmarkine: gerrit: Enable StrictTransportSecurity max-age=7days [puppet] - 10https://gerrit.wikimedia.org/r/157789 (https://bugzilla.wikimedia.org/38516) [05:58:35] !log dump s3 db1035 to db1069:3313 [05:58:41] Logged the message, Master [06:11:06] PROBLEM - Host mw1192 is DOWN: PING CRITICAL - Packet loss = 100% [06:13:06] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [06:17:56] RECOVERY - Disk space on db1047 is OK: DISK OK [06:21:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:22:52] !log removed txt files filling up db1047 /tmp, looked like analytics SELECT INTO OUTFILE, dated mid-August [06:22:57] Logged the message, Master [06:23:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:25:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:26:58] (03PS2) 10KartikMistry: WIP: Update cxserver (beta) config [puppet] - 10https://gerrit.wikimedia.org/r/157787 [06:27:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:29:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:29:35] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:45] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:45] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:55] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:26] PROBLEM - Disk space on elastic1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:30:26] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:31:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:32:16] hrmmmmmmmmmm [06:32:22] ^d, ping? [06:33:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:35:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:35:46] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:25] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 06:18:34 UTC [06:38:35] RECOVERY - Puppet freshness on cp4013 is OK: puppet ran at Tue Sep 2 06:38:26 UTC 2014 [06:43:59] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [06:45:39] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:53:49] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:01:29] RECOVERY - Disk space on elastic1004 is OK: DISK OK [07:02:12] !log removed all-but-latest large slow logs on elastic1004 and elastic1014 [07:02:18] Logged the message, Master [07:02:29] RECOVERY - Disk space on elastic1014 is OK: DISK OK [07:05:28] <_joe_> someone already on mw1192? [07:11:03] nope [07:11:25] springle: ouch, are they filling their disks regularly with log files? [07:20:42] <_joe_> !log powercycling mw1192, blank console, unresponsive [07:20:48] Logged the message, Master [07:23:19] RECOVERY - Host mw1192 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [07:24:06] (03PS3) 10KartikMistry: WIP: Update cxserver (beta) config [puppet] - 10https://gerrit.wikimedia.org/r/157787 [07:24:42] so, backend fail, huh [07:24:47] unrelated to the xfs lockup? [07:24:57] (and load avg) [07:25:22] jeremyb: no I don't think so [07:25:44] and much less widespread than what we've seen previously it seems [07:26:18] well last time we had logs mentioning memcache? i wonder about this time [07:26:38] (I also wonder if my memory is accurate :P) [07:26:54] (03CR) 10Nikerabbit: "Is apertium going to be puppetized in separate patch?" [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [07:28:17] (03PS4) 10KartikMistry: WIP: Update cxserver (beta) config [puppet] - 10https://gerrit.wikimedia.org/r/157787 [07:29:22] (03CR) 10KartikMistry: "No need of apertium on beta (as of now) as we've http://apertium.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [07:30:15] jeremyb: yep that was the case originally [07:30:56] <_joe_> mw1192 has a load average of 64 [07:31:35] <_joe_> I second godog's idea of reducing the weight of those api servers to 15 [07:33:35] _joe_, post reboot? [07:34:14] <_joe_> jeremyb: yes, it's at full CPU [07:34:30] <_joe_> (now gone down to 27, still....) [07:34:30] huh [07:34:43] _joe_: yep we could try, memory usage doesn't seem so bad on the rest [07:34:44] <_joe_> well post-reboot you need to rebuild the apc cache [07:34:54] <_joe_> godog: that was exactly my thought [07:35:48] <_joe_> well, lemme repackage hhvm with another patch we need to backport [07:36:19] <_joe_> godog: if no-one answers, I think we can move on later in the day [07:36:35] <_joe_> (uniforming the weight of the appservers a bit) [07:36:45] _joe_: sounds good, worst case rollback is easy [07:36:52] <_joe_> yep [07:37:06] <_joe_> worst case we cause our first outage :) [07:37:26] <_joe_> you know people have placed bets on the date for that? [07:38:08] hahah no I wasn't aware, we should be allowed to bet *g* [07:38:47] (03CR) 10Nikerabbit: "All dependencies should sooner or later be puppetized and run on beta labs. Preferable sooner, as that allows ops to highlight possible bl" [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [07:40:36] what do you get if you have an outage before you get shell/root? [07:41:24] a bigger prize possibly, that's slight harder to do [07:46:31] (03CR) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [07:47:14] https://gerrit.wikimedia.org/r/106109 "broke the world", https://gerrit.wikimedia.org/r/133205 was the fix. :-) [07:47:42] (03CR) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/151869 (owner: 10Giuseppe Lavagetto) [07:48:53] <_joe_> jeremyb: eheh apache is quite scary [07:49:13] heh [07:49:22] <_joe_> it's very very easy to fuck things up [07:49:38] <_joe_> I'd like to find the time to write unit tests for our apache config [08:06:44] (03PS1) 10Giuseppe Lavagetto: Add pcre overflow patch that should prevent beta from crashing regularly [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157790 [08:13:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [08:15:55] godog: looks like. ottomata logged the same apparent full disk problem for elastic* yesterday [08:17:54] <_joe_> jeremyb: btw, thanks a lot for the help on RT last week [08:18:11] which? [08:18:39] * jeremyb also gets confused with all the new italian/greek names :P [08:19:43] (i have a funny interpretation of "new") [08:19:55] indeed [08:20:26] <_joe_> jeremyb: Giuseppe [08:20:37] right, /whois told me that :) [08:20:41] unless new >= 1.2 years (at least for the greek guys) [08:20:48] <_joe_> Giuseppe == Joseph [08:21:48] <_joe_> I was kinda convinced that was widely known and thus my nick was not confusing anyone [08:22:12] <_joe_> that turned out to be false both for Italian and English native speakers [08:24:19] _joe_, oh, was last week your week? [08:24:30] calendar on https://wikitech.wikimedia.org/wiki/RT_Triage_Duty has lapsed [08:25:36] <_joe_> uhh too many places to put that in [08:25:44] <_joe_> will amend that sorry [08:26:02] <_joe_> as an excuse, this was my second time on duty only [08:26:07] we need to update that too ? [08:26:16] here's something I have never done when on RT [08:26:24] idk. it should say that it's not updated if it's not updated [08:26:34] what other places are there? [08:27:02] (originally that was the page where people would sign up for slots weeks ahead) [08:27:07] this channel's topic [08:27:13] right [08:27:33] that's it? [08:27:34] and the techops meeting's minutes/notes (which are not public...) [08:27:45] <_joe_> uh no, I was sure I put that in that wikitech page [08:28:00] <_joe_> I may have not saved it [08:28:04] <_joe_> :/ [08:28:23] <_joe_> I thought I edited a page on another wiki [08:29:24] there's officewiki... [08:35:09] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 774 [08:40:09] RECOVERY - check_mysql on lutetium is OK: Uptime: 1171608 Threads: 2 Questions: 3401470 Slow queries: 2028 Opens: 5187 Flush tables: 2 Open tables: 64 Queries per second avg: 2.903 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [08:41:09] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Epic puppet fail [08:41:40] springle: gah, okay thanks [08:44:59] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [09:01:09] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:11:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] WIP: Update cxserver (beta) config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [09:15:49] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4279847341B: /srv/deployment/ocg/postmortem 1402390B: ocg_job_status 9746 msg: ocg_render_job_queue 0 msg [09:23:51] (03PS1) 10Filippo Giunchedi: releases: do not include hostname in sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/157797 [09:25:22] (03PS2) 10Filippo Giunchedi: releases: do not include hostname in sudoers.d [puppet] - 10https://gerrit.wikimedia.org/r/157797 [09:40:24] (03PS1) 10Giuseppe Lavagetto: Fix startup of hhvm [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 [09:45:59] (03CR) 10Giuseppe Lavagetto: "This is in no way important for our installation, but is part of the needed packaging work." [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 (owner: 10Giuseppe Lavagetto) [09:46:30] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [09:47:29] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 1, utimed_out: False, uactive_primary_shards: 36, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 102, uinitializing_shards: 0, unumber_of_data_nodes: 3} [10:00:27] <_joe_> !log depooling mw1192, high CPU temperatures; we may need to check fan status [10:00:32] Logged the message, Master [10:01:15] (03PS5) 10KartikMistry: WIP: Update cxserver (beta) config [puppet] - 10https://gerrit.wikimedia.org/r/157787 [10:03:39] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /tmp 3744 MB (3% inode=99%): [10:04:50] (03CR) 10KartikMistry: WIP: Update cxserver (beta) config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157787 (owner: 10KartikMistry) [10:06:39] RECOVERY - Disk space on stat1002 is OK: DISK OK [10:12:19] (03CR) 10Alexandros Kosiaris: [C: 032] pdns: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/157681 (owner: 10Matanya) [10:14:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [10:21:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:23:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:25:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:27:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:29:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:29:39] godog: question: isn't jenkins uploading to caesium ? [10:31:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:31:39] matanya: for debian packages? no, not sure about the rest tho [10:31:43] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [10:32:11] godog: it might be an answer to your question in #8251 [10:32:24] if it is desired, not sure about the setup [10:33:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:33:26] matanya: indeed! [10:35:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:37:23] PROBLEM - Puppet freshness on mw1083 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 10:18:12 UTC [10:38:24] RECOVERY - Puppet freshness on mw1083 is OK: puppet ran at Tue Sep 2 10:38:13 UTC 2014 [10:46:00] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [10:49:55] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:50:10] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: Puppet has 1 failures [10:57:44] (03PS1) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 [10:58:24] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [10:59:53] (03PS2) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 [11:02:42] looks like we have consensus on weighting api servers to 15, I can do that in an hour or so if nobody wants to do it earlier [11:09:10] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:24:20] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 994.007393736 [12:01:50] ^d: andre__: hi, can you share an estimate of the number of subscribers of wikitech-l with me? (it says you're the list admins) [12:02:40] (or is it super duper private data?) [12:03:04] I wonder if this is displayed anywhere in Mailman [12:16:00] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [12:36:27] !log increase weight to 15 for mw1132 -> mw1148 [12:36:33] Logged the message, Master [12:40:58] <_joe_> godog: looking at the load average [12:41:07] <_joe_> it doesn't seem to explode [12:42:43] (03PS3) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 [12:43:02] yep seems "fine", in which they are fairly loaded anyways [12:43:30] PROBLEM - puppet last run on pc1001 is CRITICAL: CRITICAL: Puppet has 1 failures [12:45:47] if it seems okay we can bring everything to 15 in the next one/two hours [12:47:00] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [12:48:36] <_joe_> godog: +1 [12:51:20] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:30] RECOVERY - puppet last run on pc1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:09:20] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:12:50] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [13:19:12] (03PS5) 10Giuseppe Lavagetto: varnish: add a php engine token [puppet] - 10https://gerrit.wikimedia.org/r/156793 [13:29:31] (03CR) 10Giuseppe Lavagetto: "cherry-picked on beta, seems to work fine." [puppet] - 10https://gerrit.wikimedia.org/r/156793 (owner: 10Giuseppe Lavagetto) [13:29:50] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:37:15] (03CR) 10Alexandros Kosiaris: [C: 031] elasticsearch: handle request timeout and increase timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [13:38:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20129 MB (3% inode=99%): [13:44:15] (03PS6) 10Giuseppe Lavagetto: varnish: add a php engine token [puppet] - 10https://gerrit.wikimedia.org/r/156793 [13:52:47] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: add a php engine token [puppet] - 10https://gerrit.wikimedia.org/r/156793 (owner: 10Giuseppe Lavagetto) [13:57:58] (03PS1) 10coren: Labs: provide saner nscd defaults [puppet] - 10https://gerrit.wikimedia.org/r/157816 (https://bugzilla.wikimedia.org/70076) [13:59:42] _joe_: ^^ look at this? First attempt to lighten the load on dnsmasq by caching instead. [14:00:30] PROBLEM - Host nickel is DOWN: PING CRITICAL - Packet loss = 100% [14:00:41] <_joe_> nickel down? [14:01:16] _joe_: Looks like. [14:02:31] <_joe_> it is in fact [14:02:40] * Coren takes alook. [14:02:45] <_joe_> Coren: thanks [14:03:01] <_joe_> godog: your change on the balancing of the api servers seem to work [14:03:13] Eeew. Looks like the kernel panicked. [14:03:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20176 MB (3% inode=99%): [14:04:05] _joe_: yeah but I was expecting more of a drop in the ones at weight 20 :( will balance the rest to 15 now [14:04:25] <_joe_> godog: but there was [14:10:19] (03PS5) 10Giuseppe Lavagetto: mediawiki: funnel all php requests through HHVM [puppet] - 10https://gerrit.wikimedia.org/r/157691 [14:10:21] <_joe_> Coren: I guess you may powercycle nickel [14:10:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20073 MB (3% inode=99%): [14:10:50] RECOVERY - Host nickel is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:12:22] _joe_: Dying disk(s) [14:13:09] * Coren files RT [14:13:31] <_joe_> oh shit [14:13:32] Do we normally shut a box with dying disks down to avoid making things worse or do we keep it up? [14:13:52] <_joe_> it's the ganglia frontend [14:13:57] Poop. [14:14:16] sda is dying. lots of media errors and out of remappable cylinders. [14:14:48] <_joe_> so... we need to replace it somehow [14:14:56] isn't it raid? [14:15:10] mark: It is. Right now, it's doing the right thing. [14:15:14] Sep 2 14:14:38 nickel kernel: [ 309.649807] raid1: sda1: redirecting sector 51478016 to another mirror [14:15:30] doesn't sound like the right thing to me [14:15:32] mark: But it just returned from a kernel panic, so I expect that sbd is also ill. [14:15:35] remove sda from the mirror [14:16:14] (03CR) 10Andrew Bogott: Labs: provide saner nscd defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157816 (https://bugzilla.wikimedia.org/70076) (owner: 10coren) [14:16:14] it's trying to resync at the moment [14:16:39] mark: we crossed wires; I've set it to failed. [14:16:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20134 MB (3% inode=99%): [14:16:45] ok [14:17:00] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [14:17:26] (03PS6) 10Giuseppe Lavagetto: mediawiki: funnel all php requests through HHVM [puppet] - 10https://gerrit.wikimedia.org/r/157691 [14:17:31] mark: So I open RT for disk swap. We keep it up on one disk until then? [14:17:41] yes [14:17:45] kk [14:17:50] no point trying to keep using a disk that's known faulty [14:17:50] PROBLEM - Disk space on nickel is CRITICAL: DISK CRITICAL - free space: /mnt/ganglia_tmp 0 MB (0% inode=80%): [14:17:59] <_joe_> mh [14:18:12] <_joe_> that ^^ probably needs fixing [14:18:25] Hm. Disk full is unrelated, but probably while the last few bits of bad disk were used. [14:18:33] * Coren fixes that too. [14:18:51] PROBLEM - RAID on nickel is CRITICAL: CRITICAL: Active: 1, Working: 1, Failed: 1, Spare: 0 [14:19:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20109 MB (3% inode=99%): [14:20:24] (03CR) 10Andrew Bogott: [C: 031] Labs: provide saner nscd defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157816 (https://bugzilla.wikimedia.org/70076) (owner: 10coren) [14:20:50] PROBLEM - puppet last run on nickel is CRITICAL: CRITICAL: Puppet has 1 failures [14:21:01] /mnt/ganglia_tmp is... tmpfs. That's _bad_. [14:21:15] Well, it's not bad that it's tmpfs, it's bad that it's full. [14:21:46] (03PS7) 10Giuseppe Lavagetto: mediawiki: funnel all php requests through HHVM [puppet] - 10https://gerrit.wikimedia.org/r/157691 [14:28:43] Reedy: if you look here, you can see JohnLewis implying that those require_once statements are unneeded because of the the extension list. Which is it? https://gerrit.wikimedia.org/r/#/c/155789/16/wmf-config/wikitech.php [14:29:22] Coren: might be due to codfw being added [14:29:53] andrewbogottL while https://gerrit.wikimedia.org/r/#/c/155789/20/wmf-config/extension-list-wikitech exists; they are useless. That's the base of it. [14:30:12] mark: I think it is; we'll need to tweak the RRAs to reduce the resolution. [14:30:57] JohnLewis: I don't understand, what do you mean by 'they are useless'? [14:31:12] mark: The kernel won't let tmpfs use over half of physical mem IIRC, and we're at that limit. [14:31:14] JohnLewis: andrewbogott extension-list is only used for localisation update [14:31:19] what is wikitech.php used for? [14:31:37] ok. So, JohnLewis, you now agree with Reedy that we need those include statements? [14:31:42] eh? well I lost the whole point of extension lists there. [14:31:55] no, let's not tweak the RRDs [14:31:59] let's solve the problem [14:32:01] andrewbogott: from what aude said; yeah. [14:32:05] JohnLewis: for example, localisation cache might be run on aawikibooks [14:32:06] aude: right now wikitech is maintained by hand, I'm trying to get it on the deployment train. [14:32:15] extension list installs all of wikibase on wikibooks [14:32:27] for purpose of building i18n cache [14:32:34] (and all the things) [14:32:36] <_joe_> Coren: it's strange it's full now, and it wasn't before [14:32:47] <_joe_> maybe some hiccup with cleanup in that dir? [14:33:31] codfw boxes? [14:33:34] those create new RRDs [14:33:52] _joe_: At first glance, I see nothing in there that's insane. There's a lot of pmtpa stuff that might no longer be relevant. [14:34:21] mark: yeah, as new boxen come alive in codfw, the usage will increase (obviously) and we might have just hit the limit. [14:34:31] it's not the first time ;) [14:35:04] marktraceur, ^d: So which of us wants to SWAT this morning? [14:35:10] <_joe_> my advice would be; see which files there are not opened by a program, and maybe delete those? [14:35:30] anomie: I wouldn't mind taking my first swing at it, if it's a smallish one [14:35:45] James_F|Away (also marktraceur and ^d): 149597 doesn't seem appropriate for SWAT; the guidelines say "No new features/extensions" [14:36:01] And I see no exceptions for "But it's only beta labs" [14:36:26] Shouldn't beta labs get automatically synced anyway? [14:36:32] marktraceur: Except for the one I just mentioned, it seems small at first glance. [14:36:41] We'd just need to merge the patch and sync the no-op change. [14:36:52] anomie: it's not new, it was renamed/split [14:37:20] MatmaRex: Still, it's adding a new extension to the deployment. [14:37:32] right. just saying. [14:38:49] isn't there an extension-list-labs ? [14:38:50] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:40:04] (03CR) 10Aude: SpecialCite is now CiteThisPage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [14:40:20] Anyway, yeah, I'll do it [14:40:28] _joe_: Not helpful; basically none of it is opened at any one time (ganglia clearly only opens as needed). Ima check for files untouched in a long time. [14:40:44] <_joe_> on a tmpfs? [14:40:45] <_joe_> :) [14:40:59] ctime should be okay. [14:41:01] the tmpfs is mirrored to on-disk [14:41:12] and from on-disk [14:41:12] so YMMV [14:41:19] depending on what rsync options used etc :) [14:41:36] (03CR) 10Aude: SpecialCite is now CiteThisPage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [14:42:24] (03CR) 10Andrew Bogott: [C: 031] "ok, I'm now convinced that this is the right way to handle extensions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [14:42:40] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20138 MB (3% inode=99%): [14:42:51] mark: think it's okay if I blow away pmtpa data untouched in 2 years? [14:43:32] (03CR) 10Andrew Bogott: [C: 031] Add temporary virt1000.wikimedia.org.erb for wikitech migration to multiversion [puppet] - 10https://gerrit.wikimedia.org/r/157704 (owner: 10Reedy) [14:43:39] (03PS3) 10Andrew Bogott: Add temporary virt1000.wikimedia.org.erb for wikitech migration to multiversion [puppet] - 10https://gerrit.wikimedia.org/r/157704 (owner: 10Reedy) [14:44:33] Coren: yes [14:44:34] mark: That should recover some 200M of space. Not a /lot/, but should help in the short term. [14:46:45] anomie: Protocol question, should I make a point of doing a separate sync for each change or is it acceptable to scap all four? [14:47:31] (03CR) 10Andrew Bogott: [C: 031] Add temporary virt1000.wikimedia.org.erb for wikitech migration to multiversion [puppet] - 10https://gerrit.wikimedia.org/r/157704 (owner: 10Reedy) [14:47:57] " depending on what rsync options used etc :)" That may actually be why the space blew up. There may be more on-disk that fits in ram and while someone else might have cleaned up the tmpfs before rebooting copied it all back. [14:48:00] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [14:48:14] yes [14:48:15] be careful [14:48:19] it's a sucky setup really [14:48:26] perhaps we can do without tmpfs on ssd [14:48:26] marktraceur: I always scap (actually sync-file or sync-dir) each individual change, and then make the responsible person test to make sure it didn't break anything. In some cases where one person has a bunch of related patches doing all-at-once could be ok, but that's the exception rather than the rule. [14:48:34] KK. [14:48:35] although rrd on ssd is still not great [14:49:01] mark: I should go clean up the on-disk copy then. [14:49:07] anomie: So like, MatmaRex's patches that are a revert on both branches could be one scap. [14:49:11] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [14:49:23] But James_F's patches for different config changes, not so much [14:49:50] Er, I guess I could just sync-file the two MatmaRex patches, but in theory. [14:50:08] marktraceur: Could be, yes, but especially on Monday and Tuesday I prefer to do the one for phase0 wikis first to avoid breaking the sites end users care about, Just In Case�. [14:50:33] Sure [14:51:54] I don't know that I've ever had one where someone said "oh crap, this didn't actually work, revert!" when testing it after the patch was merged, but I don't want to get complacent. ;) [14:52:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki: funnel all php requests through HHVM (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/157691 (owner: 10Giuseppe Lavagetto) [14:52:26] Indeed [14:52:29] my swat only affects wikisource, by the way [14:52:33] mark: Other possible watermarks (mtime) for cleanup: unwritten to in 120 days cleans ~1.3G; 200 days cleans ~900M, 300 days cleans ~600M [14:52:36] so one of the branches might not be necessary [14:52:42] but i never know which wikis are running what when [14:53:03] mark: those are all files whose mtime are more than the specified number of days away so in theory no new data came in in that interval. [14:53:08] MatmaRex: https://www.mediawiki.org/wiki/MediaWiki_1.24/Roadmap. Wikisources will be on wmf19 until this afternoon. [14:53:20] And then 20, so both are necessary [14:53:24] <^d> wikiversions.json is the only way to know for sure :p [14:53:32] Oh, it's an 18 patch. Are we running 18 anywhere? [14:53:48] ... I was wrong. s/wmf19/wmf18/ [14:53:51] Coren: yeah decommissioned boxes and such [14:54:00] (and then 19 this afternoon, not 20) [14:54:05] marktraceur: ^ [14:54:05] yeah, somone needs to make a timeline/calendar [14:54:13] Oh, yeah, so both need to happen [14:54:35] MatmaRex: You can also just hit Special:Version on whatever wiki you're wondering about, of course [14:54:36] MatmaRex: That would be [[Deployments]] :) [14:54:44] mark: that's what I assumed. 300 days seems like a nice point to cut off. 365 days only cleans up 100M or so - I expect much of pmtpa was decom between 300 and 365 days ago? [14:55:11] (That sounds about right no me) [14:55:29] marktraceur: as i proved yesterday, [[deployments]] is also confusing [14:55:39] * marktraceur fetches coffee before swatting [14:57:12] Coren: yeah, for instance the squid cleanups I did when I came back from vacation in december [14:57:29] or perhaps that falls just outside of it [14:57:35] mark: Allright, blowing up data older than 300 days. [14:57:41] k [14:58:14] MatmaRex: I'll do yours first, James_F|Away is still |Away, that tardy sumbitch [14:58:45] Step 1. Sign into gerrit [14:58:50] RECOVERY - Disk space on nickel is OK: DISK OK [15:00:02] ^^ is okay now; but we'll run into the same problem eventually as codfw ramps up. An easy (if ugly) fix would be to add moar ramz to nickel. [15:00:05] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T1500). Please do the needful. [15:00:17] Coren: akosiaris has a request for a new box in [15:00:21] OK, I'm obviously taking it, MatmaRex is first. Waiting for Jenkins to merge. [15:00:25] so that should be handled [15:00:30] * Coren nods. [15:00:38] Indeed. [15:01:09] Coren: https://rt.wikimedia.org/Ticket/Display.html?id=7921 for reference [15:05:21] Syncing the wmf18 change [15:05:22] !log marktraceur Synchronized php-1.24wmf18/includes/EditPage.php: [SCAP] Revert "Toolbar: Only show on WikiText pages" (duration: 00m 08s) [15:05:28] Logged the message, Master [15:05:29] MatmaRex: Testy test? :) [15:05:44] James_F|Away: You're up next, better get to the office [15:06:03] marktraceur: testy testy, verified fixed [15:06:04] thanks [15:06:09] Sweet, syncing 19 [15:06:11] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:06:18] !log marktraceur Synchronized php-1.24wmf19/includes/EditPage.php: [SCAP] Revert "Toolbar: Only show on WikiText pages" (duration: 00m 07s) [15:06:24] Logged the message, Master [15:07:21] I'd ask you to test again, but I don't think it would help [15:07:30] * marktraceur waits for James_F [15:08:58] (03CR) 10coren: Labs: provide saner nscd defaults (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157816 (https://bugzilla.wikimedia.org/70076) (owner: 10coren) [15:09:03] (03CR) 10BryanDavis: "Tried to answer question about extension-list-wikitech inline." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:09:29] (03CR) 10coren: [C: 032] Labs: provide saner nscd defaults [puppet] - 10https://gerrit.wikimedia.org/r/157816 (https://bugzilla.wikimedia.org/70076) (owner: 10coren) [15:10:04] Oh, is James gracing us with his presence now? :D [15:10:20] marktraceur: hey, sorry. [15:10:34] No problem, we still have plenty of time [15:10:38] * marktraceur starts merging wantonly [15:11:23] (03CR) 10Andrew Bogott: "Yep, Aude explained on IRC as well. Thanks! Is this patch now deserving of a +1?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:11:25] (03CR) 10MarkTraceur: [C: 032] Enable TemplateData GUI on Norwegian Bokmål (nowiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157476 (https://bugzilla.wikimedia.org/70216) (owner: 10Jforrester) [15:11:41] (03Merged) 10jenkins-bot: Enable TemplateData GUI on Norwegian Bokmål (nowiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157476 (https://bugzilla.wikimedia.org/70216) (owner: 10Jforrester) [15:11:59] (03CR) 10John F. Lewis: [C: 031] "Restore my looks good as my stupidity was correctly :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:12:26] <_joe_> akosiaris: thanks for the review, but I was about to abort that patch alltoghether. I realized things are more simple and straight forward if I just define two puppet functions there [15:14:01] _joe_: cool [15:14:34] !log marktraceur updated /a/common to {{Gerrit|Ia1758b21e}}: depool db1035 for upgrade, move s3 vslow/dump to db1019 [15:14:42] Logged the message, Master [15:14:56] Hm [15:14:59] <_joe_> btw, yesterday I made some pretty neat benchmarking of various apache proxy methods; I should post those somewhere [15:15:00] That created a merge commit. [15:15:09] Oops. [15:15:17] Not ff only? [15:15:28] I guess it was [15:15:36] I should have rebased first [15:15:39] No big deal [15:15:56] marktraceur: Forgot to rebase! ;) [15:16:41] I have one more with which to redeem myself [15:16:54] Hah [15:16:59] :-) [15:16:59] !log marktraceur Synchronized wmf-config/InitialiseSettings.php: [SCAP] Enable the TemplateData GUI editor on Norwegian Wikipedia (duration: 00m 07s) [15:17:03] Encoding error because my message included non-ASCiI [15:17:06] Logged the message, Master [15:17:07] bd808: ^^ [15:17:21] James_F: Test? [15:17:30] marktraceur: Works great. [15:17:38] Sweet. [15:17:45] Yay for PHP config which is instant. [15:17:53] marktraceur: https://wikitech.wikimedia.org/w/index.php?title=Module:Deployment_schedule&diff=125469&oldid=118682 [15:18:03] No waiting five minutes for bits. [15:18:10] <^d> James_F: We could've gone with hphpc :p [15:18:15] marktraceur: That weird logging error is weird. I've never been able to reproduce on a test scap setup. [15:18:16] <^d> Then we would've had to wait an hour or more! [15:18:42] MatmaRex: You're my favourite [15:18:49] ^d: :-) [15:19:03] that should make it harder for me to miss which day the deployment is scheduled on [15:19:22] (03CR) 10BryanDavis: "One small change needed to move $wgExtensionEntryPointListFiles addition into the body of CommonSettings.php. Other than that I think it l" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:19:29] (03CR) 10MarkTraceur: [C: 032] SpecialCite is now CiteThisPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [15:19:35] (03Merged) 10jenkins-bot: SpecialCite is now CiteThisPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/149597 (owner: 10Jforrester) [15:19:45] Argh. [15:20:06] ? [15:20:09] I didn't rebase [15:20:17] Tsk. [15:20:38] Again. [15:21:04] I'm terrible [15:21:18] Yes. [15:21:20] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.00013662708443 [15:21:23] marktraceur: If you have any new data about the ascii codec error in scap logging, please add to https://bugzilla.wikimedia.org/show_bug.cgi?id=68780 [15:21:54] Or at least give a "me too" and if you have it the commit and command that were running [15:21:56] !log marktraceur Synchronized wmf-config/: [SCAP] SpecialCite is now CiteThisPage (duration: 00m 07s) [15:22:04] Logged the message, Master [15:22:55] <^d> marktraceur: Why is rebasing a concern? [15:23:06] <^d> (aka: why is a merge commit bad?) [15:23:12] * bd808 would like some time to polish scap logging code [15:23:13] (03CR) 10coren: [C: 032] "The "ghost" diffs look to me like UTF-8 canonicalization (e.g.: Ç -> C+¸). I don't know if that fixes the underlying issue, but at worse " [puppet] - 10https://gerrit.wikimedia.org/r/157708 (owner: 10John F. Lewis) [15:23:25] (03CR) 10coren: [V: 032] "The "ghost" diffs look to me like UTF-8 canonicalization (e.g.: Ç -> C+¸). I don't know if that fixes the underlying issue, but at worse " [puppet] - 10https://gerrit.wikimedia.org/r/157708 (owner: 10John F. Lewis) [15:23:31] ^d: It's not, but there's a step that says "if there's a merge commit it might be something wrong" [15:23:38] James_F: That sync-dir was for you, check it? [15:23:49] doing. [15:23:57] Sweet [15:24:04] bd808: Done [15:24:06] <^d> marktraceur: Merge commit locally that's not in gerrit? Yeah. [15:24:12] marktraceur: ty [15:24:22] <^d> If merge commits are bad in gerrit we could just change that setting [15:24:34] ^d: I mean, I'd argue it's unnecessary noise fo rus [15:24:35] for us. [15:24:57] It's automatically mergeable anyway, why not make it a rebase [15:25:54] marktraceur: Looks good to me. [15:26:57] Sweet [15:27:02] <^d> marktraceur: We tried ff-only on puppet once. People found it annoying. [15:27:22] <^d> I've never tried rebase if necessary. Not sure what will happen. Docs suck [15:27:26] I declare this SWAT closed [15:27:32] <^d> REBASE_IF_NECESSARY: rebase the commit when required. [15:27:36] <^d> Thx gerrit. [15:27:36] Yay [15:27:39] <^d> useful doc. [15:28:06] MatmaRex: [[Deployments]] headers are still not editable...oh because they're in the bloody lua module [15:28:09] I hate MediaWiki sometimes [15:28:32] <^d> It's vague on what happens. Does it merge if it can't rebase? [15:28:36] <^d> Or does it fail? [15:29:20] <^d> If the change being submitted is a strict superset of the destination branch, then the branch is fast-forwarded to the change. If not, then the change is automatically rebased and then the branch is fast-forwarded to the change. [15:30:00] <^d> Eh, let's try it. See what happens. [15:31:47] What's the worst that can happen? [15:31:48] :P [15:32:14] (03PS21) 10Andrew Bogott: Add wikitech config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [15:33:48] (03PS22) 10Andrew Bogott: Add wikitech config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [15:37:44] (03PS1) 10BBlack: Remove most amslvs[1-4] references [puppet] - 10https://gerrit.wikimedia.org/r/157821 [15:40:28] Reedy: are you now happy with the state of the wikitech changes? [15:40:31] PROBLEM - exim incoming message rate on iodine is CRITICAL: exim_messages_in CRITICAL: 0.0 [15:44:02] !log bring mw1114 -> mw1131 to weight 15 [15:44:07] Logged the message, Master [15:46:54] (03CR) 10Filippo Giunchedi: Fix startup of hhvm (034 comments) [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157798 (owner: 10Giuseppe Lavagetto) [15:47:59] (03CR) 10BryanDavis: [C: 04-1] "l10n update won't pick up wikitech with the labswiki guard condition in place. This will make wikitech very unhappy for l10n messages as t" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:48:47] (03CR) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [15:52:06] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: funnel all php requests through HHVM [puppet] - 10https://gerrit.wikimedia.org/r/157691 (owner: 10Giuseppe Lavagetto) [15:52:11] (03PS4) 10Filippo Giunchedi: elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 [15:54:01] (03PS23) 10Andrew Bogott: Add wikitech config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [15:54:11] thanks for your patience, bd808 [15:54:42] andrewbogott: No worries. That code is a tangle of under documented features [15:55:00] Yeah, except in this case you told me just what to do and I did the opposite :) [15:57:21] (03CR) 10BryanDavis: [C: 031] Add wikitech config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [15:57:53] (03PS1) 10Chad: Commons getting Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157824 [16:00:05] manybubbles, ^d: Dear anthropoid, the time has come. Please deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T1600). [16:00:05] andrewbogott, bd808: Respected human, time to deploy Wikitech (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T1600). Please do the needful. [16:00:36] <^d> I'm not a respected human :( [16:00:46] andrewbogott, Reedy: Are we ready? [16:00:59] I think we are, except for Reedy not being here [16:01:03] <^d> bd808: Can I do my patch first since it's one file? [16:01:03] ^d: Blame yuvipanda. He made it random [16:01:14] ^d: Yeah all yours [16:01:43] (03CR) 10Chad: [C: 032] Commons getting Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157824 (owner: 10Chad) [16:01:47] (03Merged) 10jenkins-bot: Commons getting Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157824 (owner: 10Chad) [16:02:11] (03CR) 10Aude: "suggestion" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:03:00] !log demon Synchronized wmf-config/InitialiseSettings.php: Commons gets Cirrus as primary (duration: 00m 04s) [16:03:05] Logged the message, Master [16:03:37] (03CR) 10Andrew Bogott: Add wikitech config. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:04:12] andrewbogott: is composer at all used for the smw stuff? [16:04:24] <^d> bd808: Ok, I'm out of your way. [16:04:31] aude: not at the moment. We're pinned to an old pre-composer version of smw [16:04:36] ok [16:04:38] That needs work, but not today [16:04:42] baby steps [16:04:44] :) [16:04:48] when/if it does, then sometimes composer does magic with autoloading [16:05:02] we've worked around it where needed for wikidata [16:05:07] bd808: ok, I'm about to sync the private settings. [16:05:11] anyway, no worries for wikitech yet [16:05:40] andrewbogott: Cool. You can drive and I'll be here for moral support [16:05:55] (03CR) 10Aude: [C: 031] Add wikitech config. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:06:00] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Epic puppet fail [16:07:46] !log andrew Synchronized /a/common/private/PrivateSettings.php: (no message) (duration: 00m 03s) [16:07:53] Logged the message, Master [16:08:18] !log andrew Synchronized /a/common/private/WikitechPrivateSettings.php: (no message) (duration: 00m 04s) [16:08:24] Logged the message, Master [16:08:39] bd808: I assume now is when a tiny syntax error in PrivateSettings.php breaks everything, right? [16:09:07] andrewbogott: It should have been linted as part of the sync, but yeah :) [16:09:22] (03CR) 10Andrew Bogott: [C: 032] Add wikitech config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:09:52] hm I just got a gerrit message I've never seen before, 'Change is new' [16:10:35] Maybe related to the config change ^d was talking about for merge commits? [16:10:54] <^d> andrewbogott: In what context? [16:10:56] yeah, it immediately rebased and rechecked [16:11:00] It was when I clicked 'submit' [16:11:20] on the wikitech config patch [16:11:24] <^d> We don't do submit generally on mw-config. [16:11:32] oh? What, then? [16:11:36] <^d> We +2 and let jenkins submit :) [16:11:54] <^d> (vague button name sucks) [16:12:03] andrewbogott: You will want to do a full scap to send out those changes so the l10n cache is rebuilt [16:12:23] <^d> andrewbogott: nvm. I see it now. [16:12:26] ok. So, first: Jenkins submits on a 15-minute cron, is that right? So I just wait? [16:12:40] And, second -- what's the command for a 'full scap'? [16:12:58] andrewbogott: `scap` [16:13:07] well, that's easy enough [16:13:12] Rather than sync-file or sync-dir [16:13:53] So something like `scap 'Deploying config for wikitech'` [16:14:23] <^d> andrewbogott: And it's ~5mins for beta. [16:14:24] <^d> Not 15. [16:14:35] ok, but… beta? [16:14:51] <^d> But otherwise yeah, jenkins will pick up the change for beta. [16:15:00] Why are we talking about beta? [16:15:24] <^d> Oh, you asked on a 15 minute cron automatically :p [16:15:28] <^d> So I thought you were asking about beta. [16:15:35] So, let me restate: How should I submit this patch? [16:15:54] <^d> Lemme change the config back and submit it. [16:15:57] <^d> I think it confused jenkins. [16:16:02] ok, thanks [16:16:22] oh, there it went! Or, at least, vanished from my gerrit view [16:16:26] <^d> I merged it. [16:16:30] <^d> I dunno why it didn't report to irc. [16:16:36] <^d> MERGED: https://gerrit.wikimedia.org/r/#/c/155789/ [16:16:37] Does 'scap' do a rebase before syncing? [16:16:59] <^d> You pull the changes into /a/common on tin, then scap. Scap only pushes out what's on tin. [16:17:05] <^d> Doesn't do the pull for you. [16:17:21] ok, so I need to do a pull before I scap, correct? [16:17:25] !log installing newer version of webstatscollector on oxygen and gadolinium, restarting filter process on oxygen [16:17:31] Logged the message, Master [16:17:59] And, am I pulling 'origin' or a particular branch? [16:18:00] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [16:19:02] (03CR) 10JanZerebecki: [C: 04-1] "This should be folded into its not yet merged dependency because these are just a few additional changes or the commit message should actu" [puppet] - 10https://gerrit.wikimedia.org/r/139193 (owner: 10Christopher Johnson (WMDE)) [16:19:09] <^d> andrewbogott: origin is the remote. the branch is master. [16:19:35] git fetch; git rebase origin/master [16:19:44] 'git fetch origin && git reset --hard origin' [16:19:48] OK, great, same thing [16:19:59] * andrewbogott fetches [16:20:14] <^d> bd808: turned that silly setting back off for mw-config in gerrit. it confused jenkins bad. [16:20:29] * andrewbogott scaps [16:20:38] !log andrew Started scap: Deploying wikitech config [16:20:44] Logged the message, Master [16:20:45] * andrewbogott watches en nervously [16:21:56] !log andrew scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="cawikibooks" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.SCRILhxGxO" ' returned non-zero exit status 1 (duration: 01m 17s) [16:22:02] Logged the message, Master [16:22:25] so! [16:22:29] https://dpaste.de/zg8d [16:22:42] * bd808 looks [16:23:46] andrewbogott: That message is telling us that l10n update is sad for some reason. But it's not really telling us why specifically. [16:23:56] logging fail [16:24:12] you can run it manually [16:24:16] and see what it says [16:24:20] * bd808 tries to look on fluroine to see if there is more detail [16:25:00] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:25:28] ooh [16:25:55] (03PS1) 10Giuseppe Lavagetto: hhvm: do not autoupdate packages [puppet] - 10https://gerrit.wikimedia.org/r/157826 [16:26:21] Semantic MediaWiki must be installed for Semantic Forms to run [16:26:50] order of entry points? [16:27:16] (03PS2) 10BBlack: Remove most amslvs[1-4] references [puppet] - 10https://gerrit.wikimedia.org/r/157821 [16:27:18] (03PS1) 10BBlack: Puppet LVS cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157827 [16:27:18] The order looks correct to me [16:27:22] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: do not autoupdate packages [puppet] - 10https://gerrit.wikimedia.org/r/157826 (owner: 10Giuseppe Lavagetto) [16:27:24] no [16:27:29] "Extension /a/common/php-1.24wmf18/extensions/CiteThisPage/CiteThisPage.php doesn't exist" [16:27:36] "Some files are missing (see above). Giving up." [16:27:43] https://gerrit.wikimedia.org/r/#/c/155789/24/wmf-config/extension-list-wikitech [16:27:48] what? [16:27:51] Found that on fluorine as the cause [16:28:04] THat was what made l10nupdate die [16:28:21] I didn't add CiteThisPage, did I? [16:28:21] that was in swat [16:28:33] oh, they're in order in the config but out of order in the extensionlist? [16:28:35] * andrewbogott looks [16:28:42] when i run mergemessages, i get the smw error [16:28:52] i would put smw first then smw forms [16:28:59] it seems to care about order [16:29:45] I think SMW do [16:29:47] (03PS1) 10Andrew Bogott: Reorder extensions in an attempt to avoid dependency errors. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157828 [16:29:49] there's dependancies [16:29:54] /a/common/php-1.24wmf18/extensions does not have CiteThisPage. Is it wmf19 only? [16:29:55] citethispage shouold probably be in extension-list-labs ? [16:30:05] labs only [16:30:05] ^ [16:30:09] afaik [16:30:14] It's still on the deploy list [16:30:37] * bd808 nods [16:30:46] Not in wmf19 yet either [16:30:47] Hmm [16:30:55] It's in the main CommonSettings.php [16:31:05] Who merged that? :/ [16:31:10] So picked up in swat but not really configured right [16:31:23] (03CR) 10Aude: [C: 031] Reorder extensions in an attempt to avoid dependency errors. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157828 (owner: 10Andrew Bogott) [16:31:29] marktraceur I'd guess if it was swat today [16:31:44] https://gerrit.wikimedia.org/r/#/c/149597/ [16:31:47] Looks like it [16:32:10] Hi there, what's up? [16:32:25] CiteThisPage being in extension-list [16:32:29] Ah. [16:32:30] meantime, can someone +2 https://gerrit.wikimedia.org/r/#/c/157828/ ? [16:32:35] It's not branched so it gets upset [16:32:43] andrewbogott: sure [16:32:46] (03PS3) 10BBlack: Remove most amslvs[1-4] references [puppet] - 10https://gerrit.wikimedia.org/r/157821 [16:32:47] [17:29:55] citethispage shouold probably be in extension-list-labs ? [16:32:56] (03CR) 10Aude: [C: 032] "looks correct" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157828 (owner: 10Andrew Bogott) [16:33:16] (03Merged) 10jenkins-bot: Reorder extensions in an attempt to avoid dependency errors. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157828 (owner: 10Andrew Bogott) [16:33:16] (03CR) 10BBlack: [C: 032 V: 032] Remove most amslvs[1-4] references [puppet] - 10https://gerrit.wikimedia.org/r/157821 (owner: 10BBlack) [16:33:16] thx [16:33:25] Reedy: i'm not totally sure where it goesas some config is in CommonSettings [16:33:32] but then looks like the extension is on labs only [16:33:56] The bits in common settings are guarded but yeah l10n should not be in extension-list yet [16:34:01] I think it might be ok still being in CommonSettings [16:36:00] suppose the extension has to be added as a submodule [16:36:03] to wmf 18 / 19 [16:36:03] So can we revert https://gerrit.wikimedia.org/r/#/c/149597 and then get it fixed and re-deployed later today? [16:36:08] in order to be in extension-list [16:36:14] if i understand correct [16:36:17] exactly [16:36:25] k [16:36:26] we either revert 149597 temporarily [16:36:30] Or just move CiteThisPage [16:36:33] I was just going to do the latter [16:36:42] your call Reedy [16:36:50] move seems good [16:37:22] (03PS2) 10BBlack: Puppet LVS cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157827 [16:37:37] (03PS1) 10Reedy: Move CiteThisPage to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157830 [16:38:36] (03PS1) 10Reedy: Alphasort extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157831 [16:38:58] Reedy: alphasort was just what aude and I undid for wikitech. [16:39:06] Don't you think there are dependency issues there as well? [16:39:19] for labs, probably ok [16:39:26] * aude looks [16:39:29] Ditto [16:39:50] yes, ok [16:39:58] it's just wikibase and smw that tend to care [16:39:58] (03Draft3) 10Giuseppe Lavagetto: beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 [16:39:58] FundraisingTranslateWorkflow depends on Translate presumably, but that's in a different list etc [16:40:02] shall I +2 the CiteThisPage change so I can scap again? [16:40:11] yeah [16:40:13] * aude learn the hard way :p [16:40:31] RECOVERY - exim incoming message rate on iodine is OK: exim_messages_in OKAY: 6.0 [16:40:35] (03CR) 10Andrew Bogott: [C: 032] Move CiteThisPage to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157830 (owner: 10Reedy) [16:40:39] (03Merged) 10jenkins-bot: Move CiteThisPage to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157830 (owner: 10Reedy) [16:40:59] andrewbogott: When you scap again, do `scap --verbose 'message'` and you'll get more info if it breaks again [16:41:02] (03PS22) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:41:12] !log andrew Started scap: Deploying wikitech config [16:43:44] l10n update is still failing in beta when Jenkins runs it. I'll look deeper [16:44:37] 85dbd05..4f172bb wmf/1.24wmf15 -> gerrit/wmf/1.24wmf15 [16:44:40] wikitech? [16:45:40] Nope... [16:46:28] * Reedy kicks MaxSem [16:48:48] In beta "Exception from line 31 of /mnt/srv/scap-stage-dir/php-master/extensions/Validator/Validator.php: Validator depends on the ParamProcessor library" [16:48:54] lmfao [16:49:00] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [16:49:03] That is the current l10n fail there [16:49:04] Does Wikitech load that? We don't branch it... [16:49:32] Yes, wikitech needs Validator. [16:49:42] ParamProcessor, not validator :) [16:49:59] OH [16:50:00] I know [16:50:05] bd808: master Validator [16:50:07] vs what we branch [16:50:07] :( [16:50:24] bah [16:50:28] "if ( !class_exists( 'ParamProcessor\Processor' ) ) { throw new Exception( 'Validator depends on the ParamProcessor library.' );" [16:50:29] I don't see any evidence that ParamProcessor is used or loaded on the current wikitech [16:50:41] andrewbogott: We're using pinned/tested versions defined by Ryan [16:50:47] Oh. right [16:50:54] And beta pulls masters [16:50:56] Clusterfuck [16:50:58] The latest patch of Validator on wikitech is Bump to 0.5.2 beta [16:51:02] blerg [16:51:10] (03PS1) 10Plucas: Fix variable access deprecation warning [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/157833 [16:51:11] We could do with updating SMW etc [16:51:19] But I guess now really isn't the time [16:51:25] :) probably not [16:51:36] you would need to do what we do with wikibase [16:51:38] but not now [16:51:50] I look forward to the day when ever WMF employee has said that so that I get a break from hearing it :) [16:51:52] (make a build of smw and commit that) [16:52:13] So how do we tell beta not to care about l10n for wikitech? Add a !labs guard in commonsettings? [16:52:13] (03CR) 10JanZerebecki: [C: 04-1] "This is not instructing puppet to do anything with these files. Which is something that needs to be fixed before merging this." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [16:52:34] scap on tin seems pretty happy so far. [16:52:42] sweet [16:53:40] RECOVERY - Disk space on elastic1009 is OK: DISK OK [16:55:51] (03PS1) 10BryanDavis: Do not load SMW for l10n in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157835 [16:56:17] Reedy, aude: ^ seem reasonable? [16:56:55] bd808: Hmm. Should we just wrap it if dbname == labswiki? [16:57:03] They don't need to be in the main l10n cache files [16:57:12] Oh, that won't work, will it [16:57:23] nope, it won't [16:57:24] looking [16:57:26] do we have a different realm for wikitech? [16:57:45] No. We decided against a realm for various reasons [16:57:54] Mostly because realm comes from puppet [16:58:01] hmmmm [16:58:14] cause adding all the wikitech stuff to all the l10n cache is going to increase them a fair bit [16:58:28] not so much of a big deal I guess [16:58:30] bd808: is there any reason for me to wait for scap on tin to finish before I do a sync on virt1000? [16:58:42] we're generating them locally, not transferring the generated [16:59:03] (03CR) 10Reedy: [C: 031] Do not load SMW for l10n in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157835 (owner: 10BryanDavis) [16:59:06] andrewbogott: If the l10n build is done on tin you can sync from virt1000 [16:59:11] cool [16:59:39] (03CR) 10BryanDavis: [C: 032] Do not load SMW for l10n in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157835 (owner: 10BryanDavis) [16:59:44] (03Merged) 10jenkins-bot: Do not load SMW for l10n in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157835 (owner: 10BryanDavis) [17:00:30] I'm AFK for dinner. Back in an hour or so :) [17:01:00] Nikerabbit: heya, can you take a quick look at https://bugzilla.wikimedia.org/show_bug.cgi?id=68781, it probably needs your expertise. [17:01:02] !log Fetched f711ea7 to /a/common on tin; not syncing because of in-process scap. [17:01:09] Logged the message, Master [17:01:12] (03PS4) 10Andrew Bogott: Add temporary virt1000.wikimedia.org.erb for wikitech migration to multiversion [puppet] - 10https://gerrit.wikimedia.org/r/157704 (owner: 10Reedy) [17:02:02] l10n update is building in beta now. \o/ [17:02:15] (03CR) 10Andrew Bogott: [C: 032] Add temporary virt1000.wikimedia.org.erb for wikitech migration to multiversion [puppet] - 10https://gerrit.wikimedia.org/r/157704 (owner: 10Reedy) [17:03:19] (03PS3) 10BBlack: Puppet LVS cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157827 [17:03:32] hm, I wish Reedy had not left... [17:03:39] (03CR) 10BBlack: [C: 032 V: 032] Puppet LVS cleanup [puppet] - 10https://gerrit.wikimedia.org/r/157827 (owner: 10BBlack) [17:03:45] SERVER: Failed to parse template apache/sites/virt1000.wikimedia.org.erb [17:04:04] "Error connecting to 208.80.154.18: :real_connect(): (HY000/2003): Can't connect to MySQL server on '208.80.154.18'" showing up in lgostash [17:04:33] * bd808 looks at erb patch [17:04:53] bd808: that's labs-related [17:05:09] bd808: I'm looking too… it's because of an undefined var getting passed to erb. https://dpaste.de/3w88 [17:05:38] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Epic puppet fail [17:05:52] andrewbogott: line 45 @ssl_settings [17:05:56] yep [17:06:09] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [17:06:19] copy pasta [17:06:56] do you see how to fix? It's not obvious to me why that's not defined [17:07:06] bblack: The mysql errors? They are coming from mw* servers [17:08:09] The ip is the new virt1000 in wmf-config/db-eqiad.php [17:08:24] So we did soemthing not right there I imagine [17:09:26] gwicke: i just wanted to point out that your input on https://rt.wikimedia.org/Ticket/Display.html?id=8236 is awesome =] [17:09:47] that kind of ratio info is invaluable for me in finding hardware to suit needs, thanks! [17:10:04] Anybody understand db-eqiad.php and know if adding virt1000 to 'sectionloads' was a bad idea? [17:10:06] (03PS1) 10Andrew Bogott: Stab in the dark: Move the virt1000 vhost to openstack.pp [puppet] - 10https://gerrit.wikimedia.org/r/157838 [17:10:12] robh: you are welcome ;) [17:10:14] bd808: ^ [17:10:49] it also makes any request a lot more legitimate than 'just give us a lot of cores man' heh [17:11:12] (03PS1) 10BBlack: put standard class back on amslvs[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/157839 [17:11:17] currently my laptop beats any production server at node benchmarks [17:11:27] mostly because its single-thread performance is pretty good [17:11:29] (03CR) 10BBlack: [C: 032 V: 032] put standard class back on amslvs[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/157839 (owner: 10BBlack) [17:12:08] (03CR) 10Andrew Bogott: [C: 032] "It won't be any /more/ broken" [puppet] - 10https://gerrit.wikimedia.org/r/157838 (owner: 10Andrew Bogott) [17:12:13] * bblack renames gerrit.wikimedia.org to grrrrrrrrrrrrrrrrrrit.wikimedia.org [17:12:14] (03PS2) 10Andrew Bogott: Stab in the dark: Move the virt1000 vhost to openstack.pp [puppet] - 10https://gerrit.wikimedia.org/r/157838 [17:12:17] robh: am happy to supply some numbers, for example for 'openssl speed' [17:13:36] (03CR) 10Andrew Bogott: [C: 032] Stab in the dark: Move the virt1000 vhost to openstack.pp [puppet] - 10https://gerrit.wikimedia.org/r/157838 (owner: 10Andrew Bogott) [17:13:48] andrewbogott: heh. I was trying to see where that var comes from, but you're right it won't be worse :) [17:14:13] !log andrew Finished scap: Deploying wikitech config (duration: 33m 03s) [17:14:18] Logged the message, Master [17:15:28] (03PS1) 10BBlack: remove commented-out ref to dead pmtpa addr [dns] - 10https://gerrit.wikimedia.org/r/157840 [17:15:43] (03CR) 10BBlack: [C: 032 V: 032] remove commented-out ref to dead pmtpa addr [dns] - 10https://gerrit.wikimedia.org/r/157840 (owner: 10BBlack) [17:15:59] bd808: my dumb patch fixed the puppet run [17:16:16] andrewbogott: awesome [17:16:20] bd808: but now… observe https://virt1000.wikimedia.org/ [17:16:34] * andrewbogott curses Reedy's appetite a second time [17:16:38] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:16:51] Oh, and wikitech is broken too now! [17:16:53] woot [17:18:12] andrewbogott: Back out the vhost and I'll look in a minute. Trying to understand db errors from mw* servers trying to hit virt1000 at the moment [17:18:20] yeah, backed out. [17:19:04] I don't suppose springle is around at this hour? [17:19:46] I think that may be not the right config [17:20:17] "all servers which are down or do not replicate should be removed" [17:21:28] maybe it's not needed in section loads? [17:21:50] don't trust me though [17:22:03] aude: Yeah I'm thinking that is the case [17:22:32] not everything goes in a section [17:22:32] like whatever db flow is in etc. [17:22:42] (03PS1) 10BryanDavis: Remove virt1000 from wgLBFactoryConf['sectionLoads'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 [17:22:48] (03CR) 10jenkins-bot: [V: 04-1] Remove virt1000 from wgLBFactoryConf['sectionLoads'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 (owner: 10BryanDavis) [17:22:52] logpager, etc [17:23:16] there's stuff in groupLoadsBySection [17:23:33] Grrr. REmoving cause test failures [17:23:54] * aude slightly confused [17:24:09] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:24:53] <^d> gerrit? where'd you go? [17:24:55] (03CR) 10Ori.livneh: [C: 031] Add pcre overflow patch that should prevent beta from crashing regularly [debs/hhvm] - 10https://gerrit.wikimedia.org/r/157790 (owner: 10Giuseppe Lavagetto) [17:25:03] Test "dbconfigTests::testDbAssignedToAnExistingCluster" fails [17:25:05] <^d> super slow and i can't ssh in. [17:25:09] <^d> Oh there we go [17:25:11] <^d> just slow [17:26:50] bd808: that makes me think the test is wrong. [17:27:21] (03PS1) 10Andrew Bogott: Fix a typo -- s/virt1001/virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/157844 [17:27:29] bd808: ^ is part but not all of the problem. wikitech still redirects to virt1000, and virt1000 can't find the install [17:27:56] (03CR) 10Andrew Bogott: [C: 032] Fix a typo -- s/virt1001/virt1000 [puppet] - 10https://gerrit.wikimedia.org/r/157844 (owner: 10Andrew Bogott) [17:28:29] andrewbogott: So /usr/local/apache/common/docroot/wikimedia.org doesn't exist? Or something else? [17:28:38] Oh. duh needs a static mapping [17:29:12] # ls /usr/local/apache/common/docroot/wikimedia.org [17:29:14] 503.html images w [17:29:36] Hmmm Reedy made static mapping in https://gerrit.wikimedia.org/r/#/c/157757/1/multiversion/MWMultiVersion.php,unified [17:29:39] (03CR) 10Chad: [C: 031] elasticsearch: handle request timeout and increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [17:31:38] The dblist thing has me more concerned. Seeing an error that jawiki is trying to talk to virt1000 doesn't seem right at all. [17:32:15] bd808: that config is divided into db clusters, right? So can we just declare wikitech to be a new one-host cluster? [17:32:23] With a new arbitrary # [17:34:38] andrewbogott: Maybe all the labswiki stuff should just go in the conditional guard you added at the end? [17:34:49] * bd808 is grasping for straws on this [17:35:05] I'll be back on my laptop in a few [17:35:08] Will catch up [17:36:49] Reedy: 2 issues right now. 1) wikis are trying to connect to virt1000 as a db server. 2) redirect loop in virt1000 apache vhost. [17:37:17] PROBLEM - puppet last run on db60 is CRITICAL: CRITICAL: Epic puppet fail [17:37:21] Reedy: thanks, sorry to interrupt [17:37:21] Actually, 2) is no longer a loop, it's just that wikitech redirects to virt1000, and virt1000 doesn't work. [17:37:27] It's alright [17:37:51] 157757 isn't merged yet ;) [17:38:20] (03PS2) 10Andrew Bogott: Add virt1000.wikimedia.org static mapping for wikitech migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157757 (owner: 10Reedy) [17:38:25] oh, I didn't know there were two! Ok, will merge that [17:38:44] (03CR) 10Andrew Bogott: [C: 032] Add virt1000.wikimedia.org static mapping for wikitech migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157757 (owner: 10Reedy) [17:39:16] so, that will probably fix the 'virt1000 is broken' but not the 'wikitech -> virt1000' issue [17:39:51] !log andrew Synchronized multiversion/MWMultiVersion.php: (no message) (duration: 00m 04s) [17:39:56] Logged the message, Master [17:41:09] (03PS2) 10BryanDavis: Place all dbconfig for wikitech in a guard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 [17:41:19] (03PS1) 10Jforrester: Clean up and re-organise VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157845 [17:42:03] Oh [17:42:05] Reedy: if you hit wikitech right now you'll see what I mean [17:42:05] Reedy, andrewbogott, aude: What about trying https://gerrit.wikimedia.org/r/#/c/157842/ ? [17:42:53] I was wondering for a moment if I'd put a redirect from virt1000 -> wikitech on HTTPS Everywhere [17:42:53] … but I'd like to switch it back to working properly, so if you're not equipped to look just now let me know [17:43:30] bd808: looks better [17:43:38] I'm on my laptop now [17:43:52] bd808: I have 0 idea how the db config is supposed to work :( [17:43:58] wait, now wikitech is down entirely... [17:44:04] all I did was remove the virt1000 profile [17:44:06] * andrewbogott scowls [17:44:06] but not totally sure about db stuff [17:44:49] Why's the stuff in == testwiki? [17:44:52] andrewbogott: Same for me actually. Looks like mostly s.pringle messes with it. [17:45:05] * aude was concerned about "I have no idea if this is right " and should have asked more questions [17:45:25] (03CR) 10Ottomata: [C: 032 V: 032] Fix variable access deprecation warning [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/157833 (owner: 10Plucas) [17:45:38] PROBLEM - puppetmaster https on virt1000 is CRITICAL: Connection refused [17:45:48] PROBLEM - HTTP on virt1000 is CRITICAL: Connection refused [17:45:54] Using that guard may breal wikitech when we try to use it, but at least it should keep other wikis from having problems. [17:46:11] *break [17:46:11] mmm [17:46:11] * bd808 is bold [17:46:15] maybe aaron knows? [17:46:19] Reedy: Better idea? [17:46:21] wikitech just went down? [17:46:26] (03PS3) 10BryanDavis: Place all dbconfig for wikitech in a guard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 [17:46:28] getRealmSpecificFilename() [17:46:38] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.057 second response time [17:46:48] RECOVERY - HTTP on virt1000 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.001 second response time [17:47:00] bd808: wikitech is up for you? [17:47:19] Really, we don't need to be loading the production db config (among others) on wikitech [17:47:25] my connection is not private [17:47:27] it says [17:47:27] chasemp: I think andrewbogott is working on that [17:47:30] (03CR) 10Dzahn: "hey Jan, since i hear you are now officially hired by WMDE (congrats!), let's get this done together:)" [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [17:47:44] no wiki found [17:47:52] ok I see now thanks, missed it [17:48:02] a touch of InitialiseSettings.php might fix that [17:48:02] Looks like maybe varnish cached something we didn't want it to? [17:48:35] does wikitech have varnish infront of it? [17:48:42] Reedy: no [17:48:48] Reedy: can you help me get wikitech (the old one) working again [17:49:15] ? [17:49:26] I mean, other than just rm'ing sites-enabled/50-virt1000-wikimedia-org.conf which I just did [17:49:50] !log reedy Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 14s) [17:49:50] Logged the message, Master [17:49:50] wikitech WFM now at least [17:50:33] Reedy: yes, because "just rm'ing sites-enabled/50-virt1000-wikimedia-org.conf which I just did" [17:50:37] yeah, I'm just confirming it had the desired effect [17:50:43] oh, yeah, thanks. [17:50:53] A minute ago that didn't help, I reran puppet and now it works… ??? [17:50:54] Anyway [17:51:05] restarting apache? [17:51:26] I've been gracefulling [17:51:35] maybe puppet does something else [17:51:40] (03CR) 10BryanDavis: [C: 032] "May not make wikitech work but should protect the rest of the wikis from talking to virt1000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 (owner: 10BryanDavis) [17:51:44] bd808: shall I sync that change? [17:51:48] (03Merged) 10jenkins-bot: Place all dbconfig for wikitech in a guard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157842 (owner: 10BryanDavis) [17:52:02] andrewbogott: If you'd like yes [17:52:05] I can do it too [17:52:09] whichever [17:53:05] !log andrew Synchronized wmf-config/db-eqiad.php: (no message) (duration: 00m 06s) [17:54:37] I think that at least made the wikis stop trying to hit virt1000 as a db server [17:54:40] which is good [17:54:55] RECOVERY - puppet last run on db60 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:56:35] so, I guess both vhosts are just trying to redirect everything to wikitech/virt1000, and virt1000 just happens to win out? [17:57:12] Guys, complete crazy: http://ganglia.wikimedia.org/latest/graph_all_periods.php?me=Wikimedia&m=cpu_report&r=2hr&s=by%20name&hc=4&mc=2&g=network_report&z=large [17:57:41] Apparently we just have a bust of network activity in the order of tens of petabps. [17:58:43] Max 461.2P actually. Heh. [17:58:43] <^d> Everything's been slow for me for a bit. That'd explain it. [17:59:23] andrewbogott: What number is assigned as a prefix for the wikitech config? [17:59:47] they are both 50 [17:59:49] root@virt1000:/etc/apache2/sites-available# ls [17:59:50] 00-dummy.conf 50-virt1000-wikimedia-org.conf [17:59:51] i'm guessing that has some weighting [17:59:51] 50-puppetmaster-wikimedia-org.conf 50-wikitech-wikimedia-org.conf [17:59:54] ^d: Apparently, we just routed the entire Internet for a bit. :-) [18:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T1800). Please do the needful. [18:00:13] v before w so that gets done first [18:00:38] <^d> Coren: That's a lot of internet. [18:01:07] andrewbogott: What do the SSL settings add? [18:01:14] ie <%= @ssl_settings.join("\n") %> [18:01:20] Reedy: no idea [18:01:35] Have a look in the conf? ;) [18:02:21] <^d> Coren: Labs? [18:02:24] Reedy: is a larger number priority, or a lower number? [18:02:35] <^d> Coren: http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:02:52] andrewbogott: Alphabetical sort [18:03:13] like `ls` does [18:03:34] so 100-virt1000 is higher priority than 50-wikitech [18:03:36] but 99-virt1000 is lower [18:03:38] ? [18:04:03] ^d: Clearly. Those numbers looks suspciously like some small negative integer underflowed in 32 bit unsigned. :-) [18:04:11] andrewbogott: Lexical order, dude. [18:04:45] andrewbogott: That extra digit is your gotcha. :-) [18:05:17] (03PS1) 10Andrew Bogott: Lower the priority for virt1000 vhost [puppet] - 10https://gerrit.wikimedia.org/r/157848 [18:05:31] I understand how lexical order works, I'm just…surprised that that is how this works [18:06:21] Anyway, I'm pretty sure this will still redirect wikitech to virt1000 :( [18:06:28] * andrewbogott tries [18:06:43] I've no idea why it would though [18:06:46] (03CR) 10Andrew Bogott: [C: 032] Lower the priority for virt1000 vhost [puppet] - 10https://gerrit.wikimedia.org/r/157848 (owner: 10Andrew Bogott) [18:06:59] If anything (ie MediaWiki was getting involved), I'd expect the other way round [18:07:07] andrewbogott: Them 00-file things work because (for x in *) sort lexically. :-) [18:07:35] Yeay shell globbing! [18:07:38] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Epic puppet fail [18:07:47] So it's always limited to two digits. OK [18:08:38] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:09:01] Reedy: yeah, I can't make both work. [18:09:11] So now with virt1000 a lower priority, everything redirects to wikitech [18:09:20] Funny, I thought this having-two-sites thing would be the easy part [18:10:22] ServerName <%= @webserver_hostname %> [18:10:22] ServerAlias <%= @webserver_hostname_aliases %> [18:10:32] What do they resolve to on the wikitech config? [18:11:12] ServerName wikitech.wikimedia.org [18:11:14] ServerAlias wmflabs.org www.wmflabs.org [18:11:58] reedy@tin:/a/common$ curl -I virt1000.wikimedia.org [18:11:58] Location: https://virt1000.wikimedia.org/ [18:12:50] reedy@tin:/a/common$ curl -I https://virt1000.wikimedia.org --insecure [18:12:50] Location: https://wikitech.wikimedia.org/ [18:13:43] It's 302-ing [18:13:43] I wonder if that's MediaWiki [18:14:31] Except it's not hitting mediawiki on virt1000 is it? [18:14:33] (re my $wgServer and $wgCanonicalServer comments earlier) [18:15:15] Oh, hm... [18:16:12] I guess I can change the local mw config by hand to expect virt1000 [18:16:44] Reedy: any idea why I'm seeing this in the logs now? "Connection error: Error selecting database labswiki on server 10.64.16.27 from client host mw1204" [18:18:58] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [18:19:24] bd808: s3 master [18:19:43] (03PS1) 10Andrew Bogott: Point wikitech wiki to virt1000 for testing purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157851 [18:19:44] presumably it's something looking through all.dblist [18:19:44] oh.. blerg [18:20:02] Reedy, greg-g, are you deploying now? Or can this wikitech nonsense bleed into your window? [18:20:29] I've not started that deploy yet, no real rush :) [18:20:51] Are there jobs that use all.dblist? [18:20:51] /foreachwiki yeah [18:21:02] Reedy: How do you feel about ^ ? [18:21:02] In order to get virt1000 as a proper 'other' wiki [18:21:02] perhaps that will fix the redir [18:21:16] but if it's not in all.dblist multiversion will get upset [18:21:16] mw1214, mw1209, mw1207, mw1205 ... [18:21:37] Lots of hosts reporting that message [18:22:16] Reedy: alternative is to just rip off the band-aid and skip the whole virt1000 attempt [18:22:16] (03CR) 10Reedy: [C: 031] "Shouldn't hurt for the time being" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157851 (owner: 10Andrew Bogott) [18:22:33] (03CR) 10Andrew Bogott: [C: 032] Point wikitech wiki to virt1000 for testing purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157851 (owner: 10Andrew Bogott) [18:22:51] (03PS1) 10Matanya: planet: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/157852 [18:22:52] mw1214 is an appserver. Not sure why it would run an all.dblist command [18:23:11] CA? [18:24:02] is 'labswiki' in $wgLocalDatabases? [18:25:02] "$wgLocalDatabases =& $wgConf->getLocalDatabases();" [18:25:09] So probably yes [18:25:13] legoktm: via all.dblist, probably [18:25:16] !log andrew Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 08s) [18:25:22] Logged the message, Master [18:27:44] Yup. It shows up in $wgLocalDatabases when I eval.php as mediawikiwiki [18:28:20] > var_dump(in_array('labswiki', CentralAuthUser::getWikiList())); [18:28:20] bool(true) [18:28:28] (03PS1) 10Reedy: Merge virt1000 apache config back into wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/157853 [18:28:37] (03PS4) 10Dzahn: ganglia - make codfw use ganglia_new::monitor [puppet] - 10https://gerrit.wikimedia.org/r/157325 [18:29:38] so that's probably the issue [18:29:52] * Reedy coughs [18:29:54] hack [18:29:55] * Reedy coughs [18:30:09] I see there is a hook to tweak that list [18:30:13] CentralAuthWikiList [18:31:07] Reedy: no change [18:31:13] (03CR) 10Dzahn: [C: 032] ganglia - make codfw use ganglia_new::monitor [puppet] - 10https://gerrit.wikimedia.org/r/157325 (owner: 10Dzahn) [18:31:17] ugh but the hook as to build the whole list [18:31:23] *has to [18:31:46] :/ [18:32:54] So hacktastic hack would be to subtract labswiki from $wgLocalDatabases in CommonSettings.php [18:33:32] because it's not really a local database, that kinda makes sense [18:33:58] 'Local' must not mean 'local' in this context? [18:34:15] Local means "part of this wiki farm" [18:35:20] Which confused the heck out of me for a while [18:39:10] $wgConf->wikis is the contents of all.dblist and $wgConf->getLocalDatabases() returns $wgConf->wikis. [18:40:11] heya, anyone else have problems with ssh connections (and other general wmf connectivity)? [18:40:11] 3 separate analytics folks are complaining about it, all in different locations [18:40:12] ssh connections blocking for 10ish seconds, downloads sleeping, etc. [18:40:12] ottomata: I've had a lot of slow connections both https and ssh this morning [18:41:19] Ooooh we already hook $wgHooks['CentralAuthWikiList'] [18:41:51] We just need to add labswiki to wgSiteMatrixPrivateSites [18:42:39] Is nickname 'rc-pmtpa' on irc.wikimedia.org back-compat hardcoded, or is it running in pmtpa? [18:42:47] Krinkle: back-compat hardcoded [18:43:03] When was it moved to eqiad? [18:43:36] June [18:43:36] https://gerrit.wikimedia.org/r/#/c/136755/ [18:43:43] Reedy: btw, with wikitech using InitialiseSettings/CommonSettings, does that mean wikitech will be getting ircfeed/rcstream? [18:43:45] I created a revert in https://gerrit.wikimedia.org/r/#/c/136965/ [18:43:54] Reedy: Would labswiki be better classified in private.dblist or fishbowl.dblist? Either will fix the CA issue. [18:44:03] (03PS1) 10MaxSem: Disable mobile site notices, too intrusive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157855 [18:44:06] bd808: Fishbowl possibly [18:44:14] Ah, you're still working on it. I thought I was looking at an older commit. [18:44:14] Considering it technically is [18:44:25] Krinkle: Might not work initially, but should be doable [18:44:31] carry on hacking wikitech :) [18:44:32] (03PS1) 10Matanya: protoproxy: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/157856 [18:44:45] bd808: private.dblist would make it not viewable. fishbowl is exactly what it is (needs an account to edit etc) [18:44:56] That seems like a saner fix :) [18:45:44] "Restricted write access, full read access" [18:46:23] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: Puppet has 1 failures [18:46:27] It probably means we can remove some more config overrides [18:46:43] (03CR) 10Krinkle: [C: 04-1] "Yeah. Thanks for restoring rc-pmtpa. That saved a potential firehouse of problems in old bots still working on migrating away from irc. I'" [puppet] - 10https://gerrit.wikimedia.org/r/136965 (owner: 10Reedy) [18:49:23] PROBLEM - Puppet freshness on elastic1007 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 06:23:11 UTC [18:50:24] (03PS1) 10BryanDavis: Add labswiki to fishbowl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157857 [18:50:44] Reedy: ^ [18:52:23] RECOVERY - Puppet freshness on elastic1007 is OK: puppet ran at Tue Sep 2 18:52:22 UTC 2014 [18:52:54] RECOVERY - puppet last run on elastic1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:53:25] (03CR) 10Andrew Bogott: [C: 031] Add labswiki to fishbowl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157857 (owner: 10BryanDavis) [18:54:05] Should I merge and sync that? ^ [18:54:28] (03CR) 10Reedy: [C: 031] Add labswiki to fishbowl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157857 (owner: 10BryanDavis) [18:54:31] Yeah, LGTM [18:54:35] quietens production down [18:54:38] * bd808 does then [18:55:02] (03CR) 10BryanDavis: [C: 032] Add labswiki to fishbowl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157857 (owner: 10BryanDavis) [18:55:06] (03Merged) 10jenkins-bot: Add labswiki to fishbowl.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157857 (owner: 10BryanDavis) [18:55:45] So as y'all wrap up that side-quest… any ideas how to make virt1000.wikimedia.org work? [18:55:56] (and, want me to sync that last change?) [18:56:01] !log bd808 Synchronized fishbowl.dblist: Add labswiki (wikitech) (duration: 00m 05s) [18:56:07] heh, nm [18:56:07] Logged the message, Master [18:57:14] andrewbogott: So have we figured out if the 302 is coming from Apache or MW? [18:57:32] I suspect apache since I changed the site name in mw [18:57:35] But I don't know how to tell, really. [19:01:58] andrewbogott: I hate to abandon you, but if I don't eat I'm going to get stuck in meetings all afternoon with no food in my belly :( [19:02:14] bd808: That's reasonable :) [19:02:24] Go eat, I'll follow up later if I have specific questions. [19:02:30] Nothing's especially broken at the moment. [19:02:46] thanks. It feels like we are close [19:03:23] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:03:51] It's going to be something silly or obvious [19:03:57] yeah [19:04:12] * Reedy wonders if jzerebecki is about [19:04:29] "I’m happy to announce the 1.0 release of Wikibase DataModel. Wikibase DataModel is the canonical PHP implementation of the Data Model at the heart of the Wikibase software. [19:04:41] https://github.com/wmde/WikibaseDataModel/blob/master/RELEASE-NOTES.md [19:04:57] Coren: are you available to lend your eyes? [19:05:14] andrewbogott: Sure. What be up? [19:05:20] We're trying to set up two wikis on virt1000, one at wikitech. and one at virt1000. [19:05:28] But everything gets redirected to wikitech [19:05:40] I'm trying to be eyes and ears for Reedy but not actually proving much help. [19:05:53] Want to look at the config and apache logs there and see if there's something obvious we're missing? [19:06:31] curl -I https://virt1000.wikimedia.org/wiki/Main_Page --insecure [19:06:38] gives a 302 and Location: https://wikitech.wikimedia.org/wiki/Main_Page [19:07:52] (03PS1) 10Reedy: Non Wikipedias to 1.24wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157860 [19:09:03] * Reedy gets this over and done with [19:09:04] (03CR) 10Manybubbles: [C: 031] "Lets try it!" [puppet] - 10https://gerrit.wikimedia.org/r/157805 (owner: 10Filippo Giunchedi) [19:09:22] andrewbogott: uh [19:09:34] labswiki wasn't in wikiversions.json? [19:09:50] https://gerrit.wikimedia.org/r/#/c/157860/1/wikiversions.json,unified [19:09:52] See line 432 [19:10:34] you just now added it? [19:10:44] oooo [19:10:49] Not personally, but running updateWikiversions on all.dblist did [19:11:15] (03CR) 10Reedy: [C: 032] "Also adds labswiki..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157860 (owner: 10Reedy) [19:11:19] (03Merged) 10jenkins-bot: Non Wikipedias to 1.24wmf19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157860 (owner: 10Reedy) [19:11:21] Certainly can't be helping matters [19:11:52] ok, I only understand about a third of that [19:11:56] (03PS1) 10Manybubbles: Cirrus: Switch group1 wikis to all fields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157861 [19:11:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf19, added labswiki too [19:11:57] but, sounds like it's resolved now? [19:12:00] (03PS1) 10Aude: Update cache epoch for Wikidata for html dom changes in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157862 [19:12:01] Logged the message, Master [19:12:01] Reedy: ^ [19:12:07] heh [19:12:59] (03PS2) 10Reedy: Update cache epoch for Wikidata for html dom changes in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157862 (owner: 10Aude) [19:13:03] thanks [19:13:06] gah, Internal error in ApiFormatXml::recXmlPrint [19:13:06] (03CR) 10Reedy: [C: 032] Update cache epoch for Wikidata for html dom changes in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157862 (owner: 10Aude) [19:13:10] (03Merged) 10jenkins-bot: Update cache epoch for Wikidata for html dom changes in Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157862 (owner: 10Aude) [19:13:16] not critical but noted [19:13:17] aude: stuff really broken? [19:13:21] ah [19:13:21] nope [19:13:53] "good" [19:14:02] !log reedy Synchronized wmf-config/Wikibase.php: Bump epoch (duration: 00m 14s) [19:14:09] Logged the message, Master [19:15:03] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [19:17:32] (03Abandoned) 10Reedy: Revert "keep rc-pmtpa name for now" [puppet] - 10https://gerrit.wikimedia.org/r/136965 (owner: 10Reedy) [19:20:06] Coren: Any luck? / Notice any fails on our part? [19:21:22] Reedy: Oh duh! Yeah, I think I found the boo boo. [19:21:41] I should have started with that. You're using two VirtualHost *:80 [19:22:19] That won't work. NameVirtualHost would work for HTTP, but not HTTPS (at least not without cert errors or unless you are using a wildcard) [19:22:52] One is not redirected to the other; the second one is just being ignored. :-) [19:23:16] mark: time to take a look at https://gerrit.wikimedia.org/r/#/c/155753 ? :) [19:23:23] And here I was looking at detailed bits of config when it's the first line that was the issue. :-) [19:23:26] Reedy: i am [19:23:55] I note we just use * (no port) for all the cluster wikis... [19:24:03] Reedy: Easy solution: assign a second IP and listen to it on the respective VirtualHosts. [19:24:19] Coren: We're planning on having cert errors for the time being… once virt1000 looks right it'll be moved to wikitech [19:24:28] The thought did cross my mind [19:25:05] andrewbogott: Where in puppet is this? I'll fix it. [19:25:12] templates/apache/site/ [19:25:32] templates/apache/sites/virt1000.wikimedia.org.erb templates/apache/sites/wikitech.wikimedia.org.erb [19:25:35] um… sites [19:26:07] (03PS1) 10JanZerebecki: Add icinga check for wikidata dispatch lag [puppet] - 10https://gerrit.wikimedia.org/r/157863 [19:26:32] Coren: thanks [19:29:17] andrewbogott: Where are those ultimately included from, I mean? [19:29:27] manifests/openstack.pp [19:29:44] manifests/openstack.pp [19:29:49] apache::site entries [19:31:03] * Coren fails to find where config-available/50-wikitech-ports.conf originates. [19:32:04] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:32:11] (03CR) 10Catrope: [C: 031] Clean up and re-organise VisualEditor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157845 (owner: 10Jforrester) [19:32:36] Coren: I don't know. It's possible those configs are unpuppetized, or some magic side-effect of the new apache module [19:32:50] andrewbogott: Look like. I think I found where. [19:33:24] none of those configs are in conf-enabled anyway [19:33:27] so maybe they're obsolete? [19:33:45] Ah, no, I see why; we use conf.d instead. [19:34:26] ... wait. That makes no sense. [19:34:33] * Coren boggles at the config for a bit. [19:34:48] you can use apache::conf to replace those files [19:35:45] Wait, there already /is/ a NameVirtualHost directive. [19:35:48] * Coren boggles a bit more. [19:36:00] Aaaah. But only on 80! [19:36:07] yes, from the distro package [19:36:14] there were several ways to add the 443 [19:36:17] (Because on 443 that normally causses issues) [19:37:17] if you need a config snippet, use apache::conf with content => template() now [19:37:57] andrewbogott, Reedy: just for testing that was the only issue, I've puppet agent --disabled and added the snippet directly in ports.conf. Try it now? [19:38:13] https://virt1000.wikimedia.org/wiki/ [19:38:22] "Sorry, we were not able to work out what wiki you were trying to view. Please specify a valid Host header." [19:38:31] (03CR) 10JanZerebecki: "Thx, sure. I made a new change that implements this check in a more basic way at Iac540d1b47df398e0fc92f2b843a3b618d1844d5." [puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [19:38:32] Which sounds like multiversion complaining [19:38:34] Coren: Yes, that gets us to the next problem at least. [19:38:42] sweet, thanks Coren [19:39:09] andrewbogott: I did it the *wrong* way though. [19:39:27] andrewbogott: The static mapping addition is in place on the server, right? [19:39:33] Coren: That's maybe ok… it's unclear if the virt1000 hack will last for hours or for days... [19:39:37] if the latter then we'd want it puppetized [19:39:49] Reedy: can you tell me more specifically how to answer that question? [19:40:17] andrewbogott: That's on you then. FYI, I added the NameVirtualHost in ports.conf; this would probably be erased by a puppet run as it is. [19:40:38] reedy@tin:/a/common$ grep virt1000 /usr/local/apache/common/multiversion/MWMultiVersion.php [19:40:38] 'virt1000.wikimedia.org' => 'labswiki' // Temporary for wikitech migration to multiversion [19:40:59] same on virt1000 [19:41:43] oh, I know [19:41:48] wikiversions again [19:42:04] virt1000 isn't in mediawiki-installation dsh group yet is it? [19:42:11] can you run sync-common on virt1000 [19:42:50] Reedy: yes, I just did [19:43:04] and again :) [19:44:14] (03PS1) 10Dzahn: dsh- add virt1000 to mw-installation,apaches [puppet] - 10https://gerrit.wikimedia.org/r/157867 [19:44:26] * Reedy grins at mutante [19:44:32] touch wmf-config/InitialiseSettings.php [19:44:43] (03PS2) 10Dzahn: dsh - add virt1000 to mw-installation,apaches [puppet] - 10https://gerrit.wikimedia.org/r/157867 [19:45:14] It's usually some daft cache thing... we get this adding new wikis sometimes [19:45:15] mutante: I think andrewbogott was concerned about letting mortals access virt1000 [19:45:38] (03CR) 10Dzahn: [C: 031] "deployers will deploy" [puppet] - 10https://gerrit.wikimedia.org/r/157867 (owner: 10Dzahn) [19:45:50] bd808: sync-common doesn't rebuild wikiversions, does it? [19:45:53] (03CR) 10Andrew Bogott: [C: 04-2] "We're going to keep wikitech Ops-only and sync using a cron or something." [puppet] - 10https://gerrit.wikimedia.org/r/157867 (owner: 10Dzahn) [19:46:02] Reedy: No, it doesn't [19:46:19] (03CR) 10Dzahn: [C: 04-2] "tried" [puppet] - 10https://gerrit.wikimedia.org/r/157867 (owner: 10Dzahn) [19:46:24] (03Abandoned) 10Dzahn: dsh - add virt1000 to mw-installation,apaches [puppet] - 10https://gerrit.wikimedia.org/r/157867 (owner: 10Dzahn) [19:46:36] andrewbogott: sync-wikiversions [19:46:59] uh, compile-wikiversions even [19:47:06] but... sync-common should pick up the wikiversions from tin shouldn [19:47:12] shouldn't it? [19:47:34] presumably... [19:47:46] https://dpaste.de/E4vr [19:47:47] i thought the whole idea was that it becomes part of the cluster.. shrug [19:48:10] mutante: This gets us 90% of the way there. We might revisit later on but I don't want to make that kind of a security decision unilaterally. [19:48:22] Yeah sync-wikiversions is a push command from tin, not a pull from the apache hosts side [19:48:23] ok, gotcha [19:48:28] Oh, compile-wikiversions works [19:48:29] andrewbogott: try compile-wikiversions [19:48:42] I mean 'works' in the sense of runs without error [19:48:44] doesn't fix it [19:48:49] blaaargh [19:48:49] not in the sense of 'fixes the problem' [19:49:41] the problem is multiversion thinks the wiki is "missing" [19:49:45] so pointing us at hte 404 page [19:51:36] greg-g: he already said in the other bug that someone else has to dig the reasons for https://bugzilla.wikimedia.org/show_bug.cgi?id=68781 , AFAICS [19:51:49] !log Running cleanupPageProps.php on mw.org and meta [19:51:55] Logged the message, Master [19:52:15] Reedy, can I turn on a debug channel? [19:53:23] I don't think there is any more debugging with it.. [19:53:30] Usually I end up var_dump() && die or similar [19:58:49] andrewbogott: multiversion/MWVersion.php. Line 40, before the include, add something like: var_dump( $multiversion, $file, $wiki, $scriptName, $serverName ); die(); [19:59:09] Reedy: is this in common or common-local, or are they the same? [19:59:17] ah, the same [19:59:17] one should symlink to the other [19:59:19] yeah [19:59:48] ok, I've added the var_dump, where will that end up? [19:59:56] NULL string(9) "index.php" NULL string(5) "/wiki" string(22) "virt1000.wikimedia.org" [20:00:02] if you visit https://virt1000.wikimedia.org/wiki/Main_Page [20:00:05] so I see [20:00:17] oh, fail [20:00:23] can you fix $multiversion to $multiVersion ? [20:01:09] fixed. I thought vars were insensitive in php though? [20:01:21] seems not [20:01:28] ["db":"MWMultiVersion":private]=> string(12) "labswikiwiki" [20:01:33] we've got an extra wiki appearing [20:01:36] my fault I think [20:01:48] yeah [20:01:49] duh [20:02:16] (03PS1) 10Reedy: Remove wiki suffix from labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157871 [20:02:38] (03CR) 10Reedy: [C: 032] Remove wiki suffix from labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157871 (owner: 10Reedy) [20:02:42] (03Merged) 10jenkins-bot: Remove wiki suffix from labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157871 (owner: 10Reedy) [20:03:12] andrewbogott: sync-common again please.. then if the dbname is correct... (I think sync-common might overwrite it anyway) [20:03:17] (03PS2) 10JanZerebecki: Add icinga check for wikidata dispatch lag [puppet] - 10https://gerrit.wikimedia.org/r/157863 [20:03:51] um… seems to crash now [20:04:10] (03CR) 10JanZerebecki: "$ ./check_http -H www.wikidata.org -I 91.198.174.192 -S -u /w/api.php?action=query\&meta=siteinfo\&format=json\&siprop=statistics --linesp" [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [20:04:34] PHP Fatal error: Class 'LdapAuthenticationPlugin' not found in /usr/local/apache/common-local/wmf-config/wikitech.php on line 25 [20:04:35] This is progress! [20:04:36] (03PS1) 10Ottomata: Bring in recent changes for production webstatscollector to test kafkatee webstats collector on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/157873 [20:04:39] woop [20:05:06] HAHAHA [20:05:07] $wgAuth = new LdapAuthenticationPlugin(); [20:05:07] require_once( "$IP/extensions/LdapAuthentication/LdapAuthentication.php" ); [20:05:19] include ::apache::mod::authnz_ldap ? [20:05:43] (03PS1) 10Reedy: Require before instatiate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157874 [20:05:48] !log Running cleanupPageProps.php everywhere [20:05:55] Logged the message, Master [20:06:04] (03CR) 10Andrew Bogott: [C: 032] "I hope they're all this easy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157874 (owner: 10Reedy) [20:06:04] oh, ignore me [20:06:10] (03Merged) 10jenkins-bot: Require before instatiate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157874 (owner: 10Reedy) [20:06:21] andrewbogott: sync-common again ofc [20:06:30] we're on the home stretch [20:06:54] (03CR) 10Ottomata: [C: 032 V: 032] Bring in recent changes for production webstatscollector to test kafkatee webstats collector on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/157873 (owner: 10Ottomata) [20:07:39] (03PS1) 10Reedy: Add debugging to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157877 [20:07:52] Reedy: wikitech is currently 1.24wmf15 [20:07:56] Might make the next bits a bit quicker [20:07:58] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet last ran 948835 seconds ago, expected 14400 [20:07:59] (03PS1) 10QChris: For webstatscollector, use latest version [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) [20:08:06] (Cannot contact the database server: Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) (10.64.16.33)) [20:08:07] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Tue Sep 2 20:08:03 UTC 2014 [20:08:07] can we have virt1000 be the same version so the db version works both places? [20:08:16] (03CR) 10jenkins-bot: [V: 04-1] For webstatscollector, use latest version [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [20:08:27] yeah [20:08:33] will need to build l10n cache for wikitech again [20:08:57] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:09:03] Just copy from the local on virt1000? [20:09:20] (03PS1) 10Reedy: labswiki to 1.24wmf15 to match wikitech as currently is [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157879 [20:09:23] or that :) [20:09:25] _joe_: puppet-compiler is out of disk again, can we delete stuff from "./external"? or use /dev/mapper/vd-second--local--disk to put stuff? [20:10:36] bd808: guess we're going to have to fix the db config next [20:11:00] (03PS1) 10Ottomata: Uncomment include of role::analytics::kafkatee::webrequest::webstatscollector on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/157889 [20:11:03] I'm gonna have to escape for 20-30 minutes to go home (getting kicked out of my parents house), before the mw core meeting too [20:11:18] Is it not picking up the private changes that andrewbogott made? [20:11:28] (Cannot contact the database server: Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) (10.64.16.33)) [20:11:34] * bd808 looks at seekrets on tin [20:11:45] it's trying to use s3 [20:11:47] not localhost [20:11:50] Remind me how to rebuild localisation? [20:11:53] (do we override it anywhere) [20:12:12] (03CR) 10Reedy: [C: 032] labswiki to 1.24wmf15 to match wikitech as currently is [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157879 (owner: 10Reedy) [20:12:17] (03Merged) 10jenkins-bot: labswiki to 1.24wmf15 to match wikitech as currently is [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157879 (owner: 10Reedy) [20:12:19] (03CR) 10Ottomata: [C: 032 V: 032] Uncomment include of role::analytics::kafkatee::webrequest::webstatscollector on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/157889 (owner: 10Ottomata) [20:12:35] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: (no message) [20:12:42] Logged the message, Master [20:13:00] andrewbogott: sync-common then scap-rebuild-cdbs should do it [20:13:05] Reedy: This db stuff? [20:13:28] scap-rebuild-cdbs on virt1000? [20:13:36] (03PS2) 10QChris: For regular webstatscollector installs, use latest version [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) [20:13:45] sync-common does that bit now [20:14:01] will that work without the json files existing? [20:14:05] ie do we need to run scap on tin first? [20:14:28] (I'm tempted to use bd808s idea of copying the l10n cache from the current wikitech for ease) [20:15:03] Right, going to escape. Back in 20-30 mins :) [20:15:08] * bd808 would do that rather than wait 30m for scap [20:15:40] I'm happy to copy if you tell me which files [20:15:53] (03CR) 10Dzahn: Add icinga check for wikidata dispatch lag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [20:16:07] andrewbogott: There should be l10n cache files for the existing wikitech that you can copy. They would be *somewhere*, possible cache/l10n [20:16:20] /srv/org/wikimedia/controller/wikis/w/cache or cache/l10n [20:16:21] yeah [20:17:00] I see the 'from' dir but not the 'to' dir. [20:17:08] No 'cache' in /usr/local/apache/common/w [20:17:35] Or you can copy the wmf19 cache/l10n directory to wmf15 locally on virt1000 [20:17:51] That may be easiest actually [20:18:36] sync-common won't mess with the cdbs as long as you don't copy the .json files too [20:18:42] ok, done [20:18:50] /usr/local/apache/common/php-1.24wmf19# cp -r cache ../php-1.24wmf15/ [20:19:00] (03CR) 10QChris: For regular webstatscollector installs, use latest version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [20:19:17] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [20:19:21] Now we need to figure out why the db connection is failing [20:19:32] The host looks right... [20:20:03] (03CR) 10QChris: For regular webstatscollector installs, use latest version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157878 (https://bugzilla.wikimedia.org/70295) (owner: 10QChris) [20:20:10] Oh... I bet the relative include doesn't work in PrivateSettings.php [20:20:27] Shouldn't it throw an error if it can't find the file to include? [20:20:55] It should log a warning, but not an error with include vs require [20:21:29] ok, fixing... [20:21:47] Oops, you're already fixing :) [20:21:55] andrewbogott: I think you need "include( __DIR__ . '/WikitechPrivateSettings.php' ); [20:22:13] Someone has PrivateSettings.php open already... [20:22:25] maybe not. Could be a vi orphan [20:22:37] andrewbogott: No I had it open [20:22:40] (03PS3) 10Dzahn: Add icinga check for wikidata dispatch lag [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [20:23:00] bd808: ok, you're going to commit that change? [20:23:37] andrewbogott: Yeah. You can pull it now with sync-common on virt1000 to see if it works [20:23:48] I'll commit and sync to cluster if it does [20:25:18] andrewbogott: Did you pull the change to virt1000 yet? [20:25:20] sync-common is unhappy about my l10n cache [20:25:25] oh boo [20:25:27] It's still working [20:25:38] ok, done [20:25:43] It complained about file permissions? [20:25:48] yeah [20:26:04] hmm... doesn't seem fixed [20:26:13] But it didn't error out, so should've copied the private files anyway [20:26:52] andrewbogott: Does /usr/local/apache/common-local/private/PrivateSettings.php look correct now? [20:27:08] If so are there any new errors in the apache error log? [20:27:16] 'require_once( __DIR__ . '/WikitechPrivateSettings.php' );' [20:27:34] ok. so that bit synced, but didn't fix the problem [20:28:00] (03CR) 10JanZerebecki: [C: 031] Add icinga check for wikidata dispatch lag [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [20:28:10] * bd808 guesses that __DIR__ isn't private/ [20:28:25] no complaints in the apache log [20:28:39] bd808: perhaps an absolute path? [20:29:08] hmm... oh cache. Touch wmf-config/CommonSettings.php [20:29:24] But then I expect an error in the apache log [20:30:42] (03Abandoned) 10BBlack: Kill $project-lb(.$site)?.wikimedia.org IPs [dns] - 10https://gerrit.wikimedia.org/r/140136 (owner: 10Faidon Liambotis) [20:31:17] (03CR) 10Dzahn: [C: 032] Add icinga check for wikidata dispatch lag [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [20:31:57] !log bd808 Synchronized private/PrivateSettings.php: Absolute path for WikitechPrivateSettings.php (duration: 00m 05s) [20:32:05] Logged the message, Master [20:32:12] andrewbogott: sync-common again plz [20:32:45] bd808: A repairman is about to visit, I'll be on-again-off-again for a bit. [20:32:45] but, syncing... [20:33:00] done [20:33:06] andrewbogott: Ok. We are sooo close [20:33:22] and that wasn't the problem apparently :( [20:33:26] every once in a while the apache error says File does not exist: /usr/local/apache/common/docroot/wikimedia.org/favicon.ico [20:33:30] which seems kind of promising actually [20:33:37] Or all of the problem anyway [20:35:15] in prod the favicon.ico has been replaced by favicon.php or so [20:35:20] !log bd808 Synchronized wmf-config/wikitech.php: eebc99a Require before instatiate (duration: 00m 04s) [20:35:26] Logged the message, Master [20:36:22] Debugging is hindered by the l10n cache issue. I'll need to run scap to be able to poke at things with mwscript [20:36:31] Ok, fine with me. [20:36:37] Since I'm about to wander off for a bit anyway [20:36:40] jouncebot: next [20:36:40] In 0 hour(s) and 23 minute(s): Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T2100) [20:37:10] Maybe can just squeak in before the Flow window [20:37:19] yeah, barely. [20:37:34] Anything else I can do quickly before I go? [20:38:26] andrewbogott: proably not much. I'll scap and then we can dump config stuff to see what's wrong [20:38:40] ok, back soon [20:39:01] !log bd808 Started scap: no-op scap to build l10n for wikitech [20:39:07] Logged the message, Master [20:39:20] (03CR) 10Dzahn: "wow wow, i wonder if the entire templates/misc/passwordScripts can be deleted. that stuff is really old, i see edits by Roan from 2011. pa" [puppet] - 10https://gerrit.wikimedia.org/r/157685 (owner: 10Matanya) [20:39:26] Reedy: ^ :p [20:39:34] bd808|deploy: it's not a no-op if it's building l10n for wikitech :P [20:39:39] templates/misc/passwordScripts/mysql_root_pass.erb [20:39:40] no-diff maybe [20:39:55] ori: True. No-code-udpate [20:40:00] echo -n "<%= @mysql_root_pass %>" [20:40:01] really? [20:40:09] i pick up stuff from the trash [20:40:14] matanya: lol:) [20:40:24] (03PS1) 10Yuvipanda: labmon: Send metrics from all projects to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/157969 [20:40:25] Coren: ^ +2? [20:41:09] (03CR) 10jenkins-bot: [V: 04-1] labmon: Send metrics from all projects to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/157969 (owner: 10Yuvipanda) [20:41:24] (03CR) 10Dzahn: [C: 031] "+1 for the actual change though :)" [puppet] - 10https://gerrit.wikimedia.org/r/157685 (owner: 10Matanya) [20:41:45] (03PS2) 10Yuvipanda: labmon: Send metrics from all projects to labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/157969 [20:43:05] rfarrand: congratz! [20:43:06] andrewbogott: assuming you're back, ^ [20:43:16] matanya: thank you :) [20:43:21] (03CR) 10Dzahn: [C: 032] planet: qualify var [puppet] - 10https://gerrit.wikimedia.org/r/157852 (owner: 10Matanya) [20:44:00] Back again [20:44:43] Reedy: I'm running scap to build l10n so I can use eval.php to see what password is being picked up [20:44:53] ok [20:45:38] bd808|deploy are we moving the database to s3? [20:45:43] The 208.80.154.18 db hosts looks right, but apparently the password (or user?) is not right [20:46:16] Reedy: dunno [20:46:21] Noting the dbhost is 10.64.16.16, the source is 208.80.154.18 [20:46:34] (Cannot contact the database server: Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) (10.64.16.16)) [20:46:47] oh. hrm [20:46:56] yuvipanda: I'm back but distracted by wikitech still [20:47:00] https://github.com/wikimedia/operations-mediawiki-config/commit/11c65ed379e15a5f79015bb380cebfef59b0062a [20:47:04] I was seeing the client ip as the db ip in that [20:47:14] andrewbogott: cool, added you as reviewer, do +2 when you've the undistracted time :) [20:47:15] I think we need to revert that (fully?) [20:47:35] the fishbowl should've stopped the random wikis accessing it [20:47:42] (03PS1) 10Reedy: Revert "Place all dbconfig for wikitech in a guard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157970 [20:47:51] Because the fishbowl change should take care of that hack right? [20:48:03] yup [20:48:26] Works for me. We should probably wait for the current scap to finish [20:48:30] sure [20:48:48] as that change gives it a specific db "cluster" to use [20:48:56] as is, it's doing the generic s3 fallback [20:49:02] And try to get wikitech on wmf18 soon instead of it's own lagged branch [20:49:14] that shouldn't be too much of an issue [20:49:21] I think there's a couple of db updates to be done [20:49:30] but update.php works for wikitech presumably? [20:49:35] andrewbogott: ^ ? [20:49:42] I bet it does [20:49:46] sure it works, I just want to keep the two in sync. [20:50:06] And… trying not to break wikitech.wikimedia.org while we're unbreaking virt1000.wikimedia.org :) [20:50:07] yeah. make it work then make it better [20:50:22] yeah [20:50:34] there's some other config cleanup and stuff to be done too, but that's a minor issue [20:51:27] l10n build is almost done then we will just have to wait for the sync [20:51:39] Finished mw-update-l10n (duration: 12m 14s) [20:53:28] mutante: do you know where the 'alert groups' for alerting about icinga are configured? [20:56:35] yuvipanda: puppet/files/icinga/contactgroups.cfg [20:56:41] ah ty [20:56:51] yuvipanda: the actual members of the groups are in private repo.. before you start looking [20:56:56] because they have phone numbers [20:57:01] for paging [20:57:09] so depends what you really need [20:58:26] to spam greg-g [20:59:21] mutante: yeah, ready to spam greg-g and chrismcmahon and others for alerts about betalabs [20:59:48] http://blog.cloudflare.com/the-relative-cost-of-bandwidth-around-the-world [21:00:01] yuvipanda: is the collector really big enough to handle every instance now? [21:00:01] yuvipanda: good luck with the "critical" part and adding another group besides ops [21:00:05] spagewmf: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T2100). Please do the needful. [21:00:11] I guess it's a production box now, huh? [21:00:11] sync-apaches is slooooow for new (old really) branch l10n [21:00:15] andrewbogott: yeah, it's a prod box [21:00:35] spagewmf: I'm in the middle of running scap :/ [21:00:38] TerrifiedPanda: by "spam", do you mean email? [21:00:48] (03CR) 10Andrew Bogott: [C: 032] "Hope the collector doesn't collapse under the load!" [puppet] - 10https://gerrit.wikimedia.org/r/157969 (owner: 10Yuvipanda) [21:01:03] bd808|deploy: you know, it's just typical, I actually only just purged the l10n cache for wmf15 yesterday [21:01:42] heh [21:01:57] I saw that and was happy that you did it [21:02:22] bd808|deploy: no worries, Flow team has no plans for our window. [21:02:48] mutante: yeah :) [21:03:02] mutante: how did the 'modularize icinga' work go? [21:03:04] andrewbogott: should we ping s_pringle to fix the db grant? [21:03:04] kewl. The neverending wikitech deploy will be glad to continue in its place [21:03:31] TerrifiedPanda: https://gerrit.wikimedia.org/r/#/c/145472/ [21:03:58] TerrifiedPanda: how do i test it and work around the naggen problem? [21:04:05] is the current question [21:04:06] mutante: I don't know. [21:04:32] mutante: If you mean on virt1000, I think that's still a config issue [21:04:42] yes, i mean this: Access denied for user 'wikiuser'@'208.80.154.18' [21:04:49] where the IP is virt1000 [21:05:15] mutante: yeah, not a db issue -- the db works just fine for wikitech.wikimedia.org [21:05:16] mutante: We have bad config [21:05:23] The problem is we aren't getting the password to virt1000 somehow [21:05:40] andrewbogott: no, we're using the wrong db server target :) [21:05:42] oh, better yet [21:05:47] I reverted bd808|deploys change from earlier [21:05:53] just waiting for scap to finish so we can merge etc [21:07:59] sync-common: 56% (ok: 129; fail: 0; left: 99) [21:10:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:17:04] bd808|deploy: what are we upto now? [21:17:13] sync-common: 77% (ok: 177; fail: 0; left: 51) [21:17:38] (03CR) 10Aaron Schulz: "How big was the capacity increase? I can't find any email with the full changes?" [puppet] - 10https://gerrit.wikimedia.org/r/157678 (owner: 10Filippo Giunchedi) [21:17:45] Some are slooooow -- mw1078 INFO - Finished rsync common (duration: 11m 54s) [21:18:22] I wonder why [21:18:34] something to dig into when not doing other stuff [21:18:47] * greg-g looks in and wonders what is going on [21:19:26] Probably one rsync slave that is sick. I'll check to see if they slow hosts are all pulling from the same box [21:19:40] (03PS1) 10Yuvipanda: icinga: Consistent indentation in contactgroups.cfg [puppet] - 10https://gerrit.wikimedia.org/r/157974 [21:19:45] mutante: ^ I see group member names there. [21:20:24] TerrifiedPanda: yea, but you don't see the definition of the members [21:20:31] Reedy: mw1070 looks to be involved in several slow syncs as the origin [21:20:38] mutante: ah, you mean email addresses, etc? [21:20:52] TerrifiedPanda: which would associate user name to email, phone, notification period..timezone [21:21:00] yes [21:21:04] bd808|deploy: load average: 105.11, 111.52, 91.1 [21:21:07] that probably doesn't help :) [21:21:12] yikes [21:21:19] mutante: ah, ok [21:21:51] Reedy: ugh -- https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=mw1070.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1409692875&g=cpu_report&z=large&c=Application%20servers%20eqiad [21:21:59] (03CR) 10Dzahn: [C: 032] icinga: Consistent indentation in contactgroups.cfg [puppet] - 10https://gerrit.wikimedia.org/r/157974 (owner: 10Yuvipanda) [21:22:26] I guess we might have just chosen a "bad" apache to sync from [21:22:53] I'll have a look in racktables and see if I can choose a better option [21:23:28] Finished sync-apaches (duration: 28m 17s) [21:23:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:25:55] Reedy: It got really unfairly loaded: 122 mw1070.eqiad.wmnet; 77 mw1010.eqiad.wmnet; 44 mw1161.eqiad.wmnet; 6 mw1201.eqiad.wmnet [21:26:08] ah [21:26:11] That does suck [21:26:30] I wonder if we should try adding a couple more initial deploy targets [21:26:35] Reedy, bd808|deploy: https://dpaste.de/nUtj [21:27:02] andrewbogott: Presumably due to you copying the cache dir in... [21:27:07] Try killing it and run sync-common again? [21:27:10] yeah, lemme clear that and try again [21:27:14] ie rm -rf php-1.24wmf15/cache [21:29:08] (03PS1) 10Dzahn: virtualize monitor host for wikidata monitoring [puppet] - 10https://gerrit.wikimedia.org/r/157976 [21:29:43] (03PS2) 10Dzahn: virtualize monitor host for wikidata monitoring [puppet] - 10https://gerrit.wikimedia.org/r/157976 [21:31:48] so close -- scap-rebuild-cdbs: 99% (ok: 227; fail: 0; left: 1) [21:31:59] fenari? [21:32:09] (03CR) 10Dzahn: [C: 032] "that IP is just text-lb.esams, btw, but you need something to put it on .." [puppet] - 10https://gerrit.wikimedia.org/r/157976 (owner: 10Dzahn) [21:32:32] Nope virt1000? [21:32:40] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 19:32:00 UTC [21:32:47] oh that was andrew running himself [21:32:50] yeah [21:33:02] Someone might want to ack that puppet freshness on virt1000 for now [21:33:07] still wiating to see what the last one is but fenari is done [21:34:34] apache roulette [21:34:49] !log bd808 Finished scap: no-op scap to build l10n for wikitech (duration: 55m 48s) [21:34:54] wooo [21:34:57] Logged the message, Master [21:35:00] andrewbogott: How's it going this time/ [21:35:07] heh. it was fenari. [21:35:17] Reedy: sync is working, wiki still not [21:35:18] (03PS2) 10Reedy: Revert "Place all dbconfig for wikitech in a guard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157970 [21:35:44] andrewbogott: so when sync is done, ^ I need to merge/deploy that, and then sync-common again [21:35:53] ok [21:35:54] (03PS1) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [21:35:56] (03PS1) 10BBlack: Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 [21:35:58] (03PS1) 10BBlack: Remove revdns for unused project-lb.site hostnames [dns] - 10https://gerrit.wikimedia.org/r/157980 [21:36:00] (03PS1) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [21:36:02] (03PS1) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 [21:36:11] Reedy: sync on virt1000 is done, if that's what you meant [21:36:17] aha [21:36:29] (03CR) 10Reedy: [C: 032] Revert "Place all dbconfig for wikitech in a guard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157970 (owner: 10Reedy) [21:36:35] (03Merged) 10jenkins-bot: Revert "Place all dbconfig for wikitech in a guard" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157970 (owner: 10Reedy) [21:36:37] (03PS1) 10Yurik: Zero: unified many carriers, reorged the rest [puppet] - 10https://gerrit.wikimedia.org/r/157984 [21:36:38] ACKNOWLEDGEMENT - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Tue 02 Sep 2014 19:32:00 UTC andrew bogott Disabled while I fuss with deployment [21:36:39] bblack, ^ [21:37:09] andrewbogott: sync-common again and see where we are [21:37:23] !log reedy Synchronized wmf-config/db-eqiad.php: Wikitech db (duration: 00m 22s) [21:37:27] (03CR) 10Dzahn: "this made it actually define a host in icinga now, so the service check from Change-Id: Iac540d1b47df39 could be created" [puppet] - 10https://gerrit.wikimedia.org/r/157976 (owner: 10Dzahn) [21:37:29] Logged the message, Master [21:37:32] Reedy: same [21:38:02] yurikR: have you tried it on beta or anything for syntax errors? [21:38:43] andrewbogott: touch wmf-config/InitialiseSettings.php [21:38:59] jzerebecki: good: the check has now been created after that achange above, bad: 'The command defined for service check if wikidata.org dispatch lag is higher than 2 minutes does not exist' [21:39:04] looking :) [21:39:04] done [21:39:20] sigh, so it's trying to still use s3 [21:41:27] (03CR) 10Dzahn: "follow-up was needed in Change-Id: I6a0fd28b6c1974ab (make host a virtual resource or Icinga won't create a host for this check)" [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [21:42:16] andrewbogott: ah, it's actually a different error [21:42:17] I see it [21:42:22] IT's flow related [21:42:25] I know the fix [21:42:31] (Cannot contact the database server: Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) (10.64.16.18)) [21:42:37] I'm getting the right db on terbium [21:42:42] '10.64.16.18' => 10, # db1029 [21:42:53] bd808: It's trying to use the extension1 shard [21:42:55] which it shouldn't [21:43:34] 'wmgEchoCluster' [21:43:35] let me fix [21:44:45] (03PS1) 10Reedy: Don't use extension1 cluster for labswiki for Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157987 [21:45:01] (03CR) 10Reedy: [C: 032] Don't use extension1 cluster for labswiki for Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157987 (owner: 10Reedy) [21:45:04] * andrewbogott supervises [21:45:05] andrewbogott: sync-common :) [21:45:08] (03Restored) 10Ori.livneh: wmflib: add ensure_service() [puppet] - 10https://gerrit.wikimedia.org/r/149778 (owner: 10Ori.livneh) [21:45:10] (03Merged) 10jenkins-bot: Don't use extension1 cluster for labswiki for Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157987 (owner: 10Reedy) [21:45:28] (03PS4) 10Ori.livneh: wmflib: add ensure_service() [puppet] - 10https://gerrit.wikimedia.org/r/149778 [21:45:45] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 23s) [21:45:51] Logged the message, Master [21:46:13] Hey, it loads [21:46:23] fuck yeah [21:46:23] no css [21:46:29] Looks like no static content. But a great stride forward! [21:46:32] lol [21:46:34] it's trying to use bits [21:46:37] that's easily fixed too [21:46:42] I'm not sure if that should come from bits or locally… either way I guess [21:46:45] Should it not use bits? [21:47:09] should it? :) [21:47:11] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: The command defined for service check if wikidata.org dispatch lag is higher than 2 minutes does not exist [21:47:15] I guess I'd prefer to keep it local, only because wikitech is where you look for troubleshooting info when bits goes down [21:47:35] yeah [21:47:38] let me tidy this up [21:47:46] means it's like testwiki in that regard [21:49:42] Sorry all, was victim to my ISP's complete inability to understand their own routing. [21:50:05] (03PS1) 10Reedy: Don't load JS/CSS from bits for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157993 [21:50:43] (03CR) 10Reedy: [C: 032] Don't load JS/CSS from bits for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157993 (owner: 10Reedy) [21:50:46] andrewbogott: and again! :D [21:50:47] (03Merged) 10jenkins-bot: Don't load JS/CSS from bits for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157993 (owner: 10Reedy) [21:51:11] hm, still no sidebar [21:51:43] smw seems to work though, that's good! [21:51:48] !log reedy Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 25s) [21:51:55] Logged the message, Master [21:52:32] oh, missed one [21:52:44] $wgLoadScript = "//{$wmfHostnames['bits']}/{$_SERVER['SERVER_NAME']}/load.php"; [21:54:03] (03PS1) 10Reedy: Don't use load.php from bits for labswiki either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157995 [21:54:41] andrewbogott: and again [21:55:19] (03CR) 10Reedy: [C: 032] Don't use load.php from bits for labswiki either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157995 (owner: 10Reedy) [21:55:23] (03Merged) 10jenkins-bot: Don't use load.php from bits for labswiki either [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157995 (owner: 10Reedy) [21:55:32] much better! [21:55:35] Let's see if I can log in [21:55:44] nope! [21:55:53] boom [21:55:56] !log reedy Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 24s) [21:56:01] Logged the message, Master [21:56:13] oh damn it [21:56:17] did i virt1001? [21:56:18] This actually looks like the CA already logged in bug [21:56:27] hm, it worked briefly [21:56:28] yup [21:56:29] ffs [21:56:55] (03PS1) 10Reedy: virt1001 -> virt1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157996 [21:56:59] andrewbogott: and again :P [21:57:19] (03CR) 10Reedy: [C: 032] virt1001 -> virt1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157996 (owner: 10Reedy) [21:57:23] (03Merged) 10jenkins-bot: virt1001 -> virt1000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157996 (owner: 10Reedy) [21:57:56] !log reedy Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 21s) [21:58:02] Login page shows an exception for me [21:58:03] Logged the message, Master [21:58:05] [7706616f] 2014-09-02 21:57:50: Fatal exception of type MWException [21:58:25] and on Special:Version [21:58:46] Wonder if I should merge https://gerrit.wikimedia.org/r/#/c/157877/ ? [21:59:03] uh, it all looks broken now [21:59:43] yeah, all broken for me too [21:59:58] (03PS1) 10BBlack: comment-out $project-lb.$site revdns for unused names [dns] - 10https://gerrit.wikimedia.org/r/157997 [22:00:00] (03PS1) 10BBlack: s/foundation-lb/text-lb/ in revdns for esams old text lb [dns] - 10https://gerrit.wikimedia.org/r/157998 [22:01:38] (03CR) 10BBlack: [C: 032] comment-out $project-lb.$site revdns for unused names [dns] - 10https://gerrit.wikimedia.org/r/157997 (owner: 10BBlack) [22:01:39] bblack, making another minor correction, if you haven't +2ed yet, pls wait a bit [22:01:45] (03PS2) 10Reedy: Add debugging to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157877 [22:01:46] k [22:01:58] (03PS3) 10Reedy: Add debugging to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157877 [22:02:01] (03CR) 10Reedy: [C: 032] Add debugging to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157877 (owner: 10Reedy) [22:02:05] (03Merged) 10jenkins-bot: Add debugging to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157877 (owner: 10Reedy) [22:02:09] andrewbogott: ^ [22:02:11] (03CR) 10BBlack: [C: 032] s/foundation-lb/text-lb/ in revdns for esams old text lb [dns] - 10https://gerrit.wikimedia.org/r/157998 (owner: 10BBlack) [22:02:14] might aswell make this debugging a bit simpler [22:02:20] done [22:02:34] heh, it's trying to use redis [22:02:52] for session store [22:05:35] That looks to me more complicated than just lowering a flag :( [22:05:37] (03PS2) 10BBlack: Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 [22:06:29] (03CR) 10Dzahn: "what's going on, none of the virtual monitor_hosts end up in Icinga (anymore)?" [puppet] - 10https://gerrit.wikimedia.org/r/157976 (owner: 10Dzahn) [22:06:30] I'm guessing it's trying to use the main memcached pool too [22:07:00] andrewbogott: did all the memcached config get transferred? [22:07:15] I think so. I'll make sure [22:07:23] what memcached config? [22:07:27] for wikitech [22:07:33] it should all be in puppet now [22:07:43] in the mediawiki config? [22:08:13] is there a patch? it'd be easier to explain this with reference to a particular patch [22:08:44] (03PS1) 10Andrew Bogott: Wikitech uses wikitech for memcached [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158001 [22:09:27] Reedy: you think ^ is sufficient? [22:09:39] (03PS1) 10Reedy: Add wikitech.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158002 [22:09:56] andrewbogott: I'm not sure, it might conflict with the config there [22:09:56] andrewbogott: can you make the local memcached server listen on port 11212? [22:10:17] (03CR) 10Reedy: [C: 032] Add wikitech.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158002 (owner: 10Reedy) [22:10:21] (03Merged) 10jenkins-bot: Add wikitech.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158002 (owner: 10Reedy) [22:10:24] ori: Maybe? I'm trying to reproduce the old config as much as possible... [22:10:29] andrewbogott: if you do that, no config change is necessary, because the standard app servers connect to twemproxy on that port [22:10:36] on localhost? [22:10:37] which is a memcached proxy that pretends to be a memcached servers [22:10:38] yep [22:10:44] ok, looking... [22:10:44] *a memcached server [22:10:53] !log reedy Synchronized docroot and w: (no message) (duration: 00m 14s) [22:10:59] Logged the message, Master [22:11:05] um… by the way, we don't actually have evidence that memcache is even broken do we? [22:11:11] ori: Oh, yeah, duh [22:11:19] andrewbogott: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/mc.php#L12 [22:11:28] andrewbogott: I don't think memcached is, just how MW is trying to talk to it [22:11:32] andrewbogott: it's easy to test [22:11:32] * Reedy hugs ori [22:11:39] :) [22:11:56] Reedy, but our errors are redis, aren't they? Or are there also mc errors? [22:12:15] andrewbogott: the redis errors are blocking progress [22:12:22] but memcached won't actually work due to the wrong config [22:12:31] need to stop it using redis for sessions [22:12:33] just doing that now [22:12:42] can mc listen on two ports at once? [22:12:57] andrewbogott: no [22:13:12] oh... [22:13:12] but why not change the port? is it used by anything else? [22:13:27] It's used by wikitech righ tnow [22:13:34] (03PS1) 10Reedy: Update Wikitech sessions config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158003 [22:13:39] so far we've managed to not break wikitech (much) while fussing with the new setup [22:13:41] i'm all in favor of sticking to the old config as much as possible, but this is an easy opportunity to make wikitech transparently compatible with the rest of the cluster [22:13:48] can we just install twemproxy/whatever on wikitech? [22:13:57] actually, we could [22:14:16] s/whatever/nutcracker/ [22:14:17] (03PS2) 10Yurik: Zero: unified many carriers, reorged the rest, all disabled are uni [puppet] - 10https://gerrit.wikimedia.org/r/157984 [22:14:21] bblack, ok, fixed ^ [22:14:44] tested? [22:14:45] * domas looks [22:14:45] https://noc.wikimedia.org/conf/highlight.php?file=wikitech.php [22:15:40] yurikR: have you actually run it for varnish syntax on beta? [22:15:50] bblack, no [22:15:52] so… install twemproxy? Is there a standard puppet class for that? [22:16:10] wikitech the wiki needs memcached nowadays? [22:16:17] oh my. [22:16:18] domas: has for a while [22:16:20] SMW and shit ;) [22:16:23] hah [22:16:33] andrewbogott: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/mediawiki.pp#L13 [22:16:43] well, also OpenStackManager needs a cache so it doesn't have to query nova 1000 times/minute [22:16:43] ori: Can I just copy/steal... [22:16:47] just what I was looking at [22:17:02] APC! [22:17:09] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 431062 bytes in 2.438 second response time [22:17:09] Presumably that whole block, just with one server? [22:17:34] did that check actually read 0.4MB? [22:17:44] Reedy: yes, though note the easy-to-miss weight param (server:host:weight) [22:17:50] server:port, even [22:18:11] I saw the :1 [22:18:18] Shall I do it? to nova.pp I guess? [22:18:37] (03CR) 10Dzahn: "lolwhat. how is that "check_https_url_for_regexp!www.wikidata.org!/w/api.php!foo" actually OK now? "foo" was supposed to break it :)" [puppet] - 10https://gerrit.wikimedia.org/r/157976 (owner: 10Dzahn) [22:18:37] yes, sure. then it's nice and uniform [22:18:50] Reedy: yeah, in the controller role [22:19:00] domas: it's worse than you think :P [22:19:34] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Sun 31 Aug 2014 01:50:18 UTC [22:20:50] (03PS1) 10Reedy: Add nutcracker to nova controller role [puppet] - 10https://gerrit.wikimedia.org/r/158005 [22:21:04] * Reedy waits for jenkins [22:21:41] (03CR) 10Andrew Bogott: [C: 032] Add nutcracker to nova controller role [puppet] - 10https://gerrit.wikimedia.org/r/158005 (owner: 10Reedy) [22:22:14] (03CR) 10Reedy: [C: 032] Update Wikitech sessions config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158003 (owner: 10Reedy) [22:22:16] um… I can't run puppet on virt1000 without breaking apache [22:22:20] (03Merged) 10jenkins-bot: Update Wikitech sessions config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158003 (owner: 10Reedy) [22:22:21] because Coren live-hacked apache [22:22:24] haha [22:23:04] Coren, can you tell me specifically which files you hacked? [22:23:18] well, in the current state we're only going to break virt1000.wikimedia.org... [22:23:34] True but I would enjoy knowing how to turn it back on [22:23:42] mutante: Squid>Varnish for frontend and backend text pageview caches, is that 100% complete? [22:23:42] so you could take a copy of the apache config, run puppet, replace files, kick/graceful/restart apache [22:24:06] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Epic puppet fail [22:24:27] I don't know what 'the apache config' consists of. Might include ports.conf or apache2.conf or sites-available or env-available... [22:24:40] I was just thinking the whole /etc/apache2 or something [22:24:47] heh, that'll work [22:24:51] ofc if Coren can confirm, that'll be easier [22:24:51] mutante: RoanKattouw: Last I looked at the architecture we had incoming front-end 'hot' squids (indescriminate, caching the most popular hits), and then shared/hash-based backend squids. [22:24:55] if his internet isn't still sucking [22:25:02] Did we do something similar for Varnish now? [22:26:02] Error: /Stage[main]/Nutcracker/Service[nutcracker]: Could not evaluate: Could not find init script or upstart conf file for 'nutcracker' [22:26:05] !log reedy Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 13s) [22:26:06] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Tue Sep 2 22:25:56 UTC 2014 [22:26:06] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:26:11] Logged the message, Master [22:26:12] (03PS2) 10BBlack: Remove revdns for unused project-lb.site hostnames [dns] - 10https://gerrit.wikimedia.org/r/157980 [22:26:14] (03PS3) 10BBlack: Remove unused foo-lb.(eqiad|esams).wm.o A/AAAA recs [dns] - 10https://gerrit.wikimedia.org/r/157979 [22:26:26] PROBLEM - puppet last run on virt1000 is CRITICAL: Connection refused by host [22:26:29] Krinkle: Yes, we have frontend and backend Varnishes [22:26:35] oh, that might be a race, lemme try again [22:26:37] (03PS2) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [22:26:43] (03PS2) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 [22:26:55] With the frontend Varnishes being glorified hash-based routers with an in-memory cache [22:27:31] Krinkle: well, the squid stuff has been deleted. yep https://gerrit.wikimedia.org/r/#/c/101864/ [22:28:13] http://noc.wikimedia.org/conf/highlight.php?file=squid.php [22:28:19] https://github.com/search?q=squid+%40wikimedia&type=Code [22:28:26] Still a bunch of 2-eyed creatures left [22:28:59] Krinkle: it's mostly because the variables and files haven't been renamed [22:28:59] Reedy: so, that's all done, but the errors are still all about redis [22:29:02] even more 2-eyed creatures visiting our sites, but that aside [22:29:11] andrewbogott: did you run sync-common after I merged 158003? [22:29:25] * andrewbogott does, again! [22:29:41] Ooh, a new error! [22:29:51] And it loads, I guess that's debug mode working [22:29:56] hahaha [22:29:58] (03PS1) 10Ori.livneh: mediawiki::monitoring::errors: report to statsd [puppet] - 10https://gerrit.wikimedia.org/r/158008 [22:30:02] We don't need TMH on wikitech :) [22:30:12] or SMW, for that matter [22:30:14] * ori ducks [22:30:25] Reedy: Hm.. so that means Varnish exposes a Squid-compatible protocol (or a third party standard) for purging then, because MediaWiki hasn't changed much. [22:30:41] Krinkle: IIRC, yeah. bblack should be able to tell you more [22:30:47] https://www.varnish-cache.org/docs/3.0/tutorial/purging.html [22:31:06] the searches using SMW were broken anyways i think [22:31:11] https://virt1000.wikimedia.org/wiki/Special:Version [22:32:10] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges [22:32:20] enjoy the "don't do this" part :) [22:32:38] I can't log in, unclear why [22:32:52] what's up? [22:33:11] I see traffic about purging, but I haven't found the original issue in the backscroll yet :) [22:33:53] Krinkle, Reedy there's a service called varnishhtcpd [22:34:05] listens for squid proto. generates varnish purge [22:34:06] AIUI [22:34:10] Reedy: can you log in? [22:34:21] bblack: No issue, just updating my knowledge about squid/varnish caching for page views [22:34:26] andrewbogott: Not tried. I'm just disabling a load of extensions. will check in a minute or 2 [22:34:29] oh ok [22:34:47] bblack: the old page view stats are still called "squid stats" [22:34:58] and squid all over URL paths [22:35:06] we have outdated information? someone call the president! :) [22:35:10] Reedy: diffing that version page with https://wikitech.wikimedia.org/wiki/Special:Version ? [22:35:12] lol [22:35:18] bblack, bug #1 [22:35:38] so the really really basic rundown of the current state of purging is: [22:35:46] 1) We don't use squid anywhere, just varnish [22:35:49] (03PS1) 10Reedy: Disable numerous extensions on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158009 [22:35:52] bblack: Reedy: bits servers got folded back in the main pool, right? https://wikitech.wikimedia.org/wiki/Bits_varnish_testing [22:36:05] * Krinkle listens :) [22:36:23] 2) When you purge through mediawiki, it sends multicasts everywhere, which get picked up by varnishhtcpd on the varnish hosts and forwarded to varnish as PURGE requests [22:36:25] andrewbogott: Not quite. Mostly looking what's disabled on loginwiki. There's a few that are probalby useful etc [22:36:29] bblack, even on brewster? [22:36:40] brewster is not a varnish? [22:36:49] i thought it was squid [22:36:55] (03CR) 10Reedy: [C: 032] Disable numerous extensions on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158009 (owner: 10Reedy) [22:36:59] (03Merged) 10jenkins-bot: Disable numerous extensions on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158009 (owner: 10Reedy) [22:36:59] brewster is a nothing as far as production traffic goes, AFAIK [22:37:01] andrewbogott: ^^ sync-common [22:37:12] right. it's the forward proxy for apt [22:37:13] ssh: Could not resolve hostname brewster: Name or service not known [22:37:24] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 13s) [22:37:29] Logged the message, Master [22:37:40] brewster is decom'ed [22:38:16] yeah, well i can't remember the new names [22:38:21] 3) vhtcpd -> varnish purge queues are reliable, but multicast is not. it is possible for a purge to miss one or more caches somewhere due to packet loss [22:38:41] (in which case you just purge again, or it gets fixed on the next edit, etc) [22:38:58] bblack, are you testing it or should i? [22:39:00] andrewbogott: on login do you get a cookie error? [22:39:06] chasemp: I also added you to https://gerrit.wikimedia.org/r/#/c/157969/ [22:39:07] yurikR: either way [22:39:10] Reedy: yes [22:39:20] bblack, ok, testing, one sec [22:39:26] jeremyb: nothing I'm talking about has anything to do with apt, just prod traffic [22:39:41] ok [22:39:51] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wikidata&service=check+if+wikidata.org+dispatch+lag+is+higher+than+2+m" [puppet] - 10https://gerrit.wikimedia.org/r/157863 (owner: 10JanZerebecki) [22:39:55] what about Krinkle's question about bits folding into main cluster? [22:40:05] that was done IIRC [22:40:15] app cluster? [22:40:18] varnish [22:40:26] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Puppet has 1 failures [22:40:28] at the varnish level we still have a separate pool of bits caches [22:42:00] https://wikitech.wikimedia.org/wiki/Bits_varnish_testing is really really ancient [22:42:23] it would probably be better to just delete that document, or mark that it's only of archaeological interest [22:42:47] andrewbogott: I guess the session config isn't quite right [22:43:46] (03CR) 10Krinkle: [C: 031] Alphasort extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157831 (owner: 10Reedy) [22:43:48] I'm pretty sure that oathauth throws bogus cookie errors sometimes too. So it could be one of several things. [22:43:52] (03CR) 10Krinkle: [C: 031] Add sudo -u apache call to foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [22:43:57] I wonder if anyone is lurking around who doesn't uses 2fa on wikitech [22:44:32] i don't... [22:44:40] oh [22:44:44] bblack, updated files and restarted, seems like its ok [22:44:46] ok then, 2fa is blameless [22:44:50] heh [22:45:02] so far [22:45:11] * andrewbogott stares daggers at oathauth [22:45:41] (03PS2) 10Krinkle: Add sudo -u apache call to foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [22:46:11] (03CR) 10Krinkle: "Hm... to be fair, mwscript uses "if groups | grep -Ewq 'sudo|wikidev|root'". Is that relevant here?" [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [22:46:46] andrewbogott: I wonder if comment out $wgCacheDirectory = "$IP/cache"; might help [22:47:03] letting it use /tmp/mw-cache-... [22:47:11] I'll try [22:47:29] Not sure why it would (but we should probably do this anyway so we're not writing stuff into /cache for sync-common to wipe out etc) [22:47:56] (03PS1) 10Reedy: Use wikimedia default cache dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158012 [22:48:46] (03CR) 10Reedy: [C: 032] Use wikimedia default cache dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158012 (owner: 10Reedy) [22:48:47] Seems not to matter [22:48:50] (03Merged) 10jenkins-bot: Use wikimedia default cache dir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158012 (owner: 10Reedy) [22:49:16] !log reedy Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 14s) [22:49:23] Logged the message, Master [22:49:54] Reedy: I get that same cookie error if I give a bad password [22:50:04] (03PS8) 10BBlack: Add codfw subnets (now: remove pmtpa IPv6 networks) [puppet] - 10https://gerrit.wikimedia.org/r/156090 (owner: 10Mark Bergsma) [22:50:07] Or a bogus username [22:50:15] yeah, mediawiki error handling here sucks IIRC [22:51:50] (03CR) 10BBlack: [C: 032] Add codfw subnets (now: remove pmtpa IPv6 networks) [puppet] - 10https://gerrit.wikimedia.org/r/156090 (owner: 10Mark Bergsma) [22:52:05] $wgDisableCookieCheck [22:52:07] * Reedy wonders [22:52:47] andrewbogott: might be worth running maintenance/mctest.php to check memcached config is actually "working" [22:53:34] root@virt1000:/usr/local/apache/common# php ./php-1.24wmf15/maintenance/mctest.php [22:53:35] PHP Notice: Undefined variable: wmfRealm in /usr/local/apache/common-local/wmf-config/StartProfiler.php on line 28 [22:53:36] No MWMultiVersion instance initialized! MWScript.php wrapper not used? [22:53:54] heh [22:54:01] mwscript mctest.php --wiki=testwiki [22:54:05] ok [22:54:16] Not sure if all the wrapper scripts are there though [22:54:39] (03PS3) 10BBlack: Zero: unified many carriers, reorged the rest, all disabled are uni [puppet] - 10https://gerrit.wikimedia.org/r/157984 (owner: 10Yurik) [22:54:47] bblack, thx [22:54:59] (03CR) 10BBlack: [C: 032 V: 032] Zero: unified many carriers, reorged the rest, all disabled are uni [puppet] - 10https://gerrit.wikimedia.org/r/157984 (owner: 10Yurik) [22:55:13] you mean…php multiversion/MWScript.php mctest.php --wiki=testwiki [22:55:22] I don't have an mwscript in my path [22:55:33] yeah, that looks right [22:55:41] well, not =testwiki [22:55:44] =labswiki [22:55:52] my fault, sorry [22:56:34] root@virt1000:/usr/local/apache/common# sudo -u apache php multiversion/MWScript.php mctest.php --wiki=labswiki [22:56:35] 127.0.0.1:11211 set: 0 incr: 0 get: 0 time: 0.002032995223999 [22:56:36] seems fine [22:57:06] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: The command defined for service check if wikidata.org dispatch lag is higher than 2 minutes does not exist [22:57:58] i think those numbers are supposed to be 0 [22:58:00] *100 [22:58:04] reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ php maintenance/mctest.php [22:58:05] 127.0.0.1:11211 set: 100 incr: 100 get: 100 time: 0.0417640209198 [22:58:09] ^ yea, we just added that, it's issue with monitoring itself, not wikidata [22:58:26] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [22:59:12] I think I know... [22:59:49] memcached-pecl [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140902T2300). [23:00:33] * RoanKattouw raises hand [23:00:55] (03PS1) 10Reedy: CACHE_MEMCACHED -> 'memcached-pecl' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158015 [23:01:06] andrewbogott: when is memcached not memcached? [23:01:18] (03CR) 10Reedy: [C: 032] CACHE_MEMCACHED -> 'memcached-pecl' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158015 (owner: 10Reedy) [23:01:23] (03Merged) 10jenkins-bot: CACHE_MEMCACHED -> 'memcached-pecl' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158015 (owner: 10Reedy) [23:02:07] andrewbogott: sync-common! :D [23:02:11] # sudo -u apache php multiversion/MWScript.php mctest.php --wiki=labswiki [23:02:12] PHP Fatal error: Class 'Memcached' not found in /usr/local/apache/common-local/php-1.24wmf15/includes/objectcache/MemcachedPeclBagOStuff.php on line 60 [23:03:04] hmmm [23:03:21] andrewbogott: Is the memcached pecl module installed? [23:03:40] ... [23:03:50] I don't know what that is. Installed where? [23:04:09] dpkg -l | grep php5-memcached [23:04:11] on virt1000 [23:04:27] not installed. [23:04:53] But nutcracker is, I thought we were using that? Is php5-memcached a binding for nutcracker? [23:04:59] yup [23:05:11] so mediawiki had some php classes inbuilt to use memcached [23:05:20] we moved away from that for various reasons in production [23:05:49] I wonder if novacontroller should just install mediawiki::packages::php5 [23:07:44] I guess it could just be installed manually at this point.. and possibly file an RT ticket or something to remember to add it to puppet [23:08:40] apache would need graceful/restart after [23:09:56] (03PS1) 10Dduvall: Labs: Varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) [23:11:48] Reedy: I'd prefer it be puppetized; I'll do it [23:12:13] ok, thanks [23:12:24] like I say, including mediawiki::packages::php5 would work [23:14:19] !log Running extensions/GlobalCssJs/removeOldManualUserPages.php per [[m:GlobalCssJs]] [23:14:26] Logged the message, Master [23:15:15] (03PS1) 10Andrew Bogott: Include mediawiki::packages::php5 in nova controller role, for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158018 [23:15:39] (03CR) 10Reedy: [C: 031] Include mediawiki::packages::php5 in nova controller role, for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158018 (owner: 10Andrew Bogott) [23:16:03] (03CR) 10Andrew Bogott: [C: 032] Include mediawiki::packages::php5 in nova controller role, for mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158018 (owner: 10Andrew Bogott) [23:17:33] (03CR) 10CSteipp: [C: 031] Labs: Varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [23:18:27] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Epic puppet fail [23:19:15] Reedy: would i be getting in your way if i pushed a small extension update to wmf19? [23:19:47] ori: Not mine. I'm only touching mediawiki-config atm. Might wanna check with RoanKattouw as he's swatting [23:20:13] ah, thanks. roan, let me know if i can squeeze in a patch after you're done [23:20:36] I think jenkins merged, but haven't seen a deploy [23:21:16] (03PS1) 10Andrew Bogott: Remove packages duplicated in ::mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/158021 [23:21:59] I'm about to deploy [23:22:01] (03CR) 10Reedy: [C: 031] "aha, nice" [puppet] - 10https://gerrit.wikimedia.org/r/158021 (owner: 10Andrew Bogott) [23:22:07] Sorry got distracted because someone in the office said "ResourceLoader" [23:22:42] !log catrope Synchronized php-1.24wmf19/includes/OutputPage.php: 5094c0d9c (duration: 00m 05s) [23:22:47] Logged the message, Master [23:22:49] (03CR) 10Andrew Bogott: [C: 032] Remove packages duplicated in ::mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/158021 (owner: 10Andrew Bogott) [23:23:08] ori: OK it's all yours, go bonkers [23:23:11] (03Abandoned) 10Andrew Bogott: Wikitech uses wikitech for memcached [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158001 (owner: 10Andrew Bogott) [23:23:18] RoanKattouw: sweet, thanks [23:23:57] Reedy: warning, btw, I have to go in about half an hour. Probably bedtime for you anyway :) [23:24:08] heh [23:24:18] I'm happy to finish whenever you go for the day [23:24:22] it's not too late here yet [23:24:36] we're nearly there I think too, which is great [23:26:25] (03PS2) 10Dduvall: Labs: Varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) [23:27:29] (03PS1) 10Reedy: Remove libmemcached10 [puppet] - 10https://gerrit.wikimedia.org/r/158023 [23:31:12] (03CR) 10Reedy: "http://packages.ubuntu.com/trusty/php5-memcached" [puppet] - 10https://gerrit.wikimedia.org/r/158023 (owner: 10Reedy) [23:31:19] Reedy: seems memcached is still not working. [23:31:21] sudo -u apache php multiversion/MWScript.php mctest.php --wiki=labswiki [23:31:23] PHP Deprecated: Comments starting with '#' are deprecated in /etc/php5/cli/conf.d/fss.ini on line 1 in Unknown on line 0 [23:31:23] 127.0.0.1:11212 set: 0 incr: 0 get: 0 time: 0.019572973251343 [23:31:51] lol [23:32:56] !log reedy Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 13s) [23:34:31] Reedy: any more ideas, or shall we knock off for the night? [23:34:36] (03PS1) 10Parent5446: Set password default to PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) [23:34:58] (03PS1) 10RobH: assign new mgmt ips to codfw cisco systems [dns] - 10https://gerrit.wikimedia.org/r/158025 [23:35:41] ori: do we have any outside mediawiki to test nutcracker? [23:36:57] amusingly, icinga says virt1000 has redis installede [23:37:16] (03CR) 10MaxSem: [C: 04-1] Set password default to PBKDF2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [23:38:04] Reedy: Redis is installed, but it's for keystone [23:38:12] ah [23:38:13] (03CR) 10RobH: [C: 031] "This looks right to me, and can go live as soon as someone else reviews and confirms its a sane addition. (Since it is DNS and can break " [dns] - 10https://gerrit.wikimedia.org/r/158025 (owner: 10RobH) [23:38:15] (03PS2) 10Parent5446: Set password default to PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) [23:38:45] i wonder why icinga isn't showing the nutcracker tests on virt1000 [23:38:46] (03CR) 10Parent5446: Set password default to PBKDF2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158024 (https://bugzilla.wikimedia.org/68766) (owner: 10Parent5446) [23:40:37] Reedy: `echo stats | telnet localhost 11212` might show you if the proxy is running and talking to the backing memcached [23:40:59] <^d> 2014-09-02 23:40:18 labswiki (new): has not yet been dumped [23:41:01] thanks bd808. andrewbogott ^^ [23:41:05] heh [23:41:32] root@virt1000:/usr/local/apache/common# `echo stats | telnet localhost 11212` [23:41:33] Connection closed by foreign host. [23:41:34] Trying: command not found [23:41:46] hmmm or not. doesn't seem to work on deployment-mediawiki01 [23:41:52] oh, wait, I'm quoting it wrong [23:42:09] # echo stats | telnet localhost 11212 [23:42:10] Trying ::1... [23:42:12] Trying 127.0.0.1... [23:42:13] ^d: that of course will break :P [23:42:13] Connected to localhost. [23:42:14] Escape character is '^]'. [23:42:15] Connection closed by foreign host. [23:42:17] Seems that someone is listening there at least [23:42:26] <^d> Reedy: Yeah, just happened across it when I was looking for another dump :p [23:42:34] andrewbogott: Yeah. That's what I get in beta too. [23:43:05] We don't seem to have memcached-tool installed which would be another way [23:44:09] I note mctest doesn't work from tin/terbium either [23:45:39] andrewbogott: Oooh. Try `telnet localhost 22222` to get a stats report from nutcracker [23:46:04] https://dpaste.de/FP8f [23:46:44] 0 client connections [23:46:50] That looks to me like nutcracker isn't talking to the local memcached as expected [23:46:55] I don't know if that means ever or current [23:47:01] ori mentioned something about weighting [23:47:16] i wasn't sure if he meant to make sure it's there, or to remove it (I left it as is) [23:47:29] it doesn't necessarily keep an open connection unless one has been initiated [23:47:32] run mwscript eval.php [23:47:42] and do $wgMemc->set('test:foo', 123); [23:47:48] var_dump( $wgMemc->get('test:foo') ); [23:47:52] if that works, it works. [23:48:29] bool(false) [23:48:31] https://github.com/wikimedia/operations-puppet/commit/0662f27dd53a9483b460952945d55ea7a653089c andrewbogott authored 4 days ago [23:48:33] not working [23:48:45] "response_bytes":0 seems not working to me [23:48:57] Aug 30, 2014 9:37 AM [23:49:01] beta looks something like this -- https://dpaste.de/iDmm [23:49:04] !log ori Synchronized php-1.24wmf19/extensions/WikimediaEvents: Update WikimediaEvents for cherry-picks (duration: 00m 03s) [23:49:11] Logged the message, Master [23:49:20] is this virt1001? [23:49:26] virt1000 [23:49:27] https://gerrit.wikimedia.org/r/#/c/158005/1/manifests/role/nova.pp,unified [23:49:38] '127.0.0.1:11211:1' [23:51:38] (03CR) 10BryanDavis: "Can be tested by cherry-picking and applying in beta. See bd808: I think he already has based on the bug comments :) [23:52:47] bd808: yup. i should have commented in gerrit as well [23:53:22] * bd808 seldom looks at the bug [23:53:35] I need to go. Ori, feel free to tinker with virt1000 but be warned that puppet is currently disabled to preserve the apache config. [23:53:51] i only saw due to the email notification [23:53:54] We can revisit this tomorrow :/ [23:55:52] andrewbogott: Don't be sad. This is going swimmingly. :) [23:56:04] andrewbogott: nod [23:56:15] Yeah, we'll never know which problem is the last one until it's fixed. Maybe only one more! [23:56:34] Reedy: this one is both a) ugly if true and B) probably something you could log spelunk, hopefully. [23:56:39] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=70145 [23:56:59] greg-g: Are you buying me a macbook? [23:56:59] :D [23:58:33] Reedy: let's go with no for now ;) [23:58:47] Ugh [23:58:53] Reedy: but was hoping you could see something obvious in the logs :/ [23:58:54] I can only get old safari on windows apparently :/ [23:58:57] ahh [23:59:38] No ios device either... [23:59:46] So can only do stuff server side I guess