[00:00:45] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:01:38] (03PS3) 10Dzahn: contint: disable unattended upgrade [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [00:03:03] (03CR) 10Dzahn: [C: 032] contint: disable unattended upgrade [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [00:04:13] !log ori Synchronized php-1.26wmf6/includes: 9bf0236c20, 2d3c9233ed (duration: 00m 17s) [00:04:21] Logged the message, Master [00:13:28] 6operations, 10ops-eqiad: ssh connection to some management servers fails, a hard reset may be needed - https://phabricator.wikimedia.org/T99805#1304212 (10Dzahn) p:5Triage>3High [00:13:51] 6operations, 10Deployment-Systems: errors reported by "eventual_consistency_deployment_server_init" on new deploy server - https://phabricator.wikimedia.org/T99928#1304213 (10Dzahn) p:5Triage>3Normal [00:14:43] oh, the favicon of phab changed, right [00:14:58] i guess it was the restart from the other config change [00:17:37] 6operations: [7a44ef6d] 2015-05-22 11:26:53: Fatal exception of type MWException - https://phabricator.wikimedia.org/T100012#1304215 (10Dzahn) p:5Triage>3High [00:22:54] trying to get permission to +2 something in gerrit. wondering if anyone here can point me in the right direction? [00:24:10] cwdent: for which repository? [00:24:24] also, see http://www.mediawiki.org/wiki/Gerrit/%2B2 [00:25:38] i am in fundraising tech, working with mediawiki-vagrant now [00:26:43] and yes i will be careful, i don't even want to +2, but team said i should :) [00:30:08] cwdent: did you just start working for WMF? [00:30:25] yep, 2 weeks ago [00:30:28] cwdent: would you know of some kind of onboarding ticket? [00:30:38] you are probably just lacking the WMF LDAP group [00:31:01] that problem happens with disturbing regularity [00:31:07] yes, like every time :) [00:31:13] belated welcome, cwdent :) [00:31:15] hrm, i'm not sure. i have LDAP access other places... [00:31:17] yea, welcome [00:31:26] and sorry about that but we didnt get a notification [00:31:27] thanks! i'm honored to be here [00:32:15] at least, i think i do...wikitech is the ldap account? [00:32:28] yeah [00:33:06] yeah, those creds work various places. [00:33:14] * cwdent still figuring out logins [00:33:37] !log adding cwdent to WMF LDAP group per https://www.mediawiki.org/wiki/User:CDentinger_%28WMF%29 [00:33:43] Logged the message, Master [00:34:48] cwdent: try on a relevant gerrit link and see if the +2 is not greyed out anymore, without hitting submit [00:36:59] mutante: i actually don't see a +2 at all...i'm looking at the radio buttons after i click review, right? [00:37:30] cwdent: yes, try logging out and back in [00:39:33] mutante: ah ok, i see the other buttons on a ticket for the fundraising-dash repo (which i've pushed to if that's relevant) but not mediawiki-vagrant [00:40:01] cwdent: mutante's gotta run, i'll take a look [00:40:34] ok, sorry to be bugging you late, this can totally wait till next week [00:40:35] so the WMF group is like the default and gives you a bunch of repos but not everything [00:40:47] thanks, handing over to jgage [00:41:15] thanks mutante, have a good weekend [00:42:30] hm there's an ldap group called vagrant, maybe that's what we need [00:42:33] * jgage pokes around [00:43:06] cwdent: what is the url to the gerrit change you're trying to +2? [00:44:59] we had another coloradoan in the office recently, but i didn't catch his name [00:45:05] he was from my hometown, co springs [00:45:19] https://gerrit.wikimedia.org/r/#/c/212820/1,publish [00:45:27] k [00:45:42] nice yeah there's a couple of us. i live 3 blocks from thcipriani|afk [00:45:50] cool [00:46:17] i flew through denver recently. those views of the rockies always make me want to stay. [00:47:02] yeah i love the front range. i lived in the springs for about a year [00:50:53] hmm i wonder how to determine the mapping between projects and ldap groups [00:51:35] oh rad you worked for sparkfun [00:51:49] i wish i was a competent hardware hacker [00:51:58] i know just enough to enjoy browsing their catalog [00:52:27] ha yeah i was there for 7 years [00:52:57] though i'm only borderline competent as a hardware hacker [01:17:08] cwdent: if you're still around, try again? [01:17:24] (reloading the gerrit url should be sufficient) [01:18:02] that did it! thanks a ton jgage [01:18:30] yay! [01:18:35] that took some sleuthing :) [01:18:37] * jgage takes notes [02:20:00] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 06m 02s) [02:20:22] Logged the message, Master [02:24:40] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-23 02:23:36+00:00 [02:24:47] Logged the message, Master [02:24:54] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (94787s 90000s) [02:41:25] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 05m 56s) [02:41:38] Logged the message, Master [02:45:51] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-23 02:44:48+00:00 [02:45:57] Logged the message, Master [04:13:14] PROBLEM - puppet last run on mw2042 is CRITICAL puppet fail [04:27:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [04:31:45] RECOVERY - puppet last run on mw2042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:33:53] 6operations, 7Wikimedia-log-errors: [7a44ef6d] 2015-05-22 11:26:53: Fatal exception of type MWException - https://phabricator.wikimedia.org/T100012#1304248 (10Glaisher) [04:42:34] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:54:23] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1304250 (10dr0ptp4kt) Update: I've received some guidance. May be a bit until I can take action on it. In a nutshell, though: * Not possible to increase the sites limit * We could consolidate the multitude of dist... [05:06:46] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (7934 90000s) [05:13:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat May 23 05:12:44 UTC 2015 (duration 12m 43s) [05:13:53] Logged the message, Master [05:19:44] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [05:29:45] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:32:55] PROBLEM - puppet last run on db1040 is CRITICAL Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:33:44] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on mw2083 is CRITICAL Puppet has 1 failures [06:34:16] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:34:24] PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures [06:34:44] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 2 failures [06:34:45] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:05] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:46:14] RECOVERY - puppet last run on mw2083 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on db1040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:36] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:55] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:04] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:15] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:54] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:15] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:06] (03CR) 10Filippo Giunchedi: "LGTM, the diamond changes should go in a separate code review though" [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [07:25:32] (03CR) 10Filippo Giunchedi: initial debian packaging (031 comment) [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/212528 (https://phabricator.wikimedia.org/T99771) (owner: 10Filippo Giunchedi) [07:34:25] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 87.50% of data above the critical threshold [35.0] [07:41:25] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [07:49:45] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [08:01:36] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [08:06:44] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [08:11:44] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [08:13:24] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [08:18:52] hmm [08:20:54] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 15 failures [08:25:15] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [08:29:54] legoktm: we should get UrlShortener deployed somewhere [08:31:04] yuvipanda: yesssssss [08:31:54] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [08:33:35] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [08:35:35] legoktm: should we rewrite it to be a noejs service first, tho? [08:35:44] * legoktm slaps yuvipanda [08:35:54] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:36:55] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [08:43:54] PROBLEM - High load average on labstore1001 is CRITICAL 60.00% of data above the critical threshold [24.0] [08:45:14] _joe_: https://get.docker.com/ [08:46:26] https://get.tools.wmflabs.org/ [08:46:37] * yuvipanda swats legoktm [08:55:35] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [08:55:35] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [09:00:44] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 62.50% of data above the critical threshold [35.0] [09:15:46] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [09:16:10] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1304472 (10fgiunchedi) looks like ~200G used ATM ``` /dev/mapper/sodium-mailman 280G 102G 179G 37% /var/lib/mailman ``` so from the spares list, this "Dell PowerEdge R420, single Intel Xeon E5-2450 v2 2.50... [09:21:34] (03PS1) 10Yuvipanda: mesos: Install docker on all slaves [puppet] - 10https://gerrit.wikimedia.org/r/212861 [09:22:18] (03PS2) 10Yuvipanda: mesos: Install docker on all slaves [puppet] - 10https://gerrit.wikimedia.org/r/212861 (https://phabricator.wikimedia.org/T99923) [09:22:39] (03CR) 10Yuvipanda: [C: 032 V: 032] mesos: Install docker on all slaves [puppet] - 10https://gerrit.wikimedia.org/r/212861 (https://phabricator.wikimedia.org/T99923) (owner: 10Yuvipanda) [09:24:03] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1304486 (10faidon) a:5faidon>3None Why not a (ganeti) VM? In any case, this ticket lacks an owner/assignee. Finding a machine for that is the easy part :) [09:28:38] 6operations, 10ops-eqiad: analytics1036 can't talk cross row? - https://phabricator.wikimedia.org/T99845#1304490 (10faidon) I've rebooted multiple times, also checked BIOS settings, also rebooted the iDRAC a few times just in case it was something related to NIC sharing. I've upgraded all of the firmware on th... [09:29:18] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1304491 (10fgiunchedi) does it make a difference that it needs a public ip? if it doesn't a VM would be a good fit indeed. very true re: owner, cc @mark [09:38:48] (03PS1) 10Yuvipanda: mesos: Use require_package to get docker [puppet] - 10https://gerrit.wikimedia.org/r/212865 [09:38:50] (03PS1) 10Yuvipanda: mesos: Enable docker as containerization mechanism for mesos [puppet] - 10https://gerrit.wikimedia.org/r/212866 (https://phabricator.wikimedia.org/T99923) [09:39:14] (03CR) 10Yuvipanda: [C: 032 V: 032] mesos: Use require_package to get docker [puppet] - 10https://gerrit.wikimedia.org/r/212865 (owner: 10Yuvipanda) [09:39:43] (03CR) 10Yuvipanda: [C: 032 V: 032] mesos: Enable docker as containerization mechanism for mesos [puppet] - 10https://gerrit.wikimedia.org/r/212866 (https://phabricator.wikimedia.org/T99923) (owner: 10Yuvipanda) [09:44:54] (03PS2) 10Ori.livneh: Re-enable xhprof profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212808 (https://phabricator.wikimedia.org/T66301) [09:44:56] (03PS1) 10Ori.livneh: Change StatsD port to another value temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212868 [09:44:58] (03PS1) 10Ori.livneh: Revert "Change StatsD port to another value temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212869 [09:45:04] (03PS1) 10Yuvipanda: mesos: Increase registry execution timeout to support docker [puppet] - 10https://gerrit.wikimedia.org/r/212870 [09:45:27] ^ godog [09:45:36] (03CR) 10Yuvipanda: [C: 032 V: 032] mesos: Increase registry execution timeout to support docker [puppet] - 10https://gerrit.wikimedia.org/r/212870 (owner: 10Yuvipanda) [09:45:53] (03CR) 10Filippo Giunchedi: [C: 031] Change StatsD port to another value temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212868 (owner: 10Ori.livneh) [09:49:54] (03CR) 10Ori.livneh: [C: 032] Change StatsD port to another value temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212868 (owner: 10Ori.livneh) [09:50:00] (03Merged) 10jenkins-bot: Change StatsD port to another value temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212868 (owner: 10Ori.livneh) [09:51:15] (03PS1) 10Yuvipanda: mesos: Puppetize docker config file [puppet] - 10https://gerrit.wikimedia.org/r/212871 [09:52:07] !log ori Synchronized wmf-config/CommonSettings.php: I311c989e9: Change StatsD port to another value temporarily (duration: 00m 14s) [09:52:16] Logged the message, Master [09:54:38] (03CR) 10Filippo Giunchedi: [C: 031] Re-enable xhprof profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212808 (https://phabricator.wikimedia.org/T66301) (owner: 10Ori.livneh) [09:56:19] (03CR) 10Ori.livneh: [C: 032] Re-enable xhprof profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212808 (https://phabricator.wikimedia.org/T66301) (owner: 10Ori.livneh) [09:56:25] (03Merged) 10jenkins-bot: Re-enable xhprof profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212808 (https://phabricator.wikimedia.org/T66301) (owner: 10Ori.livneh) [09:56:35] PROBLEM - Translation cache space on mw1203 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [09:57:50] !log ori Synchronized wmf-config/StartProfiler.php: Ia7549d45: Re-enable xhprof profiling (duration: 00m 14s) [09:57:56] Logged the message, Master [09:58:04] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1304560 (10JohnLewis) >>! In T82698#1304491, @fgiunchedi wrote: > does it make a difference that it needs a public ip? if it doesn't a VM would be a good fit indeed. very true re: owner, cc @mark Mailman handles it's own maiil proc... [09:59:36] (03PS2) 10Yuvipanda: mesos: Puppetize docker config file [puppet] - 10https://gerrit.wikimedia.org/r/212871 [10:01:16] JohnFLewis: are you at the hackathon? [10:01:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:01:52] (03PS1) 10Ori.livneh: *Correctly* set port of $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212873 [10:01:56] godog: he isn't [10:02:30] (03CR) 10Ori.livneh: [C: 032] *Correctly* set port of $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212873 (owner: 10Ori.livneh) [10:02:36] (03Merged) 10jenkins-bot: *Correctly* set port of $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212873 (owner: 10Ori.livneh) [10:03:11] !log ori Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 13s) [10:03:18] Logged the message, Master [10:03:34] PROBLEM - Translation cache space on mw1244 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:35] PROBLEM - Translation cache space on mw1248 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:44] PROBLEM - Translation cache space on mw1211 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:45] PROBLEM - Translation cache space on mw1166 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:45] PROBLEM - Translation cache space on mw1171 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:03:45] PROBLEM - Translation cache space on mw1243 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:03:46] PROBLEM - Translation cache space on mw1245 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [10:03:55] PROBLEM - Translation cache space on mw1231 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:03:55] PROBLEM - Translation cache space on mw1163 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:55] PROBLEM - Translation cache space on mw1082 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:03:55] PROBLEM - Translation cache space on mw1072 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:03:56] PROBLEM - Translation cache space on mw1179 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:03:56] PROBLEM - Translation cache space on mw1065 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:03:56] PROBLEM - Translation cache space on mw1097 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:03:56] PROBLEM - Translation cache space on mw1089 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:04:03] ek [10:04:04] PROBLEM - Translation cache space on mw1172 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:04] PROBLEM - Translation cache space on mw1110 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:04:04] PROBLEM - Translation cache space on mw1230 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [10:04:04] PROBLEM - Translation cache space on mw1191 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [10:04:05] PROBLEM - Translation cache space on mw1036 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 92% [10:04:05] PROBLEM - Translation cache space on mw1045 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:04:05] PROBLEM - Translation cache space on mw1195 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:05] what [10:04:05] PROBLEM - Translation cache space on mw1232 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [10:04:06] PROBLEM - Translation cache space on mw1204 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:07] PROBLEM - Translation cache space on mw1147 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:04:07] PROBLEM - Translation cache space on mw1133 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:04:07] PROBLEM - Translation cache space on mw1069 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:04:09] needs a rolling restart [10:04:10] i think [10:04:18] PROBLEM - Translation cache space on mw1068 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:19] PROBLEM - Translation cache space on mw1126 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:19] PROBLEM - Translation cache space on mw1019 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:24] PROBLEM - Translation cache space on mw1125 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:24] PROBLEM - Translation cache space on mw1049 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:24] PROBLEM - Translation cache space on mw1134 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:24] PROBLEM - Translation cache space on mw1116 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:24] PROBLEM - Translation cache space on mw1122 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:24] PROBLEM - Translation cache space on mw1088 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [10:04:25] PROBLEM - Translation cache space on mw1189 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [10:04:25] PROBLEM - Translation cache space on mw1073 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:26] PROBLEM - Translation cache space on mw1067 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:26] PROBLEM - Translation cache space on mw1094 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:27] PROBLEM - Translation cache space on mw1135 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:04:27] PROBLEM - Translation cache space on mw1033 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 95% [10:04:40] _joe_: ^ [10:04:44] PROBLEM - Translation cache space on mw1058 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [10:04:45] PROBLEM - Translation cache space on mw1249 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:45] PROBLEM - Translation cache space on mw1075 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:45] PROBLEM - Translation cache space on mw1041 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:45] PROBLEM - Translation cache space on mw1053 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [10:04:45] PROBLEM - Translation cache space on mw1253 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:45] PROBLEM - Translation cache space on mw1106 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 97% [10:04:46] PROBLEM - Translation cache space on mw1029 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:54] PROBLEM - Translation cache space on mw1127 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 98% [10:04:55] PROBLEM - Translation cache space on mw1084 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:55] PROBLEM - Translation cache space on mw1085 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:55] PROBLEM - Translation cache space on mw1043 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:55] PROBLEM - Translation cache space on mw1061 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:55] PROBLEM - Translation cache space on mw1022 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:04:59] greg-g: it's OK [10:05:04] PROBLEM - Translation cache space on mw1144 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:05:04] PROBLEM - Translation cache space on mw1115 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:05:04] PROBLEM - Translation cache space on mw1095 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 99% [10:05:04] PROBLEM - Translation cache space on mw1164 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:04] PROBLEM - Translation cache space on mw1174 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:04] PROBLEM - Translation cache space on mw1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:05] PROBLEM - Translation cache space on mw1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:05] PROBLEM - Translation cache space on mw1112 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:06] RECOVERY - Translation cache space on mw1203 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:14] PROBLEM - Translation cache space on mw1079 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:14] PROBLEM - Translation cache space on mw1109 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:15] PROBLEM - Translation cache space on mw1077 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:15] PROBLEM - Translation cache space on mw1070 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:15] PROBLEM - Translation cache space on mw1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:15] PROBLEM - Translation cache space on mw1025 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:15] RECOVERY - Translation cache space on mw1244 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:15] PROBLEM - Translation cache space on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:16] PROBLEM - Translation cache space on mw1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:16] PROBLEM - Translation cache space on mw1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:17] PROBLEM - Translation cache space on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:17] PROBLEM - Translation cache space on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:18] ori: :) [10:05:34] RECOVERY - Translation cache space on mw1245 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:34] PROBLEM - Translation cache space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:34] RECOVERY - Translation cache space on mw1231 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:35] RECOVERY - Translation cache space on mw1163 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:35] PROBLEM - Translation cache space on mw1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:35] PROBLEM - Translation cache space on mw1107 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:35] PROBLEM - Translation cache space on mw1104 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:05:35] RECOVERY - Translation cache space on mw1082 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:36] RECOVERY - Translation cache space on mw1179 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:36] RECOVERY - Translation cache space on mw1065 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:37] RECOVERY - Translation cache space on mw1097 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:44] RECOVERY - Translation cache space on mw1089 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1172 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1230 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1110 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1191 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1195 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:45] RECOVERY - Translation cache space on mw1045 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:46] RECOVERY - Translation cache space on mw1232 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:46] RECOVERY - Translation cache space on mw1036 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:47] RECOVERY - Translation cache space on mw1204 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:47] RECOVERY - Translation cache space on mw1147 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:58] RECOVERY - Translation cache space on mw1071 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:59] RECOVERY - Translation cache space on mw1194 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:05:59] RECOVERY - Translation cache space on mw1068 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:00] RECOVERY - Translation cache space on mw1126 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:00] RECOVERY - Translation cache space on mw1019 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1233 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] what caused that? [10:06:04] RECOVERY - Translation cache space on mw1125 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1049 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1134 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1122 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1116 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:04] RECOVERY - Translation cache space on mw1088 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:05] RECOVERY - Translation cache space on mw1189 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:06] RECOVERY - Translation cache space on mw1161 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:06] RECOVERY - Translation cache space on mw1073 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:07] RECOVERY - Translation cache space on mw1135 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:08] RECOVERY - Translation cache space on mw1033 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:25] RECOVERY - Translation cache space on mw1058 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:25] RECOVERY - Translation cache space on mw1075 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:25] RECOVERY - Translation cache space on mw1041 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:25] RECOVERY - Translation cache space on mw1053 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:34] RECOVERY - Translation cache space on mw1061 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:34] RECOVERY - Translation cache space on mw1106 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:34] jgage: any sync of PHP code is liable to cause the TC cache size to be exceeded [10:06:34] RECOVERY - Translation cache space on mw1029 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:34] RECOVERY - Translation cache space on mw1164 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:35] RECOVERY - Translation cache space on mw1174 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:35] RECOVERY - Translation cache space on mw1127 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:35] RECOVERY - Translation cache space on mw1042 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:35] RECOVERY - Translation cache space on mw1039 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:36] RECOVERY - Translation cache space on mw1084 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:36] RECOVERY - Translation cache space on mw1085 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:44] RECOVERY - Translation cache space on mw1043 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:44] RECOVERY - Translation cache space on mw1112 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:44] RECOVERY - Translation cache space on mw1022 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:44] RECOVERY - Translation cache space on mw1144 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:44] RECOVERY - Translation cache space on mw1079 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:45] RECOVERY - Translation cache space on mw1115 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:45] RECOVERY - Translation cache space on mw1095 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:45] RECOVERY - Translation cache space on mw1109 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:46] RECOVERY - Translation cache space on mw1077 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:48] RECOVERY - Translation cache space on mw1025 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:48] RECOVERY - Translation cache space on mw1070 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:48] RECOVERY - Translation cache space on mw1052 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:06:49] _joe_: we should kill the alert, it just frightens people [10:06:54] and it's not actionable [10:06:57] ori: gotcha [10:07:04] RECOVERY - Translation cache space on mw1098 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:04] RECOVERY - Translation cache space on mw1024 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:05] RECOVERY - Translation cache space on mw1032 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:05] RECOVERY - Translation cache space on mw1114 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:05] RECOVERY - Translation cache space on mw1028 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:05] RECOVERY - Translation cache space on mw1104 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:05] RECOVERY - Translation cache space on mw1107 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:15] PROBLEM - Translation cache space on mw1012 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:15] PROBLEM - Translation cache space on mw1003 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:16] PROBLEM - Translation cache space on mw1014 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:16] PROBLEM - Translation cache space on mw1008 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:24] RECOVERY - Translation cache space on mw1072 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:25] PROBLEM - Translation cache space on mw1005 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:34] PROBLEM - Translation cache space on mw1016 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:34] PROBLEM - Translation cache space on mw1009 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:34] RECOVERY - Translation cache space on mw1133 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:35] PROBLEM - Translation cache space on mw1010 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:07:35] RECOVERY - Translation cache space on mw1130 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:07:45] PROBLEM - Translation cache space on mw1006 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:07:54] PROBLEM - Translation cache space on mw1001 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:55] how do we recognize real TC space problems vs harmless? [10:07:55] PROBLEM - Translation cache space on mw1004 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:07:55] PROBLEM - Translation cache space on mw1011 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:08:25] PROBLEM - Translation cache space on mw1002 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:08:25] PROBLEM - Translation cache space on mw1013 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [10:08:30] jgage: they're all equally real / equally harmless. real, because they shouldn't happen. harmless (relatively) because the process restarts when they happen and that clears the TC cache [10:08:34] PROBLEM - Translation cache space on mw1007 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:08:44] PROBLEM - Translation cache space on mw1015 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [10:08:52] ori: ok, cool [10:08:55] ori: the check was added after an outage, though, right? [10:09:48] so perhaps a better alert might be "TC space exhausted, HHVM restarted" [10:10:06] well, if it could force the restart [10:10:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [10:16:17] (03PS1) 10Ori.livneh: Exclude xhprof.run_init from being reported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212874 [10:20:14] (03PS2) 10Ori.livneh: Revert "Change StatsD port to another value temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212869 [10:20:16] (03PS2) 10Ori.livneh: Exclude xhprof.run_init from being reported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212874 [10:20:37] (03CR) 10Ori.livneh: [C: 032] Exclude xhprof.run_init from being reported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212874 (owner: 10Ori.livneh) [10:20:43] (03Merged) 10jenkins-bot: Exclude xhprof.run_init from being reported [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212874 (owner: 10Ori.livneh) [10:21:28] !log ori Synchronized wmf-config/StartProfiler.php: Exclude xhprof.run_init from being reported (duration: 00m 13s) [10:21:33] Logged the message, Master [10:22:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:22:26] !log Metrics from MediaWiki to graphite are temporarily suspended while xhprof profiling work is ongoing. [10:22:31] Logged the message, Master [10:22:45] PROBLEM - Translation cache space on mw1017 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 91% [10:24:24] RECOVERY - Translation cache space on mw1017 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:25:45] RECOVERY - Translation cache space on mw1012 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:25:45] RECOVERY - Translation cache space on mw1003 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:25:54] RECOVERY - Translation cache space on mw1014 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:25:54] RECOVERY - Translation cache space on mw1008 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:25:55] RECOVERY - Translation cache space on mw1005 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:04] RECOVERY - Translation cache space on mw1016 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:04] RECOVERY - Translation cache space on mw1009 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:04] RECOVERY - Translation cache space on mw1038 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:05] RECOVERY - Translation cache space on mw1010 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:15] RECOVERY - Translation cache space on mw1006 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:25] RECOVERY - Translation cache space on mw1001 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:26] RECOVERY - Translation cache space on mw1004 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:26] RECOVERY - Translation cache space on mw1011 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:55] RECOVERY - Translation cache space on mw1002 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:26:55] RECOVERY - Translation cache space on mw1013 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:27:05] RECOVERY - Translation cache space on mw1007 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:27:14] RECOVERY - Translation cache space on mw1015 is OK: HHVM_TC_SPACE OK TC sizes are OK [10:32:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [10:33:20] (03PS1) 10Ori.livneh: Fix-up for I388671b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212876 [10:33:31] (03CR) 10Ori.livneh: [C: 032] Fix-up for I388671b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212876 (owner: 10Ori.livneh) [10:33:37] (03Merged) 10jenkins-bot: Fix-up for I388671b [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212876 (owner: 10Ori.livneh) [10:49:35] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [10:50:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:59:14] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [11:05:06] (03CR) 10Ori.livneh: [C: 032] Revert "Change StatsD port to another value temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212869 (owner: 10Ori.livneh) [11:05:12] (03Merged) 10jenkins-bot: Revert "Change StatsD port to another value temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212869 (owner: 10Ori.livneh) [11:16:04] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:16:46] !log ori Synchronized wmf-config/CommonSettings.php: Ic258d01a7: Revert "Change StatsD port to another value temporarily" (duration: 00m 13s) [11:16:53] Logged the message, Master [11:25:25] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [11:26:02] expected ^ [11:48:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Initial venv [software/sentry] - 10https://gerrit.wikimedia.org/r/201006 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [11:50:28] 6operations, 6Multimedia: Add monitoring of upload rate on commons to icingia alerts - https://phabricator.wikimedia.org/T92322#1304701 (10ori) [11:53:12] 6operations, 7Graphite, 5Patch-For-Review: enable statsd extended counters - https://phabricator.wikimedia.org/T95703#1304706 (10ori) 5Open>3Resolved [11:53:14] 6operations, 7Graphite, 5Patch-For-Review: replace txstatsd - https://phabricator.wikimedia.org/T90111#1304707 (10ori) [11:54:44] _joe_: I'm going to help this guy get set up on toollabs for a bit. I'll be back soon [11:54:52] _joe_: but yeah, private registry works [11:54:59] https://mesosphere.github.io/marathon/docs/native-docker.html I need to do [12:21:17] (03PS1) 10Ori.livneh: (ori) dotfiles update [puppet] - 10https://gerrit.wikimedia.org/r/212892 [12:21:19] (03PS1) 10Ori.livneh: graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 [12:21:31] (03PS2) 10Ori.livneh: (ori) dotfiles update [puppet] - 10https://gerrit.wikimedia.org/r/212892 [12:21:38] (03CR) 10Ori.livneh: [C: 032 V: 032] (ori) dotfiles update [puppet] - 10https://gerrit.wikimedia.org/r/212892 (owner: 10Ori.livneh) [12:21:47] (03PS2) 10Ori.livneh: graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 [12:22:04] godog: all yours [12:22:55] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:34:04] ori: thanks! [12:34:19] yuvipanda: np! [12:34:24] 6operations, 7Graphite: audit graphite retention schemas - https://phabricator.wikimedia.org/T96662#1304809 (10fgiunchedi) so, another proposal after talking with @ori, rationale being that we're most interested in recent data for investigation purposes while older data we should retain less. Difference betwee... [12:40:04] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 10 failures [12:47:55] (03PS3) 10Filippo Giunchedi: graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 (https://phabricator.wikimedia.org/T96662) (owner: 10Ori.livneh) [12:49:36] (03CR) 10Ori.livneh: [C: 031] graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 (https://phabricator.wikimedia.org/T96662) (owner: 10Ori.livneh) [12:50:57] (03PS4) 10Filippo Giunchedi: graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 (https://phabricator.wikimedia.org/T96662) (owner: 10Ori.livneh) [12:51:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: set a coarser aggregation policy to relieve storage pressure [puppet] - 10https://gerrit.wikimedia.org/r/212893 (https://phabricator.wikimedia.org/T96662) (owner: 10Ori.livneh) [12:52:58] !log bounce carbon on graphite1001 to pick up new retention schema [12:53:04] Logged the message, Master [12:53:57] !log remove MediaWiki.xhprof to pick up new retention schema [12:54:03] Logged the message, Master [13:12:25] (03PS1) 10Yuvipanda: ssh: Allow temporary opt out from more secure ssh [puppet] - 10https://gerrit.wikimedia.org/r/212909 [13:13:29] (03CR) 10Merlijn van Deen: "Shouldn't the default be 'true'?" [puppet] - 10https://gerrit.wikimedia.org/r/212909 (owner: 10Yuvipanda) [13:14:08] (03PS2) 10Yuvipanda: ssh: Allow temporary opt out from more secure ssh [puppet] - 10https://gerrit.wikimedia.org/r/212909 [13:14:09] valhallasw: yup [13:14:32] (03PS3) 10Yuvipanda: mesos: Puppetize docker config file [puppet] - 10https://gerrit.wikimedia.org/r/212871 [13:15:15] (03CR) 10Yuvipanda: [C: 032] mesos: Puppetize docker config file [puppet] - 10https://gerrit.wikimedia.org/r/212871 (owner: 10Yuvipanda) [13:16:33] yuvipanda: also <%- vs <%? [13:16:46] valhallasw: <%- trims newlines [13:16:54] ah. [13:16:54] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:16:55] hmm, maybe I shouldn't have <%- but only -%> [13:17:07] valhallasw: I don't want puppet to diff all the hosts :) [13:17:14] (03CR) 10Merlijn van Deen: [C: 031] ssh: Allow temporary opt out from more secure ssh [puppet] - 10https://gerrit.wikimedia.org/r/212909 (owner: 10Yuvipanda) [13:17:18] ? [13:17:32] valhallasw: as in, I don't want it to insert an empty line on all prod hosts [13:17:38] ahhh [13:28:37] (03PS1) 10Andrew Bogott: Resurect the old ceph module [puppet] - 10https://gerrit.wikimedia.org/r/212914 [13:31:48] (03CR) 10Giuseppe Lavagetto: [C: 031] "I don't like this at all, but as long as it's temporary, it's ok-ish." [puppet] - 10https://gerrit.wikimedia.org/r/212909 (owner: 10Yuvipanda) [13:32:06] (03PS3) 10Yuvipanda: ssh: Allow temporary opt out from more secure ssh [puppet] - 10https://gerrit.wikimedia.org/r/212909 [13:33:19] 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1304942 (10MarkTraceur) This sounds like a good plan to me - it's the least we can do to support our video creators on Commons... Are there... [13:33:28] (03CR) 10Yuvipanda: [C: 032] ssh: Allow temporary opt out from more secure ssh [puppet] - 10https://gerrit.wikimedia.org/r/212909 (owner: 10Yuvipanda) [14:15:52] (03PS2) 10Andrew Bogott: Resurect the old ceph module [puppet] - 10https://gerrit.wikimedia.org/r/212914 [14:20:43] 10Ops-Access-Requests, 6operations: Shell and research access for Moushira Elamrawy - https://phabricator.wikimedia.org/T100091#1305120 (10ori) 3NEW [14:21:34] 10Ops-Access-Requests, 6operations: Shell and research access for Moushira Elamrawy - https://phabricator.wikimedia.org/T100091#1305137 (10ori) [14:23:45] 10Ops-Access-Requests, 6operations: Shell and research access for Moushira Elamrawy - https://phabricator.wikimedia.org/T100091#1305144 (10ori) [14:38:51] (03PS3) 10Ori.livneh: Resurrect the old ceph module [puppet] - 10https://gerrit.wikimedia.org/r/212914 (owner: 10Andrew Bogott) [14:45:36] (03PS4) 10Andrew Bogott: Resurect the old ceph module [puppet] - 10https://gerrit.wikimedia.org/r/212914 [14:45:38] (03PS1) 10Andrew Bogott: Revert "Remove role::ceph::*, unused now" [puppet] - 10https://gerrit.wikimedia.org/r/212938 [14:50:06] (03PS5) 10Ori.livneh: Resurrect the old ceph module [puppet] - 10https://gerrit.wikimedia.org/r/212914 (owner: 10Andrew Bogott) [14:55:35] PROBLEM - puppet last run on ms-fe3001 is CRITICAL puppet fail [14:57:35] PROBLEM - puppet last run on labvirt1008 is CRITICAL Puppet has 1 failures [14:59:09] _joe_: https://github.com/gogits/gogs over the long term maybe :) [15:12:44] RECOVERY - puppet last run on labvirt1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:05] RECOVERY - puppet last run on ms-fe3001 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:14:15] valhallasw: where are you? [15:17:42] (03CR) 10Alex Monk: "It won't merge because we've updated the file being removed here since the parent commit was merged into master. I don't know if we even n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [15:20:06] (03PS2) 10Alex Monk: Don't commit interwiki.cdb anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188388 (https://phabricator.wikimedia.org/T75905) (owner: 10Reedy) [15:25:20] (03PS1) 10Ori.livneh: Add moushira to bastion-only and researchers. [puppet] - 10https://gerrit.wikimedia.org/r/212946 (https://phabricator.wikimedia.org/T100091) [15:38:00] (03CR) 10Moushira: [C: 031] "Yes, thats my key!" [puppet] - 10https://gerrit.wikimedia.org/r/212946 (https://phabricator.wikimedia.org/T100091) (owner: 10Ori.livneh) [15:55:34] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [15:59:55] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 81 failures [16:00:07] _joe_: boom [16:00:21] _joe_: do dig -t srv @marathon-master-01.eqiad.wmflabs 1 _tool-hello._tcp.marathon.mesos [16:00:24] :D [16:07:19] _joe_: and it does multiples well as well - two instances running returns two SRV records [16:07:23] * yuvipanda likes this [16:11:25] (03PS1) 10Gergő Tisza: Remove .pyc files [software/sentry] - 10https://gerrit.wikimedia.org/r/212958 [16:11:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove .pyc files [software/sentry] - 10https://gerrit.wikimedia.org/r/212958 (owner: 10Gergő Tisza) [16:14:05] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:16:44] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:15] PROBLEM - puppet last run on db2005 is CRITICAL puppet fail [16:45:04] RECOVERY - puppet last run on db2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:57:12] _joe_: marathon has built in support for rolling deploys, and you can tweak it to see if you want it to be 2 stage or fully 'rolling' [16:57:59] _joe_: so we can restart by basically doing a PUT with a new docker image, and it takes care of everything else by itself :) [18:00:15] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [18:37:04] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:42] ori, did you see that post on wikitech-l? [19:41:48] about page weight [19:49:48] i was actually going to respond that ori has been on it already, but couldn't find any pretty graphs or anything [19:51:19] I noticed the comment about lack of gzip compression over TLS [19:53:46] but it definitely works for me [19:55:39] (03PS1) 10Yuvipanda: dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 [19:56:21] https://phabricator.wikimedia.org/P677 [19:56:22] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 (owner: 10Yuvipanda) [19:57:31] (03PS2) 10Yuvipanda: dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 [19:59:00] (03PS3) 10Yuvipanda: dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 [19:59:45] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 25 failures [20:00:17] MatmaRex, also you'd expect someone like Rich Farmbrough to know which domain he's talking about by now [20:01:13] hmm> [20:01:14] ? [20:02:00] (03PS4) 10Yuvipanda: dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 [20:02:09] he thought that bits was on wikipedia's domain [20:02:24] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Hygiene fixes [puppet] - 10https://gerrit.wikimedia.org/r/212991 (owner: 10Yuvipanda) [20:02:33] see the thread that was linked to from the email [20:02:46] MatmaRex: performance.wikimedia.org [20:02:54] MatmaRex: fancy graphs! [20:05:31] and ugh, "the Wikipedia developers" [20:06:14] and then of course there is the obligatory no-js person [20:06:24] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [20:07:55] meh [20:08:09] legoktm: thanks [20:11:52] (03PS1) 10Yuvipanda: dynamicproxy: Add redundanturl dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/212997 [20:13:24] (03PS2) 10Yuvipanda: dynamicproxy: Add redundanturl dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/212997 [20:16:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:36:45] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:40:55] (03PS3) 10Yuvipanda: dynamicproxy: Add redundanturl dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/212997 [21:08:05] PROBLEM - carbon-cache write error on graphite1001 is CRITICAL 22.22% of data above the critical threshold [8.0] [21:10:34] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1305760 (10Paladox) [21:10:53] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1303220 (10Paladox) [21:11:11] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1305766 (10Paladox) 5duplicate>3Open [21:11:57] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1303220 (10Paladox) Hi I had this patch https://gerrit.wikimedia.org/r/#/c/212813/ review and +2 for code reviewed and it said it was successfully merged but looking on g... [21:13:55] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1305769 (10Paladox) p:5Triage>3Unbreak! [21:14:19] 6operations, 10Wikimedia-Git-or-Gerrit: git.wikimedia.org replication from gerrit stopped or lags - https://phabricator.wikimedia.org/T99990#1303220 (10Paladox) Since gerrit has stoped replicating into gitblit status should be unbreak now. [21:29:57] (03PS1) 10Aaron Schulz: Fixed totally broken runner JSON response code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/213010 [21:30:05] ori ^ [21:30:20] next swat would be nice [21:41:37] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [21:43:21] expected ^ xhprof [21:44:20] any idea why I can't properly create/upload protect https://commons.wikimedia.org/w/index.php?title=File:Ssss.jpg ? [21:44:56] it protects as normal, but when reloading the page after leaving the page protection dialog, the protection vanishes [21:47:47] https://commons.wikimedia.org/wiki/File:Ssss.jpg?action=edit gives 'This title has been protected from creation by Nick. The reason given is "Protection against re-creation (non-descriptive file name)".' to me [21:53:01] OK, will try and upload with a non admin account. [23:30:30] !log ori Synchronized php-1.26wmf7/extensions/Gadgets: b592efa5fe: Update Gadgets for I6da3eede0: Conversion to using WAN cache (duration: 00m 13s) [23:30:37] Logged the message, Master [23:37:14] PROBLEM - puppet last run on mw2007 is CRITICAL puppet fail [23:52:12] (03PS1) 10Yuvipanda: mesos: Setup marathon properly [puppet] - 10https://gerrit.wikimedia.org/r/213189 [23:55:44] RECOVERY - puppet last run on mw2007 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:56:45] PROBLEM - Kafka Broker Messages In Per Second on graphite1001 is CRITICAL Anomaly detected: 0 data above and 45 below the confidence bounds