[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150806T0000). Please do the needful. [00:02:41] PROBLEM - Kafka Broker Server on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [00:02:49] (03CR) 10JanZerebecki: [C: 04-1] "needs another security related change before this goes live: see T107602#1512318" [puppet] - 10https://gerrit.wikimedia.org/r/229392 (https://phabricator.wikimedia.org/T107602) (owner: 10Giuseppe Lavagetto) [00:03:24] that is a page i pay attention to! [00:03:26] whaaa? [00:03:29] heh [00:03:32] PROBLEM - Kafka Broker Server on analytics1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [00:03:33] was about to ask you about it [00:03:51] PROBLEM - Kafka Broker Server on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [00:03:54] uhhhhh [00:04:01] !log restarted restbase old-render clean-up scripts on wikipedia html and data-parsoid [00:04:03] lol [00:04:04] uhhhhhh [00:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:04:08] so our pagerduty trial expired [00:04:10] they renewed it [00:04:13] i'll disable it now. [00:04:38] well... maybe i should leave it until we use the alternative starting tomorrow... i dunno [00:04:42] what the crap [00:04:47] did puppe tjust auto upgrade kafka?!?!! [00:04:51] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1513292 (10BBlack) >>! In T107940#1512972, @CCogdill_WMF wrote: >> The fact that T107977 (local creation/routing of anna@benefactorevents.wikimedia.org... [00:04:54] you have have it set to latest? [00:04:57] what the crap crap [00:04:57] no! [00:05:12] that sucks [00:05:28] I assume I can ignore the analytics pages since otto's here typing [00:05:40] you assume correctly but what the crap [00:06:14] OH, phew [00:06:16] phewwwwwww [00:06:51] ok phew. [00:06:56] this is ok. [00:06:57] hahah [00:07:10] those alerts are for the new brokers I added but then removed. [00:07:19] i think the downtime I scheudled just ran out. [00:07:32] and puppet hasn't run on them, so icinga still thinks they are kafka brokers [00:07:36] * Jamesofur assumes G [00:07:43] (sorry) [00:07:51] Anyone understand the ssl config proxies? [00:08:27] Mine seems to have been working just fine yesterday (I had to re set it up after a HD failure) but now I can only get on to the bastion (Hooft) and am getting errored out when I try to get to another server [00:10:02] RECOVERY - puppet last run on analytics1014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:10:12] Jamesofur, SSH config? [00:10:15] Jamesofur: we have disabled ssh agent forwarding on the bastions. You need to setup your ssh to proxy now instead -- https://wikitech.wikimedia.org/wiki/Help:Access#Accessing_instances_with_ProxyCommand_ssh_option_.28recommended.29 [00:10:26] (03PS1) 10RobH: disable pagerduty sms [puppet] - 10https://gerrit.wikimedia.org/r/229604 [00:10:28] bd808: I'm using the proxy [00:10:30] robh: if we get a new paging service, can we get one that pages from the same number? [00:10:32] hehe [00:10:38] ottomata: pagerduty isnt? [00:10:45] i dunno, whatever pages me now [00:10:47] looks like [00:10:52] Krenair: yes sorry [00:10:57] my issue with pagerduty sms was the formatting is horrible [00:10:58] fork failed: Resource temporarily unavailable [00:10:58] ssh_exchange_identification: Connection closed by remote host (x 50 or so of these) [00:11:00] 954-9973-1 [00:11:02] and we dont want an ack system [00:11:03] 973-0 [00:11:11] so yes, we dont want rotating number [00:11:15] yeah, k [00:11:15] cool [00:11:20] the one i signed up for today to test will read wikimedia ;D [00:11:28] cool [00:11:30] i like :) [00:11:32] much like smsglobal used to years ago [00:11:36] before they fucked their own system [00:11:48] Jamesofur: that sounds like you've got an ssh forwarding/proxying setup that loops into itself until it dies from resource exhaustion [00:12:01] Jamesofur, paste your ssh config somewhere? [00:12:09] try adding an exclusion to the Host line for whatever you're forwarding through [00:12:10] (03CR) 10RobH: [C: 032] disable pagerduty sms [puppet] - 10https://gerrit.wikimedia.org/r/229604 (owner: 10RobH) [00:12:12] yup, sec [00:12:19] cool, ok, fixed the pages from those nodes. I'm probably going to reinstall them again on monday anyway. [00:12:23] as in: [00:12:34] ok byYYeyeyey [00:12:36] Host *.esams.wmnet !hooft.esams.wikimedia.org [00:12:40] ProxyCommand ssh -4 -W %h:%p hooft.esams.wikimedia.org [00:12:58] (don't use those lines, but I'm just saying, there's an example of excluding hooft when proxying through hooft) [00:13:22] ottomata: you may have to use that "puppetstoredconfig" script on the puppetmaster to make Icinga forget about (old) hosts [00:13:24] what's the point of the !hooft when you don't include *.esams.wikimedia.org anyway? [00:13:37] Krenair: historical cruft in my ssh config :) [00:13:40] :) [00:13:49] that line used to do *.esams.wikimedia.org too [00:14:05] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1513342 (10CCogdill_WMF) Thanks again for explaining, @BBlack. I am by no means an expert in understanding these complicated DNS rules so this really h... [00:14:24] hooft is the bastion with ops+deployment+restricted-only access [00:14:35] Jamesofur is restricted and probably wants to use it to access eqiad hosts [00:14:51] * Jamesofur nods [00:14:52] https://phabricator.wikimedia.org/P1840 [00:15:08] bblack: I misread the above phab comment and thought you said you weren't an expert in DNS haha [00:15:47] mutante: aye, ja i know, thanks, they are still alive at the moment, i had just not let puppet run to have it remove the puppetized kafka alerts [00:15:48] well, Krenair it worked, I assume *.wikimedia.org is sucking up hooft [00:15:53] yep [00:15:57] they are installed, just not doing anything right now [00:16:00] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1513353 (10BBlack) Ok, thanks. Sorry for all the poor grammar and general vitriol, it's been a long day :) [00:16:05] here's a patch to fix the hooft issue: https://gerrit.wikimedia.org/r/#/c/222522/ [00:16:21] it'll match hooft.esams.wikimedia.org [00:16:31] Now if we were actually talking about SSL, it wouldn't match :) [00:16:52] * Jamesofur adds that to the example on https://wikitech.wikimedia.org/wiki/SSH_access#Production which doesn't exclude the bastion in the 2nd command [00:17:19] ok I'm gonna take phabricator down for upgrade [00:17:46] .... what the heck am i gonna do then? [00:17:46] heh [00:17:54] oh wait its past 5 i guess i can stop working..... [00:18:03] watch TV! [00:18:11] via the internet! [00:18:45] I just discovered (I'm a cable-cord-cutter, but I don't mind paying for alternate IP-based commercial services to get what I need!) that GoT S5 is available through Sling.com on Roku's [00:19:13] you basically have to sign up for their normal $20/mo service, and then another +$15/mo for HBO, and then you can legally stream what used to only be available via HBO Go [00:19:14] my mom has comcast, and i have her login. [00:19:24] Jamesofur, the first block should handle disabling ProxyCommand for bast1001, I think? [00:19:29] before that i used brion's but then he went and moved! [00:19:33] I don't remember what ControlMaster meant [00:19:44] (my xbox 360 used to work on his comcast account from when we were roomies) [00:19:44] it sounds important [00:20:16] so basically, signing up for a month of Sling is about the same cost to me as buying GoT S5 on Amazon Video or similar, but I don't have to wait until Aug 31 to stream (legally!) [00:20:35] Krenair: I thought it would too... but not exactly sure what to add to ensure that happens [00:20:46] ControlMaster ensures that all of the TCP connections go down one path [00:20:49] faster/less authentication [00:21:23] * Jamesofur knows this only because he read a couple articles last night when asking himself the same question [00:22:53] Krenair: oh you know what, I'll bet you it's because i use a host alias for hooft [00:23:04] bblack: nice. HBO launched that no-cable streaming service the week GoT debuted this season [00:23:05] and it matches on host not hostName to exclude [00:23:23] that's probably why it worked earlier, I don't think I set it up with the alias originally [00:24:00] I didn't end up using it because my cable carrier dropped the monthly price of HBO below the streaming service cost [00:27:06] Jamesofur, yep, that'd do it [00:27:38] clearly that was originally intended to exclude proxycommand for hooft and then got changed to the host alias which broke everything [00:29:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [00:29:53] (03PS1) 10Alex Monk: beta: Delete ee_prototypewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229608 (https://phabricator.wikimedia.org/T107397) [00:30:00] ew, what? [00:30:03] git showed that as an HTTP 500 [00:30:17] Phabricator is also down [00:30:59] Oh: [00:31:00] [17:17] twentyafterfour ok I'm gonna take phabricator down for upgrade [00:31:07] But it's not on the deployments calendar :S [00:31:23] nor on SAL [00:31:48] RoanKattouw, it is on the calendar [00:31:52] !log ok I'm gonna take phabricator down for upgrade [00:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:32:00] Ooooh [00:32:02] Under Thursday [00:32:04] Of course [00:32:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:32:43] Select all 289,423 conversations in "Monitor/Cron" [00:33:13] google wants storage back:) [00:33:19] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:33:31] !log es1.7.1 upgrade on elastic1017 [00:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:53] robh: ^ that's the pagerduty change [00:34:04] should it be disabled? [00:34:26] !log phabricator upgrade complete [00:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:34:38] sorry [00:34:41] i had it mid change [00:34:43] its mered [00:34:46] Where is the phabricator upgrade listed as thursday? [00:35:20] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:35:24] twentyafterfour: it's Thursday now :) [00:35:37] (03PS2) 10Alex Monk: beta: delete ee_prototypewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229608 (https://phabricator.wikimedia.org/T107397) [00:35:44] Actually, I wonder what deleted.dblist really does [00:35:57] twentyafterfour, deployment calendar, because it is thursday [00:36:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:36:58] Why did the gerrit bot miss https://phabricator.wikimedia.org/T107397 ? [00:37:20] sometimes it just misses random ones [00:38:39] Krenair: thursday, UTC [00:38:41] ? [00:38:43] yes [00:39:39] that's now [00:39:46] midnight is at like 5pm [00:40:30] (03PS1) 10Ori.livneh: Use external diff, now that lightprocess is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229609 [00:40:45] (03PS2) 10Ori.livneh: Use external diff, now that lightprocess is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229609 [00:40:50] (03CR) 10Ori.livneh: [C: 032] Use external diff, now that lightprocess is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229609 (owner: 10Ori.livneh) [00:40:56] (03Merged) 10jenkins-bot: Use external diff, now that lightprocess is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229609 (owner: 10Ori.livneh) [00:42:09] (03CR) 10MZMcBride: "This is related to . I came here to update the commit message, but I was too late. Oh well." [puppet] - 10https://gerrit.wikimedia.org/r/229219 (owner: 10GWicke) [00:42:28] !log ori Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 12s) [00:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:58] jouncebot had the time right, as does https://phabricator.wikimedia.org/calendar/ [00:44:21] 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1513401 (10MZMcBride) The initial deployment only worked on the www domains. With now merged, /api/ should start working in... [00:49:00] !reset password for User:Tonval after identify verification [00:49:05] well done [00:49:12] !log reset password for User:Tonval after identify verification [00:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:40] expected morebots to quit after that :) [00:51:14] lol [00:51:32] Jamesofur [00:51:34] ! [00:51:40] :) [00:55:25] (03CR) 10JanZerebecki: ""Note that the ::labs suffix shouldn't really be necessary now that we have hiera - just define a parameter for the role class for each fu" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) (owner: 10WMDE-leszek) [01:01:07] RECOVERY - RAID on db1059 is OK optimal, 1 logical, 2 physical [01:05:40] (03CR) 10Dzahn: "+1 to Jan, if we can avoid the "labs" part and just use hiera that's better" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) (owner: 10WMDE-leszek) [01:07:36] 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1513433 (10MZMcBride) and friends are now working properly. I think this task can be marked resolved/fixed. Regarding Beta Labs su... [01:26:06] PROBLEM - puppet last run on mw2058 is CRITICAL puppet fail [01:26:50] 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1513445 (10GWicke) 5Open>3Resolved @mzmcbride, for now the intention is just to be provide a friendly HTML entry point, which isn't critical and shouldn't need... [01:33:58] !log es1.7.1 upgrade on elastic1018 [01:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:52:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [01:54:46] RECOVERY - puppet last run on mw2058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:07:43] (03PS1) 10Gergő Tisza: Enable authmetrics logging on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229618 (https://phabricator.wikimedia.org/T91701) [02:16:06] !log es1.7.1 upgrade on elastic1019 [02:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:13] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: l10nupdate for 1.26wmf16 (duration: 10m 42s) [02:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:44:40] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf16) at 2015-08-06 02:44:39+00:00 [02:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:40] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1513499 (10JeroenDeDauw) Are these production machines that run MediaWiki? I'd like to know if they are relevant for extension development, or just for certain... [03:05:58] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [03:06:28] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [03:12:06] !log l10nupdate Synchronized php-1.26wmf17/cache/l10n: l10nupdate for 1.26wmf17 (duration: 10m 32s) [03:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:18:28] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf17) at 2015-08-06 03:18:27+00:00 [03:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:16] PROBLEM - puppet last run on ruthenium is CRITICAL puppet fail [03:25:30] !log es1.7.1 upgrade on elastic1020 [03:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:47:36] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [03:57:29] !log krinkle Synchronized php-1.26wmf17/resources/src/mediawiki/mediawiki.js: T108124 (duration: 00m 12s) [03:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:58:00] !log krinkle Synchronized php-1.26wmf17/includes: T108124 (duration: 00m 17s) [03:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:04:30] !log krinkle Synchronized php-1.26wmf17/resources/src/mediawiki.legacy/wikibits.js: T108139 (duration: 00m 12s) [04:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:05:27] PROBLEM - Disk space on hafnium is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=70%) [04:07:27] RECOVERY - Disk space on hafnium is OK: DISK OK [04:10:18] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (100299s 100000s) [04:28:27] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (11717 100000s) [04:42:34] !log krinkle Synchronized php-1.26wmf17/resources/src/mediawiki/mediawiki.js: I885c36398 (duration: 00m 12s) [04:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:45:37] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [04:48:15] !log es1.7.1 upgrade on elastic1021 [04:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:02:11] !log krinkle Synchronized php-1.26wmf17/includes/OutputPage.php: I885c36398 (duration: 00m 12s) [05:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:13:47] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:32:12] !log krinkle Synchronized php-1.26wmf17/resources/src/mediawiki/mediawiki.js: touch (duration: 00m 13s) [05:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:39:36] !log krinkle Synchronized php-1.26wmf17/resources/src/mediawiki/mediawiki.js: touch (duration: 00m 12s) [05:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:17] !log krinkle Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoaderModule.php: Ib4371255fe (duration: 00m 13s) [05:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:57:27] !log krinkle Synchronized php-1.26wmf16/includes/resourceloader/ResourceLoaderModule.php: Ib4371255fe (duration: 00m 12s) [05:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:48] !log es1.7.1: restart elastic1022 [06:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:19:57] PROBLEM - puppet last run on mw2196 is CRITICAL puppet fail [06:31:08] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 4 failures [06:34:28] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 3 failures [06:52:16] !log restart HHVM on canary API servers (mw1114-mw1119) for libtiny/PCRE security updates [06:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:36] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:56:06] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mc2015 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:58:17] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:42] (03PS1) 10Jcrespo: Repool db1056, Depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229649 [07:05:01] (03CR) 10Jcrespo: [C: 032] Repool db1056, Depool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229649 (owner: 10Jcrespo) [07:06:59] !log jynus Synchronized wmf-config/db-eqiad.php: Repool db1056, Depool db1068 (duration: 00m 12s) [07:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:12:35] lag on db1068: it is expected and the server is depooled [07:18:27] RECOVERY - puppet last run on mw2196 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:34:35] !log es1.7.1: restart elastic1023 [07:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:40:06] (03PS1) 10Yuvipanda: k8s: Open up flannel UDP port with ferm [puppet] - 10https://gerrit.wikimedia.org/r/229655 [07:40:16] (03PS2) 10Yuvipanda: k8s: Open up flannel UDP port with ferm [puppet] - 10https://gerrit.wikimedia.org/r/229655 [07:40:24] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Open up flannel UDP port with ferm [puppet] - 10https://gerrit.wikimedia.org/r/229655 (owner: 10Yuvipanda) [07:59:36] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:20:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 6 08:20:24 UTC 2015 (duration 20m 23s) [08:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:07] ACKNOWLEDGEMENT - puppet last run on ms-be2009 is CRITICAL Puppet has 1 failures Filippo Giunchedi T107877 [08:32:16] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.010 second response time [08:32:57] PROBLEM - NTP on etcd1001 is CRITICAL: NTP CRITICAL: Offset 20.66204441 secs [08:34:17] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [08:35:17] PROBLEM - NTP on krypton is CRITICAL: NTP CRITICAL: Offset 8.822870731 secs [08:38:11] (03PS6) 10Giuseppe Lavagetto: wdqs: multiple fixes - Create rules.log as a symlink to a file in /var - drop cookies when proxying request [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [08:38:25] (03PS7) 10Giuseppe Lavagetto: wdqs: multiple fixes - Create rules.log as a symlink to a file in /var - drop cookies when proxying request [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [08:38:54] (03CR) 10Giuseppe Lavagetto: [C: 032] wdqs: multiple fixes - Create rules.log as a symlink to a file in /var - drop cookies when proxying request [puppet] - 10https://gerrit.wikimedia.org/r/229150 (owner: 10Smalyshev) [08:39:51] (03Abandoned) 10Giuseppe Lavagetto: drop cookies when proxying request [puppet] - 10https://gerrit.wikimedia.org/r/229194 (owner: 10Smalyshev) [08:39:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 2 below the confidence bounds [08:40:23] (03PS2) 10Giuseppe Lavagetto: T103907: restrict further passed URLs [puppet] - 10https://gerrit.wikimedia.org/r/229584 (owner: 10Smalyshev) [08:40:47] (03CR) 10Giuseppe Lavagetto: [C: 032] T103907: restrict further passed URLs [puppet] - 10https://gerrit.wikimedia.org/r/229584 (owner: 10Smalyshev) [08:50:16] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:54:59] (03PS1) 10Alexandros Kosiaris: ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 [08:58:07] (03PS2) 10Alexandros Kosiaris: ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 [08:59:14] (03PS3) 10Alexandros Kosiaris: ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 [08:59:37] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 (owner: 10Alexandros Kosiaris) [08:59:47] (03PS4) 10Alexandros Kosiaris: ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 [09:00:05] (03CR) 10Alexandros Kosiaris: [V: 032] ganeti/ferm: open the migration TCP port [puppet] - 10https://gerrit.wikimedia.org/r/229665 (owner: 10Alexandros Kosiaris) [09:09:34] (03PS1) 10Yuvipanda: k8s: Add class for running kubelets [puppet] - 10https://gerrit.wikimedia.org/r/229666 [09:09:45] (03PS2) 10Yuvipanda: k8s: Add class for running kubelets [puppet] - 10https://gerrit.wikimedia.org/r/229666 [09:10:48] (03CR) 10Yuvipanda: [C: 032] k8s: Add class for running kubelets [puppet] - 10https://gerrit.wikimedia.org/r/229666 (owner: 10Yuvipanda) [09:12:49] (03CR) 10Alexandros Kosiaris: [C: 04-2] "postgis is not a daemon. it does not listen on any TCP port. it is an addon for postgres. Plus this is a module class. Please put ferm rul" [puppet] - 10https://gerrit.wikimedia.org/r/226075 (owner: 10Muehlenhoff) [09:16:00] (03PS1) 10Jcrespo: Repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229669 [09:16:30] (03CR) 10Jcrespo: [C: 032] Repool db1068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229669 (owner: 10Jcrespo) [09:17:00] (03PS1) 10Yuvipanda: k8s: Explicitly pass nameserver to ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/229672 [09:18:10] !log jynus Synchronized wmf-config/db-eqiad.php: Repool db1068 (duration: 00m 13s) [09:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:39] (03CR) 10Yuvipanda: [C: 032] k8s: Explicitly pass nameserver to ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/229672 (owner: 10Yuvipanda) [09:20:26] (03PS1) 10Yuvipanda: k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/229673 [09:20:38] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/229673 (owner: 10Yuvipanda) [09:21:24] (03PS2) 10Muehlenhoff: add ferm rules for postgres [puppet] - 10https://gerrit.wikimedia.org/r/226075 [09:23:59] (03PS1) 10Yuvipanda: k8s: Make sure parent directories are created [puppet] - 10https://gerrit.wikimedia.org/r/229674 [09:25:09] (03CR) 10Yuvipanda: [C: 032] k8s: Make sure parent directories are created [puppet] - 10https://gerrit.wikimedia.org/r/229674 (owner: 10Yuvipanda) [09:28:37] (03CR) 10DCausse: [C: 04-1] "We should wait for the elasticsearch 1.7.1 to complete before merging." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [09:30:48] !log es1.7.1: restart elastic1024 [09:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:39] (03PS2) 10Filippo Giunchedi: update cassandra-metrics-collector to latest [puppet] - 10https://gerrit.wikimedia.org/r/229401 (https://phabricator.wikimedia.org/T97024) (owner: 10Eevans) [09:32:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update cassandra-metrics-collector to latest [puppet] - 10https://gerrit.wikimedia.org/r/229401 (https://phabricator.wikimedia.org/T97024) (owner: 10Eevans) [09:35:07] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1513988 (10akosiaris) Hello, well, I am fine running the process, but I did create the documentation so anyone can do it. For whatever reasons, I might be available for longer than wanted periods of time, so... [09:35:53] (03CR) 10Alexandros Kosiaris: [C: 031] "seems fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/226075 (owner: 10Muehlenhoff) [09:39:42] !log restarted HHVM on API apaches in eqiad for libtiny/PCRE security updates [09:39:47] !log Applying schema change to Commons db master [09:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:29] (03CR) 10Alexandros Kosiaris: "replied inline" (035 comments) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [09:46:18] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1514006 (10Krenair) >>! In T86081#1513499, @JeroenDeDauw wrote: > Are these production machines that run MediaWiki? I'd like to know if they are relevant for ex... [09:47:40] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1514009 (10Tau) Mediawiki was installed by extracting tar file, following these instructions found here: [[ https://www.howto... [09:50:16] !log uploaded openjdk-8_8u66-b01-1~bpo8+1 to jessie-wikimedia and jessie-backports/debian.org [09:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:52:23] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1514030 (10MoritzMuehlenhoff) 5Open>3Resolved The latest release has been backported as openjdk-8_8u66-b01-1~bpo8+1 and was uploaded to jessie-wikimedia (also available in jessie-backports... [09:52:51] Commons ok, any slowdowns? [09:53:15] 6operations: Java 8 for Jessie - https://phabricator.wikimedia.org/T97406#1514032 (10MoritzMuehlenhoff) [09:53:18] 6operations, 10RESTBase-Cassandra: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1514034 (10MoritzMuehlenhoff) [09:54:31] I see it ok [09:54:57] but ping me if you see something strange in commons in the next 4 hours [09:55:11] (03CR) 10Filippo Giunchedi: labs: new role::logstash::stashbot class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [09:56:57] PROBLEM - NTP on bromine is CRITICAL: NTP CRITICAL: Offset 24.60673773 secs [09:58:15] (03CR) 10Faidon Liambotis: [C: 031] Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [09:59:20] (03CR) 10Faidon Liambotis: [C: 031] Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [10:01:49] (03CR) 10Faidon Liambotis: [C: 031] network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 (owner: 10BBlack) [10:02:39] (03CR) 10Alexandros Kosiaris: [C: 031] network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 (owner: 10BBlack) [10:04:36] (03PS1) 10Filippo Giunchedi: swift: suppress all output from swift-dispersion-stats [puppet] - 10https://gerrit.wikimedia.org/r/229677 (https://phabricator.wikimedia.org/T78762) [10:04:46] godog: heh that was fast :)) [10:05:10] (03CR) 10Faidon Liambotis: [C: 032] swift: suppress all output from swift-dispersion-stats [puppet] - 10https://gerrit.wikimedia.org/r/229677 (https://phabricator.wikimedia.org/T78762) (owner: 10Filippo Giunchedi) [10:05:29] (03PS2) 10Filippo Giunchedi: swift: suppress all output from swift-dispersion-stats [puppet] - 10https://gerrit.wikimedia.org/r/229677 (https://phabricator.wikimedia.org/T78762) [10:05:31] (03CR) 10Faidon Liambotis: "The package still ships a sysvinit script though, right? Is there a race here, e.g. before our own systemd unit gets shipped by puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/228591 (owner: 10BBlack) [10:05:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: suppress all output from swift-dispersion-stats [puppet] - 10https://gerrit.wikimedia.org/r/229677 (https://phabricator.wikimedia.org/T78762) (owner: 10Filippo Giunchedi) [10:05:57] paravoid: hehe yeah the related ticket isn't, but this is better than the cronspam [10:12:18] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [10:17:09] (03CR) 10Faidon Liambotis: [C: 031] VCL: use network::constants::all_networks_lo (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229122 (owner: 10BBlack) [10:18:19] (03CR) 10Faidon Liambotis: [C: 031] restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 (owner: 10BBlack) [10:22:39] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Assign varnish memory-only role to maps servers - https://phabricator.wikimedia.org/T105076#1514080 (10Yurik) [10:27:55] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Add Redis to maps cluster - https://phabricator.wikimedia.org/T107813#1514086 (10Yurik) [10:29:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [10:29:19] (03CR) 10Faidon Liambotis: [C: 04-1] rsyslog: add rsyslog::receiver to deprecate syslog-ng (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [10:32:15] godog, moritzm: should https://gerrit.wikimedia.org/r/#/c/222999/ be -2ed/abandoned? [10:33:09] !log es1.7.1: restart elastic1025 [10:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:09] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514097 (10akosiaris) [10:36:22] (03CR) 10Muehlenhoff: [C: 04-2] "We'll backport ffmpeg 2.7 to jessie." [puppet] - 10https://gerrit.wikimedia.org/r/222999 (https://phabricator.wikimedia.org/T95002) (owner: 10Hashar) [10:36:32] paravoid: yep, I just followed up [10:36:53] danke :) [10:38:43] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514102 (10akosiaris) OK, obviously a righteous goal. codfw sounds fine as well. Quick question though. Do we indeed need public IPs ? Given that we have many internal LVS services, I think we could... [10:42:27] RECOVERY - NTP on etcd1001 is OK: NTP OK: Offset -0.01091551781 secs [10:44:47] RECOVERY - NTP on krypton is OK: NTP OK: Offset -0.002485752106 secs [10:46:09] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514118 (10Joe) I don't think we need public ips if not for ease of testing. [10:50:14] (03PS3) 10Muehlenhoff: add ferm rules for postgres [puppet] - 10https://gerrit.wikimedia.org/r/226075 [10:50:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] add ferm rules for postgres [puppet] - 10https://gerrit.wikimedia.org/r/226075 (owner: 10Muehlenhoff) [10:54:17] (03PS1) 10Muehlenhoff: Enable base::firewall for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/229688 [10:57:12] (03CR) 10Giuseppe Lavagetto: [C: 031] "small comment but the python code LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228329 (owner: 10coren) [11:01:35] (03PS1) 10Muehlenhoff: Add ferm rules for osm [puppet] - 10https://gerrit.wikimedia.org/r/229690 [11:12:19] (03PS3) 10Filippo Giunchedi: rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) [11:12:34] (03CR) 10Filippo Giunchedi: rsyslog: add rsyslog::receiver to deprecate syslog-ng (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [11:16:37] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [11:21:49] (03CR) 10Filippo Giunchedi: "I'm assuming "others" means this comment, https://issues.apache.org/jira/browse/CASSANDRA-7486?focusedCommentId=14550721&page=com.atlassia" [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) (owner: 10GWicke) [11:26:47] (03CR) 10Faidon Liambotis: [C: 032] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [11:35:59] (03CR) 10Alexandros Kosiaris: [C: 031] Add ferm rules for osm [puppet] - 10https://gerrit.wikimedia.org/r/229690 (owner: 10Muehlenhoff) [11:38:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [11:43:37] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:46:04] 6operations, 10vm-requests, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514244 (10akosiaris) OK then, moving on with private IPs for the VMs. [11:52:09] (03PS1) 10Alexandros Kosiaris: Introduce pybal-test200{1,2,3}.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/229696 (https://phabricator.wikimedia.org/T107901) [11:52:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for osm [puppet] - 10https://gerrit.wikimedia.org/r/229690 (owner: 10Muehlenhoff) [11:55:27] !log es1.7.1: restart elastic1026 [11:55:30] (03PS1) 10Muehlenhoff: Enable base::firewall for labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/229697 [11:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:24] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1514274 (10JohnLewis) a:5akosiaris>3RobH Thanks @akosiaris. Presumably since we'll be testing an import here we'd want to view mailman physically so when you do this and install it Rob, perhaps add it t... [11:59:12] (03PS1) 10Muehlenhoff: Enable base::firewall for labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/229698 [12:03:23] (03PS2) 10Muehlenhoff: Enable base::firewall for labsdb1006 [puppet] - 10https://gerrit.wikimedia.org/r/229697 [12:03:33] (03PS1) 10Alexandros Kosiaris: Introduce scb100{1,2}.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/229700 (https://phabricator.wikimedia.org/T107287) [12:06:00] (03PS2) 10Alexandros Kosiaris: Introduce pybal-test200{1,2,3}.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/229696 (https://phabricator.wikimedia.org/T107901) [12:11:14] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1514305 (10akosiaris) Not sure I understand what "view mailman physically" exactly means but I am gonna assume it means have the web interface publicly available in which case, a temporary `misc-web` rule is... [12:11:56] RECOVERY - NTP on bromine is OK: NTP OK: Offset 0.001547694206 secs [12:15:29] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1514315 (10JohnLewis) @akosiaris yes, misc-web to view the interface is what I meant, perhaps not the best wording in my part :) The service is fully puppetised and has been for years to my knowledge. It re... [12:15:46] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1514316 (10JohnLewis) p:5Triage>3Normal [12:32:08] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514323 (10mobrovac) [12:33:36] (03PS1) 10Muehlenhoff: Add ferm rules for jmxtrans/impala [puppet] - 10https://gerrit.wikimedia.org/r/229704 (https://phabricator.wikimedia.org/T83597) [12:35:47] (03PS1) 10Muehlenhoff: Enable base::firewall on analytics1026 [puppet] - 10https://gerrit.wikimedia.org/r/229705 [12:45:40] (03PS1) 10Muehlenhoff: Add ferm rules for jmxtrans/hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/229706 (https://phabricator.wikimedia.org/T83597) [12:50:14] (03Abandoned) 10Muehlenhoff: Add ferm rules for jmxtrans/hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/229706 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [12:51:06] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 447996 msg: ocg_render_job_queue 3060 msg (=3000 critical) [12:51:26] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 448693 msg: ocg_render_job_queue 3479 msg (=3000 critical) [12:53:56] (03PS4) 10Filippo Giunchedi: rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) [12:53:57] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 454002 msg: ocg_render_job_queue 6882 msg (=3000 critical) [12:54:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] rsyslog: add rsyslog::receiver to deprecate syslog-ng [puppet] - 10https://gerrit.wikimedia.org/r/229162 (https://phabricator.wikimedia.org/T107611) (owner: 10Filippo Giunchedi) [12:55:02] !log stop syslog-ng on lithium before switching to rsyslog [12:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:56] !log es1.7.1: restart elastic1027 [13:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:13] (03PS1) 10Muehlenhoff: Add ferm rules for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) [13:19:08] (03PS1) 10Alexandros Kosiaris: Introduce scb100{1,2}.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/229710 (https://phabricator.wikimedia.org/T107287) [13:24:56] akosiaris: i don't mind puttin gin /usr/share/kafka, but one more suggestion [13:24:57] 6operations, 10ops-eqiad: wmf5842's DRAC is unresponsive - https://phabricator.wikimedia.org/T108184#1514405 (10akosiaris) 3NEW [13:25:01] how about usr/share/kafka/lib [13:25:01] ? [13:25:21] this is done for some other javay packages we use...although maybe that is not very debianish either [13:25:31] 6operations, 10ops-eqiad: wmf5842's DRAC is unresponsive - https://phabricator.wikimedia.org/T108184#1514414 (10akosiaris) [13:25:38] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514413 (10akosiaris) [13:27:01] !log es1.7.1: restart elastic1028 [13:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:36] (03PS4) 10BBlack: restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 [13:27:43] (03CR) 10BBlack: [C: 032 V: 032] restrict_access: move to common code for all backends [puppet] - 10https://gerrit.wikimedia.org/r/229121 (owner: 10BBlack) [13:27:58] (03PS5) 10BBlack: network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 [13:28:03] ottomata: which packages ? [13:28:09] (03CR) 10BBlack: [C: 032 V: 032] network::constants::all_networks(_lo)? via flatten() [puppet] - 10https://gerrit.wikimedia.org/r/228586 (owner: 10BBlack) [13:28:41] 6operations, 5Patch-For-Review: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1514415 (10fgiunchedi) 5Open>3Resolved thanks @faidon for the review, this is merged. we're off syslog-ng in production now and no more syslog jousting {icon thumbs-o-up} [13:28:47] (03PS4) 10BBlack: VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 [13:31:19] (03PS5) 10BBlack: VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 [13:31:45] (03CR) 10BBlack: [C: 032 V: 032] VCL: use network::constants::all_networks_lo [puppet] - 10https://gerrit.wikimedia.org/r/229122 (owner: 10BBlack) [13:32:12] akosiaris: kafka [13:32:22] https://gerrit.wikimedia.org/r/#/c/229193/3/debian/changelog [13:32:25] oh sorry [13:32:27] the cdh ones [13:32:29] 6operations, 5Patch-For-Review, 7Swift: swift-dispersion-stats cronspam when disks are broken - https://phabricator.wikimedia.org/T78762#1514420 (10fgiunchedi) 5Open>3Resolved output goes to `/dev/null` now, if something else is broken and stats are not being pushed the respective graphite alarms will go... [13:32:30] in particular [13:32:36] (03PS6) 10BBlack: Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [13:32:43] (03CR) 10BBlack: [C: 032 V: 032] Remove cache::bits role from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [13:33:26] well, they do /usr/lib//package*.jars and /usr/lib//lib/.jars [13:33:50] e.g. /usr/lib/hadoop on analytics1011 if you want to take a look [13:35:50] also, akosiaris, how about this instead of the if / else: [13:35:53] # Remove any leading ':' [13:35:53] CLASSPATH=$(echo $CLASSPATH:$SYSTEM_JARS:$KAFKA_EXT_JARS:$KAFKA_JARS | sed 's/^:*//') [13:36:07] (I don't really care much on either of these issues, btw :) ) [13:36:19] argh [13:36:25] the if is better than that :P [13:36:51] as far as lib goes [13:37:04] I am ok with that. I see groovy, gradle and other using this [13:37:18] the hadoop packages I don't really consider them the best example given they are from cloudera [13:37:34] and are actually unbuildable by anyone else than them [13:37:57] ottomata: so feel free with the /lib/ thing [13:38:00] ok cool, will drop the sed, and use lib/ [13:38:01] ok! [13:38:39] and I hate java btw [13:38:42] akosiaris: one more [13:38:51] not clear on what you want me to do about slf4j dep. [13:39:01] leave it in for now, but when I build for jessie manually remove it? [13:39:08] and put it back into control? [13:39:22] (03PS1) 10BBlack: Cap all cache objects to max 30 days [puppet] - 10https://gerrit.wikimedia.org/r/229714 (https://phabricator.wikimedia.org/T102991) [13:39:30] and put a comment in the repo explaining? [13:39:31] 6operations, 10vm-requests, 5Patch-For-Review, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514437 (10akosiaris) p:5Triage>3Low [13:39:36] 6operations, 10vm-requests, 5Patch-For-Review, 7Pybal: codfw: 3 VM request for PyBal - https://phabricator.wikimedia.org/T107901#1514438 (10akosiaris) a:3akosiaris [13:40:16] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514439 (10akosiaris) p:5Triage>3Normal [13:40:38] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1514444 (10mobrovac) [13:40:48] akosiaris: what do you think about moving /ext into debian/lib, so as not to pollute the root repo dir? [13:40:52] 6operations, 5Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#1514446 (10akosiaris) p:5Triage>3Low [13:41:14] (03CR) 10BBlack: [C: 032 V: 032] Cap all cache objects to max 30 days [puppet] - 10https://gerrit.wikimedia.org/r/229714 (https://phabricator.wikimedia.org/T102991) (owner: 10BBlack) [13:42:29] (03PS1) 10BBlack: followup bugfix for ba63625e [puppet] - 10https://gerrit.wikimedia.org/r/229715 [13:42:47] (03CR) 10BBlack: [C: 032 V: 032] followup bugfix for ba63625e [puppet] - 10https://gerrit.wikimedia.org/r/229715 (owner: 10BBlack) [13:43:24] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514460 (10akosiaris) Parsoid is left to be done from what I see. We could couple this with moving parsoid to service::node in the long run (@Gwicke, @mobrovac - can you think... [13:43:37] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514461 (10akosiaris) p:5Normal>3Lowest [13:43:57] (03PS1) 10Muehlenhoff: Add ferm rules for spark/jmxtrans [puppet] - 10https://gerrit.wikimedia.org/r/229716 [13:46:47] 7Puppet, 6operations, 5Patch-For-Review, 7network: Migrate as much as possible from network::constants from network.pp to hiera - https://phabricator.wikimedia.org/T87519#1514474 (10akosiaris) p:5Normal>3Low [13:47:29] 6operations, 6Discovery, 10Maps, 6Services, and 2 others: Puppetize Kartotherian & Tilerator for deployment - https://phabricator.wikimedia.org/T105074#1514479 (10akosiaris) p:5Triage>3Normal [13:53:26] akosiaris: btw, THANK YOU FOR COPPER [13:53:35] everything is so much easier when buliding debs these days [13:54:54] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514484 (10mobrovac) >>! In T107287#1510377, @akosiaris wrote: > Indeed. But we have no estimates though about either in term... [13:55:15] ottomata: yw [14:01:12] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514515 (10akosiaris) > In the midst of this, I still do not understand your point about MobileApps. Are you saying that, eff... [14:01:19] akosiaris: oh another thing. I am considering moving the kafka script into usr/bin [14:01:22] instead of usr/sbin [14:01:34] otto praising how easy it is to build debs?! [14:01:34] especially since in jessie i've noticed that usr/sbin is not in regular user paths [14:01:38] mark, hahaha [14:01:39] what has happened over the years... [14:01:48] RELATIVELY easy [14:02:29] btw notice that the deb I am building has quite a few include-binaries...which makes things much easier [14:02:48] mark: wait until you run lintian :-p [14:03:32] akosiaris: and it is nice to be able to run things like kafka topic --list and kafka console-consumer not as root [14:04:01] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514518 (10mobrovac) >>! In T86633#1514460, @akosiaris wrote: > We could couple this with moving parsoid to service::node in the long run Yes, definitely. There's already a... [14:04:23] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514520 (10mobrovac) [14:05:41] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514522 (10GWicke) @akosiaris, a precondition for parsoid moving is {T90668}. [14:05:48] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 6Services, and 3 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1514524 (10mobrovac) [14:05:50] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1514523 (10mobrovac) [14:06:44] (03PS4) 10Ottomata: Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) [14:06:52] !log es1.7.1: restart elastic1029 [14:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:10] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514526 (10mobrovac) >>! In T107287#1514515, @akosiaris wrote: > T108184 is already a first blocker. Maybe it will be solved... [14:11:25] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1514533 (10fgiunchedi) [14:13:36] (03CR) 10Alexandros Kosiaris: [C: 031] Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [14:14:27] !log restarted HHVM on appservers in codfw for libtidy/PCRE security updates [14:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:46] !log beginning nodetool cleanup on restbase1001 [14:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:02] \o/ online alter finished with no errors [14:22:24] thanks akosiaris, hm, on emore thing i want to mention, maybe you have an idea [14:22:40] so, we have KAFKA_START=no in /etc/default/kafka on install of kafka-server package [14:22:59] this was no biggy on ubuntu without systemd. it tried to start kafka on install, but failed. [14:23:17] with systemd, it tries to start kafka, fails, but probably because of some exit code somewhere, systemd thinks that kafka actually started, [14:23:32] which means the first time you try to start it after settting KAFKA_START=yes [14:23:34] it will not [14:23:43] because systemd is all like "no prob, kafka is already running" [14:23:45] so you have to do [14:23:49] service kafka stop [14:23:51] service kafka start [14:23:52] just the first time [14:24:26] maybe init.d/kafka should exit 1 if KAFKA_START=no [14:24:26] ? [14:24:30] instead of 0? [14:24:43] ottomata: may I suggest to just provide a systemd unit ? [14:24:43] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514561 (10GWicke) [14:25:20] PROBLEM - puppet last run on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:34] akosiaris: yarghGHGH [14:25:56] (03PS1) 10KartikMistry: cxserver: Add 'nb' in Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/229720 [14:26:13] you may suggest that and I thought about it, but then I was liek, nawwww, then we have to support multiple init systems in the package, and this one works fine [14:26:33] i just experimiented with exit 1, i think that works [14:26:38] Aug 06 14:25:17 kafka202 kafka[7412]: KAFKA_START not set to 'yes' in /etc/default/kafka, not starting ... failed! [14:26:39] Aug 06 14:25:17 kafka202 systemd[1]: kafka.service: control process exited, code=exited status=1 [14:26:39] Aug 06 14:25:17 kafka202 systemd[1]: Failed to start LSB: Start kafka. [14:26:39] Aug 06 14:25:17 kafka202 systemd[1]: Unit kafka.service entered failed state. [14:26:41] (03PS2) 10KartikMistry: cxserver: Add 'nb' in Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/229720 (https://phabricator.wikimedia.org/T97938) [14:26:43] well, you don't have to delete the old one [14:26:49] we can ship all three [14:26:55] systemd/upstart/initscript [14:26:57] HMmMM you know...ok. THREE!? [14:26:59] haha, not doing upstart. [14:27:00] :p [14:27:06] fine by me ;-) [14:27:21] but, ya ok i'll do it, i was hoping to maybe use this soon, but it is thurs and I would be wise to wait til monday to try my upgrade again [14:27:31] so, ja lemme at that there systemd thang [14:27:51] akosiaris: do you know of an example pacakge we have that installs systemd? i don't know the proper debhelper stuff for it [14:28:05] sorry, installs a systemd unit [14:28:33] ah, it's very very difficult :P [14:28:35] https://git.wikimedia.org/blob/operations%2Fdebs%2Fetherpad-lite/4fdc1f72b18d88177957ccbeba4241f0781ef705/debian%2Fetherpad-lite.service [14:28:45] if you got debhelper new enough, that's just enough [14:28:49] aka building for jessie [14:28:56] just add the file [14:29:25] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Add 'nb' in Apertium MT [puppet] - 10https://gerrit.wikimedia.org/r/229720 (https://phabricator.wikimedia.org/T97938) (owner: 10KartikMistry) [14:29:52] ah .service, just found that in docs too, ok cool [14:29:53] thank you! [14:43:23] hmm, akosiaris, systemd issue. currently we allow configuration of ulimit -n by setting $KAFKA_NOFILES_ULIMIT in default file [14:43:41] i see i can set LimitNOFILE in systemd unit [14:43:48] but, i can't do variable substitution there [14:44:10] should I just hardcode it? [14:44:13] yes [14:44:32] if anyone wants to override that, they can ship their own unit file in /etc/systemd/system [14:44:45] the packages one will go anyway into /lib/systemd/system/ [14:44:45] hmm, ooook [14:45:12] plus there is an override thingy for systemd unit files but I 've never used it.. just read about it [14:45:19] either that or I dreamt it [14:50:31] !log es1.7.1: restart elastic1030 [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:38] ottomata: you can use "EnvironmentFile" (see man systemd.exec), it should read almost all existing options from /etc/default/kafka (the only exception being the uncommented KAFKA_JMX_OPTS and KAFKA_LOG4J_OPTS) [14:50:57] since these expand on shell variables which EnvironmentFile doesn't support [14:51:02] OH [14:51:13] i was going to use that, didn't realize the env file wouldn't just be executed like a shells cript [14:51:15] hm [14:51:54] ok, well, since those are just commetned out examples in that file [14:51:59] i'll just change them so they use the hardcoded paths [14:52:22] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514597 (10thcipriani) >>! In T107532#1514041, @Joe wrote: > So is releng settled on using ansible for deploys as @gwicke's comments in ht... [14:53:05] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1514601 (10mobrovac) [14:54:47] (03PS1) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [14:56:17] akosiaris, ^ [14:56:46] (03PS2) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [14:58:16] sorry, forgot a few files [14:58:18] (03PS3) 10Yurik: Added tilerator service, and granted kartotherian OSM DB read access [puppet] - 10https://gerrit.wikimedia.org/r/229727 (https://phabricator.wikimedia.org/T105074) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150806T1500). [15:00:04] ebernhardson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:16] looks lke just me? i'll just deploy it [15:00:58] moritzm: is debhelper smart enough to know if it has a .service file to not install the .init file? [15:01:10] or do I have to somehow specify? [15:04:33] (03PS1) 10Yurik: GitIgnore JetBrain's IDEA setting files [puppet] - 10https://gerrit.wikimedia.org/r/229730 [15:06:18] debhelper installs both by default (to support those who opt out of systemd), but that doesn't matter in practice; if a systemd unit is present the sysvinit script isn't used [15:08:35] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 542829 msg: ocg_render_job_queue 311 msg [15:08:50] !log ebernhardson Synchronized php-1.26wmf17/extensions/CirrusSearch/: bump cirrussearch in 1.26wmf17 for swat (duration: 00m 14s) [15:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:05] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 542877 msg: ocg_render_job_queue 36 msg [15:09:56] ah ok, that's cool, thanks moritzm [15:09:58] guess that's fine [15:10:06] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 542978 msg: ocg_render_job_queue 0 msg [15:10:18] although i would find it confusing as a systemd newb. i would look in /etc/init.d/ first and see the script, and then try to debug problems with that rather than some other thing [15:14:38] (03PS1) 10BBlack: kill dead code from upload VCL (ceph era) [puppet] - 10https://gerrit.wikimedia.org/r/229732 [15:24:25] moritzm: another q. i'm trying to keep kafka server from being started on install [15:24:36] i just tried override_dh_systemd_start: [15:24:36] dh_systemd_start --no-start [15:24:40] but that didn't seem to work [15:24:54] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514677 (10greg) [15:25:26] man page says [15:25:27] Do not start the unit file after upgrades and after initial installation (the latter is only relevant for services without a corresponding init script). [15:25:33] and, i guess this one has a corresponding init script? [15:30:24] ottomata: yes, I think the corresponding, available init script is the problem. what you could do is to run "systemctl disable kafka.service" in postinst [15:31:18] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514694 (10GWicke) > So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/... [15:31:19] then the unit file won't be started automatically (but you would still need to enable the sysv init script as well) [15:31:20] hm, will disable keep it from starting on boot? [15:32:51] for the systemd unit it should prevent that,yes [15:33:43] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1514704 (10greg) >>! In T107532#1514694, @GWicke wrote: >> So is releng settled on using ansible for deploys as @GWicke's comments in http... [15:46:29] moritzm: hm, does this look ok? [15:46:33] Loaded: loaded (/etc/systemd/system/kafka.service; disabled) [15:46:33] i did [15:46:41] dh_installinit --name=kafka --no-start [15:46:52] i think --no-start might be new for jessie stuff...i don't remember it before, and i looked [15:47:00] on dh_installinit [15:47:03] but maybe I just missed it... [15:51:17] I don't remember the exact status message, but if it shows "disabled" that should be alright [15:55:05] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1514770 (10Dzahn) Or we can use `ssh -D` port forwarding when connecting to a bastion host and put localhost in SOCKS proxy settings of the browser. Then we will also be able to see the web interface while r... [16:05:09] (03PS1) 10Dzahn: grafana: add role to krypton (VM) [puppet] - 10https://gerrit.wikimedia.org/r/229737 [16:06:41] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1514801 (10BBlack) So, I talked this over a bit with @JGreen (and ran it by @Faidon as well), and this is the best alternate proposal I've come up with... [16:07:10] (03PS5) 10Ottomata: Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) [16:11:16] akosiaris: that one is looking pretty good ^ [16:15:34] (03CR) 10BBlack: [C: 032] kill dead code from upload VCL (ceph era) [puppet] - 10https://gerrit.wikimedia.org/r/229732 (owner: 10BBlack) [16:17:33] (03PS1) 10ArielGlenn: dumps: for cutoff arg, check all wikis before giving up [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/229739 [16:19:29] 6operations, 10ops-eqiad: wmf5842's DRAC is unresponsive - https://phabricator.wikimedia.org/T108184#1514840 (10Cmjohnson) wmf5842 is accessible. The server was not plugged in. I unplugged in while I was working on labnet1002. All yours...racktables updated with new names [16:19:57] (03CR) 10ArielGlenn: [C: 032] dumps: for cutoff arg, check all wikis before giving up [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/229739 (owner: 10ArielGlenn) [16:20:51] (03PS2) 10Dzahn: admin: add user for John F. Lewis [puppet] - 10https://gerrit.wikimedia.org/r/229587 (https://phabricator.wikimedia.org/T108082) [16:21:30] mutante: heh, I was just about to reorder those two patches, are you doing it already? [16:22:26] andrewbogott: yes, the dependency should have been gone alreayd [16:22:46] Oh, I was going to /add/ a dep so that tox would work [16:22:51] but, either way, I’ll stand back. [16:23:07] oh, it should work after this one is merged and then rebase.. let's see [16:23:34] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514854 (10Cmjohnson) scb1001 => wmf5843 (row A) scb1002 => wmf5842 (row B) [16:23:52] andrewbogott: but if you want to confirm the key and uid looks correct, that would still be nice [16:23:59] 6operations, 6Services, 10hardware-requests, 5Patch-For-Review: Assign WMF5842, WMF5843 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1514858 (10Cmjohnson) [16:24:00] !log upgrading elastic1031 to 1.7.1 [16:24:01] 6operations, 10ops-eqiad: wmf5842's DRAC is unresponsive - https://phabricator.wikimedia.org/T108184#1514856 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson [16:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:24] mutante: sure. Is this something you think can be merged right away, or does it need 3 days and/or Ops meeting approval? [16:25:25] andrewbogott: yea, it can, i just asked Faidon about that [16:25:46] it was agreed upon in a meeting with mutante/robh/myself/mark [16:25:52] so I don't think there's much point in waiting [16:25:54] so +1 :) [16:26:01] andrewbogott: besides, the first part doesnt add to any groups, and the second part is not on a VM yet [16:26:11] (03CR) 10Andrew Bogott: [C: 031] "confirmed that this is the right UID." [puppet] - 10https://gerrit.wikimedia.org/r/229587 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [16:26:14] hey I was there too :P [16:26:16] meh [16:26:36] paravoid: i'd disagree and say ops meeting [16:26:41] why? [16:26:43] just cuz we say its ok doesnt mean the rest of the team is ok [16:26:59] they may have an objection and we're just deciding we dont let them have the chance to voice it [16:27:02] by skipping the meeting. [16:27:07] Yeah, unless it’s urgent +1 for following our policy. [16:27:15] its not an emergency, i dont see why we need to rush and avoid the policy. [16:27:17] Just because — it’s easier to follow policy :) [16:27:44] I don't suspect anyone will object, but those kinds of assumptions are exactly why we shouldn't break procedure imo. [16:28:11] not technically an emergency but; is a blocker for most things and waiting for Monday further extends wait because I'm away all next week and getting any balls rolling would be essential after that meeting [16:28:17] I think it's taking the "process" to its extreme [16:28:17] but my say is just of a volunteer :) [16:28:32] Like I said, I think that discounts letting the rest of the team have input [16:28:33] like it would be by saying that we should wait for people to get back from their vacation, for instance [16:28:36] and its a bad idea to me. [16:28:46] 10Ops-Access-Requests, 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1514870 (10Dzahn) 09:21 < mutante> is "jdouglas" around, maybe using a different nick? 09:21 < guillom> mutante: I think he left the WMF a couple of weeks ago. 09:24 < gu... [16:28:49] thats not at all the same [16:28:53] that is a gross oversimplication [16:29:13] it is an oversimplification and taking it to the extreme, yeah [16:29:17] We have the policy if someone has sudo, we want to responsibly have a team review of it in an ops meeting. [16:29:29] If you guys want to discount allowing the rest of the team input, I think its a bad idea. [16:29:43] but, im not in charge andif paravoid says to do it, then itll happen ;] [16:29:53] 10Ops-Access-Requests, 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1514872 (10Dzahn) a:5Jdouglas>3None so we don't have to re-enable any access it looks. re-assigning "for grabs" for him then. [16:30:46] paravoid: eveyrone has an excuse of why their request is an exception, i dont see why this is different. [16:31:07] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 10.00% of data above the critical threshold [500.0] [16:31:16] (03CR) 10Dzahn: [C: 032] "this does not give any access yet because the user is not in any groups" [puppet] - 10https://gerrit.wikimedia.org/r/229587 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [16:31:32] paravoid can correct me (since it was his meeting at the time) but wasn't this discussed previous (a month or so ago) where the outcome was 'approved but not for sodium'? [16:31:42] or is said discussion invalid now? :) [16:31:44] Nope [16:31:46] thats not a request [16:31:51] thats open ended as hell ;D [16:32:19] Sudo should inherently be very in scope and the request should be very specific. [16:32:27] JohnFLewis: its not personal, i +1 you having it. [16:32:28] (03PS2) 10Dzahn: admin: add mailman root group and add john [puppet] - 10https://gerrit.wikimedia.org/r/229585 (https://phabricator.wikimedia.org/T108082) [16:33:02] I just think we shouldn't circumvent the policy because we feel like it. It exists as a check in the system, specifically called for on ops meetings. [16:33:04] robh: I know it's not, more so since you're also on the project so you know how much this means in terms of helping the goal :) [16:33:20] it's not like that policy is written in stone or doesn't allow exception [16:33:24] but anyway, how about this [16:33:35] but again, if someone in ops mgmt puts on task to break it, then it'll happen, i just think its a bad idea. [16:33:37] let's give john the regular user access now [16:33:55] and if you guys do it anyhow later today, i wont continue to argue then [16:34:00] im just sharing my view now ;] [16:34:01] on the grounds of both yesterday's approval and the previous task that was discussed in a meeting too (and everyone was in favor with the sodium exception btw) [16:34:09] and then do the root part on monday [16:34:10] 6operations, 10ops-eqiad: db1035 died - network or power problem - https://phabricator.wikimedia.org/T107746#1514882 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson db1035 idrac6 failed to initialize. Drained the flea power and rebooted. All is well again! [16:34:20] would you be able to do some work this week without root, JohnFLewis? [16:34:56] paravoid: I believe so though that depends on how much progress is made today and tomorrow :) [16:35:19] maybe if we did "sudo as mailman user" but not sudo all [16:35:24] but i'd still be sudo [16:35:51] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1514895 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The disk is spun up and icinga is no longer complaining. Resolving [16:36:35] JohnFLewis: with regular user at least you could proxy via bastion to see webinterface [16:36:46] mutante: indeed [16:36:54] i agree, depends if we get to that point though, let's see [16:37:04] i take it as an ACK for regular user but will wait with more [16:37:39] and if we do get to that point; that is all I can forsee needing *at the moment*. Following Monday, more will be useful/needed and that follows due course then at least. [16:37:49] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1514908 (10Cmjohnson) a:3Cmjohnson Taking this to replace the disk [16:39:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:39:33] (03CR) 10Dzahn: [C: 04-1] "the root/sudo part should wait until Monday, but we can start with a regular shell user now" [puppet] - 10https://gerrit.wikimedia.org/r/229585 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [16:44:18] 10Ops-Access-Requests, 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1514964 (10Andrew) a:3Andrew [16:44:37] (03CR) 10Ottomata: Add ferm rules for spark/jmxtrans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229716 (owner: 10Muehlenhoff) [16:45:21] 6operations, 10Traffic, 10Wikimedia-DNS: DNS request for wikimedia.org (let 3rd party send mail as wikimedia.org) - https://phabricator.wikimedia.org/T107940#1514968 (10CCogdill_WMF) Thank you for thinking of this solution, @BBlack, et al. I really appreciate your flexibility, and I've confirmed Trilogy can... [16:45:21] (03CR) 10Ottomata: Add ferm rules for spark/jmxtrans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/229716 (owner: 10Muehlenhoff) [16:46:07] 10Ops-Access-Requests, 6operations: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1514970 (10Dzahn) 09:43 < Krenair> guillom, mutante: They still have wiki accounts open and are in the wmf ldap group 09:44 < Krenair> if they have been offboarded it was... [16:47:19] (03CR) 10Ottomata: Add ferm rules for Hadoop worker nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/229707 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [16:56:05] !log krinkle Synchronized php-1.26wmf17/resources/Resources.php: T108191 unbreak mobile js (duration: 00m 11s) [16:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:32] (03PS24) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [16:57:39] (03CR) 10jenkins-bot: [V: 04-1] labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [16:57:51] (03CR) 10BryanDavis: labs: new role::logstash::stashbot class (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227175 (owner: 10BryanDavis) [16:59:09] !log krinkle Synchronized php-1.26wmf17/resources/src/startup.js: touch (duration: 00m 11s) [16:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:59:54] (03PS25) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [16:59:56] (03PS15) 10BryanDavis: Update configuration for logstash 1.5.3 [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) [17:01:51] (03CR) 10Glaisher: "would be nice to have i18n messages in WikimediaMessages as well" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227879 (https://phabricator.wikimedia.org/T106724) (owner: 10John F. Lewis) [17:06:35] 6operations, 7HTTPS, 7LDAP: ldap-codfw.wikimedia.org & ldap-eqiad.wikimedia.org expire in September 2015 - https://phabricator.wikimedia.org/T106604#1515077 (10Andrew) a:5Andrew>3RobH Ah, sorry I didn't respond. Yes, please renew those certs! Both of those hosts are still active. Thanks. [17:08:52] (03CR) 10DCausse: [C: 031] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [17:16:49] 10Ops-Access-Requests, 6operations: Ops offboarding for jdouglas - https://phabricator.wikimedia.org/T108203#1515149 (10Andrew) 3NEW a:3Andrew [17:19:09] (03PS1) 10Andrew Bogott: Mark jdouglas as absent, remove from all groups. [puppet] - 10https://gerrit.wikimedia.org/r/229755 (https://phabricator.wikimedia.org/T108111) [17:19:11] (03PS1) 10Andrew Bogott: Add gage to 'absent' group, as he is. [puppet] - 10https://gerrit.wikimedia.org/r/229756 [17:19:41] (03PS1) 10Merlijn van Deen: Make adminbot also work if no headers are present [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229757 [17:19:43] (03PS1) 10Merlijn van Deen: Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 [17:24:42] (03CR) 10jenkins-bot: [V: 04-1] Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 (owner: 10Merlijn van Deen) [17:26:15] 10Ops-Access-Requests, 6operations: Ops offboarding for jdouglas - https://phabricator.wikimedia.org/T108203#1515211 (10Andrew) I have removed jdouglas from the wmf group and removed his admin role in the deployment-prep project. Attached patch will revoke his prod access. [17:33:58] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1515244 (10GWicke) @greg, I think the amount of confusion about what the cabal is doing or not illustrates that there is maybe not enough... [17:40:39] (03CR) 10Alexandros Kosiaris: "Minor comment inline, otherwise LGTM" (031 comment) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [17:40:43] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1515263 (10Cmjohnson) Still looking good cmjohnson@palladium:~$ sudo salt --out=raw --verbose -t 30 'cp10*' cmd.run 'cat /sys/class/thermal/therm... [17:50:40] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1515318 (10greg) >>! In T107532#1515244, @GWicke wrote: > @greg, I think the amount of confusion about what the cabal is doing or not illu... [17:54:09] 6operations, 6Collaboration-Team Backlog, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1515330 (10Mattflaschen) [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150806T1800). [18:03:33] (03CR) 10Ottomata: Updates and fixes for 0.8.2.1-2 release (031 comment) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:08:11] hey all! n00b here. what's the preferred way to test puppet changes locally? would i just use vagrant or something else? [18:09:02] niedzielski: for mw-vagrant: yes. in other cases, it's often 'pray and hope it works' :( [18:09:43] niedzielski: puppet apply or puppetmaster::self [18:09:44] niedzielski: by mw-vagrant, you mean https://gerrit.wikimedia.org/r/mediawiki/vagrant? [18:10:19] mutante: thanks, i'll look into those [18:10:38] niedzielski: yes; I'm not sure what the context for your changes is [18:10:46] niedzielski: there's also the "puppet compiler" but it's currently being fixed [18:11:37] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1515370 (10Dzahn) Let's go ahead then and create the instance please. [18:11:48] mutante valhallasw`cloud: we're seeing some weird issues trying to get our android ci machine setup. i wanted to repro the setup locally which i assume could be done as a virtual instance in the same manner it's done in prod [18:12:32] niedzielski: you can start a labs instance and apply the same puppet role to it (via wikitech ui, it's called "puppet groups" there) that is used in production [18:13:23] niedzielski: then there's a thing where you can let it install a local puppetmaster (puppet:self) on that instance, so you can hack puppet locally without having to go thorugh gerrit [18:14:12] mutante: hm, well, i definitely want to avoid having to post changes to gerrit to test them if at all possible so maybe i will try the second option [18:14:20] (03CR) 10Alexandros Kosiaris: Updates and fixes for 0.8.2.1-2 release (031 comment) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:14:56] niedzielski: https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [18:15:31] you'll just need the "single instance" part [18:16:01] (03CR) 10Ottomata: Updates and fixes for 0.8.2.1-2 release (031 comment) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:16:10] mutante: thanks for the link! i'll dig in [18:16:10] then you can just hack away in /var/lib/git/operations/puppet on the VM , run puppet and see changes immdiately [18:16:16] (03CR) 10Ottomata: [C: 032] Updates and fixes for 0.8.2.1-2 release [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/229193 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [18:17:42] mutante: oof, i don't suppose there's a way to run this in a user directory without contaminating my system structure? so, for example, ~/dev/wmf/foo-test/var/lib/git/operations/puppet [18:18:24] mutante: oops, i missed the part where you said that was on the vm [18:18:42] niedzielski: yea, just start a fresh VM and then throw it away again [18:18:56] mutante: sounds good, thanks! [18:34:05] PROBLEM - Host mw1061 is DOWN: PING CRITICAL - Packet loss = 100% [18:37:21] akosiaris: [18:37:37] shoudl I build packages with distro name in version? not sure how reprepro work shere [18:37:38] here [18:37:44] e.g. [18:37:54] kafka-server 0.8.2.1-2~jessie1 [18:37:55] ? [18:38:22] then i'd build separate packages named like that for each precise, trusty and jessie? [18:38:28] (03PS1) 1020after4: all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229851 [18:40:46] (03PS2) 10Andrew Bogott: Mark jdouglas as absent, remove from all groups. [puppet] - 10https://gerrit.wikimedia.org/r/229755 (https://phabricator.wikimedia.org/T108111) [18:43:11] (03CR) 10Andrew Bogott: [C: 032] Mark jdouglas as absent, remove from all groups. [puppet] - 10https://gerrit.wikimedia.org/r/229755 (https://phabricator.wikimedia.org/T108111) (owner: 10Andrew Bogott) [18:43:56] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229851 (owner: 1020after4) [18:44:03] (03Merged) 10jenkins-bot: all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229851 (owner: 1020after4) [18:45:24] (03PS2) 10Andrew Bogott: Add gage to 'absent' group, as he is. [puppet] - 10https://gerrit.wikimedia.org/r/229756 [18:45:50] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1515421 (10GWicke) > If, instead, others would ask the deployment working group direct questions (as @Joe did above) then there wouldn't b... [18:46:36] RECOVERY - Host mw1061 is UPING OK - Packet loss = 0%, RTA = 1.75 ms [18:46:58] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf17 [18:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:52] ottomata: sorry I can't really answer your question about packaging but that looks reasonable to me at least [18:48:13] something is about to...go down... [18:48:24] Josve05a: ? [18:48:30] My userscrips and gadgets on enwp aren't loading [18:48:42] something other than mw1061? [18:48:44] (03CR) 10Andrew Bogott: [C: 032] Add gage to 'absent' group, as he is. [puppet] - 10https://gerrit.wikimedia.org/r/229756 (owner: 10Andrew Bogott) [18:49:09] Josve05a: I just pushed 1.26wmf17 [18:49:16] to enwiki [18:49:31] oh, nvm they popped back [18:49:33] weird [18:49:55] I purged a few times and on the fifth time the "came to live" again [18:50:25] twentyafterfour: thanks, i'm just not sure what the right htin gto do there, i guess reprepro won't let us have the same version for multiple distros? [18:50:48] ok that's not good: Notice: Undefined index: laplace in /srv/mediawiki/php-1.26wmf17/extensions/CirrusSearch/includes/Hooks.php on line 359 [18:50:55] that error is happening a LOT [18:51:07] I think I better roll back [18:51:32] back in a few... [18:53:44] (03CR) 10Faidon Liambotis: [C: 04-1] "My original question seems to be answered by:" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [18:54:03] (03PS1) 1020after4: Revert "all wikis to 1.26wmf17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229855 [18:54:18] (03CR) 1020after4: [C: 032] Revert "all wikis to 1.26wmf17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229855 (owner: 1020after4) [18:54:34] (03Merged) 10jenkins-bot: Revert "all wikis to 1.26wmf17" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229855 (owner: 1020after4) [18:55:08] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: revert 1.26wmf17 [18:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:08] andrewbogott: jitsi announced they are starting to support firefox from version 40 and on, i would like to have another round of testing that in labs, can you please arrange a project for that ? [18:57:47] sure [18:58:06] thanks [18:59:48] matanya: created project ‘jitsi’ with you as admin [19:00:28] brion: Is there anything I can do to help debug https://phabricator.wikimedia.org/T107968 ? Like anything in the colsole I can find and copy or test or...? [19:00:34] console* [19:05:12] 10Ops-Access-Requests, 6operations: Ops offboarding for jdouglas - https://phabricator.wikimedia.org/T108203#1515490 (10Andrew) revoked prod access, removed gerrit group memberships. Phab... I'm not so sure about. [19:05:48] (03PS18) 10BBlack: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [19:06:21] (03PS1) 10Yuvipanda: k8s: Create worker role and include that in master too [puppet] - 10https://gerrit.wikimedia.org/r/229858 [19:06:36] (03PS2) 10Yuvipanda: k8s: Create worker role and include that in master too [puppet] - 10https://gerrit.wikimedia.org/r/229858 [19:07:31] (03CR) 10Yuvipanda: [C: 032] k8s: Create worker role and include that in master too [puppet] - 10https://gerrit.wikimedia.org/r/229858 (owner: 10Yuvipanda) [19:10:28] (03PS1) 10Yuvipanda: k8s: Include docker in worker nodes as well [puppet] - 10https://gerrit.wikimedia.org/r/229860 [19:10:33] (03CR) 10jenkins-bot: [V: 04-1] k8s: Include docker in worker nodes as well [puppet] - 10https://gerrit.wikimedia.org/r/229860 (owner: 10Yuvipanda) [19:10:39] (03PS2) 10Yuvipanda: k8s: Include docker in worker nodes as well [puppet] - 10https://gerrit.wikimedia.org/r/229860 [19:10:48] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Include docker in worker nodes as well [puppet] - 10https://gerrit.wikimedia.org/r/229860 (owner: 10Yuvipanda) [19:11:28] robh: yt? need a server naming opinion, and I know you have them :) [19:11:46] (03PS1) 10RobH: mailman ganeti vm fermium internal ip allocation [dns] - 10https://gerrit.wikimedia.org/r/229861 [19:11:46] yep [19:12:05] so, i'll be attempting the kafka upgrade again next week [19:12:13] which means i have the opportunity to rename these boxes if we want to. [19:12:20] right now they are just various analytics10xx boxes [19:12:32] (03CR) 10RobH: [C: 032] mailman ganeti vm fermium internal ip allocation [dns] - 10https://gerrit.wikimedia.org/r/229861 (owner: 10RobH) [19:12:35] maybe i should name them kafka10xx, or something like that? [19:12:43] i know we don't like to name boxes after the tech [19:12:50] but rather the servie they provide [19:12:56] but i'm not sure what else i would call these [19:13:02] OR, i could just keep them as analytics nodes. [19:13:08] analytics has the most unflattering shortening of the name ever. [19:13:12] an! [19:13:13] haha [19:13:13] or [19:13:17] i know what you mean : [19:13:18] :) [19:13:20] an-broker1001 [19:13:25] anal-broker! [19:13:26] heh [19:13:31] we don't need to keep an in the name i thikn [19:13:31] that was my first thought [19:13:33] but yes [19:13:38] heh [19:13:42] (03PS19) 10BBlack: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [19:13:52] broker10xx? hm. [19:14:06] i mean, why not kafka10xx in that case. [19:14:15] well, it is typically not wise to name them after the software but service, but if these are ONLY ever running kafka. [19:14:18] and they are in a cluster [19:14:19] also, i want to be careful, in case we decide to ahve multiple kafka clusters in eqiad in the future. i don't have a plan for that, but we might. [19:14:23] mabye an analytics and a prod cluster, who knows [19:14:25] i dont think kafka1001 is horrible. [19:14:48] (03CR) 10BBlack: "I've put cache bypassing back into the cache. I also figured out what was confusing me about X-Wikimedia-Debug's behavior and applied tha" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [19:14:52] well, dbs are in 1001+ rnage and span different service groups [19:14:54] seems like the same issue [19:14:59] hm, true. [19:15:04] but yea, the kafka machines in analytics being different than the others sounds nice [19:15:09] if they are differing hardware, even more so. [19:15:23] akosiaris, bd808__ , mutante * hallo [19:15:25] lemme check something [19:15:29] did the train run today? [19:15:36] i think technically there will be no need to have multiple clusters in eqiad, but we might want to do so for organizational purpose [19:15:46] in case we want to have a higher SLA on a prod cluster or something [19:15:46] ottomata: I would suggest you guys map out your service host types like labs [19:15:51] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [19:16:01] notice how labs has distinct groups, analytics seems much the same to me. [19:16:03] but meh [19:16:06] (03CR) 10BBlack: [C: 032] beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [19:16:17] so having some kind of analytics tag to hostname is useful in that regard [19:16:26] particularly if you start having multiple kinds of analytics service groups [19:16:42] but since i dunno how many services you guys run now, its hard to plan [19:17:12] yeah, that is hard to plan, and the line between analytics/prod might not be so fine. especially with kafka [19:17:32] in those cases lets just call those kafka and not analytics anuthing indeed [19:17:53] k, yeah i think that makes sense, if we have another cluster we can just deal and look at the config to figure out which is which [19:17:54] and deal with the possible multiple kafka service gropus when that day arrives [19:17:57] yep [19:18:06] pls update the https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions =] [19:18:33] ah ok, cool [19:19:48] aharoni: yes but I think twentyafterfour reverted himself [19:20:16] o [19:20:19] ow [19:20:24] is it expected to run later? [19:21:51] (03PS1) 10BBlack: Add empty eqiad security_audit backends [puppet] - 10https://gerrit.wikimedia.org/r/229863 [19:21:52] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Replace jdouglas's production ssh key - it matched labs key - https://phabricator.wikimedia.org/T108111#1515567 (10Andrew) [19:21:53] 10Ops-Access-Requests, 6operations: Ops offboarding for jdouglas - https://phabricator.wikimedia.org/T108203#1515565 (10Andrew) 5Open>3Resolved ok, removed from phab wmf_nda as well. [19:22:31] (03CR) 10BBlack: [C: 032 V: 032] Add empty eqiad security_audit backends [puppet] - 10https://gerrit.wikimedia.org/r/229863 (owner: 10BBlack) [19:22:55] aharoni: I'm fixing a small bug on wmf17 [19:23:00] twentyafterfour: thanks [19:23:00] and I'll push it out again [19:23:22] by small bug I mean an error that was severely flooding the logs [19:23:28] :) [19:23:40] we have a lot of important changes in ContentTranslation today and I'd love to test them in production as soon as they are deployed [19:24:06] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 1 failures [19:25:04] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1515581 (10Tgr) Apparently git.wikimedia.org patch pages are HTML, not plaintext. How fun. So here's a command that works: `... [19:26:03] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1515583 (10RobH) So slow... robh@ganeti1003:~$ sudo gnt-instance add \ > -t drbd \ > -I hail \ > --net 0:link=br0 \ > --hypervisor-parameters=kvm:boot_order=network \ > -o debootstrap+default \ >... [19:28:10] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1515596 (10RobH) I'll note the wikitech directions have: -B vcpus=,memory=g --disk 0:size=g \ which is missing the , after the memory line. I don't think it matters though, as -- seems to be... [19:30:16] aharoni: I'm going to deploy right now [19:30:25] coool [19:32:00] (03PS1) 10BBlack: varnish: do not define empty static directors [puppet] - 10https://gerrit.wikimedia.org/r/229867 [19:32:13] (03CR) 10BBlack: [C: 032 V: 032] varnish: do not define empty static directors [puppet] - 10https://gerrit.wikimedia.org/r/229867 (owner: 10BBlack) [19:32:41] (03CR) 10Dzahn: "i was about to do it and had last-moment concerns about the NAS mounts there being covered" [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [19:34:57] !log twentyafterfour Synchronized php-1.26wmf17: sync hotfixes before deploying 1.26wmf17 to group2 (duration: 02m 18s) [19:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:35:52] (03PS1) 1020after4: all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229877 [19:36:19] (03PS1) 10BBlack: another fixup for the security_audit empty backends thing... [puppet] - 10https://gerrit.wikimedia.org/r/229883 [19:36:21] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229877 (owner: 1020after4) [19:36:28] (03Merged) 10jenkins-bot: all wikis to 1.26wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229877 (owner: 1020after4) [19:36:32] (03CR) 10BBlack: [C: 032 V: 032] another fixup for the security_audit empty backends thing... [puppet] - 10https://gerrit.wikimedia.org/r/229883 (owner: 10BBlack) [19:36:55] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf17 [19:36:57] aharoni: done. [19:37:08] twentyafterfour: thanks, checking [19:38:06] (03PS1) 10BBlack: another fixup for the security_audit empty backends thing... [puppet] - 10https://gerrit.wikimedia.org/r/229910 [19:38:06] ... [19:38:13] !log issuing nodetool cleanup on restbase1003 [19:38:17] endless debugging via live machines, such fun! [19:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:36] (03CR) 10BBlack: [C: 032 V: 032] another fixup for the security_audit empty backends thing... [puppet] - 10https://gerrit.wikimedia.org/r/229910 (owner: 10BBlack) [19:40:26] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:41:04] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 3 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1515654 (10BBlack) Varnish part is merged (along with a bunch of minor followup fixes to figure out h... [19:41:17] (03CR) 10BryanDavis: [C: 031] Enable authmetrics logging on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229618 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [19:41:36] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [19:42:31] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 3 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1515663 (10csteipp) Thank you! [19:42:47] !log twentyafterfour Synchronized php-1.26wmf17: actually deploy the hotfix this time (duration: 01m 33s) [19:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:55] PROBLEM - puppet last run on cp2010 is CRITICAL puppet fail [19:44:46] PROBLEM - puppet last run on cp3030 is CRITICAL puppet fail [19:47:45] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:48:56] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:49:06] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:49:25] (03PS4) 10BBlack: Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) [19:50:43] (03CR) 10BBlack: [C: 032] Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [19:52:56] 10Ops-Access-Requests, 6operations: Access to stat1002 for csteipp - https://phabricator.wikimedia.org/T108227#1515691 (10csteipp) [19:56:10] PROBLEM - Host bits-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:56:16] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [19:56:39] what's up? [19:56:43] ignore the bits alerts, still trying to track them down [19:56:45] thank you andrewbogott [19:56:56] it's just syncup issues between decom and neon config [19:57:40] (03PS2) 10Merlijn van Deen: Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 [19:57:47] twentyafterfour: akosiaris * is there a way to know which verison of cxserver is running in production? [19:59:22] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a [19:59:46] where the hell are those defined as hosts? [20:00:05] maybe it just needs more puppet runs to gather state from remote facts or something... [20:00:24] PROBLEM - Host bits-lb.eqiad.wikimedia.org is DOWN: CRITICAL - Network Unreachable (208.80.154.234) [20:00:30] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Network Unreachable (198.35.26.106) [20:00:36] yeah that was it [20:03:39] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1515742 (10BBlack) [20:03:56] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1515743 (10BBlack) 5Open>3Resolved [20:03:58] 6operations, 10Traffic: Expand misc cluster into cache PoPs - https://phabricator.wikimedia.org/T101339#1515745 (10BBlack) [20:04:00] 6operations, 10Traffic: Upgrade eqiad-misc varnish cluster from 2 to 4 systems. - https://phabricator.wikimedia.org/T86718#1515748 (10BBlack) [20:04:02] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1515747 (10BBlack) [20:06:47] !log ori Synchronized php-1.26wmf17/extensions/Flow: I2089b21fc: Updated mediawiki/core Project: mediawiki/extensions/Flow (duration: 00m 15s) [20:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:52] (03PS1) 10BBlack: decom more bits-cluster stuff (hieradata) [puppet] - 10https://gerrit.wikimedia.org/r/229943 (https://phabricator.wikimedia.org/T86718) [20:08:22] (03CR) 10BBlack: [C: 032 V: 032] decom more bits-cluster stuff (hieradata) [puppet] - 10https://gerrit.wikimedia.org/r/229943 (https://phabricator.wikimedia.org/T86718) (owner: 10BBlack) [20:13:10] bblack: there is that "puppetstoredconfigs" script to make icinga forget hosts [20:14:25] I just did [20:14:34] (for the actual hosts) [20:14:47] puppetstoredconfigclean.rb on palladium and then another puppet run on neon [20:16:20] (03PS1) 10BBlack: old eqiad bits hosts -> cache::misc role [puppet] - 10https://gerrit.wikimedia.org/r/229944 (https://phabricator.wikimedia.org/T86718) [20:16:30] PROBLEM - Cassandra database on xenon is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [20:18:14] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 3 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1515863 (10dduvall) 5Open>3Resolved Thanks a ton, @BBlack @csteipp, Beta Cluster should be good... [20:19:51] (03PS2) 10Dzahn: bacula: enable firewall on helium [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) [20:20:09] PROBLEM - Cassandra database on cerium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [20:20:17] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1515901 (10GWicke) [20:20:19] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1515899 (10GWicke) 5Open>3Resolved Since yesterday's deploy RESTBase is now directly using http://api.svc.eqiad.wmnet/, without going through Varnish. The config for the backend r... [20:21:18] (03CR) 10BBlack: [C: 032] old eqiad bits hosts -> cache::misc role [puppet] - 10https://gerrit.wikimedia.org/r/229944 (https://phabricator.wikimedia.org/T86718) (owner: 10BBlack) [20:21:29] PROBLEM - Cassandra database on praseodymium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [20:22:35] (03PS3) 10Dzahn: bacula: enable firewall on helium [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) [20:24:03] (03CR) 10Dzahn: [C: 032] bacula: enable firewall on helium [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [20:24:47] bblack: should i merge both on palladium? [20:26:11] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1515938 (10Dzahn) p:5Triage>3Normal [20:26:19] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [20:26:29] !log es-tool restart-fast on elastic1031 to test alerting issues [20:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:12] netmon1001 = Failed to parse template torrus/varnish.xml.erb: [20:28:02] Could not find data item cache::bits::nodes in any Hiera ... [20:28:12] mutante: err, yes, did [20:28:21] I'll fix torrus in a bit [20:28:39] bblack: ok, thanks [20:29:42] watches firewall being applied on a poolcounter / the bacula server [20:30:59] PROBLEM - Apache HTTP on mw1116 is CRITICAL - Socket timeout after 10 seconds [20:30:59] PROBLEM - HHVM rendering on mw1121 is CRITICAL - Socket timeout after 10 seconds [20:31:00] PROBLEM - HHVM rendering on mw1129 is CRITICAL - Socket timeout after 10 seconds [20:31:22] and removed it again because ferm failed to start [20:32:29] PROBLEM - HHVM rendering on mw1114 is CRITICAL - Socket timeout after 10 seconds [20:32:29] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [20:32:29] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.756 second response time [20:32:38] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 7.627 second response time [20:32:38] PROBLEM - Apache HTTP on mw1120 is CRITICAL - Socket timeout after 10 seconds [20:32:38] PROBLEM - Apache HTTP on mw1189 is CRITICAL - Socket timeout after 10 seconds [20:33:38] PROBLEM - HHVM rendering on mw1134 is CRITICAL - Socket timeout after 10 seconds [20:33:48] PROBLEM - Apache HTTP on mw1145 is CRITICAL - Socket timeout after 10 seconds [20:33:48] PROBLEM - HHVM rendering on mw1142 is CRITICAL - Socket timeout after 10 seconds [20:33:49] PROBLEM - Apache HTTP on mw1192 is CRITICAL - Socket timeout after 10 seconds [20:33:49] PROBLEM - HHVM rendering on mw1234 is CRITICAL - Socket timeout after 10 seconds [20:33:49] PROBLEM - Apache HTTP on mw1229 is CRITICAL - Socket timeout after 10 seconds [20:33:49] PROBLEM - Apache HTTP on mw1226 is CRITICAL - Socket timeout after 10 seconds [20:33:49] PROBLEM - HHVM rendering on mw1232 is CRITICAL - Socket timeout after 10 seconds [20:33:50] PROBLEM - HHVM rendering on mw1197 is CRITICAL - Socket timeout after 10 seconds [20:33:50] PROBLEM - Apache HTTP on mw1233 is CRITICAL - Socket timeout after 10 seconds [20:33:51] PROBLEM - Apache HTTP on mw1224 is CRITICAL - Socket timeout after 10 seconds [20:33:51] PROBLEM - HHVM rendering on mw1230 is CRITICAL - Socket timeout after 10 seconds [20:33:52] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [20:33:52] PROBLEM - Apache HTTP on mw1133 is CRITICAL - Socket timeout after 10 seconds [20:33:53] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [20:34:08] PROBLEM - Apache HTTP on mw1204 is CRITICAL - Socket timeout after 10 seconds [20:34:08] PROBLEM - HHVM rendering on mw1131 is CRITICAL - Socket timeout after 10 seconds [20:34:08] PROBLEM - Apache HTTP on mw1199 is CRITICAL - Socket timeout after 10 seconds [20:34:08] PROBLEM - Apache HTTP on mw1230 is CRITICAL - Socket timeout after 10 seconds [20:34:08] PROBLEM - HHVM rendering on mw1127 is CRITICAL - Socket timeout after 10 seconds [20:34:09] PROBLEM - Apache HTTP on mw1124 is CRITICAL - Socket timeout after 10 seconds [20:34:09] PROBLEM - Apache HTTP on mw1205 is CRITICAL - Socket timeout after 10 seconds [20:34:10] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [20:34:10] PROBLEM - Apache HTTP on mw1132 is CRITICAL - Socket timeout after 10 seconds [20:34:11] PROBLEM - Apache HTTP on mw1194 is CRITICAL - Socket timeout after 10 seconds [20:34:11] PROBLEM - HHVM rendering on mw1140 is CRITICAL - Socket timeout after 10 seconds [20:34:18] PROBLEM - Apache HTTP on mw1136 is CRITICAL - Socket timeout after 10 seconds [20:34:18] PROBLEM - Apache HTTP on mw1148 is CRITICAL - Socket timeout after 10 seconds [20:34:18] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [20:34:28] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.226 second response time [20:34:29] PROBLEM - Apache HTTP on mw1127 is CRITICAL - Socket timeout after 10 seconds [20:34:29] PROBLEM - Apache HTTP on mw1131 is CRITICAL - Socket timeout after 10 seconds [20:34:29] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.109 second response time [20:35:04] ^ is this from the poolcounter fw? [20:35:19] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.166 second response time [20:35:39] PROBLEM - Restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:39] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:39] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:46] uhhhh [20:35:58] mutante: poolcounter fw? [20:35:59] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.611 second response time [20:35:59] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:59] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.086 second response time [20:35:59] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.148 second response time [20:36:00] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.702 second response time [20:36:01] bblack: i wouldnt think so, there are no rules and packets are accepted [20:36:09] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [20:36:09] PROBLEM - HHVM rendering on mw1225 is CRITICAL - Socket timeout after 10 seconds [20:36:09] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.208 second response time [20:36:10] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.486 second response time [20:36:18] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.408 second response time [20:36:18] PROBLEM - HHVM rendering on mw1224 is CRITICAL - Socket timeout after 10 seconds [20:36:24] wtf [20:36:28] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 5.791 second response time [20:36:30] PROBLEM - Restbase endpoints health on restbase1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:30] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:39] PROBLEM - Apache HTTP on mw1116 is CRITICAL - Socket timeout after 10 seconds [20:36:39] PROBLEM - HHVM rendering on mw1129 is CRITICAL - Socket timeout after 10 seconds [20:36:39] PROBLEM - HHVM rendering on mw1200 is CRITICAL - Socket timeout after 10 seconds [20:36:40] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:45] the RB failures are caused by the API [20:36:48] PROBLEM - HHVM rendering on mw1229 is CRITICAL - Socket timeout after 10 seconds [20:36:49] PROBLEM - Apache HTTP on mw1197 is CRITICAL - Socket timeout after 10 seconds [20:36:58] PROBLEM - HHVM busy threads on mw1227 is CRITICAL 60.00% of data above the critical threshold [115.2] [20:37:05] heh? [20:37:07] there were RB failures before the rendering spam [20:37:08] 20:16 < icinga-wm> PROBLEM - Cassandra database on xenon is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [20:37:13] (etc for the other hosts) [20:37:17] that's the staging cluster [20:37:21] ok [20:37:25] I changed some configs there for testing [20:37:27] so unrelated [20:37:28] PROBLEM - HHVM busy threads on mw1122 is CRITICAL 55.56% of data above the critical threshold [86.4] [20:37:28] PROBLEM - HHVM busy threads on mw1226 is CRITICAL 75.00% of data above the critical threshold [115.2] [20:37:39] PROBLEM - HHVM busy threads on mw1207 is CRITICAL 62.50% of data above the critical threshold [115.2] [20:37:39] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:39] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [20:37:39] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [20:37:39] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [20:37:40] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [20:37:40] PROBLEM - SSH on helium is CRITICAL: Connection timed out [20:37:43] prod os restbase100X [20:37:45] *is [20:37:46] i definitely disabled any ferm changes on helium already up there [20:37:48] PROBLEM - puppet last run on helium is CRITICAL: Timeout while attempting connection [20:37:48] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [20:37:50] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.146 second response time [20:37:55] recovery? [20:37:58] RECOVERY - HHVM rendering on mw1131 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.136 second response time [20:37:58] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.134 second response time [20:37:59] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 7.992 second response time [20:37:59] PROBLEM - HHVM busy threads on mw1124 is CRITICAL 50.00% of data above the critical threshold [86.4] [20:37:59] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [20:37:59] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [20:38:00] * greg-g crosses fingers [20:38:08] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.205 second response time [20:38:09] limitedd recovery so far [20:38:10] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 3.811 second response time [20:38:10] https://en.wikipedia.org/w/api.php still errors for me [20:38:12] but slowly I guess [20:38:13] mutante: what commit was the poolcounter thing? [20:38:18] PROBLEM - HHVM rendering on mw1116 is CRITICAL - Socket timeout after 10 seconds [20:38:18] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [20:38:19] (03PS1) 10Dzahn: Revert "bacula: enable firewall on helium" [puppet] - 10https://gerrit.wikimedia.org/r/229946 [20:38:22] PROBLEM - HHVM busy threads on mw1130 is CRITICAL 57.14% of data above the critical threshold [86.4] [20:38:23] PROBLEM - HHVM busy threads on mw1126 is CRITICAL 85.71% of data above the critical threshold [86.4] [20:38:28] PROBLEM - Apache HTTP on mw1114 is CRITICAL - Socket timeout after 10 seconds [20:38:28] PROBLEM - HHVM rendering on mw1190 is CRITICAL - Socket timeout after 10 seconds [20:38:28] we are probably now in a dance of hhvm overload [20:38:29] PROBLEM - Apache HTTP on mw1227 is CRITICAL - Socket timeout after 10 seconds [20:38:36] Pool error for 2607:FB90:C28:3312:0:30:A8F2:6001 on key CirrusSearch-Prefix:_elasticsearch_enwiki during prefix search for 'Sierra': pool-queuefull [Called from Closure$CirrusSearch\Searcher::search#3 in /srv/medi [20:38:37] bblack: that ^ [20:38:38] awiki/php-1.26wmf17/extensions/CirrusSearch/includes/Searcher.php at line 1183] [20:38:38] PROBLEM - HHVM busy threads on mw1232 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:38:38] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 1.365 second response time [20:38:39] PROBLEM - HHVM busy threads on mw1147 is CRITICAL 60.00% of data above the critical threshold [86.4] [20:38:49] PROBLEM - bacula director process on helium is CRITICAL: Timeout while attempting connection [20:38:49] PROBLEM - HHVM busy threads on mw1148 is CRITICAL 62.50% of data above the critical threshold [86.4] [20:38:49] PROBLEM - HHVM busy threads on mw1119 is CRITICAL 77.78% of data above the critical threshold [86.4] [20:38:50] oh helium [20:38:52] ok [20:38:59] PROBLEM - HHVM queue size on mw1232 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:38:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 50.00% of data above the critical threshold [500.0] [20:39:09] PROBLEM - HHVM busy threads on mw1202 is CRITICAL 88.89% of data above the critical threshold [115.2] [20:39:19] PROBLEM - HHVM busy threads on mw1143 is CRITICAL 85.71% of data above the critical threshold [86.4] [20:39:20] PROBLEM - HHVM busy threads on mw1136 is CRITICAL 83.33% of data above the critical threshold [86.4] [20:39:29] PROBLEM - HHVM queue size on mw1190 is CRITICAL 55.56% of data above the critical threshold [80.0] [20:39:38] PROBLEM - HHVM busy threads on mw1123 is CRITICAL 87.50% of data above the critical threshold [86.4] [20:39:38] PROBLEM - HHVM busy threads on mw1201 is CRITICAL 62.50% of data above the critical threshold [115.2] [20:39:39] PROBLEM - HHVM rendering on mw1144 is CRITICAL - Socket timeout after 10 seconds [20:39:39] PROBLEM - HHVM busy threads on mw1137 is CRITICAL 66.67% of data above the critical threshold [86.4] [20:39:39] PROBLEM - HHVM busy threads on mw1127 is CRITICAL 87.50% of data above the critical threshold [86.4] [20:39:40] PROBLEM - HHVM busy threads on mw1116 is CRITICAL 87.50% of data above the critical threshold [86.4] [20:39:48] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [20:39:49] PROBLEM - HHVM busy threads on mw1192 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:39:49] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 2.934 second response time [20:39:52] going to restart apache && hhvm on a couple to see if that recovers them [20:39:54] FWIW, cached stuff is still fine. this is all just API/uncached-ish things [20:39:58] PROBLEM - HHVM busy threads on mw1129 is CRITICAL 88.89% of data above the critical threshold [86.4] [20:39:59] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.153 second response time [20:39:59] PROBLEM - HHVM busy threads on mw1145 is CRITICAL 50.00% of data above the critical threshold [86.4] [20:40:05] bblack: i'll do the revert but i already flushed the rules minute ago anyways [20:40:08] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.630 second response time [20:40:09] PROBLEM - HHVM busy threads on mw1117 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:40:09] PROBLEM - HHVM busy threads on mw1115 is CRITICAL 80.00% of data above the critical threshold [86.4] [20:40:17] I can't even log into helium [20:40:18] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [20:40:19] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 9.014 second response time [20:40:19] PROBLEM - Apache HTTP on mw1140 is CRITICAL - Socket timeout after 10 seconds [20:40:20] PROBLEM - HHVM busy threads on mw1140 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:40:24] (03CR) 10Dzahn: [C: 032] Revert "bacula: enable firewall on helium" [puppet] - 10https://gerrit.wikimedia.org/r/229946 (owner: 10Dzahn) [20:40:28] PROBLEM - Restbase endpoints health on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:40:28] PROBLEM - HHVM busy threads on mw1203 is CRITICAL 71.43% of data above the critical threshold [115.2] [20:40:28] PROBLEM - HHVM busy threads on mw1205 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:40:29] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.957 second response time [20:40:29] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 3.178 second response time [20:40:29] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 3.314 second response time [20:40:30] PROBLEM - Apache HTTP on mw1119 is CRITICAL - Socket timeout after 10 seconds [20:40:30] PROBLEM - Apache HTTP on mw1144 is CRITICAL - Socket timeout after 10 seconds [20:40:30] RECOVERY - HHVM rendering on mw1129 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.137 second response time [20:40:39] PROBLEM - HHVM busy threads on mw1125 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:40:48] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [20:40:48] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 4.588 second response time [20:40:49] PROBLEM - HHVM queue size on mw1140 is CRITICAL 50.00% of data above the critical threshold [80.0] [20:40:49] PROBLEM - HHVM busy threads on mw1142 is CRITICAL 50.00% of data above the critical threshold [86.4] [20:40:49] PROBLEM - HHVM busy threads on mw1230 is CRITICAL 75.00% of data above the critical threshold [115.2] [20:40:49] PROBLEM - HHVM busy threads on mw1197 is CRITICAL 88.89% of data above the critical threshold [115.2] [20:40:58] PROBLEM - HHVM busy threads on mw1131 is CRITICAL 80.00% of data above the critical threshold [86.4] [20:41:09] PROBLEM - HHVM busy threads on mw1114 is CRITICAL 80.00% of data above the critical threshold [86.4] [20:41:09] PROBLEM - HHVM queue size on mw1191 is CRITICAL 57.14% of data above the critical threshold [80.0] [20:41:10] PROBLEM - HHVM busy threads on mw1234 is CRITICAL 71.43% of data above the critical threshold [115.2] [20:41:10] PROBLEM - HHVM queue size on mw1134 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:41:10] PROBLEM - HHVM busy threads on mw1189 is CRITICAL 44.44% of data above the critical threshold [115.2] [20:41:19] PROBLEM - HHVM busy threads on mw1235 is CRITICAL 80.00% of data above the critical threshold [115.2] [20:41:19] PROBLEM - HHVM busy threads on mw1128 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:41:20] PROBLEM - HHVM busy threads on mw1194 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:41:20] PROBLEM - HHVM queue size on mw1198 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:41:20] PROBLEM - HHVM busy threads on mw1139 is CRITICAL 83.33% of data above the critical threshold [86.4] [20:41:28] PROBLEM - HHVM busy threads on mw1233 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:41:29] PROBLEM - HHVM busy threads on mw1228 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:41:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [20:41:29] PROBLEM - HHVM queue size on mw1230 is CRITICAL 55.56% of data above the critical threshold [80.0] [20:41:29] PROBLEM - HHVM busy threads on mw1146 is CRITICAL 87.50% of data above the critical threshold [86.4] [20:41:29] PROBLEM - HHVM busy threads on mw1135 is CRITICAL 85.71% of data above the critical threshold [86.4] [20:41:29] PROBLEM - HHVM busy threads on mw1225 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:41:30] PROBLEM - HHVM queue size on mw1114 is CRITICAL 62.50% of data above the critical threshold [80.0] [20:41:30] PROBLEM - HHVM busy threads on mw1208 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:41:31] PROBLEM - HHVM busy threads on mw1221 is CRITICAL 50.00% of data above the critical threshold [115.2] [20:41:38] PROBLEM - HHVM rendering on mw1207 is CRITICAL - Socket timeout after 10 seconds [20:41:39] PROBLEM - HHVM busy threads on mw1222 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:41:39] PROBLEM - HHVM rendering on mw1139 is CRITICAL - Socket timeout after 10 seconds [20:41:39] PROBLEM - Apache HTTP on mw1138 is CRITICAL - Socket timeout after 10 seconds [20:41:39] PROBLEM - HHVM rendering on mw1130 is CRITICAL - Socket timeout after 10 seconds [20:41:39] PROBLEM - HHVM rendering on mw1123 is CRITICAL - Socket timeout after 10 seconds [20:41:39] PROBLEM - Apache HTTP on mw1234 is CRITICAL - Socket timeout after 10 seconds [20:41:48] PROBLEM - HHVM busy threads on mw1138 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:41:48] PROBLEM - HHVM busy threads on mw1121 is CRITICAL 62.50% of data above the critical threshold [86.4] [20:41:48] PROBLEM - HHVM busy threads on mw1134 is CRITICAL 50.00% of data above the critical threshold [86.4] [20:41:48] PROBLEM - HHVM queue size on mw1119 is CRITICAL 44.44% of data above the critical threshold [80.0] [20:41:49] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.135 second response time [20:41:49] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.149 second response time [20:41:49] PROBLEM - HHVM queue size on mw1205 is CRITICAL 44.44% of data above the critical threshold [80.0] [20:41:50] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.465 second response time [20:41:59] PROBLEM - Apache HTTP on mw1233 is CRITICAL - Socket timeout after 10 seconds [20:41:59] PROBLEM - Apache HTTP on mw1128 is CRITICAL - Socket timeout after 10 seconds [20:41:59] PROBLEM - Apache HTTP on mw1139 is CRITICAL - Socket timeout after 10 seconds [20:41:59] PROBLEM - HHVM queue size on mw1225 is CRITICAL 75.00% of data above the critical threshold [80.0] [20:41:59] PROBLEM - HHVM queue size on mw1125 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:00] PROBLEM - HHVM busy threads on mw1229 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:42:00] PROBLEM - HHVM busy threads on mw1206 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:42:01] PROBLEM - HHVM busy threads on mw1133 is CRITICAL 88.89% of data above the critical threshold [86.4] [20:42:01] PROBLEM - Apache HTTP on mw1206 is CRITICAL - Socket timeout after 10 seconds [20:42:07] since this aligned with a change to poolcounter; https://phabricator.wikimedia.org/T105378 come to my mind. [20:42:08] PROBLEM - HHVM queue size on mw1116 is CRITICAL 75.00% of data above the critical threshold [80.0] [20:42:08] PROBLEM - HHVM queue size on mw1233 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:09] PROBLEM - HHVM queue size on mw1135 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:09] PROBLEM - HHVM busy threads on mw1198 is CRITICAL 66.67% of data above the critical threshold [115.2] [20:42:10] PROBLEM - HHVM busy threads on mw1120 is CRITICAL 50.00% of data above the critical threshold [86.4] [20:42:10] PROBLEM - HHVM busy threads on mw1132 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:42:10] PROBLEM - HHVM queue size on mw1192 is CRITICAL 37.50% of data above the critical threshold [80.0] [20:42:10] PROBLEM - HHVM busy threads on mw1231 is CRITICAL 75.00% of data above the critical threshold [115.2] [20:42:10] bblack: i'm on the shell, i'm running puppet, there are all kinds of puppet errors now [20:42:18] PROBLEM - HHVM busy threads on mw1193 is CRITICAL 83.33% of data above the critical threshold [115.2] [20:42:18] PROBLEM - Apache HTTP on mw1143 is CRITICAL - Socket timeout after 10 seconds [20:42:19] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 9.823 second response time [20:42:19] PROBLEM - HHVM rendering on mw1191 is CRITICAL - Socket timeout after 10 seconds [20:42:19] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 5.571 second response time [20:42:28] PROBLEM - HHVM queue size on mw1128 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:28] PROBLEM - HHVM queue size on mw1145 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:28] PROBLEM - HHVM busy threads on mw1223 is CRITICAL 57.14% of data above the critical threshold [115.2] [20:42:29] PROBLEM - HHVM queue size on mw1200 is CRITICAL 37.50% of data above the critical threshold [80.0] [20:42:38] PROBLEM - Apache HTTP on mw1131 is CRITICAL - Socket timeout after 10 seconds [20:42:38] PROBLEM - HHVM busy threads on mw1199 is CRITICAL 44.44% of data above the critical threshold [115.2] [20:42:38] PROBLEM - HHVM queue size on mw1132 is CRITICAL 42.86% of data above the critical threshold [80.0] [20:42:38] PROBLEM - HHVM queue size on mw1228 is CRITICAL 77.78% of data above the critical threshold [80.0] [20:42:39] PROBLEM - HHVM queue size on mw1115 is CRITICAL 42.86% of data above the critical threshold [80.0] [20:42:39] PROBLEM - HHVM busy threads on mw1191 is CRITICAL 87.50% of data above the critical threshold [115.2] [20:42:39] PROBLEM - HHVM rendering on mw1143 is CRITICAL - Socket timeout after 10 seconds [20:42:40] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 100.00% of data above the critical threshold [115.2] [20:42:40] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [20:42:48] PROBLEM - HHVM queue size on mw1207 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:42:48] PROBLEM - HHVM busy threads on mw1224 is CRITICAL 50.00% of data above the critical threshold [115.2] [20:42:48] PROBLEM - HHVM rendering on mw1128 is CRITICAL - Socket timeout after 10 seconds [20:42:49] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.552 second response time [20:42:49] PROBLEM - salt-minion processes on helium is CRITICAL: Timeout while attempting connection [20:42:49] PROBLEM - HHVM busy threads on mw1195 is CRITICAL 83.33% of data above the critical threshold [115.2] [20:42:49] PROBLEM - HHVM busy threads on mw1204 is CRITICAL 50.00% of data above the critical threshold [115.2] [20:42:50] PROBLEM - HHVM queue size on mw1208 is CRITICAL 50.00% of data above the critical threshold [80.0] [20:42:50] PROBLEM - HHVM queue size on mw1138 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:43:08] PROBLEM - HHVM rendering on mw1154 is CRITICAL - Socket timeout after 10 seconds [20:43:09] PROBLEM - HHVM busy threads on mw1144 is CRITICAL 100.00% of data above the critical threshold [86.4] [20:43:09] PROBLEM - HHVM queue size on mw1227 is CRITICAL 57.14% of data above the critical threshold [80.0] [20:43:09] RECOVERY - HHVM queue size on mw1134 is OK Less than 30.00% above the threshold [10.0] [20:43:28] PROBLEM - HHVM queue size on mw1117 is CRITICAL 44.44% of data above the critical threshold [80.0] [20:43:28] RECOVERY - HHVM queue size on mw1198 is OK Less than 30.00% above the threshold [10.0] [20:43:29] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.642 second response time [20:43:29] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.785 second response time [20:43:29] PROBLEM - HHVM queue size on mw1197 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:43:29] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 4.940 second response time [20:43:30] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 66165 bytes in 1.272 second response time [20:43:30] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [20:43:30] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.756 second response time [20:43:31] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 1.932 second response time [20:43:40] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [20:43:40] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [20:43:40] RECOVERY - Restbase endpoints health on restbase1009 is OK: All endpoints are healthy [20:43:40] RECOVERY - SSH on helium is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [20:43:48] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [20:43:48] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [20:43:48] PROBLEM - HHVM queue size on mw1143 is CRITICAL 37.50% of data above the critical threshold [80.0] [20:43:48] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.152 second response time [20:43:49] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.028 second response time [20:43:49] RECOVERY - Apache HTTP on mw1224 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [20:43:49] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [20:43:50] ok helium is fixed now [20:43:50] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [20:43:50] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [20:43:51] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.114 second response time [20:43:51] PROBLEM - HHVM busy threads on mw1200 is CRITICAL 85.71% of data above the critical threshold [115.2] [20:43:58] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [20:43:59] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.147 second response time [20:43:59] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [20:44:04] well that was interesting [20:44:08] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [20:44:09] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [20:44:09] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [20:44:09] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.111 second response time [20:44:09] PROBLEM - HHVM queue size on mw1234 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:44:09] PROBLEM - HHVM queue size on mw1229 is CRITICAL 60.00% of data above the critical threshold [80.0] [20:44:09] PROBLEM - HHVM queue size on mw1144 is CRITICAL 50.00% of data above the critical threshold [80.0] [20:44:19] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [20:44:28] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [20:44:29] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [20:44:29] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.058 second response time [20:44:29] PROBLEM - HHVM queue size on mw1194 is CRITICAL 66.67% of data above the critical threshold [80.0] [20:44:29] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.104 second response time [20:44:29] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.031 second response time [20:44:38] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [20:44:38] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [20:44:39] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 66157 bytes in 0.181 second response time [20:44:39] RECOVERY - Restbase endpoints health on restbase1005 is OK: All endpoints are healthy [20:44:39] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.152 second response time [20:44:47] it finished a puppet run, ack [20:44:48] RECOVERY - salt-minion processes on helium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:44:59] RECOVERY - HHVM rendering on mw1154 is OK: HTTP OK: HTTP/1.1 200 OK - 66156 bytes in 0.095 second response time [20:45:49] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:45:49] PROBLEM - HHVM queue size on mw1189 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:46:09] RECOVERY - HHVM queue size on mw1233 is OK Less than 30.00% above the threshold [10.0] [20:46:18] RECOVERY - HHVM queue size on mw1234 is OK Less than 30.00% above the threshold [10.0] [20:46:18] RECOVERY - HHVM queue size on mw1192 is OK Less than 30.00% above the threshold [10.0] [20:46:29] PROBLEM - HHVM queue size on mw1231 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:46:39] RECOVERY - HHVM queue size on mw1115 is OK Less than 30.00% above the threshold [10.0] [20:46:59] RECOVERY - HHVM queue size on mw1208 is OK Less than 30.00% above the threshold [10.0] [20:47:08] RECOVERY - HHVM queue size on mw1232 is OK Less than 30.00% above the threshold [10.0] [20:47:50] RECOVERY - HHVM queue size on mw1189 is OK Less than 30.00% above the threshold [10.0] [20:48:08] RECOVERY - HHVM queue size on mw1225 is OK Less than 30.00% above the threshold [10.0] [20:48:09] PROBLEM - HHVM queue size on mw1201 is CRITICAL 33.33% of data above the critical threshold [80.0] [20:48:09] PROBLEM - HHVM queue size on mw1193 is CRITICAL 50.00% of data above the critical threshold [80.0] [20:48:09] RECOVERY - HHVM queue size on mw1135 is OK Less than 30.00% above the threshold [10.0] [20:48:18] RECOVERY - HHVM queue size on mw1229 is OK Less than 30.00% above the threshold [10.0] [20:48:38] RECOVERY - HHVM queue size on mw1145 is OK Less than 30.00% above the threshold [10.0] [20:48:38] RECOVERY - HHVM queue size on mw1231 is OK Less than 30.00% above the threshold [10.0] [20:48:48] RECOVERY - HHVM queue size on mw1132 is OK Less than 30.00% above the threshold [10.0] [20:48:58] RECOVERY - HHVM queue size on mw1207 is OK Less than 30.00% above the threshold [10.0] [20:48:59] RECOVERY - HHVM queue size on mw1140 is OK Less than 30.00% above the threshold [10.0] [20:48:59] RECOVERY - HHVM busy threads on mw1148 is OK Less than 30.00% above the threshold [57.6] [20:49:00] RECOVERY - HHVM busy threads on mw1204 is OK Less than 30.00% above the threshold [76.8] [20:49:28] RECOVERY - HHVM busy threads on mw1202 is OK Less than 30.00% above the threshold [76.8] [20:49:38] RECOVERY - HHVM queue size on mw1117 is OK Less than 30.00% above the threshold [10.0] [20:49:40] RECOVERY - HHVM queue size on mw1114 is OK Less than 30.00% above the threshold [10.0] [20:49:40] RECOVERY - HHVM queue size on mw1197 is OK Less than 30.00% above the threshold [10.0] [20:49:40] RECOVERY - HHVM busy threads on mw1122 is OK Less than 30.00% above the threshold [57.6] [20:49:48] RECOVERY - HHVM busy threads on mw1221 is OK Less than 30.00% above the threshold [76.8] [20:49:58] RECOVERY - HHVM busy threads on mw1121 is OK Less than 30.00% above the threshold [57.6] [20:49:58] RECOVERY - HHVM busy threads on mw1137 is OK Less than 30.00% above the threshold [57.6] [20:50:09] RECOVERY - HHVM queue size on mw1125 is OK Less than 30.00% above the threshold [10.0] [20:50:09] RECOVERY - HHVM busy threads on mw1133 is OK Less than 30.00% above the threshold [57.6] [20:50:18] RECOVERY - HHVM queue size on mw1201 is OK Less than 30.00% above the threshold [10.0] [20:50:19] RECOVERY - HHVM queue size on mw1193 is OK Less than 30.00% above the threshold [10.0] [20:50:19] RECOVERY - HHVM busy threads on mw1124 is OK Less than 30.00% above the threshold [57.6] [20:50:19] RECOVERY - HHVM queue size on mw1144 is OK Less than 30.00% above the threshold [10.0] [20:50:19] RECOVERY - HHVM busy threads on mw1120 is OK Less than 30.00% above the threshold [57.6] [20:50:20] RECOVERY - HHVM busy threads on mw1132 is OK Less than 30.00% above the threshold [57.6] [20:50:38] RECOVERY - HHVM queue size on mw1194 is OK Less than 30.00% above the threshold [10.0] [20:50:38] RECOVERY - HHVM busy threads on mw1130 is OK Less than 30.00% above the threshold [57.6] [20:50:39] RECOVERY - HHVM busy threads on mw1126 is OK Less than 30.00% above the threshold [57.6] [20:50:39] RECOVERY - HHVM queue size on mw1200 is OK Less than 30.00% above the threshold [10.0] [20:50:48] RECOVERY - HHVM busy threads on mw1232 is OK Less than 30.00% above the threshold [76.8] [20:50:48] RECOVERY - HHVM queue size on mw1228 is OK Less than 30.00% above the threshold [10.0] [20:50:49] RECOVERY - HHVM busy threads on mw1125 is OK Less than 30.00% above the threshold [57.6] [20:50:59] RECOVERY - HHVM busy threads on mw1119 is OK Less than 30.00% above the threshold [57.6] [20:51:09] RECOVERY - HHVM busy threads on mw1227 is OK Less than 30.00% above the threshold [76.8] [20:51:19] RECOVERY - HHVM busy threads on mw1234 is OK Less than 30.00% above the threshold [76.8] [20:51:29] RECOVERY - HHVM busy threads on mw1235 is OK Less than 30.00% above the threshold [76.8] [20:51:29] RECOVERY - HHVM busy threads on mw1143 is OK Less than 30.00% above the threshold [57.6] [20:51:38] RECOVERY - HHVM busy threads on mw1233 is OK Less than 30.00% above the threshold [76.8] [20:51:38] RECOVERY - HHVM busy threads on mw1228 is OK Less than 30.00% above the threshold [76.8] [20:51:39] RECOVERY - HHVM busy threads on mw1146 is OK Less than 30.00% above the threshold [57.6] [20:51:39] RECOVERY - HHVM queue size on mw1230 is OK Less than 30.00% above the threshold [10.0] [20:51:39] RECOVERY - HHVM busy threads on mw1208 is OK Less than 30.00% above the threshold [76.8] [20:51:48] RECOVERY - HHVM busy threads on mw1123 is OK Less than 30.00% above the threshold [57.6] [20:51:58] RECOVERY - HHVM queue size on mw1119 is OK Less than 30.00% above the threshold [10.0] [20:51:58] RECOVERY - HHVM busy threads on mw1116 is OK Less than 30.00% above the threshold [57.6] [20:51:58] RECOVERY - HHVM busy threads on mw1207 is OK Less than 30.00% above the threshold [76.8] [20:51:59] RECOVERY - HHVM busy threads on mw1192 is OK Less than 30.00% above the threshold [76.8] [20:51:59] RECOVERY - HHVM busy threads on mw1200 is OK Less than 30.00% above the threshold [76.8] [20:52:08] RECOVERY - HHVM busy threads on mw1229 is OK Less than 30.00% above the threshold [76.8] [20:52:19] RECOVERY - HHVM busy threads on mw1193 is OK Less than 30.00% above the threshold [76.8] [20:52:19] RECOVERY - HHVM busy threads on mw1231 is OK Less than 30.00% above the threshold [76.8] [20:52:30] RECOVERY - HHVM busy threads on mw1140 is OK Less than 30.00% above the threshold [57.6] [20:52:30] RECOVERY - HHVM queue size on mw1128 is OK Less than 30.00% above the threshold [10.0] [20:52:38] RECOVERY - HHVM busy threads on mw1205 is OK Less than 30.00% above the threshold [76.8] [20:52:39] RECOVERY - HHVM busy threads on mw1223 is OK Less than 30.00% above the threshold [76.8] [20:52:39] RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8] [20:52:48] RECOVERY - HHVM busy threads on mw1199 is OK Less than 30.00% above the threshold [76.8] [20:52:49] RECOVERY - HHVM busy threads on mw1147 is OK Less than 30.00% above the threshold [57.6] [20:52:49] RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8] [20:52:58] RECOVERY - HHVM busy threads on mw1224 is OK Less than 30.00% above the threshold [76.8] [20:52:59] RECOVERY - HHVM busy threads on mw1195 is OK Less than 30.00% above the threshold [76.8] [20:52:59] RECOVERY - HHVM busy threads on mw1142 is OK Less than 30.00% above the threshold [57.6] [20:52:59] RECOVERY - HHVM queue size on mw1138 is OK Less than 30.00% above the threshold [10.0] [20:52:59] RECOVERY - HHVM busy threads on mw1230 is OK Less than 30.00% above the threshold [76.8] [20:52:59] RECOVERY - HHVM busy threads on mw1197 is OK Less than 30.00% above the threshold [76.8] [20:53:10] RECOVERY - HHVM busy threads on mw1131 is OK Less than 30.00% above the threshold [57.6] [20:53:19] RECOVERY - HHVM busy threads on mw1114 is OK Less than 30.00% above the threshold [57.6] [20:53:19] RECOVERY - HHVM queue size on mw1191 is OK Less than 30.00% above the threshold [10.0] [20:53:19] RECOVERY - HHVM busy threads on mw1144 is OK Less than 30.00% above the threshold [57.6] [20:53:19] RECOVERY - HHVM queue size on mw1227 is OK Less than 30.00% above the threshold [10.0] [20:53:28] RECOVERY - HHVM busy threads on mw1189 is OK Less than 30.00% above the threshold [76.8] [20:53:39] RECOVERY - HHVM busy threads on mw1128 is OK Less than 30.00% above the threshold [57.6] [20:53:39] RECOVERY - HHVM busy threads on mw1136 is OK Less than 30.00% above the threshold [57.6] [20:53:39] RECOVERY - HHVM busy threads on mw1194 is OK Less than 30.00% above the threshold [76.8] [20:53:39] RECOVERY - HHVM busy threads on mw1139 is OK Less than 30.00% above the threshold [57.6] [20:53:40] RECOVERY - HHVM busy threads on mw1225 is OK Less than 30.00% above the threshold [76.8] [20:53:40] RECOVERY - HHVM busy threads on mw1135 is OK Less than 30.00% above the threshold [57.6] [20:53:40] RECOVERY - HHVM queue size on mw1190 is OK Less than 30.00% above the threshold [10.0] [20:53:40] RECOVERY - HHVM busy threads on mw1226 is OK Less than 30.00% above the threshold [76.8] [20:53:49] RECOVERY - HHVM busy threads on mw1222 is OK Less than 30.00% above the threshold [76.8] [20:53:49] RECOVERY - HHVM busy threads on mw1201 is OK Less than 30.00% above the threshold [76.8] [20:53:59] RECOVERY - HHVM busy threads on mw1127 is OK Less than 30.00% above the threshold [57.6] [20:53:59] RECOVERY - HHVM queue size on mw1143 is OK Less than 30.00% above the threshold [10.0] [20:53:59] RECOVERY - HHVM busy threads on mw1134 is OK Less than 30.00% above the threshold [57.6] [20:53:59] RECOVERY - HHVM busy threads on mw1138 is OK Less than 30.00% above the threshold [57.6] [20:54:00] RECOVERY - HHVM queue size on mw1205 is OK Less than 30.00% above the threshold [10.0] [20:54:09] RECOVERY - HHVM busy threads on mw1206 is OK Less than 30.00% above the threshold [76.8] [20:54:09] RECOVERY - HHVM busy threads on mw1129 is OK Less than 30.00% above the threshold [57.6] [20:54:18] RECOVERY - HHVM queue size on mw1116 is OK Less than 30.00% above the threshold [10.0] [20:54:19] RECOVERY - HHVM busy threads on mw1145 is OK Less than 30.00% above the threshold [57.6] [20:54:19] RECOVERY - HHVM busy threads on mw1198 is OK Less than 30.00% above the threshold [76.8] [20:54:19] RECOVERY - HHVM busy threads on mw1117 is OK Less than 30.00% above the threshold [57.6] [20:54:23] -_- [20:54:28] RECOVERY - HHVM busy threads on mw1115 is OK Less than 30.00% above the threshold [57.6] [20:54:49] RECOVERY - HHVM busy threads on mw1191 is OK Less than 30.00% above the threshold [76.8] [20:57:34] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1516015 (10Dzahn) root@helium:~# /etc/init.d/ferm reload * Reloading Firewall configuration... iptables-restore: line 27 failed... [20:58:02] I don't understand why there are still so many instances of this in the log: 462343 Notice: Undefined index: laplace in /srv/mediawiki/php-1.26wmf17/extensions/CirrusSearch/includes/Hooks.php on line 359 [20:58:23] the fix went in quite a long time ago [21:01:19] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:01:31] twentyafterfour: is hhvm.log not growing quickly? [21:02:00] the fatalmonitor shell script sort of assumed that there were tons of errors all the time [21:02:01] bd808: well the count of that error is still growing [21:02:06] *nod* [21:02:20] but I think the logs get delayed sometimes [21:02:23] somewhere [21:02:33] rsyslog buffering [21:03:28] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [21:04:00] so yeah that log doesn't seem to be filling up very quickly [21:04:13] tail -f on the log is barely moving [21:04:34] we broke monitoring by cleaning up crappy errors ;) [21:05:08] have we really done that much to eliminate all the errors or does this log only represent a sampling of all errors across the cluster? [21:05:35] e.g. does fluorine get all hhvm log messages? (or nearly all of them?) [21:05:56] yeah, all of them (that don't get lost somewhere along the way) [21:06:28] hhvm -> local rsyslog -> fluorine syslog [21:07:12] hhvm.log is all of the HHVM fatal errors and php warnings/trigger_error() messages [21:07:26] !log ebernhardson Synchronized php-1.26wmf17/extensions/CirrusSearch/includes/Hooks.php: Repush file spewing notices into hhvm.log (duration: 00m 12s) [21:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:48] which is the equivalent of the old apache2 error.log stream from php5 [21:08:23] It used to be filled with undefined index warnings from all over the codebase [21:08:37] but we have really cut that down over the last 3 months or so [21:11:23] bd808: yeah we have made really good progress [21:12:07] !log switched cassandra staging cluster (xenon, cerium, praseodymium) to CMS & started a load test on that [21:12:08] it's annoying when writing the code to have to check isset on every inconsequential variable but it's necessary I suppose. [21:12:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:42] phabricator has a nice handy function for querying arrays without triggering array index errors [21:13:07] $local_var = idx($array, 'index', 'default-value') [21:13:41] wfSafeGetArrayElementByIndex() [21:13:55] is much nicer than if (isset($array['index'])) { $local_var = $array['index']; } else { $local_var = 'default-value'; } [21:14:35] MaxSem: wfSafeGetArrayElementByIndex ... makes it almost as long as the verbose form [21:14:56] (03PS1) 10Ottomata: Rename analytics1013,1014,1020 to kafka1013,1014,1020 [dns] - 10https://gerrit.wikimedia.org/r/229956 [21:15:41] nice to know though, I might use that [21:15:44] err, right. so it shold really be \MediaWiki\Core\ArrayUtils::safeGetArrayElementByIndex() :P [21:16:00] nah it should really be idx [21:16:15] or \MediaWiki\idx [21:16:15] SARCASM DETECTION FAIL [21:16:19] ;) [21:16:40] heh sarcasm doesn't exist on the internet [21:18:09] recursive sarcasm well played [21:20:19] someone tell me again: [21:20:22] ensure_packages or require_packages [21:20:24] which is better [21:20:25] ? [21:20:55] iirc ensure_packages is newer [21:20:59] apt-get install [21:21:59] ottomata: chasemp ensure_packages is from stdlib, require_package is from ori [21:22:01] IIRC [21:22:11] oh, ori's was better? [21:22:14] the latter is 'better' in that you don't need to do require => Package['foo'] in your module [21:22:15] if i remember? [21:22:17] it's autorequired [21:22:17] yes [21:22:23] k [21:23:40] (03PS2) 10EBernhardson: Start CirrusSearch AB test on suggestion confidence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 (https://bugzilla.wikimedia.org/108103) [21:24:21] (03PS1) 10Ottomata: Use require_package for analytics and kafka nodes instead of ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/229958 [21:25:05] !log krinkle Synchronized php-1.26wmf17/extensions/EducationProgram/EducationProgram.hooks.php: T107980) (duration: 00m 12s) [21:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:33] (03CR) 10Ottomata: [C: 032] Use require_package for analytics and kafka nodes instead of ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/229958 (owner: 10Ottomata) [21:25:48] (03PS3) 10EBernhardson: Start CirrusSearch AB test on suggestion confidence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 (https://phabricator.wikimedia.org/T108103) [21:26:29] (03PS4) 10EBernhardson: Start CirrusSearch AB test on suggestion confidence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 (https://phabricator.wikimedia.org/T108103) [21:27:22] 6operations, 5Patch-For-Review: Ferm rules for backup roles - https://phabricator.wikimedia.org/T104996#1516149 (10Dzahn) 20150806-poolcounter == Summary == A firewall change was merged on server helium which serves as Bacula director and also as a poolcounter. There was a failure to start the ferm service. A... [21:27:48] RECOVERY - salt-minion processes on analytics1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:28:57] (03PS1) 10Ottomata: Rename analytics1013,1014,1020 to kafka1013,1014,1020 [puppet] - 10https://gerrit.wikimedia.org/r/229961 (https://phabricator.wikimedia.org/T106581) [21:29:41] 6operations, 10Wikimedia-Mailing-lists: Rename usergroups@ to usergroup-applications@ - https://phabricator.wikimedia.org/T108099#1516160 (10JohnLewis) p:5Triage>3Normal [21:30:37] (03PS2) 10Ottomata: Rename analytics1013,1014,1020 to kafka1013,1014,1020 [puppet] - 10https://gerrit.wikimedia.org/r/229961 (https://phabricator.wikimedia.org/T106581) [21:32:01] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1516175 (10JohnLewis) p:5Normal>3Lowest [21:34:49] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1516185 (10thcipriani) >>! In T107532#1515421, @GWicke wrote: > Thanks for giving a link to the spreadsheet. I see that about half the ser... [21:34:57] PROBLEM - salt-minion processes on analytics1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:36:28] (03CR) 10Ottomata: [C: 032] Rename analytics1013,1014,1020 to kafka1013,1014,1020 [dns] - 10https://gerrit.wikimedia.org/r/229956 (owner: 10Ottomata) [21:37:21] (03PS3) 10Ottomata: Rename analytics1013,1014,1020 to kafka1013,1014,1020 [puppet] - 10https://gerrit.wikimedia.org/r/229961 (https://phabricator.wikimedia.org/T106581) [21:37:38] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1516189 (10Cmjohnson) The faulty disk has been replaced and a freshly installed with puppet certs and salt. I did not revert the dsh group changes. I will leave that to @Joe [21:37:55] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1516190 (10Cmjohnson) a:5Cmjohnson>3Joe [21:38:28] (03CR) 10Ottomata: [C: 032] Rename analytics1013,1014,1020 to kafka1013,1014,1020 [puppet] - 10https://gerrit.wikimedia.org/r/229961 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [21:40:20] 6operations, 10Wikimedia-Mailing-lists: request: use spamassassin to filter as well - https://phabricator.wikimedia.org/T83030#1516199 (10Dzahn) I think it's kind of resolved. We _do_ use spamassassin on list mail, and we _do_ score it and list admins _can_ use that score in filter rules in the admin web ui.... [21:41:18] RECOVERY - puppet last run on analytics1021 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:41:48] (03CR) 10BryanDavis: [C: 031] Make adminbot also work if no headers are present [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229757 (owner: 10Merlijn van Deen) [21:42:08] 6operations, 10Wikimedia-Mailing-lists: request: use spamassassin to filter as well - https://phabricator.wikimedia.org/T83030#1516203 (10Dzahn) arr, all that said, sorry, the filters may not apply to the -owner addresses. that is the actual issue then [21:42:13] 6operations, 10Wikimedia-Mailing-lists: request: use spamassassin to filter as well - https://phabricator.wikimedia.org/T83030#1516207 (10JohnLewis) I'm also going to cautiously dupe this ticket of T105093 and extend that ticket's scope. [21:42:17] PROBLEM - puppet last run on analytics1022 is CRITICAL Puppet last ran 2 days ago [21:42:18] PROBLEM - puppet last run on analytics1012 is CRITICAL Puppet last ran 2 days ago [21:42:23] 6operations, 10Wikimedia-Mailing-lists: request: use spamassassin to filter as well - https://phabricator.wikimedia.org/T83030#1516209 (10JohnLewis) [21:42:35] (03CR) 10BryanDavis: [C: 031] Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 (owner: 10Merlijn van Deen) [21:43:51] (03PS1) 10Merlijn van Deen: Updated changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229965 [21:44:17] RECOVERY - puppet last run on analytics1022 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [21:44:18] RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:44:48] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1516223 (10JohnLewis) @faidon / @mark: input valued here. [21:45:57] (03CR) 10Yuvipanda: [C: 032] Make adminbot also work if no headers are present [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229757 (owner: 10Merlijn van Deen) [21:45:59] (03Merged) 10jenkins-bot: Make adminbot also work if no headers are present [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229757 (owner: 10Merlijn van Deen) [21:46:55] (03CR) 10Yuvipanda: [C: 032] Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 (owner: 10Merlijn van Deen) [21:46:57] (03Merged) 10jenkins-bot: Add IRC notice on exceptions [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229758 (owner: 10Merlijn van Deen) [21:47:55] (03CR) 10Dzahn: [C: 032] Updated changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229965 (owner: 10Merlijn van Deen) [21:48:20] YuviPanda: want me to build? [21:48:30] (03CR) 10Yuvipanda: Updated changelog (031 comment) [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229965 (owner: 10Merlijn van Deen) [21:48:35] or i have docs [21:49:00] eh, the name thing is true [21:51:35] (03PS1) 10Merlijn van Deen: Fix signature in changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229970 [21:52:54] (03CR) 10Dzahn: [C: 032] Fix signature in changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/229970 (owner: 10Merlijn van Deen) [21:54:34] (03CR) 10Ori.livneh: [C: 031] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/229696 (https://phabricator.wikimedia.org/T107901) (owner: 10Alexandros Kosiaris) [21:55:48] !log Running "updateSpecialPages.php --wiki wikidatawiki --only DoubleRedirects" on terbium [21:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:15] !log ori Synchronized php-1.26wmf17/extensions/Graph: I2089b21fc: Updated mediawiki/core Project: mediawiki/extensions/Graph (duration: 00m 12s) [21:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:00:40] !log built adminbot 1.7.12 and copied to carbon to incoming - but not imported [22:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:48] !log if up for watching the (auto)-upgrade and restarting: @carbon:/srv/wikimedia/incoming# reprepro -C main include adminbot_1.7.12_amd64.changes [22:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:37] (03PS1) 10John F. Lewis: mailman: add apache and exim aliases for education-coop rename [puppet] - 10https://gerrit.wikimedia.org/r/229978 (https://phabricator.wikimedia.org/T107445) [22:06:51] (03PS2) 10John F. Lewis: mailman: add apache and exim aliases for education-coop rename [puppet] - 10https://gerrit.wikimedia.org/r/229978 (https://phabricator.wikimedia.org/T107445) [22:10:53] 6operations, 10Wikimedia-Mailing-lists: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1516333 (10Dzahn) a:3Dzahn @robh created the VM MAC is aa:00:00:a4:ac:f1 [22:11:46] (03PS1) 10Dzahn: fermium: add to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) [22:14:44] haha, are we gonna add ferm to fermium? [22:14:46] (03PS2) 10Dzahn: fermium: add to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) [22:16:17] (03CR) 10Dzahn: [C: 031] ".. after the new list has been created" [puppet] - 10https://gerrit.wikimedia.org/r/229978 (https://phabricator.wikimedia.org/T107445) (owner: 10John F. Lewis) [22:16:30] YuviPanda: :) yes [22:16:54] YuviPanda: honestly it's the only reason it exists right now :P [22:17:33] they say its for 'staging' but that's just a cover story! [22:21:01] (03PS3) 10Dzahn: fermium: add to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) [22:21:04] !log ori Synchronized php-1.26wmf17/extensions/Flow: 94703bc291: Updated mediawiki/core Project: mediawiki/extensions/Flow (duration: 00m 15s) [22:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:54] !log ori Synchronized php-1.26wmf17/includes/resourceloader/ResourceLoader.php: (no message) (duration: 00m 11s) [22:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:19] !log ori Synchronized php-1.26wmf17/resources/src/startup.js: (no message) (duration: 00m 12s) [22:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:35] (03PS1) 10Dzahn: fermium: add to site.pp, add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/229985 (https://bugzilla.wikimedia.org/108070) [22:28:22] (03PS4) 10Dzahn: fermium: add to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) [22:28:58] (03CR) 10John F. Lewis: [C: 031] "The patch is good." [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [22:29:44] (03CR) 10John F. Lewis: [C: 031] "Correction; sodium has it disabled about 3 years ago in https://github.com/wikimedia/operations-puppet/commit/f4a968187f93c03ed9e9a20104bd" [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [22:29:46] (03CR) 10Dzahn: [C: 032] fermium: add to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/229983 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [22:30:07] hah I didn't forget, you rebased it :) [22:31:25] !log ori Synchronized php-1.26wmf17/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: (no message) (duration: 00m 12s) [22:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:57] !log Previous two syncs were of I2089b21fc and I3f46fee7c [22:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:44] !log starting new instance fermium on ganeti [22:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:45] (03PS1) 10Dzahn: admin: add non-root user group for mailman staging [puppet] - 10https://gerrit.wikimedia.org/r/229995 (https://phabricator.wikimedia.org/T108082) [22:42:16] sudo gnt-instance modify --hypervisor-parameters=boot_order=disk fermium.eqiad.wmnet [22:42:41] JohnFLewis: ^ that is needed or endless reboot cycle [22:43:23] okay [22:43:30] eh, robh too, the only things different from actual metal server, how to get console and that [22:44:10] !log ori Synchronized php-1.26wmf17/tests/phpunit/includes/OutputPageTest.php: (no message) (duration: 00m 13s) [22:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:45] (03CR) 10Dzahn: [C: 032] admin: add non-root user group for mailman staging [puppet] - 10https://gerrit.wikimedia.org/r/229995 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [22:49:54] (03PS1) 10Dzahn: admin: non-sudo shell for johnflewis on fermium [puppet] - 10https://gerrit.wikimedia.org/r/230000 (https://phabricator.wikimedia.org/T108082) [22:50:30] mutante: look at the patchset number :P [22:50:52] heheh, nice one [22:50:58] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1516526 (10GWicke) >>! In T107532#1516185, @thcipriani wrote: >>>! In T107532#1515421, @GWicke wrote: >> Thanks for giving a link to the s... [22:51:45] (03PS2) 10Dzahn: fermium: add to site.pp, add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/229985 (https://phabricator.wikimedia.org/T108070) [22:51:58] (03PS3) 10Dzahn: fermium: add to site.pp, add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/229985 (https://phabricator.wikimedia.org/T108070) [22:52:31] (03CR) 10Dzahn: [C: 032] fermium: add to site.pp, add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/229985 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [22:54:17] (03CR) 10Dzahn: "reverted because https://phabricator.wikimedia.org/T104996#1516149" [puppet] - 10https://gerrit.wikimedia.org/r/229054 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [22:54:35] (03PS2) 10Dzahn: admin: non-sudo shell for johnflewis on fermium [puppet] - 10https://gerrit.wikimedia.org/r/230000 (https://phabricator.wikimedia.org/T108082) [22:55:33] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1516560 (10Dzahn) [22:55:35] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1516558 (10Dzahn) 5Open>3Resolved [22:55:45] 6operations, 10vm-requests: eqiad: 1 VM request for mailman - https://phabricator.wikimedia.org/T108065#1511472 (10Dzahn) resolved by @Robh [22:56:03] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible? - https://phabricator.wikimedia.org/T107532#1516562 (10GWicke) > Ideally, there would be a more general wrapper for ansible that allows folks to take the repo_config out of the salts... [22:57:26] (03PS1) 10Ori.livneh: Revert "Use external diff, now that lightprocess is enabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230002 [22:57:36] (03PS2) 10Ori.livneh: Revert "Use external diff, now that lightprocess is enabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230002 [22:57:48] (03CR) 10Ori.livneh: [C: 032] Revert "Use external diff, now that lightprocess is enabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230002 (owner: 10Ori.livneh) [22:57:54] (03Merged) 10jenkins-bot: Revert "Use external diff, now that lightprocess is enabled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230002 (owner: 10Ori.livneh) [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150806T2300). Please do the needful. [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1516580 (10Dzahn) installer finished normal, i did change boot order to disk, i tried to connect to console: ``` [ganeti1003:~] $ sudo gnt-instance console fermium.eq... [23:00:42] just me? I'll ship it i guess [23:02:54] (03PS1) 10Dzahn: fermium: comment from site.pp, just defaults for now [puppet] - 10https://gerrit.wikimedia.org/r/230003 (https://phabricator.wikimedia.org/T108070) [23:03:33] (03CR) 10Dzahn: [C: 032] fermium: comment from site.pp, just defaults for now [puppet] - 10https://gerrit.wikimedia.org/r/230003 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [23:05:39] !log ebernhardson Synchronized php-1.26wmf17/extensions/CirrusSearch: Bump cirrusearch in 1.26wmf17 for SWAT (duration: 00m 11s) [23:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:10] * ebernhardson isn't a huge fan of undefined class warnings that spit out while deploying. Ever run of sync-* kicks off a few failed requests :( [23:06:36] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1516607 (10Dzahn) do not apply base::firewall before initial run. also got console now, just needed more patience: ``` Debian GNU/Linux 8 fermium ttyS0 fermium login... [23:07:33] !log puppet/salt-master: signing certs and adding keys for fermium [23:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:23] (03CR) 10EBernhardson: [C: 032] Start CirrusSearch AB test on suggestion confidence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 (https://phabricator.wikimedia.org/T108103) (owner: 10EBernhardson) [23:08:54] (03Merged) 10jenkins-bot: Start CirrusSearch AB test on suggestion confidence [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229462 (https://phabricator.wikimedia.org/T108103) (owner: 10EBernhardson) [23:09:43] !log ebernhardson Synchronized wmf-config/: Start cirrussearch suggester confidence AB test (duration: 00m 13s) [23:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:45] (03PS1) 10Dzahn: fermium: enable in site.pp, include admin [puppet] - 10https://gerrit.wikimedia.org/r/230007 (https://phabricator.wikimedia.org/T108070) [23:12:19] (03CR) 10Dzahn: [C: 032] fermium: enable in site.pp, include admin [puppet] - 10https://gerrit.wikimedia.org/r/230007 (https://phabricator.wikimedia.org/T108070) (owner: 10Dzahn) [23:13:11] tgr: Do you want that group0 auth logging out today? [23:13:53] (03PS3) 10Dzahn: admin: non-sudo shell for johnflewis on fermium [puppet] - 10https://gerrit.wikimedia.org/r/230000 (https://phabricator.wikimedia.org/T108082) [23:14:47] ebernhardson: still swatting? [23:15:02] bd808: Krinkle and I are about to push out a change [23:15:05] give me a minute? [23:15:08] np [23:15:22] ori: we just have this tiny one -- https://gerrit.wikimedia.org/r/#/c/229618/ [23:15:32] !log ebernhardson Synchronized wmf-config/: Redeploy cirrussearch ab test start (duration: 00m 14s) [23:15:36] (03PS2) 10Ori.livneh: Enable authmetrics logging on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229618 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [23:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:40] (03CR) 10Ori.livneh: [C: 032] Enable authmetrics logging on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229618 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [23:15:46] (03Merged) 10jenkins-bot: Enable authmetrics logging on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/229618 (https://phabricator.wikimedia.org/T91701) (owner: 10Gergő Tisza) [23:15:58] (03CR) 10Dzahn: [C: 032] "acked by Faidon/Mark on meeting and IRC, no sudo access for now" [puppet] - 10https://gerrit.wikimedia.org/r/230000 (https://phabricator.wikimedia.org/T108082) (owner: 10Dzahn) [23:16:21] !log ori Synchronized wmf-config/InitialiseSettings.php: I053a6e9: Enable authmetrics logging on group0 wikis (duration: 00m 12s) [23:16:25] bd808, tgr ^ [23:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 42.86% of data above the critical threshold [500.0] [23:17:02] ori: thx I'll go look at graphite [23:17:11] bd808: that is fallout from ebernhardson's sync [23:17:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 54 data above and 6 below the confidence bounds [23:17:28] um [23:17:50] 6operations, 10Deployment-Systems, 10RESTBase, 6Services, 5Patch-For-Review: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? - https://phabricator.wikimedia.org/T107532#1516652 (10GWicke) [23:17:56] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-2hours&from=-2hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [23:17:57] our patch sends some new metrics to graphite.... [23:18:10] it spiked at :10 [23:18:11] wtf [23:18:19] lines up with !log ebernhardson Synchronized wmf-config/: Start cirrussearch suggester confidence AB test (duration: 00m 13s) [23:18:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1516657 (10Dzahn) signed puppet cert on master, added salt key on master, initial puppet run, added standard and admin classes, done [23:18:59] (03PS1) 10EBernhardson: Add missing prefix_length cirrus suggest property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230010 [23:19:05] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1516659 (10Dzahn) [23:19:07] 6operations, 10Wikimedia-Mailing-lists: export config and archive data from sodium - https://phabricator.wikimedia.org/T108071#1516660 (10Dzahn) [23:19:08] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: install jessie on new VM for mailman - https://phabricator.wikimedia.org/T108070#1516658 (10Dzahn) 5Open>3Resolved [23:19:10] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1516661 (10Dzahn) [23:19:12] (03CR) 10EBernhardson: [C: 032] Add missing prefix_length cirrus suggest property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230010 (owner: 10EBernhardson) [23:19:16] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1516662 (10Dzahn) a:3Dzahn [23:19:50] (03Merged) 10jenkins-bot: Add missing prefix_length cirrus suggest property [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230010 (owner: 10EBernhardson) [23:20:24] !log ebernhardson Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [23:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:45] looks better [23:20:51] looked better before you synced it, in fact [23:21:16] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1516681 (10Dzahn) [23:21:18] 10Ops-Access-Requests, 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: give John Lewis shell access on the mailman staging VM - https://phabricator.wikimedia.org/T108082#1516679 (10Dzahn) 5Open>3Resolved fermium: Notice: /Stage[main]/Admin/Admin::Hashuser[johnflewis]/Admin::User[johnflewis]... [23:24:03] !log ori Synchronized php-1.26wmf17: c5c52ec1d8: resourceloader: Async all the way (duration: 01m 41s) [23:24:08] LOOK OVER THERE A THREE-HEADED MONKEY [23:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:23] it went so fast I didn't see it [23:25:16] that's the performance team for you [23:25:50] WMF perf team: fast as a three-headed monkey [23:26:00] :) [23:27:14] !bash WMF perf team: fast as a three-headed monkey [23:27:35] https://tools.wmflabs.org/bash/quip/AU8FVwpD6snAnmqnLHK5 [23:28:34] Pull back!!! [23:28:47] https://phabricator.wikimedia.org/T107968 happening on svwp now! [23:29:05] has anyone other than you been able to reproduce this? [23:29:27] Yo, commons in maitenance mode? [23:29:50] not sure...Everthing just went to "loading" mode...and then it happened... [23:30:25] GITM? Works now [23:31:02] (03PS2) 10Jforrester: VisualEditor: Switch config from …Namespaces to …AvailableNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228040 [23:31:06] Josve05a: we think we know what it is; just a sec [23:31:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:31:58] * Krinkle is fixing the bug we see on sv.wikipedia.org where gadgets that don't use ResourceLoader load without jQuery [23:32:51] (03PS3) 10Jforrester: Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) [23:35:39] Krinkle: https://www.irccloud.com/pastebin/FoCZOG5d/ [23:43:24] Josve05a: Ah, that's another deprecation popping up. svwiki has old gadgets that don't use ResourceLoader yet [23:45:08] 6operations, 6Collaboration-Team Backlog, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1516765 (10Mattflaschen) >>! In T107610#1506876, @jcrespo wrote: > My suggestion is to separate it logically, but not physically for 2 reasons: I think this should be... [23:45:15] ah ok. THough they might be related, since I saw the both errors at the same time [23:47:06] But I'm still seeing the big spinning big ball of death... :/ [23:47:13] we have a bash for WMF? [23:47:16] Josve05a: Yeah, unrelated. But I'll see what we can do about that [23:47:17] neat! [23:47:25] Josve05a: I can't reproduce the spinning thing you keep mentioning though [23:47:32] :/ [23:47:34] Josve05a: Can you try an incognito window? [23:47:39] Do you have any Chrome extensions instaleld? [23:47:40] sure.... [23:47:44] umm...yes [23:48:10] hmm [23:48:47] ir keeps popping up while logged (not incognito) in intermittent on https://sv.wikipedia.org/wiki/Wikipediadiskussion:Flow [23:48:56] might be one of my add-ons but... [23:49:07] I haven't installed anything new [23:53:19] :/ [23:56:05] Keep popping up intermittent...