[00:16:01] (03CR) 10BBlack: [C: 031] varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [00:16:40] Anyone here have full access on tools i have a dir that wont remove [00:18:44] Zppix|mobile: there is a specific channel for Wikimedia Labs stuff : #wikimedia-labs [00:19:08] You could perhaps have more luck there. [00:20:00] by the way, tools folders could be owned by your user or by the tool user, check it's not the other one you need to use [00:20:37] I used take dirname on every dir [00:20:40] I tried [00:21:50] do a ls -lh on the parent directory, you'll know who own it, and what's the chmod [00:22:44] Its owned by the tool acct (my bot) cause thats who created [00:26:20] If what you want is to delete it, from tools.acct try a rm -rf foldername, could work if tools.acct own the parent directory too [00:38:28] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2733582 (10Dzahn) a:03Dzahn Yes, i'll talk to Chad about it next week when they are back from their offsite. [01:08:28] (03PS1) 10Dzahn: repeat hostname for AAAA,bast3/4,sodium,dataset,ms1001 [dns] - 10https://gerrit.wikimedia.org/r/317093 [01:20:50] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: Connection refused [01:21:02] PROBLEM - cassandra-a service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:21:49] PROBLEM - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [01:23:31] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpected status 500 (expecting: 200) [01:24:37] that looks kinda bad [01:25:06] maybe it's not user-facing since it's codfw [01:26:08] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [01:30:08] * mdholloway goes to run a few quick checks for peace of mind... [01:36:27] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpected status 500 (expecting: 200) [01:37:05] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpected status 500 (expecting: 200) [01:39:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [01:39:37] ori: not user-facing, because it's codfw, and because it's not actually an endpoint that's currently in use. but it's still troubling. [01:39:45] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [01:42:35] RECOVERY - cassandra-a service on restbase1011 is OK: OK - cassandra-a is active [01:43:24] RECOVERY - cassandra-a SSL 10.64.0.117:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-a valid until 2017-09-12 15:34:03 +0000 (expires in 326 days) [01:44:58] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.005 second response time on port 9042 [02:24:10] PROBLEM - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [02:26:51] RECOVERY - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-c valid until 2017-09-12 15:34:16 +0000 (expires in 326 days) [02:29:01] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2733664 (10yuvipanda) I *think* I'm ok with the proposal as stated with the amendments @Joe mentioned :) [02:36:45] PROBLEM - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: Connection refused [02:37:53] PROBLEM - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [02:39:38] PROBLEM - cassandra-c service on restbase1012 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [02:47:23] RECOVERY - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is OK: TCP OK - 0.016 second response time on port 9042 [02:47:25] RECOVERY - cassandra-c service on restbase1012 is OK: OK - cassandra-c is active [02:48:29] RECOVERY - cassandra-c SSL 10.64.32.204:7001 on restbase1012 is OK: SSL OK - Certificate restbase1012-c valid until 2017-09-12 15:34:16 +0000 (expires in 326 days) [05:38:21] !log @@@@ J@IN #wikimedia-ayuda @@@@ [05:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:39:47] !log @@@@ J@IN #wikimedia-ayuda @@@@ [05:40:01] !log @@@@ J@IN #wikipedia-es @@@@ [05:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:40:21] !log @@@@ J@IN #wikipedia-es-biblios @@@@ [05:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:40:41] @@@@ J@IN #wikimedia-ops-internal @@@@ [05:41:01] !log @@@@ J@IN #wikimedia-ops-internal @@@@ [05:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:41:12] !log @@@@ J@IN #wikimedia-ops @@@@ [05:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:42:03] guess that bot still isn't fixed [05:42:21] * AlexZ cleans up SAL [05:42:38] <_joe_> AlexZ: I can take care of twitter now [05:42:58] _joe_: thanks [05:43:01] <_joe_> at least this time he is just being annoying [05:43:03] <_joe_> :P [05:43:10] yeah [05:43:16] oh, Edit conflict [05:45:36] <_joe_> I wast tempted to log "SebastianPinera is a sad douche" [05:45:45] XDDDD [05:45:45] <_joe_> but that'll make him happy [05:47:14] <_joe_> uhm for some reason pwstore is failing on me [05:48:42] Works for me, I will clean twitter [05:49:24] done [05:49:45] <_joe_> marostegui: heh, at the same time [05:52:25] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2733733 (10Marostegui) Maybe this server still needs a reboot, as it has been having the icinga warning about not be... [05:55:24] !log @@!!@@ J@IN #wikimedia-ayuda @@!!@@ [05:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:35] !log @@!!@@ J@IN #wikimedia-ops-internal @@!!@@ [05:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:48] @@!!@@ J@IN #wikipedia-es @@!!@@ [05:56:37] @@!!@@ J@IN #wikimpdia-ayuda @@!!@@ [05:56:50] !log @@!!@@ J@IN #wikipedia-es-biblios @@!!@@ [05:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:02:54] _joe_: morning, is there a ticket to whitelist the bot users ? [06:05:32] <_joe_> matanya: nope [06:05:52] _joe_: you think we need it ? [06:06:53] <_joe_> matanya: absolutely and surely not. I just want to unlink the damn twitter feed [06:07:43] https://phabricator.wikimedia.org/T148119 [06:10:13] and T148120 unless they got merged [06:11:48] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2708958 (10Joe) [06:20:26] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2708958 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['kubernetes1003.... [06:25:15] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2733746 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['kubernetes1002.... [06:26:01] RECOVERY - Disk space on cp4014 is OK: DISK OK [06:26:08] !log restarting stat1001 for kernel upgrades (will cause a brief outage for some analytics websites like analytics.w.o and pivot.w.o) [06:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:32:51] did we have another API outage today? around 01:22, 21 October 2016 (UTC)? https://commons.wikimedia.org/wiki/Commons:Help_desk#Upload_problem [06:54:53] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2733749 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kubernetes1003.eqiad.wmnet'] ``` Those hosts were successful: ``` [... [06:58:06] <_joe_> volans: ^^ bug! [06:59:29] (03PS1) 10Muehlenhoff: Update to 4.4.26 [debs/linux44] - 10https://gerrit.wikimedia.org/r/317116 [07:11:18] _joe_: failed? [07:11:25] <_joe_> nope [07:11:32] <_joe_> go look at the ticket [07:12:19] <_joe_> the reimage was successful [07:14:23] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2733750 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` ['kubernetes1004.... [07:15:19] _joe_: not completely true, salt is not running, certificate rejected, is a bug in wmf-reimage that sometimes doesn't clean it properly ;) [07:15:37] <_joe_> sigh [07:15:44] so all the steps after the wmf-reimage call were not completed [07:16:11] <_joe_> but the message in phab is then confusing [07:16:21] Those hosts were successful: [] ??? [07:16:22] happens in about 10-15% of reimage runs [07:16:33] it's clearly not successful [07:16:36] <_joe_> volans: and just before [07:17:01] if you reimage N you get the whole list and the succesfull list [07:17:07] you could get the failed one if you want :D [07:17:09] <_joe_> volans: meaning there is no error report [07:17:20] is in the log [07:17:31] that is reported in the task /var/log/wmf-auto-reimage/201610210620_oblivian_7697.log [07:17:34] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733752 (10Marostegui) I had a chat with Jaime yesterday about the past issues with the wildcard-based privileges and it is certainily worrying. Probably... [07:17:41] <_joe_> yeah, ok, you should add some message in phabricator, IMHO [07:18:01] <_joe_> or whoever looks at the ticket has no idea what is happening [07:18:14] sure, we can improve the messaging [07:18:17] <_joe_> (yes, a list of failed hosts would be enough, maybe) [07:18:26] and make sense to add the failed ones instead of the successfull [07:20:16] !log rebooting stat100[234] for kernel upgrades [07:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:20:33] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2733753 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kubernetes1002.eqiad.wmnet'] ``` Those hosts were successful: ``` [... [07:21:47] _joe_: man you're unlucky or we have a bug in wmf-reimage for new hosts? [07:22:12] actually being a new host... is not the old cert for sure, has to be the signing of the new [07:22:46] a quick trick is, if you go to neodymium while it's waiting for salt and sign it it will continue because of the retries [07:23:12] PROBLEM - Elasticsearch HTTPS on elastic1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [07:23:13] PROBLEM - puppet last run on elastic1027 is CRITICAL: Connection refused by host [07:23:22] PROBLEM - configured eth on elastic1027 is CRITICAL: Connection refused by host [07:23:47] <_joe_> yeah that uhm [07:23:58] <_joe_> 1027, that happened yesterday as well [07:24:50] <_joe_> uhm someone is reimaging it I guess [07:25:51] RECOVERY - Elasticsearch HTTPS on elastic1027 is OK: SSL OK - Certificate elastic1027.eqiad.wmnet valid until 2021-03-15 20:29:03 +0000 (expires in 1606 days) [07:26:01] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 11 minutes ago with 0 failures [07:26:13] probably gehel rebooting it for the kernel update [07:26:15] RECOVERY - configured eth on elastic1027 is OK: OK - interfaces up [07:26:33] elastic/codfw is done, so eqiad is likely WIP now [07:28:11] <_joe_> moritzm: not just rebooting, I could not log in [07:30:50] no gehel is rebooting codfw afaik [07:31:31] <_joe_> ok I know what's happening, and I guess it's my fault [07:31:47] dcausse: elastic seems completed [07:31:52] dcausse: elastic/codfw seems completed [07:32:10] really? nive it was fast this time [07:32:15] s/nive/nice/ [07:32:27] <_joe_> sorry, my bad [07:32:37] ? [07:34:10] <_joe_> I got an A record mixed in the DNS [07:34:18] <_joe_> and I guess in dhcp too [07:34:32] <_joe_> as a consequence [07:35:18] (03PS1) 10Giuseppe Lavagetto: Fix kubernetes1004 A record [dns] - 10https://gerrit.wikimedia.org/r/317117 [07:35:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix kubernetes1004 A record [dns] - 10https://gerrit.wikimedia.org/r/317117 (owner: 10Giuseppe Lavagetto) [07:42:31] dcausse: yep, codfw is just finished. [07:43:21] _joe_: we've had what seemed like short connection issues yesterday on elastic1027 (T148736) [07:43:21] T148736: Write errors on elasticsearch eqiad - https://phabricator.wikimedia.org/T148736 [07:45:05] _joe_: could the k8 DNS entry explain that? [07:46:13] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2733815 (10Marostegui) So the tricky part is this one: ``` DEFINER=viewmaster ``` As per MySQL documentation: ``` If you specify the DEFINER clause, y... [07:47:42] <_joe_> gehel: It's possible, I turned on that machine yesterday for a few minutes [07:47:53] around 8am UTC? [07:48:16] <_joe_> I don't remember, tbh [07:48:22] <_joe_> let me see some timestamps [07:48:38] * gehel Ok, I' [07:48:55] The intervals where we had issues on elastic1027: [08:04:28-08:06:05] [08:09:08-08:09:54 [08:14:38-08:15:28] [08:18:38-08:19:54] [07:50:36] <_joe_> gehel: that's possible, yes [07:51:11] _joe_: Thanks! I really had no idea what to look for! Quite happy to have an explanation! [07:51:13] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.26 [debs/linux44] - 10https://gerrit.wikimedia.org/r/317116 (owner: 10Muehlenhoff) [07:51:20] <_joe_> it went on a PXE loop for a bit [07:51:24] <_joe_> sigh [07:51:25] (03PS3) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) [07:51:25] <_joe_> sorry [07:52:14] !log rebooting bohrium (hosting piwik) for kernel update [07:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:30] _joe_: no problem, there is luckily enough resilience built into Cirrus for that to not be a major issue [07:53:02] <_joe_> something is wrong again though [07:54:18] (03CR) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) (owner: 10Urbanecm) [07:54:29] _joe_: I connected to elastic1027 a few minutes ago, but now I get a mismatched SSH fingerprint... [07:54:54] <_joe_> gehel: it's dhcp [07:54:59] <_joe_> it still has the old lease [07:55:04] <_joe_> sigh [07:55:49] _joe_: I suspect you know how to fix that better than I do... [07:57:05] <_joe_> gehel: yup I should be able to handle this [07:58:25] _joe_: thanks [08:00:07] <_joe_> fixed [08:01:57] _joe_: thanks, looks good from my side as well [08:05:56] !log rolling reboot of swift backend servers in codfw [08:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:07:08] (03PS1) 10Jcrespo: mariadb: pool db1053 as the new rc special slave after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317118 [08:07:29] \o/ [08:11:04] (03PS1) 10Volans: wmf-auto-reimage: improve messaging [puppet] - 10https://gerrit.wikimedia.org/r/317119 (https://phabricator.wikimedia.org/T148815) [08:11:39] _joe_: ^^^ for you :) [08:11:48] <_joe_> volans: <3 [08:12:22] !log rebooting bast1001 for kernel update [08:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:57] (03PS1) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [08:20:07] 06Operations, 10media-storage: ms-be1027 borked - https://phabricator.wikimedia.org/T148807#2733840 (10elukey) p:05Triage>03High [08:23:45] !log applying events_coredb_slave.sql to db1070 [08:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:52] 06Operations, 10media-storage: ms-be1027 borked - https://phabricator.wikimedia.org/T148807#2733479 (10MoritzMuehlenhoff) I think that's already handled via https://phabricator.wikimedia.org/T136631 [08:25:35] !log rebooting alsafi (url_downloader for codfw) for kernel update [08:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:26:51] (03PS2) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [08:35:06] !log rebooting aluminium (url_downloader for eqiad) for kernel update [08:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:17] (03PS1) 10Elukey: Add ppchelko to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/317121 (https://phabricator.wikimedia.org/T148475) [08:37:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2733853 (10elukey) So the next step is to wait for the Ops meeting [08:38:28] 06Operations, 13Patch-For-Review, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2733855 (10elukey) [08:38:53] !log rebooting radium (tor relay) for kernel update [08:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:20] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724106 (10elukey) [08:40:11] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:32] !log Deploying schema change S1 enwiki.ores_model in codfw - T147734 [08:40:33] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [08:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:33] 06Operations, 10Ops-Access-Requests, 06Release-Engineering-Team, 13Patch-For-Review: Add Tyler Cipriani to releasers-mediawiki - https://phabricator.wikimedia.org/T148681#2729869 (10elukey) As far as I can see so sudo is granted for this group, so we'd only need for the usual three days period before proce... [08:45:13] (03PS3) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [08:48:06] (03PS4) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [08:49:55] !log rebooting ununpentium (RT) for kernel update [08:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:50:33] (03PS1) 10Jcrespo: dbtools: binarly log and drop fixes on the events [software] - 10https://gerrit.wikimedia.org/r/317122 (https://phabricator.wikimedia.org/T148790) [08:51:14] (03PS2) 10Jcrespo: dbtools: binary log and drop fixes on the events [software] - 10https://gerrit.wikimedia.org/r/317122 (https://phabricator.wikimedia.org/T148790) [08:51:28] (03PS2) 10Volans: wmf-auto-reimage: improve messaging [puppet] - 10https://gerrit.wikimedia.org/r/317119 (https://phabricator.wikimedia.org/T148815) [08:53:04] (03CR) 10Jcrespo: [C: 032] dbtools: binary log and drop fixes on the events [software] - 10https://gerrit.wikimedia.org/r/317122 (https://phabricator.wikimedia.org/T148790) (owner: 10Jcrespo) [08:54:49] (03PS5) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [09:00:04] 06Operations, 10hardware-requests: Decommission db1019 - https://phabricator.wikimedia.org/T147309#2733880 (10Marostegui) @Cmjohnson just to make sure you've not missed this one :) [09:00:51] (03PS6) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [09:03:21] (03PS1) 10Jcrespo: dbtools: disable binary log for all events (not only on creation) [software] - 10https://gerrit.wikimedia.org/r/317123 (https://phabricator.wikimedia.org/T148790) [09:04:21] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:06:31] !log rebooting serpens (labs LDAP server) for kernel update [09:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:53] (03PS1) 10Jcrespo: dbtools: explicitly set definer as root@localhost [software] - 10https://gerrit.wikimedia.org/r/317124 (https://phabricator.wikimedia.org/T148790) [09:08:23] (03CR) 10Jcrespo: [C: 032] dbtools: disable binary log for all events (not only on creation) [software] - 10https://gerrit.wikimedia.org/r/317123 (https://phabricator.wikimedia.org/T148790) (owner: 10Jcrespo) [09:09:17] (03CR) 10Jcrespo: [C: 032] dbtools: explicitly set definer as root@localhost [software] - 10https://gerrit.wikimedia.org/r/317124 (https://phabricator.wikimedia.org/T148790) (owner: 10Jcrespo) [09:12:59] (03PS3) 10Ema: varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) [09:13:20] (03CR) 10Ema: [C: 032 V: 032] varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [09:13:31] !log reviewing and applying new watchdog events to all core dbs T148790 [09:13:32] T148790: apply events killing long running queries to db1070; any other production server - https://phabricator.wikimedia.org/T148790 [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:32] !log rebooting meitnerium (archiva.wikimedia.org) for kernel update [09:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:30] (03PS1) 10Filippo Giunchedi: prometheus-node-exporter: allow access from ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/317125 [09:18:56] !log start rolling reboot of ms-be machines in eqiad for kernel update [09:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:46] !log rebooting krypton for kernel update [09:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:32] (03PS7) 10Giuseppe Lavagetto: kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 [09:32:21] !log rebooting kafka100[12] for kernel upgrades (EventBus hosts) [09:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:40] !log Deploying schema change S1 enwiki.ores_model in eqiad - T147734 [09:35:41] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [09:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:38:04] (03CR) 10Giuseppe Lavagetto: [C: 032] kubernetes: install the worker class on the kubernetes1001-4 [puppet] - 10https://gerrit.wikimedia.org/r/317120 (owner: 10Giuseppe Lavagetto) [09:38:33] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2733945 (10fgiunchedi) [09:38:35] 06Operations, 10media-storage: ms-be1027 borked - https://phabricator.wikimedia.org/T148807#2733947 (10fgiunchedi) [09:38:40] 06Operations, 10media-storage: ms-be1027 borked - https://phabricator.wikimedia.org/T148807#2733479 (10fgiunchedi) Indeed, more specifically T140374, merging [09:38:55] (03PS3) 10Gehel: decommission deployment-elastic08 [puppet] - 10https://gerrit.wikimedia.org/r/315940 (https://phabricator.wikimedia.org/T147777) [09:39:06] 06Operations, 10Traffic, 13Patch-For-Review: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2733965 (10ema) 05Open>03Resolved [09:39:08] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#2733966 (10ema) [09:40:20] (03CR) 10Gehel: [C: 032] decommission deployment-elastic08 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315940 (https://phabricator.wikimedia.org/T147777) (owner: 10Gehel) [09:43:51] <_joe_> expect failures from nrpe on the kubernetes hosts [09:44:02] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad/producer\.properties [09:44:14] 1002? [09:44:21] weird [09:44:27] <_joe_> crashed? [09:45:09] I think that it is not super tolerant to produce failures [09:45:30] if it detects some issues it shuts down [09:45:41] probably we need to configure it a bit better [09:45:44] will open a task [09:46:53] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad on kafka1002 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad/producer\.properties [09:46:59] there you go [09:48:56] !log rebooting neon (icinga host) for kernel update [09:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:26] (03PS1) 10Alexandros Kosiaris: apt: Add proxy for security-cdn.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/317128 [09:51:02] !log Deploying schema change S5 wikidatawiki.ores_model - T147734 [09:51:03] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [09:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:17] (03CR) 10Muehlenhoff: [C: 031] "Ack, the demand for the recent Linux updated exhausted the bandwidth on the official security mirrors..." [puppet] - 10https://gerrit.wikimedia.org/r/317128 (owner: 10Alexandros Kosiaris) [09:55:51] ACKNOWLEDGEMENT - DPKG on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:51] ACKNOWLEDGEMENT - Disk space on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:51] ACKNOWLEDGEMENT - MD RAID on kubernetes1002 is CRITICAL: Connection refused by host Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:51] ACKNOWLEDGEMENT - configured eth on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:51] ACKNOWLEDGEMENT - dhclient process on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:51] ACKNOWLEDGEMENT - puppet last run on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [09:55:52] ACKNOWLEDGEMENT - salt-minion processes on kubernetes1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Giuseppe Lavagetto nrpe fails if docker is installed. [10:03:03] (03PS1) 10Giuseppe Lavagetto: nrpe: use main_ipaddress, not the ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/317130 (https://phabricator.wikimedia.org/T147181) [10:03:26] <_joe_> akosiaris, paravoid I need a careful check of ^^ [10:04:23] !log rebooting seaborgium (labs LDAP server) for kernel update [10:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:28] !log Deploying schema change S7 fawiki.ores_model - T147734 [10:09:29] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:19] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[docker-engine],File[/etc/lvm/profile/docker-thinpool.profile],Service[docker],Physical_volume[/dev/md2] [10:16:32] (03CR) 10Alexandros Kosiaris: "Premise looks good, we probably need a pretty cluster wide PCC to make sure we don't break stuff" [puppet] - 10https://gerrit.wikimedia.org/r/317130 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [10:21:00] <_joe_> akosiaris: already running :P [10:21:58] <_joe_> akosiaris: we have a serious issue though [10:22:00] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2734038 (10BBlack) >>! In T147718#2731712, @bd808 wrote: >>>! In T147718#2730658, @Joe wrote: >> - General feature flags like `has_ganglia` or `has_lvs`... [10:22:01] <_joe_> with the compiler [10:22:18] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 12 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[docker-engine],File[/etc/lvm/profile/docker-thinpool.profile],Service[docker],Physical_volume[/dev/md2] [10:22:23] <_joe_> how do we get a complete list of our yaml nodes? [10:22:25] !log rolling reboot of restbase in codfw for kernel update [10:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:45] <_joe_> we have too many backends now :P [10:23:54] hmm [10:24:14] <_joe_> akosiaris: https://puppet-compiler.wmflabs.org/4456/ [10:24:15] we also need to clear the damn "trusted" fact that gets into the yaml cache for some reason occasionally [10:24:34] <_joe_> akosiaris: there is no "trusted" fact there IIRC [10:24:40] <_joe_> but lemme check [10:24:57] _joe_: https://puppet-compiler.wmflabs.org/4456/db1018.eqiad.wmnet/prod.db1018.eqiad.wmnet.err [10:25:03] that's what I mean [10:25:42] PROBLEM - Host kubernetes1002 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:12] <_joe_> it's not... [10:26:27] <_joe_> did we export the docker ip for the host monitoring too? [10:27:13] ACKNOWLEDGEMENT - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris https://phabricator.wikimedia.org/T148710 [10:28:17] !log rebooting radon (ns0) [10:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:57] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2734045 (10BBlack) [10:31:59] 06Operations, 10Traffic, 10Wikimedia-Logstash, 13Patch-For-Review: Move logstash.wikimedia.org (kibana) to an LVS service - https://phabricator.wikimedia.org/T132458#2734043 (10BBlack) 05Open>03Resolved a:03Gehel [10:32:51] (03PS1) 10Giuseppe Lavagetto: profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) [10:33:49] (03CR) 10jenkins-bot: [V: 04-1] profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [10:35:29] PROBLEM - Host ns0-v4 is DOWN: PING CRITICAL - Packet loss = 100% [10:35:38] (03CR) 10Ema: [C: 031] site: add varnish_exporter to ulsfo/codfw maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/316742 (owner: 10Filippo Giunchedi) [10:35:38] PROBLEM - Host radon is DOWN: PING CRITICAL - Packet loss = 100% [10:35:59] (03PS2) 10Giuseppe Lavagetto: profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) [10:36:24] !log Deploying schema change S2 several wikis for table ores_model - T147734 [10:36:25] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [10:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:41] RECOVERY - Host radon is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms [10:38:41] RECOVERY - Host ns0-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [10:39:11] (03CR) 10Alexandros Kosiaris: [C: 032] apt: Add proxy for security-cdn.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/317128 (owner: 10Alexandros Kosiaris) [10:39:33] (03CR) 10Alexandros Kosiaris: [C: 032] "gonna merge this, feel free to revert" [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:39:37] (03PS3) 10Alexandros Kosiaris: check_ssl: Do not verify server cert chain on connect [puppet] - 10https://gerrit.wikimedia.org/r/316906 [10:39:40] (03CR) 10Alexandros Kosiaris: [V: 032] check_ssl: Do not verify server cert chain on connect [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:41:30] (03PS1) 10BBlack: rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) [10:44:17] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [10:44:34] (03PS3) 10Giuseppe Lavagetto: profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) [10:45:06] (03CR) 10BBlack: [C: 031] site: add varnish_exporter to ulsfo/codfw maps/misc [puppet] - 10https://gerrit.wikimedia.org/r/316742 (owner: 10Filippo Giunchedi) [10:45:16] PROBLEM - Check size of conntrack table on kubernetes1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:25] PROBLEM - salt-minion processes on kubernetes1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:32] <_joe_> sigh, the other nrpe issue [10:45:37] (03CR) 10Giuseppe Lavagetto: [V: 032] profile::docker::engine: fixups to lvm configuration [puppet] - 10https://gerrit.wikimedia.org/r/317131 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [10:45:38] PROBLEM - Disk space on kubernetes1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:46:14] (03CR) 10Jcrespo: [C: 032] mariadb: pool db1053 as the new rc special slave after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317118 (owner: 10Jcrespo) [10:46:35] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:48] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:48:55] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:49:06] !log jynus@mira Synchronized wmf-config/db-eqiad.php: mariadb: pool db1053 as the new rc special slave after maintenance (duration: 01m 00s) [10:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:18] commons seems happy with the change [10:56:41] :-) [10:56:56] I am sure the other server appreciate the help of db1053 [10:57:25] there is a lot of noise on the dbquery channel, though [10:58:27] Mmm no for me at 10:55 [10:58:37] I think things like "LoadBalancer::{closure}: found writes/callbacks pending." should be warning or info, not error [10:58:58] same for reconnections [10:59:32] reconnecting if the connection goes down should not be an error [11:01:23] (03CR) 10Jcrespo: "@Andrew, please give this a thought when you have the time, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/316598 (owner: 10Jcrespo) [11:04:50] (03CR) 10Jcrespo: [C: 032] "Otto, this is very low priority, but let's schedule at some point before the end of the year a package upgrade + mysql reboot to fix bugs " [puppet] - 10https://gerrit.wikimedia.org/r/316595 (owner: 10Jcrespo) [11:04:56] (03PS2) 10Jcrespo: Use mariadb::service; prevent puppet from managing mysql symlinks [puppet] - 10https://gerrit.wikimedia.org/r/316595 [11:04:58] !log rebooting hafnium for kernel update [11:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:42] (03PS1) 10Giuseppe Lavagetto: profile::docker::engine: add directory creation [puppet] - 10https://gerrit.wikimedia.org/r/317133 [11:09:04] (03CR) 10Alexandros Kosiaris: "this looks fine but we will need to coordinate it a bit careful. We need to also open the puppetmaster ports in IPv6 otherwise the fronten" [dns] - 10https://gerrit.wikimedia.org/r/316032 (owner: 10Dzahn) [11:10:40] (03PS2) 10Jcrespo: Add mariadb::services to labs::db [puppet] - 10https://gerrit.wikimedia.org/r/316590 [11:11:32] PROBLEM - MD RAID on kubernetes1004 is CRITICAL: Connection refused by host [11:12:03] PROBLEM - configured eth on kubernetes1004 is CRITICAL: Connection refused by host [11:12:05] PROBLEM - salt-minion processes on kubernetes1004 is CRITICAL: Connection refused by host [11:12:13] PROBLEM - dhclient process on kubernetes1004 is CRITICAL: Connection refused by host [11:12:23] PROBLEM - Check size of conntrack table on kubernetes1004 is CRITICAL: Connection refused by host [11:12:35] PROBLEM - DPKG on kubernetes1004 is CRITICAL: Connection refused by host [11:12:58] PROBLEM - Disk space on kubernetes1004 is CRITICAL: Connection refused by host [11:13:12] (03CR) 10Jcrespo: [C: 032] "I got no -1, and probably I am the owner of the service, so I hope you trust my judgement and allow me to merge this :-)" [puppet] - 10https://gerrit.wikimedia.org/r/316590 (owner: 10Jcrespo) [11:15:03] (03PS2) 10Jcrespo: labs dns: Add mariadb::service and changes for new package [puppet] - 10https://gerrit.wikimedia.org/r/316598 [11:15:03] <_joe_> grrr [11:15:22] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::engine: add directory creation [puppet] - 10https://gerrit.wikimedia.org/r/317133 (owner: 10Giuseppe Lavagetto) [11:15:36] (03PS2) 10Giuseppe Lavagetto: profile::docker::engine: add directory creation [puppet] - 10https://gerrit.wikimedia.org/r/317133 [11:15:53] (03CR) 10Giuseppe Lavagetto: [V: 032] profile::docker::engine: add directory creation [puppet] - 10https://gerrit.wikimedia.org/r/317133 (owner: 10Giuseppe Lavagetto) [11:21:52] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2734127 (10Cmjohnson) Quick update, I created a ticket with HP, supplied with logs, I was contacted once for more information and provided but did not hear back in a few days. A phone c... [11:24:45] (03PS1) 10Niharika29: Set $wgCategoryCollation to 'uca-hr' for Croatian wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317139 (https://phabricator.wikimedia.org/T148749) [11:27:41] PROBLEM - NTP on radon is CRITICAL: NTP CRITICAL: Offset unknown [11:28:06] ^ fixing [11:28:20] !log starting rolling restart of elasticsearch eqiad cluster [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:00] (03PS1) 10Jcrespo: proxysql: install mysql-client alongside the proxy for admin [puppet] - 10https://gerrit.wikimedia.org/r/317140 (https://phabricator.wikimedia.org/T148500) [11:32:05] (03CR) 10jenkins-bot: [V: 04-1] proxysql: install mysql-client alongside the proxy for admin [puppet] - 10https://gerrit.wikimedia.org/r/317140 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:33:42] (03PS2) 10Jcrespo: proxysql: install mysql-client alongside the proxy for admin [puppet] - 10https://gerrit.wikimedia.org/r/317140 (https://phabricator.wikimedia.org/T148500) [11:33:55] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [11:35:53] (03CR) 10Jcrespo: [C: 032] proxysql: install mysql-client alongside the proxy for admin [puppet] - 10https://gerrit.wikimedia.org/r/317140 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [11:35:59] (03PS3) 10Jcrespo: proxysql: install mysql-client alongside the proxy for admin [puppet] - 10https://gerrit.wikimedia.org/r/317140 (https://phabricator.wikimedia.org/T148500) [11:39:46] (03PS1) 10Elukey: Add user pmiazga and its related ssh key [puppet] - 10https://gerrit.wikimedia.org/r/317142 (https://phabricator.wikimedia.org/T148477) [11:43:37] (03CR) 10Jcrespo: "This is ok, it just need an appropriate time to be deployed, as it will temporarily break gerrit, bacula, otrs, etc." [puppet] - 10https://gerrit.wikimedia.org/r/316341 (owner: 10Muehlenhoff) [11:44:57] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2734155 (10elukey) Yes stat1003 is the preferred one for data crunching, but stat1004 was also created to help spread the load. There is no clear boundary as far as I know,... [11:45:13] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2734161 (10elukey) [11:45:15] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724142 (10elukey) [11:47:40] (03PS1) 10Giuseppe Lavagetto: puppet-facts-export: remove "trusted" fact [puppet] - 10https://gerrit.wikimedia.org/r/317143 [11:48:55] RECOVERY - NTP on radon is OK: NTP OK: Offset -0.002999663353 secs [11:48:57] <_joe_> akosiaris: ^^ [11:50:09] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2734168 (10elukey) So next steps are: 1) Review https://gerrit.wikimedia.org/r/317142 and possibly merge, even if it will only create the user with bastion... [11:50:55] (03CR) 10Alexandros Kosiaris: [C: 031] puppet-facts-export: remove "trusted" fact [puppet] - 10https://gerrit.wikimedia.org/r/317143 (owner: 10Giuseppe Lavagetto) [11:51:25] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-facts-export: remove "trusted" fact [puppet] - 10https://gerrit.wikimedia.org/r/317143 (owner: 10Giuseppe Lavagetto) [11:53:51] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2734169 (10mark) >>! In T145082#2620384, @RobH wrote: > First off, that is easily one of the best damned requests ever (in terms of populated system inf... [12:02:50] !log rebooting bromine for kernel update [12:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:18:07] PROBLEM - Disk space on elastic1030 is CRITICAL: DISK CRITICAL - free space: / 751 MB (2% inode=92%) [12:20:04] (03PS1) 10Jcrespo: labsdb proxy: add fake passwords for sqlproxy [labs/private] - 10https://gerrit.wikimedia.org/r/317144 [12:20:27] ^ checking disk space on elastic1030... [12:20:56] (03CR) 10Jcrespo: [C: 032] labsdb proxy: add fake passwords for sqlproxy [labs/private] - 10https://gerrit.wikimedia.org/r/317144 (owner: 10Jcrespo) [12:21:22] (03CR) 10Jcrespo: [V: 032] labsdb proxy: add fake passwords for sqlproxy [labs/private] - 10https://gerrit.wikimedia.org/r/317144 (owner: 10Jcrespo) [12:22:04] (03PS4) 10Jcrespo: Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) [12:23:55] (03CR) 10Jcrespo: [C: 032] Create labs::db::proxy role to load balance and failover replicas [puppet] - 10https://gerrit.wikimedia.org/r/316558 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [12:24:46] !log rebooting ruthenium for kernel update [12:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:25:28] (03CR) 10QChris: [C: 04-1] "> @QChris hi, how can I support T1#1 for example please?" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [12:27:01] (03PS12) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [12:27:06] (03PS13) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [12:27:22] (03CR) 10Paladox: "@Qchris would this work? I removed the #." [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [12:34:08] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:38:16] RECOVERY - Disk space on elastic1030 is OK: DISK OK [12:40:33] 06Operations, 10Traffic: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734202 (10ema) [12:43:11] 06Operations, 10Traffic: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734215 (10ema) [12:59:42] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:07:01] gehel: ping! [13:10:20] 06Operations, 10Traffic: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734266 (10ema) p:05Triage>03Normal [13:12:58] urandom: pong [13:13:47] gehel: Q: how beefy of a workstation do you have, how much RAM? [13:14:12] urandom: 32Go [13:14:16] gehel: would you be in a position to analyze a 12G JVM heap dump? [13:14:27] * urandom whistles [13:14:27] urandom: most probably, yes [13:14:34] sweet! [13:14:35] !log Deploying schema change S6 ruwiki for table ores_model - T147734 [13:14:36] T147734: Review and deploy 309825 - https://phabricator.wikimedia.org/T147734 [13:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:57] urandom: were is that dump? It is going to take some time to copy over... [13:15:41] * gehel has more free RAM than free HDD space [13:15:50] gehel: OK, so for background: https://phabricator.wikimedia.org/T148516 [13:16:10] gehel: there is a section there labeled "heap dumps" in the description [13:16:16] with machines and paths [13:16:40] we want one of the 12G ones, so 1007, 1010, or 1013 [13:17:00] and... there is some chance that any one or all of them won't be of help [13:17:39] Cassandra's OOM error handling sometimes interferes with the on-OOM heap dump [13:18:20] it's obvious that has happened when the dump is less than our max heap size (12G), but it can be true otherwise too [13:18:33] urandom: ok, I'll have a look [13:19:09] gehel: thanks! [13:19:26] marostegui , be aware that when stashbot replies on tasks , any text matching with phab app keys is highlighted and linked inappropriately [13:19:30] gehel: i wonder if there is somewhere we can copy these [13:20:09] in this case "S6" trying to link to space [13:20:23] gehel: somewhere under /srv on bast1001 maybe ? [13:20:27] temporarily of course [13:20:43] "s6" lowercase wouldnt link [13:20:44] arseny92: yep, I see that. Is there anyway to avoid that from my side? [13:26:17] so if an item exists (in this case "S1"), it links to the Space S1 Public. If not exists, phab doesn't link, but will link if the item gets created anytime in the future . The way to avoid that is write things like this lowercased , especially since s6 in that specific case is referred to as s6 lowercased anyway [13:26:40] arseny92: got it. So lowercase will be from now on! Thanks for the advise [13:26:47] see in the comment/reply box and the post preview [13:27:41] if you write T123456 yu get a link, if you type T1234567 you don't get a link because the task doesn't yet exist [13:27:41] T123456: Special:CentralAuth reports account attachment, which - being standalone - is confusing, report accout creation as well - https://phabricator.wikimedia.org/T123456 [13:28:53] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2563672 (10akosiaris) I had a look at this. This is not network related. carbon answers as it should, the routers relay the DHCP packets as they should. AFAICT it's the... [13:29:03] but if you type t123456 it isn't linked [13:30:14] I see I see [13:30:36] true for any phab app links, such as Spaces (S), Files (F), Maniphest (T), Legalpad (L), Pastes (P) etc [13:30:52] (03PS7) 10Alexandros Kosiaris: icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 [13:30:56] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Kill /etc/icinga/puppet_hostextinfo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/315242 (owner: 10Alexandros Kosiaris) [13:31:45] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2734295 (10Eevans) Some additional data points: restbase1011-a and restbase1012-c OOM'd earlier today as well (at 2016-10-21T01:15:04... [13:32:25] (03PS1) 10Giuseppe Lavagetto: profile::docker::engine: fix docker config [puppet] - 10https://gerrit.wikimedia.org/r/317147 [13:33:35] ...Differential (D) and Diffusion (rOMWC will link to the callsign of a repo or to the change accordingly, but romwc, rOmwc, rOMWc wil not) [13:33:44] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::engine: fix docker config [puppet] - 10https://gerrit.wikimedia.org/r/317147 (owner: 10Giuseppe Lavagetto) [13:33:51] (03PS2) 10Giuseppe Lavagetto: profile::docker::engine: fix docker config [puppet] - 10https://gerrit.wikimedia.org/r/317147 [13:33:57] (03CR) 10Giuseppe Lavagetto: [V: 032] profile::docker::engine: fix docker config [puppet] - 10https://gerrit.wikimedia.org/r/317147 (owner: 10Giuseppe Lavagetto) [13:36:00] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2734304 (10Eevans) Summarizing a conversation in `#wikimedia-operations`: @Gehel has access to a machine capable of performing analysi... [13:36:42] (03PS10) 10Alexandros Kosiaris: icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 [13:36:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: normal_check_interval => check_interval [puppet] - 10https://gerrit.wikimedia.org/r/315086 (owner: 10Alexandros Kosiaris) [13:37:11] (03PS10) 10Alexandros Kosiaris: icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 [13:37:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: retry_check_interval => retry_interval [puppet] - 10https://gerrit.wikimedia.org/r/315087 (owner: 10Alexandros Kosiaris) [13:38:51] <_joe_> akosiaris: revert [13:39:01] <_joe_> akosiaris: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter normal_check_interval on Monitoring::Service[raid_md] at /etc/puppet/modules/nrpe/manifests/monitor_service.pp:56 on node kubernetes1002.eqiad.wmnet [13:39:14] <_joe_> ?? [13:39:29] no it's a one line fix [13:39:33] I 'll fix [13:39:51] it's the new check for the RAID [13:39:58] was done after my patch and I did not amend [13:40:13] !log completed rolling reboot of restbase in codfw [13:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:38] <_joe_> eheh ok [13:43:01] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:02] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:45:28] (03PS1) 10Alexandros Kosiaris: monitoring: Add the missed *interval calls [puppet] - 10https://gerrit.wikimedia.org/r/317149 [13:46:50] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring: Add the missed *interval calls [puppet] - 10https://gerrit.wikimedia.org/r/317149 (owner: 10Alexandros Kosiaris) [13:47:10] (03CR) 10Giuseppe Lavagetto: "surprise surprise the only place where things change is labstest:" [puppet] - 10https://gerrit.wikimedia.org/r/317130 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [13:50:11] 06Operations, 10Traffic: ERR_CONTENT_DECODING_FAILED on certain png images from varnish-fe - https://phabricator.wikimedia.org/T148830#2734326 (10ema) After further investigation we've noticed that the problem is not reproducible forcing a cache miss by adding some random query parameters. Further, we've tried... [13:52:30] (03Abandoned) 10Alexandros Kosiaris: icinga: Remove the last vestiges of hostextinfo [puppet] - 10https://gerrit.wikimedia.org/r/315245 (owner: 10Alexandros Kosiaris) [13:52:39] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:40] PROBLEM - puppet last run on graphite2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:40] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:41] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:44] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:44] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:45] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:45] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:51] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:51] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:51] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:52:52] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:02] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:02] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:04] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:04] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:04] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:05] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:11] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:11] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:12] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:23] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:26] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:27] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:29] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:29] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:46] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:46] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:53:46] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:01] !log rolling reboot of thumbor* for kernel update [13:55:06] intentional to not flood? [13:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:08] _joe_ and AlexZ !logs shouls be made only to where people with wmf -related cloaks (wikipedia/wikimedia/mediawiki/etc) can execute it sucessfully to prevent trolls from misusing it [13:55:31] !log restart isc-dhcp-server on carbon [13:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:47] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 185 bytes in 1.449 second response time [14:01:41] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:02:13] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:27] Ok can we silence or ack ^ [14:03:03] (03PS2) 10Alexandros Kosiaris: Add tendril role to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/316794 [14:03:05] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add tendril role to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/316794 (owner: 10Alexandros Kosiaris) [14:03:16] (03PS3) 10Alexandros Kosiaris: Add tendril role to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/316794 [14:03:18] (03CR) 10Alexandros Kosiaris: [V: 032] Add tendril role to tegmen [puppet] - 10https://gerrit.wikimedia.org/r/316794 (owner: 10Alexandros Kosiaris) [14:04:51] 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734359 (10ema) [14:06:02] 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734202 (10ema) [14:07:51] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:18:25] (03PS1) 10Giuseppe Lavagetto: profile::docker::engine: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/317154 [14:19:12] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] profile::docker::engine: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/317154 (owner: 10Giuseppe Lavagetto) [14:22:08] (03PS1) 10Gehel: elasticsearch - drop nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/317156 [14:22:47] 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734393 (10BBlack) The specific repro URL for the Serbia map has been PURGEd now to clear up the issue for users, since we're not getting much debug v... [14:22:57] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2365691 (10elukey) I added `rootdelay=60` and mc1020 booted correctly, so we can keep going with the installation.. I am going to dedicated some time next week on them! [14:23:17] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2734397 (10elukey) a:05Cmjohnson>03elukey [14:29:58] !log Stopping replication on db2055 to use it to clone another host - T146261 [14:29:58] T146261: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261 [14:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:15] !log rolling reboot of thumbor* for kernel update [14:31:21] <_joe_> !log rebooting all kubernetes worker nodes in production [14:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:17] 06Operations, 10Traffic: cache_upload: uncompressed images with Content-Encoding: gzip cause content decoding issues - https://phabricator.wikimedia.org/T148830#2734403 (10BBlack) To be clearer about what was debugged on IRC: this wasn't a case of actual bad gzip encoding. The object contents in all affected... [14:38:22] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/317156 (owner: 10Gehel) [14:38:27] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2734404 (10elukey) p:05Triage>03Normal [14:43:00] (03PS2) 10Gehel: elasticsearch - drop nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/317156 [14:45:13] (03CR) 10Gehel: [C: 032] elasticsearch - drop nginx access logs [puppet] - 10https://gerrit.wikimedia.org/r/317156 (owner: 10Gehel) [14:50:37] !log reimaging mc1020 with wmf-auto-reimage (T137345) [14:50:57] T137345: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345 [14:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:58] (03PS1) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 2 (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317159 [14:54:21] (03PS2) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 2 (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317159 [14:55:52] moritzm: any idea how I can do a non-interactive upgrade as per https://phabricator.wikimedia.org/T148767 ? [14:56:44] (03PS3) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 2 (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317159 (https://phabricator.wikimedia.org/T147508) [14:58:05] andrewbogott: what's the exact dialogue, does it prompt for a modified conffile? [14:58:48] That's right — it says that there are local changes to the grub conf (which as far as I can tell from googling is a bug) and defaults to the current config, which leaves us on the old kernel. [15:00:06] moritzm: ^ [15:01:14] RECOVERY - Check size of conntrack table on kubernetes1004 is OK: OK: nf_conntrack is 0 % full [15:01:28] RECOVERY - Host kubernetes1004 is UP: PING OK - Packet loss = 0%, RTA = 17.95 ms [15:01:39] RECOVERY - configured eth on kubernetes1004 is OK: OK - interfaces up [15:01:39] RECOVERY - Host kubernetes1002 is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [15:01:40] RECOVERY - dhclient process on kubernetes1004 is OK: PROCS OK: 0 processes with command name dhclient [15:01:41] RECOVERY - DPKG on kubernetes1004 is OK: All packages OK [15:01:51] RECOVERY - salt-minion processes on kubernetes1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:01:58] RECOVERY - Check size of conntrack table on kubernetes1003 is OK: OK: nf_conntrack is 0 % full [15:02:00] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [15:02:21] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [15:02:21] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [15:02:55] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Docker installation for production kubernetes - https://phabricator.wikimedia.org/T147181#2734490 (10Joe) The production installation of docker seemed to work well, until I rebooted the servers for a fina... [15:03:17] andrewbogott: I'm creating a trusty instance in labs to have a look [15:03:29] moritzm: ok — I have several waiting here if you want [15:03:41] ssh trusty-kernel-1.testlabs.eqiad.wmflabs through ssh trusty-kernel-6.testlabs.eqiad.wmflabs :) [15:04:00] RECOVERY - salt-minion processes on kubernetes1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:04:19] moritzm: what is your wikitech username? [15:04:24] oh, nm, found [15:04:31] RECOVERY - MD RAID on kubernetes1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:05:03] looking at in on trusty-kernel-1.testlabs.eqiad.wmflabs [15:06:47] for some reason it tries to drop console=ttyS0 from the currently installed kernels [15:12:31] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 5 minutes ago with 7 failures. Failed resources (up to 3 shown): Package[apparmor],Package[prometheus-node-exporter],Package[python-apt],Package[linux-tools-generic] [15:19:59] mc1020 is up and running :) [15:20:26] moritzm: I'm looking at the image-building code now… it does [15:20:29] sed -i '/^kernel/s/$/ console=ttyS0/' /boot/grub/menu.lst [15:20:29] sed -i 's/console=hvc0/xencons=hvc0 console=hvc0/' /boot/grub/menu.lst [15:20:32] andrewbogott: mmh, no idea. it's something specific in the way instances are created in OpenStack, that doesn't happen with plain trusty/precise installations [15:20:47] so that's probably the diff it's worrying about [15:21:38] but still it's showing in the diff that is also stumbles over the lines added for the newly installed kernel [15:21:38] Of course I can re-apply that after I do the kernel upgrade, BUT, I need it to actually switch to the new kernel first :( [15:22:07] I don't understand why '-o Dpkg::Options::="--force-confnew"' [15:22:15] doesn't do anything when just hitting up-arrow and 'enter' works fine [15:22:23] me neither [15:24:50] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:27:32] :( [15:28:31] !log reimaging mc1019 with wmf-auto-reimage (T137345) [15:28:32] T137345: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345 [15:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:30] (03PS3) 10Volans: wmf-auto-reimage: improve messaging [puppet] - 10https://gerrit.wikimedia.org/r/317119 (https://phabricator.wikimedia.org/T148815) [15:45:46] (03CR) 10Dereckson: [C: 031] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317139 (https://phabricator.wikimedia.org/T148749) (owner: 10Niharika29) [15:47:09] (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve messaging [puppet] - 10https://gerrit.wikimedia.org/r/317119 (https://phabricator.wikimedia.org/T148815) (owner: 10Volans) [15:50:54] moritzm: for some reason # DEBIAN_FRONTEND=noninteractive apt-get -y upgrade works just fine [15:51:03] I can't decide if that's dangerous or not... [15:52:43] oh, nm, I'm wrong, that doesn't work either [16:01:16] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734628 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2001.codfw.wmnet'] ``` The lo... [16:02:06] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734630 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2001.codfw.wmnet'] ``` The lo... [16:05:19] !log reimaging mc1021 with wmf-auto-reimage (T137345) [16:05:20] T137345: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345 [16:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:06] PROBLEM - IPsec on mc1019 is CRITICAL: Strongswan CRITICAL - No connections configured: check ipsec.conf [16:22:28] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:24:12] hello mc1019 [16:24:30] ipsec? [16:30:23] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:30:52] !log rebooting planet2001 [16:31:30] ahh mc1019 is already configured in puppet, but the host is new [16:31:31] weird [16:31:52] elukey: does it just need a second puppet run? [16:32:02] almost sounds like it when new but "no config" [16:32:44] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:33:03] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2734685 (10Eevans) [16:33:25] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734686 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2001.cod... [16:33:25] !log rebooting planet1001 - *.planet.wm.org will be right back [16:33:29] mutante: I tried but I think it is something a bit more network level, afaics mc1019 is configured in puppet but we are only using mc1001->mc1018 [16:33:43] so there might be some ACLs for prod hosts? [16:36:23] hmm, sounds good, would just expect "connection failed" or something then instead of "No connnections configured" [16:36:33] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2734691 (10Eevans) restbase1007:/srv/cassandra-b/java_pid101952.hprof has been removed, though I can provide a copy of it upon request... [16:39:56] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:41:47] mutante: I've also double checked in that 10.64.0.80 is not configured in puppet, we use up to mc1018 [16:42:16] I think that the safest choice is to remove mc1019 from site.pp [16:42:21] for the moment [16:42:32] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734707 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2002.codfw.wmnet'] ``` The lo... [16:42:48] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734708 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2003.codfw.wmnet'] ``` The lo... [16:43:00] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734709 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The lo... [16:43:13] elukey: that sounds reasonable, maybe just comment out [16:44:28] _joe_ you there for a quick consult? [16:46:06] (03PS1) 10Elukey: Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) [16:46:14] this one is wrong --^ [16:47:53] (03CR) 10jenkins-bot: [V: 04-1] Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:48:01] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734736 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2002.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2002.cod... [16:48:16] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734737 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2003.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2003.cod... [16:48:29] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734739 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2004.cod... [16:48:30] (03PS2) 10Elukey: Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) [16:48:55] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:50:20] (03CR) 10jenkins-bot: [V: 04-1] Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:51:49] (03PS1) 10Jcrespo: proxysql: Setup dbproxy1011 as a test host for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/317173 (https://phabricator.wikimedia.org/T148500) [16:52:04] jenkins doesn't like me: Failed to establish a new connection: [Errno -2] Name or service not known' [16:52:45] oh, that looks like a real jenkins issue [16:53:40] (03PS3) 10Elukey: Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) [16:53:52] (03CR) 10jenkins-bot: [V: 04-1] proxysql: Setup dbproxy1011 as a test host for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/317173 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [16:54:55] elukey do a recheck please? [16:55:24] paladox: not sure how to do it :( [16:55:36] elukly just type recheck in the comment box on gerrit [16:55:42] ah okok [16:55:44] Let me show you [16:55:52] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:55:54] (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:56:03] :) [16:56:15] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734746 (10RobH) [16:56:21] wow nice [16:56:26] didn't know this! [16:56:46] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2619347 (10RobH) 05Open>03stalled a:05mark>03RobH stealing this back for sub-task implementations (both orders and system allocations) [16:56:54] Yep :) [16:57:00] you can also do check experimental [16:57:03] (03CR) 10Jcrespo: "I do not intend to deploy this today, but give it a look. We will probably move a couple of production hosts to be proxies on labs-support" [puppet] - 10https://gerrit.wikimedia.org/r/317173 (https://phabricator.wikimedia.org/T148500) (owner: 10Jcrespo) [16:58:12] elukey it works [16:58:25] so sometimes it will fail, and doing a recheck sometimes works [16:58:36] :) [16:58:39] (03CR) 10Elukey: "mc1001->mc1018 show no differences:" [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [16:58:51] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734781 (10RobH) So right now this system is in the same rack as kafka1002. Ideally clusters are distributed into different racks, if not different rows. I'll create a sub-task for this system to be... [16:58:58] paladox: thanks! [16:59:02] phew ok, i was already wondering if we had broken nodepool or something [16:59:05] Your welcome :) [16:59:08] thanks paladox [16:59:11] mutante: https://gerrit.wikimedia.org/r/#/c/317171 - wdyt? [16:59:21] mutante unlikly, nodepool is a built snapshot [16:59:35] we have to manually update the image [17:00:04] and upload it to nodepool, and then it is distributed when ever a test uses nodepool [17:00:27] <_joe_> elukey: what's up? [17:00:33] 06Operations, 10ops-eqiad, 10EventBus: move and label kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148851#2734783 (10RobH) [17:00:58] _joe_ I just reimaged mc1019 and now it complains for IPsec, https://gerrit.wikimedia.org/r/#/c/317171 [17:01:18] (03CR) 10Dzahn: [C: 031] "yea, it shows with IPSEC errors in Icinga and as the ticket says we only installed up to 1018 so far, mc1019-36 should probably be added t" [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [17:01:28] so I am asking for sanity checks :) [17:02:01] <_joe_> elukey: why remove it? [17:02:19] <_joe_> and, why ipsec errors? [17:02:22] <_joe_> oh ofc [17:02:26] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:02:31] yeah I think it is the codfw replication [17:03:01] I think it is just confusing to have it listed in site.pp [17:03:03] it says "no connections configured" not like when it fails [17:03:04] but I can leave it [17:03:32] <_joe_> mutante: yeah, exactly [17:03:46] <_joe_> well, I'm fine with both solutions [17:04:00] <_joe_> honestly mc1019 should replace mc1001 [17:04:03] <_joe_> and so on [17:04:08] ah yes [17:04:15] <_joe_> like yesterday :P [17:04:19] ahhahaha [17:04:33] <_joe_> but I guess there was an issue of some kind. [17:05:46] elukey: btw that jenkins error, it tries to connect to python.org apparently, and that may be affected by the DNS issues [17:05:55] well i cant open it right now from home [17:05:55] :/ [17:06:06] or now i can.. hrm [17:06:21] Should we be worried or on high alert because of this? http://motherboard.vice.com/read/twitter-reddit-spotify-were-collateral-damage-in-major-internet-attack [17:06:47] mutante: yes, python.org is on NS3.P11.DYNECT.NET and similar [17:06:49] Amir1 that wont affect wikimedia [17:06:58] _joe_ my idea is to reimage mc1019->mc1036 and then think about how to replace mc1001->mc1018 [17:07:01] sounds good? [17:07:03] Amir1: well side-effects like that jenkins thing maybe [17:07:05] But they could try to target wikimedia [17:07:18] so i doint knoe [17:07:22] know [17:07:45] (03PS1) 10Dereckson: Edit-a-thon BDA (Poitiers) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317174 (https://phabricator.wikimedia.org/T148852) [17:08:38] if you recheck on gerrit on tests that fail it will most likly pass again [17:08:40] Thanks, I hope that just don't happen, We are the best thing about the internet: https://www.washingtonpost.com/news/in-theory/wp/2016/10/19/science-shows-wikipedia-is-the-best-part-of-the-internet/ [17:09:45] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:10:05] Amir1: probably not for our own sites, we dont use "cloud dns"/dyn [17:10:06] (03CR) 10Elukey: [C: 032] Remove mc1019 from role memcached (new host) [puppet] - 10https://gerrit.wikimedia.org/r/317171 (https://phabricator.wikimedia.org/T137345) (owner: 10Elukey) [17:10:24] Amir1: just when we rely on others like in that example where jenkins wants to load stuff from python.org [17:10:31] but if we have CI jobs that build directly from pypi without a cache they might be affected [17:10:36] Yep [17:11:00] okay [17:11:09] thanks [17:11:30] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734858 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2002.codfw.wmnet'] ``` The lo... [17:11:38] we might want to disable that one check for now.. [17:12:13] s/disable/make it non-voting [17:12:19] ^^ we can make it non-voting [17:12:39] legoktm: i hear you can deploy zuul changes? [17:13:30] How do i make it so when i clone somethig i dont get any folder created i just want the files [17:14:53] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734863 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2003.codfw.wmnet'] ``` The lo... [17:15:05] PROBLEM - check_listener_ipn on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.007 second response time [17:15:05] PROBLEM - check_listener_ipn on saiph is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 214 bytes in 0.012 second response time [17:15:17] Ffs whats with all the errors [17:16:30] ^^^ the thulium/saiph ones are expected, sorry apparently my attempt to silence them preemptively would work [17:17:00] ha maybe it would have helped if I finished clicking through the commit stage... [17:18:05] <_joe_> Zppix: "all the errors" are usually minor glitches we have to know about anyways [17:18:50] <_joe_> (and some are, sadly, false positives) [17:19:01] Zppix: try appending a "." to the end [17:19:11] as in "current directory" [17:19:26] But i get a dupe folder [17:19:38] Sorry i mean a folder i just want the contents [17:19:44] Of said folder [17:21:05] mutante: ran puppet on mc1* hosts, all good [17:21:09] thanks for the +1 [17:21:25] Zppix: git clone https://gerrit.wikimedia.org/r/p/operations/dns.git . [17:21:34] Zppix: see the . at the end [17:21:44] Yes [17:21:46] elukey: good :) yw [17:22:00] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734897 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The lo... [17:22:08] Zppix: it doesnt create a new directory for me like that, just files [17:22:12] But i just want the files i dont want any folders that are in repo [17:22:25] X/Y problem [17:22:29] I tried it gave me a folder in curr dir [17:23:50] mutante https://gerrit.wikimedia.org/r/#/c/317176/ [17:26:55] mutante: yes I can, what's up? [17:27:53] legoktm: we would like to change "operations-puppet-tox-jessie" to non-voting temporarily [17:28:23] legoktm: because it relies on python.org and that is affected by the current DNS issues (http://arstechnica.com/security/2016/10/dos-attack-on-major-dns-provider-brings-internet-to-morning-crawl/) it looks [17:28:38] uh, is it possible to ignore it failing for now? if I make it non-voting, then I have to make it voting again once the internet comes back [17:28:41] but it's intermittent, a "recheck" worked earlier [17:28:53] legoktm we could try cache [17:28:57] so we can also ignore it if it causes more problems [17:29:10] apparently in 6.0 pip it supports cache on by default [17:29:13] I think it's easier if you just ignore it [17:29:26] yes it does support caching except we use disposable VMs [17:29:31] I did that, I checked it was tox, and ignored it [17:29:42] and I don't think hashar finished implementing the caching stuff for python [17:29:50] Oh [17:30:08] ok, then let's ignore it [17:30:26] thanks legoktm [17:30:39] mhm [17:31:48] thanks lego (today is travel day and Antoine and Tyler took the morning to go to the Air and Space museum) [17:32:16] ooh, that place is awesome [17:32:28] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:33:08] I'm sitting in a park looking at the white house from about 200 feet from that black fence [17:35:05] greg-g: how long are you going to be there? I might come join you if it will be more than an hour ? [17:36:27] maybe you get to see [[w:Bo_(dog)]] [17:37:01] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734919 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2003.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2003.cod... [17:37:19] rfarrand: flight is at 5, so, a bit. I'll probably start heading towards airport if it starts raining or 3pm, which ever is first [17:37:37] rfarrand: (or coffee shop if it starts raing and you're here and it's not yet near 3) ;) [17:37:44] raining* [17:38:09] the cold front is nice [17:39:47] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:42:35] 06Operations, 10netops, 05Goal, 13Patch-For-Review: Decomission palladium - https://phabricator.wikimedia.org/T147320#2734929 (10Dzahn) yep, "last heads up" sent to ops@ now [17:45:40] (03CR) 10Dzahn: [C: 031] "lgtm, the wikitech/LDAP user exists with this UID, has matching @wikimedia.org email and key is not the labs key" [puppet] - 10https://gerrit.wikimedia.org/r/317142 (https://phabricator.wikimedia.org/T148477) (owner: 10Elukey) [17:47:31] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734932 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2002.codfw.wmnet'] ``` and were **ALL** successful. [17:51:14] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2734941 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['maps-test2004.cod... [18:02:08] Reedy, around? [18:04:14] (03PS2) 10Dzahn: repeat hostname for AAAA,bast3/4,sodium,dataset,ms1001 [dns] - 10https://gerrit.wikimedia.org/r/317093 [18:06:28] (03CR) 10Dzahn: [C: 032] repeat hostname for AAAA,bast3/4,sodium,dataset,ms1001 [dns] - 10https://gerrit.wikimedia.org/r/317093 (owner: 10Dzahn) [18:09:32] (03PS2) 10Niharika29: Set $wgCategoryCollation to 'uca-hr' for Croatian wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317139 (https://phabricator.wikimedia.org/T148749) [18:10:55] (03PS3) 10Niharika29: Set $wgCategoryCollation to 'uca-hr' for Croatian wikipedia Add numeric sorting for bs, hr and uk wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317139 (https://phabricator.wikimedia.org/T148749) [18:39:59] (03CR) 10Dereckson: [C: 04-1] "Split in two changes, that's more easy to deploy and test them one idea at the time. That also allows to only revert one part if there is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317139 (https://phabricator.wikimedia.org/T148749) (owner: 10Niharika29) [18:44:04] http://downforeveryoneorjustme.com/downdetector.com/ [18:45:54] mutante is that down for you? [18:49:30] paladox: DNS-wise no, i can open it but i cant get the live maps, probably just overloaded, too many people using it now [18:49:47] Oh, it works for me [18:49:49] issues for me intermittent, but no reddit, no github, no python, no twitter... [18:50:05] reddit github and python and twitter are working for me [18:50:35] yea, i mean the map shows how it's all US, first east coast, then west coast [18:50:46] I can get to reddit and twitter, haven't tried github [18:51:35] http://gizmodo.com/this-is-probably-why-half-the-internet-shut-down-today-1788062835?rev=1477054209946 [18:53:55] "new wave of attacks seems to be affecting the West Coast of the United States and Europe" [18:54:37] PROBLEM - kartotherian endpoints health on maps2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src} [18:54:39] PROBLEM - cassandra CQL 10.192.48.57:9042 on maps2004 is CRITICAL: Connection refused [18:54:47] PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src} [18:54:48] PROBLEM - kartotherian endpoints health on maps2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src} [18:54:50] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src} [18:55:07] eh? [18:55:13] the biggest impact for people seems to be that "the Starbucks app is down", when people dont get their coffee order they get grumpy [18:55:43] I mean icinga :) [18:55:58] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [18:56:02] checking maps... [18:56:29] never saw that particular sms before, heh [18:58:36] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200) [18:58:37] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200) [18:58:38] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200) [18:58:39] ACKNOWLEDGEMENT - cassandra CQL 10.192.48.57:9042 on maps2004 is CRITICAL: Connection refused Gehel gehel checking it [18:58:39] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200) [18:59:03] mutante apparently a user on wikipedia is reverting edit's based on users not using the edit box, how do we go about resolving this. [18:59:05] https://en.wikipedia.org/wiki/User_talk:RunnyAmiga#Windows_10_edit_revert [19:00:29] cassandra issue on maps2*, I hope its not me breaking things with reimage of maps-test* [19:00:56] paladox: please ask in #wikipedia or so, i'm not an admin on en.wp [19:01:02] ok [19:02:25] anyone knows cassandra? [19:04:00] PROBLEM - cassandra service on maps2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:06:11] !log shutting down cassandra on maps2004, seems to have lost data [19:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:06] gehel: urandom_ perhaps? [19:08:27] bblack: We're having trouble with maps2* (codfw), maps1*(eqiad) are ready to take over, but I'm not exactly up to date with cache::maps config [19:08:37] bblack: help welcomed! [19:10:15] gehel: feel free to start calling people, btw [19:10:26] * gehel is calling people! [19:10:37] ok! :) [19:10:49] yuvipanda: you mean phone? [19:11:01] gehel: yes [19:11:39] is there a criteria on what gerrit repos get watched by grrrit-wm in -dev?? [19:12:15] yuvipanda: If you could help me get a hold of bblack or ema that would be great (or someone who understand varnish) [19:12:27] gehel: ok, let me call bblack [19:13:00] (03PS1) 10Gehel: maps switching maps traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317182 [19:13:02] (03PS1) 10Gehel: maps - swithing traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317183 [19:13:39] gehel: bblack is on his way [19:13:51] ^ Those 2 changes *should* route maps traffic to eqiad, but I need a review! [19:13:57] yuvipanda: thanks a lot! [19:14:01] hi [19:14:06] np [19:14:13] bblack: thanks! [19:14:18] sorry to interrupt whatever [19:14:45] maps codfw is not going well. We were ready to switch traffic to eqiad anyway, but I'm not entirely sure how to do that [19:15:09] https://gerrit.wikimedia.org/r/317182 and https://gerrit.wikimedia.org/r/317183 is my understanding of how that's done, but I'm not sure [19:15:10] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2735054 (10Dzahn) How to do this and confirm it's done. 1.) git clone the operations/puppet repo (git clone https://gerrit.wikimedia.org/r... [19:15:36] kartotherian.svc.eqiad.wmnet exists and works, right? [19:15:45] bblack: yes it does [19:18:00] (03PS1) 10BBlack: kartotherian: define eqiad backend for caches [puppet] - 10https://gerrit.wikimedia.org/r/317185 [19:18:25] (03CR) 10BBlack: [C: 032 V: 032] kartotherian: define eqiad backend for caches [puppet] - 10https://gerrit.wikimedia.org/r/317185 (owner: 10BBlack) [19:18:35] hmmm my pull sucked I guess [19:18:39] (03PS2) 10BBlack: kartotherian: define eqiad backend for caches [puppet] - 10https://gerrit.wikimedia.org/r/317185 [19:18:42] (03CR) 10BBlack: [V: 032] kartotherian: define eqiad backend for caches [puppet] - 10https://gerrit.wikimedia.org/r/317185 (owner: 10BBlack) [19:19:29] (03PS1) 10BBlack: maps: switch backend to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317187 [19:19:40] (03CR) 10BBlack: [C: 032 V: 032] maps: switch backend to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317187 (owner: 10BBlack) [19:19:56] (am going afk for a bit now) [19:21:21] gehel: I'm salting agent for the 2x commits above, which will change the varnish->app part to use eqiad [19:21:37] ACKNOWLEDGEMENT - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (ex [19:21:49] bblack: thank you sooo much! [19:22:03] the other two commits you've prepped are correct and come afterwards, to fix up cache routing so we're not leaking a bunch of PII by sending x-dc unencrypted reqs from codfw->eqiad to do so [19:22:15] * apergos peeks in (pages) [19:22:31] and require careful assurance that the first of the two is fully deployed and operational before merging the second [19:22:32] So I missed only 2/3 of what needed to be done... :P [19:23:12] the eqiad bit should be switched now... [19:23:31] seem ok? [19:23:39] s/eqiad bit/applayer bit/ [19:23:55] looks much better! [19:24:17] so now step through those other 2x commits. first just making eqiad direct. [19:24:38] (03PS1) 10RobH: setting dns records for kafka1003 [dns] - 10https://gerrit.wikimedia.org/r/317188 [19:24:39] salt the agent run for that and be sure it has really applied on every host, and maybe wait another 5 minutes after that just to be sure [19:24:44] then do the other [19:25:11] ok, I'll try that [19:25:38] (03PS2) 10Gehel: maps switching maps traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317182 [19:25:39] if the first fails to be fully applied (in cache_maps@eqiad) before the second becomes active anywhere, we get loops [19:25:49] (03CR) 10RobH: [C: 032] setting dns records for kafka1003 [dns] - 10https://gerrit.wikimedia.org/r/317188 (owner: 10RobH) [19:25:52] (03CR) 10Gehel: [C: 032 V: 032] maps switching maps traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317182 (owner: 10Gehel) [19:25:54] (03PS2) 10RobH: setting dns records for kafka1003 [dns] - 10https://gerrit.wikimedia.org/r/317188 [19:26:19] (03PS2) 10Dzahn: maps::server: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315889 [19:27:09] we have some VCL protection against the dangers of those loops, which should 503-fail any looping requests so they don't storm, but it's never really been tested before :) [19:27:11] bblack: sudo salt -v -t 10 -b 10 -E '^cp.*' cmd.run "puppet agent -t" [19:27:21] no [19:27:45] salt -v -t 10 -b 100 -C 'G@cluster:cache_maps and G@site;eqiad' cmd.run 'puppet agent -t' [19:27:58] well I typoed [19:27:59] ok, only eqiad, make sense [19:28:04] salt -v -t 10 -b 100 -C 'G@cluster:cache_maps and G@site:eqiad' cmd.run 'puppet agent -t' [19:28:12] only eqiad is faster and less spam to read through to verify [19:28:28] and -b 100 to avoid inane salt output about every other host in the network :P [19:29:17] !log running puppet on eqiad cache nodes to activate maps traffic redirection [19:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:32] (03PS1) 10RobH: setting kafka1003 install params [puppet] - 10https://gerrit.wikimedia.org/r/317191 [19:29:54] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2735096 (10Cmjohnson) @joe kubernetes1001 cable was fine, it was connected to the wrong switch port. [19:29:55] ok, checking the output... [19:30:12] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2735097 (10RobH) [19:30:15] if the VCL change it should be fine, mostly we're worried about agent failure due to racing cron or whatever [19:30:31] (03CR) 10RobH: [C: 032] setting kafka1003 install params [puppet] - 10https://gerrit.wikimedia.org/r/317191 (owner: 10RobH) [19:30:56] RECOVERY - cassandra service on maps2004 is OK: OK - cassandra is active [19:30:57] (03CR) 10Dzahn: "watroles says "role::maps::server" is not used in labs https://tools.wmflabs.org/watroles/role/role::maps::server" [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [19:31:00] 4 nodes looks to have the same VCL changes, no errors [19:31:26] so give it a few minutes, say 5, just to be sure of edge cases with draining connections and VCL reloads and whatever (it's just being super paranoid) [19:31:40] then merge and and apply the other change and we're done with this [19:32:07] and force puppet agent on codfw ? [19:32:20] Or just let it deploy with standard puppet? [19:32:22] I would just to see it happen and breath easier [19:32:32] yeah, me too... [19:32:42] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:32:42] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2735104 (10debt) [19:32:51] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on kafka1018 - https://phabricator.wikimedia.org/T147707#2735106 (10Cmjohnson) @Ottomata can you look at this please [19:32:57] the upshot is, now maps isn't the only service running primary in codfw anymore, counter to all routing :) [19:33:08] all other routing, I mean [19:33:20] yeah, I would have preferred to do that switch with a bit more preparation... [19:33:26] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: pay-lvs1003/pay-lvs1004 hardware swap for pay-lvs1001/pay-lvs1002 - https://phabricator.wikimedia.org/T147932#2735108 (10Cmjohnson) 05Open>03Resolved Resolving this...work has been completed. [19:33:27] :) [19:33:48] And I think I did some really stupid mix up to break all that. [19:33:52] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2735115 (10Cmjohnson) [19:34:11] bblack: I owe you a beer, or half a kilo of chocolate, your choice... [19:34:37] tough call! [19:34:57] 06Operations, 10ops-eqiad, 10netops: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2735117 (10Cmjohnson) p:05Normal>03Unbreak! Moving this to higher priority on my workboard. [19:35:42] (03CR) 10Muehlenhoff: [C: 031] "Looks good, then" [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [19:35:48] gehel: i have a question related to maps but not related to that problem at all, is role::maps::server used in labs> [19:36:27] mutante: I would need to check, we have a test server we play with, but not sure if we used the role itself [19:36:34] MaxSem should know ^ [19:36:53] bblack: ok, 5' are passed, let's deploy this other change [19:37:13] (03PS2) 10Gehel: maps - swithing traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317183 [19:37:41] mutante: I'll check in a few minutes [19:38:01] PROBLEM - cassandra service on maps2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:38:05] gehel: please dont worry now, didnt want to distract from that [19:38:07] (03CR) 10Gehel: [C: 032 V: 032] maps - swithing traffic to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/317183 (owner: 10Gehel) [19:38:08] thanks [19:38:41] mutante, gehel - I experiment with it on maps-scratch* [19:39:43] MaxSem: do you actually put role::maps::server on an instance? there is this https://tools.wmflabs.org/watroles/role/role::maps::server that should show it if it was [19:39:51] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:40:21] !log routing traffic for cache-maps in codfw -> eqiad [19:40:23] mutante, it's a self-hosted puppetmaster, for debugging this paticular role [19:40:23] !log labvirt1005 swapping disk 0 [19:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:46] bblack: puppet run looks good on the 4 cache servers in codfw [19:41:55] case closed then :) [19:42:55] bblack: not yet :) still need to understand how that happen. And get a good flogging if I'm as stupid as I think [19:43:26] ok well public-facing case closed anyways [19:45:40] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: labvirt1005 - HP RAID controller issue (battery?) - https://phabricator.wikimedia.org/T148255#2735128 (10Cmjohnson) Swapped disk in slot 0 with new disk the old disk is being sent back via UPS 1ZA7327E90828184 10 [19:45:43] that one yes [19:49:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:50:34] !log dataset1001 array 1 swap failed disk slot 4 [19:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:17] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2735146 (10Cmjohnson) Swapped out the failed disk in slot 4. [19:57:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:58:01] (03PS1) 10Dzahn: add mapped IPv6 address for eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/317192 [19:59:02] (03CR) 10Dzahn: "then also this first https://gerrit.wikimedia.org/r/#/c/317192/" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [20:02:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [20:03:10] PROBLEM - Disk space on cp4006 is CRITICAL: DISK CRITICAL - free space: / 345 MB (3% inode=86%) [20:04:15] (03PS5) 10Dzahn: wikimedia.org: repeat hostname on each line for multi records [dns] - 10https://gerrit.wikimedia.org/r/304155 [20:04:27] (03CR) 10jenkins-bot: [V: 04-1] wikimedia.org: repeat hostname on each line for multi records [dns] - 10https://gerrit.wikimedia.org/r/304155 (owner: 10Dzahn) [20:06:17] (03PS6) 10Dzahn: wikimedia.org: repeat hostname on each line for multi records [dns] - 10https://gerrit.wikimedia.org/r/304155 [20:10:11] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:11:08] 06Operations, 10Cassandra, 06Services (doing): some cassandra instances restarted with java.lang.OutOfMemoryError: Java heap space - https://phabricator.wikimedia.org/T148516#2735191 (10Eevans) Thanks to @Gehel, we have [[ https://people.wikimedia.org/~eevans/java_pid101952/ | confirmation that this is aberr... [20:13:00] is there a machine on our network (something i might have access to), with lots of available RAM (say 16G+), something I could use to analyze a very large heap dump? [20:13:22] large labs instance? [20:13:31] <|L> xlarge has 16 GB RAM [20:13:38] <|L> no bigger instance available [20:13:45] oh, can i make one that big? [20:13:49] that might work [20:14:05] urandom: the old cp spares in esams have 96 GB RAM (e.g. cp3012) [20:14:20] oooo [20:17:19] OK, so I would need shell access, the non-headless version of openjdk installed, and firewall rules that allowed me to forward X, how much work would this be for one of these spares? [20:17:58] i can try labs first, I guess [20:19:15] (03PS1) 10Yuvipanda: tcl: Fix /var/run/lighttpd permissions [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317197 [20:19:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:19:52] (03CR) 10Yuvipanda: [C: 032] tcl: Fix /var/run/lighttpd permissions [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317197 (owner: 10Yuvipanda) [20:26:00] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:33:01] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:37:17] I guess "Failed to create instance." in labs means that the quota is exceeded? [20:37:18] (03PS1) 10Yuvipanda: Add some more useful packages to the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317259 [20:37:35] that's not an awesome error message [20:37:36] bd808: ^ [20:37:44] urandom: is that horizon or wikitech? [20:37:51] wikitech [20:38:08] yuvipanda: what is this horizon of which you speak? [20:38:14] urandom: yeah.. OSM's error messages are horrible [20:38:16] urandom: yeah, horizon.wikimedia.org is a less shitty version [20:38:27] i heard mark speak of it in the quarterly review [20:38:35] urandom: hopefully wikitech will no longer be needed in a couple months, hoizon can do pretty much all the things wikitech can do [20:38:59] (03CR) 10BryanDavis: [C: 032] Add some more useful packages to the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317259 (owner: 10Yuvipanda) [20:39:04] yuvipanda: I never used wikiteck for labs, only horizon, but I was wondering if we plan some improvement in speed ;) [20:39:08] yuvipanda: so... i can use horizon? or i one day will be able to? [20:39:15] urandom: you can right now [20:39:20] urandom: you need to enable 2fa in wikitech tho [20:39:21] (03Merged) 10jenkins-bot: Add some more useful packages to the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317259 (owner: 10Yuvipanda) [20:39:21] wikitech creds? [20:39:26] urandom: yup [20:39:27] yeah, done that [20:39:35] oh! how do I add another app? [20:39:51] since that is something i recently failed to discover [20:39:53] it uses the same token as wikitech [20:40:00] volans: yeah, I want lots of perf improvements tho :( not sure when andrewbogott will have time for it [20:40:11] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:40:22] bd808: yeah, i tried to setup my tablet with the authenticator app, and didn't know how to get the code [20:41:10] oh... yeah :/ So what you will have to do is disable 2fa, re-enable it and when you first enable copy down the seed value [20:41:21] oh. [20:41:38] awesome, phab2001 salt minion is failing because of pypi [20:42:04] our oath extension could use some UX love to make that not so horrible. [20:42:33] It should just prompt you for a token and then show you the seed again [20:43:39] now that authmanager is in core features like that should be possible, and the security team is starting to work on some enhancements [20:43:43] Is python offline in the us? [20:43:56] anybody knows what is the status of Java 8 in production? supported/allowed ? [20:44:13] SMalyshev: I think we can run it on jessie in prod [20:44:24] volans ^^ is python offline for you? [20:44:25] paladox: python.org and pypi are using dynect (dyndns) so are affected by the DNS DDOS [20:44:30] Oh [20:44:38] volans python working for me in the uk [20:44:47] is not offline the website, but the DNS resolution might fail, is working for me too right now [20:44:48] bd808: cool, thanks! [20:44:56] ok [20:45:01] (03PS4) 10KartikMistry: Configurable mode_path for apertium [puppet] - 10https://gerrit.wikimedia.org/r/297350 (https://phabricator.wikimedia.org/T139330) [20:45:10] SMalyshev: you might check with moritzm to find out the details [20:46:03] bd808: do you know about CI? [20:46:15] (03PS1) 10Yuvipanda: Add more things to base package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317274 [20:46:25] yeah, horizon is nice, and yeah it is quota [20:46:26] bd808: ^ [20:46:29] I doint think ci supports java 8, but i doint really know [20:46:59] a 16G instance would max out the quota of services, or services-test, all by its lonesome [20:47:13] urandom: yeah, we rejigged default quotas a few months ago [20:47:29] urandom: we ran into a quota crisis at wikimania, and so have become a bit more conservative since [20:47:36] and i'm guessing it would be rude to create it in deployment-prep [20:48:48] (03CR) 10BryanDavis: [C: 032] Add more things to base package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317274 (owner: 10Yuvipanda) [20:49:13] bd808: thanks! I think I'll rebuild now [20:49:31] SMalyshev: Not sure about CI. I would guess that there are not JDK8 supporting images yet. [20:50:35] SMalyshev: CI mostly depends on the status of OS packages, if jdk8 is available in jessie (or trusty) then we can use it pretty easily. Otherwise we'll need to figure something else out. [20:51:32] (03Merged) 10jenkins-bot: Add more things to base package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317274 (owner: 10Yuvipanda) [20:52:00] bd808: doing a rebuild [20:52:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:53:34] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [20:54:40] ^ that's something real that's maybe been flying under the radar since ~18:00 (nearly 3h) [20:55:31] hmmm it's on cache_upload [20:56:18] starts in eqiad, and then comes through at the others [20:56:19] .... [20:56:54] legoktm: I think it's available in packports, not sure if in maunstream os [20:56:57] (03PS1) 10Yuvipanda: tcl: Add fcgi headers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317275 [20:57:05] ACKNOWLEDGEMENT - cassandra service on maps2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Gehel checking cassandra issue - gehel [20:57:24] swift? [20:57:27] SMalyshev: jessie-backports is also fine [20:58:45] legoktm: looks like it's there: https://packages.debian.org/jessie-backports/openjdk-8-jdk [20:59:18] (03CR) 10Yuvipanda: [C: 032] tcl: Add fcgi headers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317275 (owner: 10Yuvipanda) [20:59:35] But how do we install java 8 without it uninstalling java 7? [20:59:35] SMalyshev: ok, can you file a ticket in the #ci-infrastructure project asking for it to be installed? and what job you want to use it [20:59:35] (03Merged) 10jenkins-bot: tcl: Add fcgi headers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/317275 (owner: 10Yuvipanda) [20:59:49] urandom: https://phabricator.wikimedia.org/T140904 for quota bumps [20:59:57] We will most likly need to create a java 8 test. [21:00:31] error :-( [21:00:36] legoktm: ok, thanks. It's not ready yet but eventually will be... [21:00:39] yuvipanda: auh, is that documented somewhere? [21:01:18] (03PS3) 10Dzahn: maps::server: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315889 [21:02:23] urandom: I think 'setting up a labs node' is so generic I don't know. Krenair or someone in releng might know how to setup a particular thing on deployment-prep [21:02:26] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:04:01] ok, the 5xx alerts are from cache_upload, storage stuff, looking into it [21:04:03] (am going afk for a bit, sorry!) [21:04:12] cheers bblack [21:05:07] yuvipanda, urandom, what's the question? [21:05:33] Krenair: do you know of https://wikitech.wikimedia.org/wiki/Labs_node_setup [21:06:05] !log restarting varnish backends (depooled, etc) for eqiad cache_upload: cp1049, cp1072, cp1074 [21:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:26] no [21:06:35] Im wondering could someone add this repo https://phabricator.wikimedia.org/diffusion/EPFM/repository/master/ to mediawiki-extension submodules please? [21:07:08] SemanticForms is being renamed. [21:07:58] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: labvirt1005 - HP RAID controller issue (battery?) - https://phabricator.wikimedia.org/T148255#2735310 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [21:09:54] PROBLEM - salt-minion processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:07] RECOVERY - kartotherian endpoints health on maps2003 is OK: All endpoints are healthy [21:14:30] RECOVERY - kartotherian endpoints health on maps2002 is OK: All endpoints are healthy [21:14:31] RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy [21:14:33] RECOVERY - kartotherian endpoints health on maps2004 is OK: All endpoints are healthy [21:15:01] yay gehel [21:16:08] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy [21:16:31] honestly, I'm not really sure why this worked... still trying to understand [21:16:48] that paged [21:16:54] he, what's up ? [21:17:09] kartotherian decided to get us paged ? [21:17:12] guess so [21:17:17] it was a recovery page though [21:17:49] oh I got the 2 CRITICALs as well, but just got home [21:17:58] plus, it was codfw.. so no live traffic, right ? [21:18:01] I got them much earlier, ge hel was working on it with b black's help [21:18:04] gehel: what happened ? [21:18:27] it was after the reimage I think [21:18:34] ah, there's a backlog ? [21:18:37] uh huh [21:19:25] akosiaris: I mixed up maps2004 and maps-test2004 during reimage. Really a bad case of stupid. [21:19:51] maps traffic is now going to eqiad, which we had to do anyway... [21:20:16] why did we only now get the recovery page I wonder [21:20:49] apergos: ? I got 2 CRITICAL and 1 recovery. Should I have received more ? [21:21:07] apergos: because codfw is just recovering now. It took me time to understand what went wrong... [21:21:22] ah, that woul dbe in then [21:21:32] akosiaris: no, that's right, 2 crits from much earlier, a recovery now [21:21:51] but as ge hel says, it's for codfw which is just now all cleaned up [21:21:54] makes sense [21:22:55] I never got the recovery from eqiad one though gehel [21:23:10] none of us did, it was probably silenced or something [21:23:14] there wasn't a recovery from eqiad (that I know of) [21:23:27] eqiad was always working, but no traffic was sent there. [21:23:57] ugh [21:24:11] I broke codfw (which was serving traffic), Brandon helped me move the traffic to eqiad, so that I have time to understand what went wrong. [21:24:47] maps was an aberation were traffic was served from codfw, this is now corrected (a nice side effect from my mistakes) [21:25:58] :-) [21:26:19] all's well that ends well etc [21:27:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:28:50] I'm going to wander off then, it's past midnight o'clock. see ya [21:30:10] apergos: good night! [21:31:26] same here, no yet midnight, but late enough [21:36:27] cool. good night! [21:37:12] !log phab2001 - ip addr del 10.64.32.186/21 dev eth0 [21:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:11] RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:40:03] !log phab2001 that IP was also on iridium/phab1001, it should not be hardcoded in puppet, causing issues in T143363 [21:40:05] T143363: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363 [21:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:18] (03PS5) 10Ppchelko: service::node - support sampled logging [puppet] - 10https://gerrit.wikimedia.org/r/302309 (https://phabricator.wikimedia.org/T139674) [21:47:36] (03CR) 10Ppchelko: "I've rebased this, puppet compiler says it's a noop as expected: https://puppet-compiler.wmflabs.org/4461/" [puppet] - 10https://gerrit.wikimedia.org/r/302309 (https://phabricator.wikimedia.org/T139674) (owner: 10Ppchelko) [21:49:34] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2735412 (10Dzahn) 10.64.32.186 is hardcoded in puppet in several places ``` hieradata/role/eqiad/phabricator/main.yaml: - "10.64.32.186" hieradata/role/eqiad/phabricator/main.yaml: -... [21:54:59] (03CR) 10Dzahn: [C: 032] "also talked to Max, there is one instance for testing this role, it's a self-hosted puppetmaster and this should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [21:57:21] (03PS1) 10Smalyshev: Add configs for LDF server [puppet] - 10https://gerrit.wikimedia.org/r/317282 [21:57:27] (03CR) 10Dzahn: "puppet fails on maps1001 are totally unrelated and already acked in icinga -> T147780" [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [21:59:51] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2735416 (10Dzahn) 05Resolved>03Open [maps1001:~] $ puppet agent -tv .. Info: Caching catalog for maps1001.eqiad.wmne... [21:59:53] (03CR) 10Dzahn: "no-op on maps1002, maps1003 .." [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [22:00:34] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:05] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2735424 (10Dzahn) Checking an unrelated change i noticed puppet fails on maps1001. puppet run-check in Icinga linked to... [22:02:23] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Unmet dependencies around postgis apt packages on maps* servers - https://phabricator.wikimedia.org/T147780#2735425 (10Dzahn) puppet works fine on maps1002,1003 though .... [22:04:45] ACKNOWLEDGEMENT - puppet last run on maps1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[postgresql-9.4-postgis-2.1] daniel_zahn https://phabricator.wikimedia.org/T147780 [22:08:21] !log cp4006, cp4014 were running out of disk, apt-get clean [22:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:13] RECOVERY - Disk space on cp4014 is OK: DISK OK [22:17:30] !log cp4006,cp4014 gzipped some logs in home for disk space [22:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:54] RECOVERY - Disk space on cp4006 is OK: DISK OK [22:25:44] (03PS1) 10RobH: move kafka1003 to use raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/317284 [22:26:48] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:28:03] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:32:07] (03CR) 10RobH: [C: 032] move kafka1003 to use raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/317284 (owner: 10RobH) [22:32:12] (03PS2) 10RobH: move kafka1003 to use raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/317284 [22:33:04] Why cant my tools.account connect to gerrit when i run git review [22:35:37] Zppix: probably because labs and production are not supposed to talk each other per networking ACLs [22:35:49] Damn [22:36:17] But its too labs/tools/zppixbot not ops repo [22:36:22] Zppix: can you run it from your computer at home? [22:36:37] If i had access to my pc [22:37:16] how do you connect to tools now? [22:37:30] Mobile [22:37:43] ah [22:37:56] Zppix: there is gerrit patch uploader [22:37:58] But it pulls fine [22:38:20] https://tools.wmflabs.org/gerrit-patch-uploader/ [22:38:50] Zppix mutante you can also use gerrit's inline edit [22:38:59] Zppix: you might want to confirm that in -labs [22:44:50] there is a git client for android, fwiw https://play.google.com/store/apps/details?id=com.aor.pocketgit&hl=en [22:45:31] Ios mate [22:45:35] Im using a client [22:47:13] the reason you can pull but not push is probably that you are pulling via https but pushing via ssh to the high port on gerrit [22:47:22] you shouldnt upload your private key [22:47:31] but you can push over http, after setting a password in gerrit config [22:47:34] https [22:48:13] it's called "HTTP password" in Gerrit User settings [22:48:21] i haven't tried it myself [22:51:27] https://gerrit-review.googlesource.com/Documentation/user-upload.html#http [22:52:48] Zppix ^^ [22:54:23] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:55:06] How do i set password [22:55:49] in gerrit web ui, click your user name, go to settings [22:56:22] Zppix https://gerrit.wikimedia.org/r/#/settings/http-password [22:56:34] it will be an automatically password generated for you. [22:56:38] Click generate password [22:56:49] Its so long though [22:57:11] Yep [22:57:14] Just copy and paste [22:57:37] Where? [22:58:47] edit your .git/config file and change the "remote gerrit" to use https instead of ssh [22:58:52] run git-review [22:59:06] when it asks you for password, try the one you generated [22:59:07] afaict [22:59:10] Zppix when you generate your password by going to https://gerrit.wikimedia.org/r/#/settings/http-password you can then copy what you see in the password field [22:59:13] per mutante :) [23:00:01] U mean gitreview [23:00:12] Bc config doesnt have a gerrit remote [23:12:13] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2735538 (10RobH) [23:12:18] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2639951 (10jeremyb) completed today? [23:19:27] (03PS1) 10Dzahn: rename iridium-vcs to phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) [23:20:51] (03CR) 10Paladox: [C: 031] rename iridium-vcs to phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [23:32:05] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Maps - move traffic to eqiad instead of codfw - https://phabricator.wikimedia.org/T145758#2735559 (10BBlack) >>! In T145758#2735539, @jeremyb wrote: > completed today? Yes, there was an unplanned incident with the codfw karthotherian se... [23:34:12] (03PS1) 10Dzahn: add phab2001-vcs.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/317291 (https://phabricator.wikimedia.org/T143363) [23:34:14] bummer. maps1002 is running older software. can an ops person take down maps1002.eqiad for me please [23:34:21] i don't want to do a scap3 now [23:34:45] gehel is deep asleep by now [23:35:47] !log maps1002.eqiad is running older/incorrect/misbehaving software for some reason, restart didn't help. Need to depool [23:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:20] elukey, ping [23:37:19] yurik elukey is afk and also probally asleep [23:38:04] are there other available opsies? [23:38:37] I think so, the us ones :) [23:39:03] lovely, irc is having a field day, massive DDOS, maps codfw meltdown, one of the maps in eqiad is running out of date software polluting the cache... [23:39:26] greg-g, do you know who might be around? [23:39:40] yurik greg-g is on an airplane i think [23:39:44] lol [23:40:00] ok, i will take down the service itself, and there will be plenty of pings. [23:40:05] also ddos attack only affected europe a bit not as much as it did in the us [23:40:08] on 1002 only [23:41:39] * paladox restarts his pc for a windows 10 update :) [23:44:39] mayday. Need someone with ops to kill a service. [23:44:52] yurik, you should check with bblack (probably around, was involved earlier with this issue) [23:45:00] apergos, thanks! [23:45:01] yurik: i depooled it [23:45:09] awesome, thanks mutante !!! [23:45:26] right, almost 3 am and still not in bed [23:45:27] !log depooling maps1002 (by running "depool" on the server itself) [23:45:28] * apergos goes [23:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:13] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2735593 (10RobH) [23:48:55] 06Operations, 10EventBus: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734746 (10RobH) a:05RobH>03Ottomata @Ottomata I'm assigning this task to you, as the setup/deployment of the basic server/OS is complete. You can feel free to resolve this task once you are awar... [23:50:52] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:58:38] (03PS1) 10Alex Monk: shinkengen: Ensure consistent ordering of hostgroups [puppet] - 10https://gerrit.wikimedia.org/r/317294