[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T0000). Please do the needful. [00:00:05] csteipp: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:18] o/ [00:02:35] (03CR) 10Catrope: [C: 032] Set password policy for global sysadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259436 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:03:34] (03Merged) 10jenkins-bot: Set password policy for global sysadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259436 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:05:11] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Password policy for sysadmin group (duration: 00m 29s) [00:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:24] Thanks! [00:08:02] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 59, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: Tele2 (AMS13-CORE-1:4/2) {#13443} [10Gbps]BR [00:11:30] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1886459 (10Legoktm) In the future, if you can get the cache headers, what mw backend served the request, and specific timestamps, that can be helpful when trying to d... [00:14:01] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 [00:28:54] !log ori@tin Synchronized php-1.27.0-wmf.8/includes/api/ApiStashEdit.php: I552cf6b0420: Upgrade some ApiStashEdit logging calls to info() (duration: 00m 30s) [00:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:29] (03PS3) 10Aaron Schulz: [WIP] Configure $wgCdnReboundPurgeDelay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258365 (https://phabricator.wikimedia.org/T113192) [00:44:48] (03PS4) 10Aaron Schulz: Configure $wgCdnReboundPurgeDelay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258365 (https://phabricator.wikimedia.org/T113192) [00:58:31] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: puppet fail [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T0100). [01:03:53] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: puppet fail [01:09:29] (03PS3) 10Yuvipanda: ores: Stop using aof for redis persistance [puppet] - 10https://gerrit.wikimedia.org/r/259593 (https://phabricator.wikimedia.org/T121658) [01:09:40] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Stop using aof for redis persistance [puppet] - 10https://gerrit.wikimedia.org/r/259593 (https://phabricator.wikimedia.org/T121658) (owner: 10Yuvipanda) [01:09:53] (03PS3) 10Yuvipanda: labs: Remove nfs_mounts params [puppet] - 10https://gerrit.wikimedia.org/r/259602 [01:10:22] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Remove nfs_mounts params [puppet] - 10https://gerrit.wikimedia.org/r/259602 (owner: 10Yuvipanda) [01:15:33] 6operations, 10MobileFrontend: Stale copy of Wikipedia:Featured picture candidates/Peacock butterfly - https://phabricator.wikimedia.org/T121594#1886659 (10dr0ptp4kt) Right. I had copied from a curl but failed to paste. It's of about no consequence now, but it was as I recall a miss (0), hit (3), hit (11) afte... [01:24:21] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:24:33] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [01:28:32] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [01:29:41] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:30:52] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [01:31:00] (03PS15) 10MaxSem: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [01:33:30] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 6Zero, and 3 others: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1886690 (10Dzahn) Yes, it would be in wikimedia.org in DNS. @ori but how will that system know anything about traffic on shop... [01:38:42] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1886694 (10Dzahn) a:3Dzahn [01:44:44] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1886701 (10Pchelolo) I've added this capability to #service-runner experimentally, it can be enabled by providing `batch: true` to the metrics config. Next step is to try... [01:56:32] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:27:25] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 10m 32s) [02:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:59] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 08m 22s) [02:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Dec 17 02:53:11 UTC 2015 (duration 7m 12s) [02:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:22] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [03:59:16] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1886785 (10ori) Addition is commutative. So why not coalesce counter updates within a batch? If a counter is incremented twice in a single update interval, send `foo:2|c`... [04:11:33] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1886792 (10Pchelolo) @ori Great idea, however I'm not quite sure how would that work in production. The batch contains maximum 20 messages (the limit is set by an appro... [05:22:41] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [05:48:33] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:24:22] PROBLEM - Disk space on restbase1002 is CRITICAL: DISK CRITICAL - free space: /var 105248 MB (3% inode=99%) [06:31:13] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:58:53] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:48] 6operations, 6Performance-Team, 10Thumbor, 5Patch-For-Review: Use cgroups to limit thumbor & subprocesses resource usage - https://phabricator.wikimedia.org/T120940#1886922 (10Gilles) [07:33:21] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1886955 (10Joe) So, apart from the question of "where should we consume the data from", which is important, I think we s... [07:34:04] <_joe_> how is that ticket "blocked on operations", [07:34:26] <_joe_> it's "blocked on XY", I'd say [07:47:24] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1886968 (10yuvipanda) Tangential, but just wanted to point out that this won't be the last time we need to provide aggre... [07:53:53] (03PS1) 10Legoktm: extdist: Split skindist log into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/259639 [08:02:36] (03PS4) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/258981 [08:02:39] (03PS6) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [08:02:41] (03PS2) 10Giuseppe Lavagetto: Made locking optional as it might slow down syncing significantly [software/conftool] - 10https://gerrit.wikimedia.org/r/259492 [08:08:31] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [08:09:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:10:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:14:29] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1886996 (10GWicke) Counters are also rather rare. Most of our metrics are timers. [08:16:52] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:18:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:18:52] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:19:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:36:22] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [09:46:51] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 58 failures [10:18:52] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [10:30:23] !log nodetool stop -- CLEANUP restbase1004 [10:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 700 [10:35:42] ^FR [10:36:02] (03PS3) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [10:40:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 999 [10:43:42] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:45:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1263 [10:47:37] !log performing schema change on s7-master metawiki.oauth_registered_consumer [10:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:48:52] 6operations, 10MediaWiki-General-or-Unknown, 7Graphite, 5MW-1.27-release-notes, 5Patch-For-Review: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1887181 (10Addshore) For reference he reduction in packets https://commons.wikimedia.org/wiki/File:Graphite_Mediaw... [10:50:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1564 [10:51:32] (03PS4) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [10:52:02] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:55:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1863 [10:56:02] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [10:56:47] (03PS5) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [10:57:32] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [11:00:11] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1109 [11:05:11] RECOVERY - check_mysql on db1008 is OK: Uptime: 2911242 Threads: 172 Questions: 135601707 Slow queries: 27337 Opens: 85776 Flush tables: 2 Open tables: 64 Queries per second avg: 46.578 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3 [11:12:02] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:15:25] addshore: would you have any bandwidth to look at jobrunner too? [11:15:36] possibly :P [11:17:20] haha thanks! but the bulk is gone now, so thanks again! [11:17:33] hey good morning godog and addshore :-) [11:17:37] No worries! It was only a 2 line change after all ;) [11:17:41] Morning hashar ! [11:17:44] nice to see several statsd metrics can be embedded in a single packet [11:18:26] yeh, it makes me wonder if some of the old things people tried to do with statsd might work now (such as tracking 1:1000 api calls (or hooke calls) etc [11:19:00] ciao hashar [11:19:03] the thing with jobrunner, is that it has its own low level implementation [11:19:05] and just socket_sendto() [11:19:33] Well, I just cloned the thing so I can take a look :) [11:19:53] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:20:19] !log performing schema change on x1-master wikishared.cx_translations [11:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:52] Godog, how feasible do you think some sort of admin group would be to graphite (to move and delete metrics?) [11:21:12] addshore: in theory we could have jobrunner depends on "liuggio/statsd-php-client" , librarize mw/core SamplingStatsdClient or upstream it [11:21:42] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [11:21:53] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:23:11] hashar, well, is the jobrunner stuff even sampled? [11:23:33] maybe not :( [11:23:33] forget me [11:23:35] :-) [11:24:11] heh, and of course there are no tests for this thing ;) [11:26:52] (03CR) 10Alexandros Kosiaris: [C: 031] "ready to be deployed. Puppet compiler says OK as well http://puppet-compiler.wmflabs.org/1504/sca1002.eqiad.wmnet/." [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [11:27:41] its possible we may just be able to steal the reducing stuff from https://github.com/liuggio/statsd-php-client/blob/master/src/Liuggio/StatsdClient/StatsdClient.php [11:27:56] if we really dont want to just use the library ;) [11:27:56] (03PS6) 10Giuseppe Lavagetto: mediawiki: add conftool-specifc credentials and scripts [puppet] - 10https://gerrit.wikimedia.org/r/258979 [11:28:43] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [11:28:50] I'll see if I can throw a patch up :) [11:29:30] (doing it a different way though) [11:30:28] addshore: re: admin group it'd mean having write access to all metrics, unless it gets out of control I'd wait on it, unless it is creating problems? [11:31:24] well, really I would like to move some metrics around that wikidata have in there, but its bearable for now (as of course when they move in graphite I have to also change the things reporting them and the things consuming them [11:31:34] !log performing schema change on x1-master flowdb [11:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:30] I'll leave it for now :) [11:32:52] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [11:33:06] !log performing schema change on officewiki [11:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:22] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:39:03] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [11:41:02] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [11:44:21] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:46:11] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [11:47:12] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:22] (03CR) 10Alexandros Kosiaris: [C: 032] Add ferm to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [11:49:56] 6operations, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1887291 (10fgiunchedi) another option of course is to (officially) backport 2.5 from stretch to jessie-backports. in terms of upstream support, swift 2.2.0 isn't in Kilo (EOL: 2016-05-02) http://docs.openstack.org/releases/re... [11:53:12] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [11:53:45] (03PS2) 10Jcrespo: Repool db1041 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259532 [11:54:31] (03CR) 10Jcrespo: [C: 032] Repool db1041 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259532 (owner: 10Jcrespo) [11:55:47] (03PS4) 10Alexandros Kosiaris: Add ferm to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [11:55:51] (03CR) 10Alexandros Kosiaris: [V: 032] Add ferm to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/259608 (owner: 10BryanDavis) [11:56:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1041 with low weight (duration: 00m 37s) [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:21] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.32.153:10042/: Timeout on connection while downloading http://10.64.32.153:10042/ [12:00:05] kart_ akosiaris mobrovac: Respected human, time to deploy Content Translation server service-runner migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T1200). Please do the needful. [12:00:05] kart_: A patch you scheduled for Content Translation server service-runner migration is about to be deployed. Please be available during the process. [12:00:22] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [12:00:36] akosiaris: it is time. [12:05:31] godog: https://gerrit.wikimedia.org/r/#/c/259660/ but I think it may bet better to instead buffer them >.> as there are not actually mutliple things sending stats in a row really [12:05:41] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.48.29:10042/: Timeout on connection while downloading http://10.64.48.29:10042/ [12:08:36] !log depool sca1001 from cxserver service. [12:08:43] addshore: nit: you forgot to refer the task :-D [12:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:50] !log disable puppet, stop salt-minion on sca1002 [12:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:32] akosiaris: is it 1001 or 1002? [12:10:14] kart_: we will be deploying to 1001, so it's depooled. I 've disabled puppet and salt on 1002 so that it remains untouched by our changes until we are sure they work fine [12:10:43] <_joe_> akosiaris: that's less fun though [12:10:58] hashar: oh yeh ;) [12:11:07] _joe_: true. lemme issue a random restart on wdqs so you have something to do :P [12:11:34] <_joe_> akosiaris: a random restart of pybal would be more fun [12:11:45] <_joe_> it would call all of our friends to the party with us :P [12:12:36] kart_: so, ready ? [12:12:42] mobrovac: hola [12:12:48] akosiaris: yes. [12:12:50] feel free to deploy the code, I 'll merge the puppet change [12:12:54] ciao ciao kart_ akosiaris [12:12:58] sorry for being late [12:13:06] mobrovac: no worries [12:13:22] PROBLEM - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:13:30] mobrovac: I'm merging cxserver/deploy patch. [12:13:32] (03PS1) 10Jcrespo: Depool db1031 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259663 [12:13:41] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [12:13:43] kk kart_ [12:13:49] (03CR) 10Alexandros Kosiaris: [C: 032] service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [12:13:55] (03PS25) 10Alexandros Kosiaris: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [12:14:37] (03PS1) 10ArielGlenn: jessie 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/259664 [12:15:01] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:15:21] (03CR) 10Alexandros Kosiaris: [V: 032] service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [12:16:54] kart_: akosiaris: puppet needs to be run after the code has been deployed to sca [12:17:01] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:17:23] mobrovac: well, even if it runs before, server is depooled, no harm will come to the service [12:17:41] ah kk [12:18:00] (03PS1) 10Bartosz DziewoƄski: Enable cross-wiki upload A/B test in additional languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259665 (https://phabricator.wikimedia.org/T120867) [12:19:15] (03PS1) 10Addshore: Grafana increase homepage dash list limits [puppet] - 10https://gerrit.wikimedia.org/r/259666 [12:20:01] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [12:22:57] addshore: sweet! yeah I don't really know how that part works, buffering might make sense too like in T121231 [12:23:44] akosiaris: done [12:24:05] !log Updated cxserver on sca1002 [12:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:24:13] godog: also, for what I was talking about earlier for moving metircs, I guess the best idea would be puppetize a script that does it and then give the group sudo for the script [12:24:18] kart_: 1001 :P [12:24:25] blah [12:25:20] kart_: looks like it's running fine [12:26:07] nice [12:26:28] and with firejail... [12:26:40] ah, no [12:26:42] addshore: yeah that'd make it safer indeed [12:26:46] so nagios complains HTTP WARNING: HTTP/1.1 404 Not Found - 276 bytes in 0.010 second response time [12:26:58] lemme see if neon's config needs update first [12:27:54] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1849958 (10Joe) Actually, I think the component @mobrovac is implementing for services could be an optimal candidate to... [12:29:44] <_joe_> akosiaris: that check is that? [12:30:21] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [12:30:22] godog: infact https://github.com/graphite-project/whisper/blob/master/bin/whisper-merge.py Would allow me to start sending data to a new location, run this to merge the old data into it, then file a phab ticket for deletion of the old one [12:30:29] <_joe_> uhm 404, that doesn't sound good [12:30:40] _joe_: neon wanted an update [12:30:45] it wasn't check for /_info [12:30:51] checking* [12:32:08] kart_: _joe_ and the check is ok now [12:32:13] great [12:32:20] so, seems like everything is working fine [12:32:28] I 'll repool sca1001 then [12:32:54] cool [12:33:07] !repooling sca1001 for cxserver [12:33:13] !log repooling sca1001 for cxserver [12:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:21] !log depooling sca1002 for cxserver [12:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:53] akosiaris: Let me know when I can deploy again? [12:34:10] kart_: in about 30 secs [12:34:49] kart_: ok you can deploy now [12:35:27] ok. [12:36:12] RECOVERY - salt-minion processes on sca1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:37:12] http://cxserver.wikimedia.org/ seems 503. [12:37:54] akosiaris: we need to restart service once deployment is done. [12:38:04] (03PS1) 10ArielGlenn: jessie 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/259668 [12:38:57] PROBLEM - LVS HTTP IPv4 on cxserver.svc.eqiad.wmnet is CRITICAL: Connection refused [12:39:01] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:04] hmm [12:39:09] that should not have happened [12:39:11] looking [12:39:29] akosiaris: I'm done with deployment. [12:40:58] ah, I think it's the pybal check [12:41:43] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1887352 (10mobrovac) Yup, @Joe, indeed it sounds like this could benefit from the #EventBus system we are working on tog... [12:43:02] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [12:43:07] akosiaris: should probably be moved to use /_info ? [12:43:26] for the check [12:43:39] <_joe_> akosiaris: need help with this? [12:43:56] already uploading the pybal change, thanks [12:44:40] (03PS1) 10Alexandros Kosiaris: cxserver: Move pybal configuration to monitor /_info [puppet] - 10https://gerrit.wikimedia.org/r/259669 [12:44:57] <_joe_> akosiaris: btw pybal now can support redirects, but not 404s ofc [12:45:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Move pybal configuration to monitor /_info [puppet] - 10https://gerrit.wikimedia.org/r/259669 (owner: 10Alexandros Kosiaris) [12:45:10] curl localhost:8080/_info works on sca1001, but not on sca1002 [12:45:22] akosiaris: kart_ ^^ [12:45:49] <_joe_> 1 is enough for now :P [12:46:02] ah, it hasn't been restarted yet on sca1001 it seems [12:46:07] still running the old code [12:46:09] <_joe_> uhm [12:46:18] <_joe_> maybe it's better to leave it like that? [12:46:30] (03PS2) 10Muehlenhoff: Bump connection limit to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/259534 [12:46:39] _joe_: ? [12:46:46] is that a friday joke? [12:46:47] :P [12:46:48] <_joe_> running the old code :P [12:47:02] _joe_: Good Friday. [12:47:05] <_joe_> since it works where the old code is still deployed [12:47:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump connection limit to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/259534 (owner: 10Muehlenhoff) [12:47:48] <_joe_> yeah on sca1002 both / and /_info return 404 [12:47:59] please don't touch anything [12:48:26] ok pybal fixed, page about cxserver.svc.eqiad.wmnet should be coming soon [12:48:28] 1001 seems OK [12:50:20] kk, sca1002 responds now to /_info [12:50:34] <_joe_> wow a lot of errors in the logs [12:50:52] https://cxserver.wikimedia.org/v1/ seems fine now. [12:51:32] !log restarted cxserver on sca1002 [12:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:54] !log repooled sca1002 for cxserver [12:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:22] ok, fixing the page issue right now and I think we can declared this successful [12:53:24] (03PS1) 10ArielGlenn: make ping_on_rotate work without minion data cache [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/259671 [12:53:25] does not seem fine to me [12:53:32] dictionaries/mt not working [12:53:49] (03PS1) 10Alexandros Kosiaris: cxserver: also fix the icinga LVS check [puppet] - 10https://gerrit.wikimedia.org/r/259672 [12:54:10] Nikerabbit: is the only thing not working or more are problematic ? [12:54:26] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: also fix the icinga LVS check [puppet] - 10https://gerrit.wikimedia.org/r/259672 (owner: 10Alexandros Kosiaris) [12:54:32] (03PS2) 10Alexandros Kosiaris: cxserver: also fix the icinga LVS check [puppet] - 10https://gerrit.wikimedia.org/r/259672 [12:54:35] (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: also fix the icinga LVS check [puppet] - 10https://gerrit.wikimedia.org/r/259672 (owner: 10Alexandros Kosiaris) [12:55:47] <_joe_> Nikerabbit: kart_ https://phabricator.wikimedia.org/P2434 [12:55:50] kart_: got a big stacktrace in logs [12:56:09] that one ^ [12:56:22] _joe_: yes that's what's breaking the things I tested, lemme see [12:59:05] my first guess is that something wrong with config [12:59:17] kart_: might it be because of https://gerrit.wikimedia.org/r/#/c/258122/ ? [12:59:23] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/complete/: Timeout on connection while downloading http://10.64.32.153:10042/complete/ [12:59:41] kart_: in prod, the cert is set, while your patch says it needs to be null (sca is on ubuntu) [13:00:59] mobrovac: umm. Only required while using Node 0.10.x on Ubuntu. [13:01:03] Nikerabbit: ^ [13:01:27] mobrovac: apart from that, cx can't load the page too. [13:01:32] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [13:01:54] kart_: sca is on ubuntu and using node 0.10 [13:01:55] !log moved old cruft /srv/deployment/cxserver/deploy/src/config.js out of the way [13:02:00] mobrovac: that wouldn't break so many things, but yes ca should be after we get everything else fixed [13:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:32] mobrovac: We probably need, https://gerrit.wikimedia.org/r/#/c/259209/ [13:03:03] kart_: service-runner loads the config.yaml only [13:03:18] mobrovac: okay. so safer on that side. [13:03:19] unless you use that one explicitly somewhere in your code [13:05:47] kart_: should look at what the code in lines 7,8,9 from https://phabricator.wikimedia.org/P2434 does and why it explodes [13:06:58] so it looks like app.conf.registry is undefined in https://gerrit.wikimedia.org/r/#/c/244145/28/registry/index.js [13:07:49] <_joe_> can we roll back in the meanwhile? akosiaris: how feasible would that be? [13:08:44] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [13:11:37] _joe_: rather feasible I think [13:11:47] !log starting to restart hhvm on application servers (to effect security updates for libxml2, openssl and others) [13:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:31] akosiaris: can you look into registry? in config.yaml.erb OK? [13:13:19] kart_: looks mostly fine [13:14:14] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:33] (03PS1) 10ArielGlenn: 2014.7.5 jessie, backport patches for singleton SAuth class [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/259674 [13:16:02] kart_: so, /etc/cxserver/cxserver.yaml which has all the basic settings looks ok [13:16:13] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:17:20] kart_: and I assume you get registry from /srv/deployment/cxserver/deploy/src/config.yaml ? [13:17:29] which btw is a symlink to /srv/deployment/cxserver/deploy/src/config.dev.yaml [13:17:42] which is semantically incorrent since we are in production, not dev [13:17:47] I don't see anything logged to logstash [13:17:56] and no log files either [13:18:06] Nikerabbit: /srv/log/cxserver/main.log [13:18:20] akosiaris: aha. [13:18:23] yes, bad place, will fix it at some point [13:18:31] akosiaris: That need to fix. [13:18:54] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [13:19:08] also, that config.yaml files has a lot of duplicate settings. as in duplicated from /etc/cxserver/cxserver.yaml [13:19:30] er, /etc/cxserver/config.yaml [13:20:13] akosiaris: if it is using /etc/cxserver/config.yaml that would explain the stacktraces we see, there is no "registry" in there [13:20:23] but I am completely puzzled which file it is actually using [13:20:46] or which files, if multiple [13:20:52] Nikerabbit: what do you mean "if" ? it's passed as the argument to the server.js file [13:21:09] as in /usr/bin/nodejs src/server.js -c /etc/cxserver/config.yaml [13:21:23] akosiaris: then I suspect that is the cause [13:21:41] 6operations: Update firejail to 0.36 - https://phabricator.wikimedia.org/T121756#1887385 (10MoritzMuehlenhoff) 3NEW a:3MoritzMuehlenhoff [13:22:09] Nikerabbit: could be. depends on how many config files cxserver loads upon start up [13:22:35] only /etc/cxserver/config.yaml is read guys [13:22:45] can I help somehow? [13:22:47] config.yaml from the repo is not [13:23:01] jynus: no, go back to the shadow from which you came :P [13:23:05] and i don't see any mt pairs there [13:23:12] akosiaris: we're getting registry correctly? [13:23:20] not I guess. [13:23:40] so that's the cause. symlink to prod/dev doesn't matter. [13:24:08] that yaml comes from puppet, am I right? [13:24:13] yes [13:24:15] and we no longer have merging with defaults? [13:24:23] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:24] PROBLEM - configured eth on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:30] either we need that merging or specify everything in puppet [13:24:34] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:36] I was about to point that out [13:24:45] PROBLEM - nutcracker process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:48] we had moved to a scheme where registry WAS not in puppet [13:24:54] PROBLEM - Disk space on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:55] I will go with 2nd option as of now. [13:25:00] akosiaris: ^ [13:25:05] PROBLEM - salt-minion processes on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:08] which is what ? [13:25:24] akosiaris: kart_: the pairs need to be added to the config in ops/puppet [13:25:24] oh, specify it in pupppet ? [13:25:46] so the old functionality of merging it with defaults did not make it to service-runner ? [13:25:53] PROBLEM - DPKG on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:03] PROBLEM - nutcracker port on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:05] PROBLEM - RAID on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:09] cause cxserver used to have that since wikimania [13:26:22] akosiaris: service::node has defaults only for service-agnostic stuff [13:26:27] not sure what you mean [13:26:41] mobrovac: nothing to do with service::node [13:26:46] k [13:26:54] akosiaris: wondering how it is not working here :/ [13:26:54] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:55] cxserver would really the puppet provided config and merge it with defaults [13:27:34] somehow that is not working now. either it did not make it to this release, or something's wrong with the loading/merging of configuration [13:27:43] imho, its whole config should be in config.yaml, plain and simple [13:27:50] oh it was [13:27:54] it turned out to be a mess [13:28:13] too much friction between ops and LE to coordinate for releases and all that [13:28:28] mobrovac: we change it often and need Ops for simple change. [13:28:30] it's the restbase config in deployment repo thing [13:28:49] graphoid had it too IIRC [13:28:54] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [13:28:56] it also does that merge trick [13:29:26] or maybe that's maps [13:29:27] kart_: a simple trick might be to put a config stanza that tells cxcserver which file to load for the defaults, and then cxserver does it on start-up [13:29:33] some yuri software in any case [13:29:52] haha [13:30:09] akosiaris: probably the maps services [13:31:51] RECOVERY - LVS HTTP IPv4 on cxserver.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 898 bytes in 0.005 second response time [13:33:25] kart_: Nikerabbit, so what's the verdict ? how do you want to proceed with this ? [13:34:10] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1887421 (10ArielGlenn) >>! In T120831#1880676, @mark wrote: >>>! In T120831#1862403, @ArielGlenn wrote: >> I'll be applying a patch for that of 3 who... [13:34:12] akosiaris: Should I try with registry in puppet, we can fix it later. [13:34:15] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:34:48] kart_: ok [13:34:54] submit a patch, I 'll merge [13:36:15] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 60 failures [13:41:40] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1887429 (10ArielGlenn) >>! In T120831#1880681, @mark wrote: >>>! In T120831#1866444, @ArielGlenn wrote: >> The next category of problem looks like th... [13:42:34] akosiaris: should we use hieradata? [13:42:45] or inside config.yaml.erb? [13:42:56] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [13:43:45] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:56] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1887430 (10ArielGlenn) So the master and minion on neodymium have been behaving well for the past few days, and pings across all minions are reasonab... [13:44:00] kart_: hieradata [13:44:06] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [13:44:07] okay [13:44:24] mobrovac: what's up with mathoid today ? [13:45:34] akosiaris: need to investigate [13:47:31] I am going to try to set x1 servers into maintenance mode, with temporary lower redundancy [13:47:35] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [13:48:00] (03CR) 10Jcrespo: [C: 032] Depool db1031 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259663 (owner: 10Jcrespo) [13:48:21] (03PS1) 10KartikMistry: CX: Use registry from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/259680 [13:48:35] akosiaris, mobrovac: thanks for the debugging help [13:48:59] akosiaris: registry patch up. [13:50:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] CX: Use registry from hieradata (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259680 (owner: 10KartikMistry) [13:50:39] kart_: only supply the data in hieradata, no need to touch the puppet code [13:50:57] akosiaris: OK [13:51:00] (03CR) 10Mobrovac: [C: 04-1] CX: Use registry from hieradata (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/259680 (owner: 10KartikMistry) [13:52:14] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.48.29:10042/: Timeout on connection while downloading http://10.64.48.29:10042/ [13:55:05] PROBLEM - dhclient process on mw1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:52] things seem ok, not a single error on the logs [13:56:33] load is ok, in some ways, better [13:56:36] (03PS2) 10KartikMistry: CX: Use registry from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/259680 [13:56:52] (03PS1) 10Andrew Bogott: Role::db:: was renamed to role::redisdb. [puppet] - 10https://gerrit.wikimedia.org/r/259682 [13:57:35] (03PS2) 10Andrew Bogott: Role::db:: was renamed to role::redisdb. [puppet] - 10https://gerrit.wikimedia.org/r/259682 [13:57:56] mobrovac: thanks. PS3 coming. [13:58:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Use registry from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/259680 (owner: 10KartikMistry) [13:58:11] akosiaris: oh. [13:58:13] :) [13:58:19] do not know if I am getting older, or more paranoid with time, probably both [13:59:14] kart_: yeah I noticed you did not address mobrovac's comment but let's get cxserver running and then do that in another commit [13:59:44] (03PS3) 10Andrew Bogott: Role::db:: was renamed to role::redisdb. [puppet] - 10https://gerrit.wikimedia.org/r/259682 [13:59:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1031 (x1-slave) for maintenance (duration: 07m 30s) [13:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:06] akosiaris: okay. [14:00:08] argh [14:00:14] 13:59:46 1 apaches had sync errors [14:00:18] akosiaris: did that change deployed? [14:00:19] ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Error from DataBinding 'hiera' while looking up 'cxserver::restbase_url': syntax error on line 645, col 7: ` no:' on node sca1002.eqiad.wmnet?[0m [14:00:49] no needs to be quoted ? [14:00:54] fixing [14:01:15] mw1004.eqiad.wmnet had problems syncronizing, checking [14:01:55] akosiaris: yup [14:02:02] (03CR) 10Andrew Bogott: [C: 032] Role::db:: was renamed to role::redisdb. [puppet] - 10https://gerrit.wikimedia.org/r/259682 (owner: 10Andrew Bogott) [14:02:12] some versions of the ruby yaml parser treat it as false [14:02:30] that is for labs? [14:02:45] oh. 'no' [14:03:22] ParserError: while parsing a block mapping [14:03:22] in "hieradata/common/cxserver.yaml", line 567, column 7 [14:03:22] expected , but found '' [14:03:23] in "hieradata/common/cxserver.yaml", line 646, column 8 [14:03:23] grrrr [14:03:24] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: puppet fail [14:04:35] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [14:06:11] !log nodetool stop -- CLEANUP on restbase1002 [14:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:14] RECOVERY - Disk space on restbase1002 is OK: DISK OK [14:08:14] akosiaris: fixing. [14:08:23] (03PS1) 10Andrew Bogott: Correct a mistake in Ibfd446e2d78ca61854312f71a9697b9221728496. [puppet] - 10https://gerrit.wikimedia.org/r/259686 [14:08:37] (03PS1) 10KartikMistry: CX: Fix cxserver.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/259687 [14:08:48] (03PS2) 10Andrew Bogott: Correct a mistake in Ibfd446e2d78ca61854312f71a9697b9221728496. [puppet] - 10https://gerrit.wikimedia.org/r/259686 [14:09:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:09:46] PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: puppet fail [14:09:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:10:51] (03PS2) 10KartikMistry: CX: Fix cxserver.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/259687 [14:10:55] akosiaris: fixing more things. [14:11:05] akosiaris: PS2 is Okay. [14:11:40] !log soft-rebooting mw1004, responsive to ping, but not to salt, ssh [14:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:12:16] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:54] <_joe_> mobrovac: did you look into what's happening to mathoid? [14:13:12] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1887497 (10fgiunchedi) I had cancelled cleanup on 1002 ``` restbase1002:~$ nodetool compactionstats -H pending tasks: 2 compaction type... [14:13:16] <_joe_> I sense it could be overloaded, but had no time to look into it actually [14:14:05] PROBLEM - nutcracker process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:14] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:24] PROBLEM - DPKG on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:25] PROBLEM - salt-minion processes on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:29] <_joe_> uhm something is exhausting memory on the jobrunners? [14:14:36] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:36] RECOVERY - RAID on mw1004 is OK: OK: no RAID installed [14:14:36] RECOVERY - nutcracker process on mw1004 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:14:40] <_joe_> moritzm: is this you ^^? [14:14:45] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:14:49] not yet, _joe_, will do it soon [14:14:55] RECOVERY - nutcracker port on mw1004 is OK: TCP OK - 0.000 second response time on port 11212 [14:15:05] PROBLEM - Disk space on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:15] PROBLEM - nutcracker port on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:17] (03PS1) 10Alexandros Kosiaris: cxserver: Fix the cxserver registry structure [puppet] - 10https://gerrit.wikimedia.org/r/259689 [14:15:21] _joe_: no, mw1015 hasn't been restarted so far [14:15:26] RECOVERY - dhclient process on mw1004 is OK: PROCS OK: 0 processes with command name dhclient [14:15:30] (03PS2) 10Alexandros Kosiaris: cxserver: Fix the cxserver registry structure [puppet] - 10https://gerrit.wikimedia.org/r/259689 [14:15:35] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: puppet fail [14:15:35] <_joe_> heh, ok, we need to restart hhvm on those machines [14:15:36] and IIRC mw1004 was flapping earlier already [14:15:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Fix the cxserver registry structure [puppet] - 10https://gerrit.wikimedia.org/r/259689 (owner: 10Alexandros Kosiaris) [14:15:44] RECOVERY - configured eth on mw1004 is OK: OK - interfaces up [14:15:44] RECOVERY - DPKG on mw1004 is OK: All packages OK [14:15:45] RECOVERY - Disk space on mw1004 is OK: DISK OK [14:15:46] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:15:46] PROBLEM - RAID on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:15:55] PROBLEM - configured eth on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:16:05] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:16:39] so if I run sync-common from mw1004 it will get up to date, right? [14:17:57] yeah [14:18:23] <_joe_> moritzm: did you reboot any of these servers? [14:18:43] but I put mw1004 up and now 15 is down, either a rolling restart is in process or there is something going on [14:18:55] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:18:58] no, only restarted hhvm on about 30-40 hosts so far [14:19:10] <_joe_> jynus: I'm handling this now [14:19:25] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:19:29] _joe_, all yourse [14:19:41] Finished rsync common (duration: 00m 51s) [14:19:59] I am going back to x1 maintenance [14:20:15] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [14:20:44] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 6 failures [14:21:00] akosiaris: cxserver need restart now? [14:21:38] kart_: one more fix to send upstream [14:22:02] eer, to gerrit [14:22:16] oh [14:22:24] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [14:22:45] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [14:23:48] <_joe_> !log restarting HHVM on the first jobrunners [14:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:26] (03PS1) 10Alexandros Kosiaris: cxserver: quote no strings in registry yaml [puppet] - 10https://gerrit.wikimedia.org/r/259691 [14:25:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: quote no strings in registry yaml [puppet] - 10https://gerrit.wikimedia.org/r/259691 (owner: 10Alexandros Kosiaris) [14:26:22] (03Abandoned) 10Zfilipin: RuboCop: fixed Lint/UnusedMethodArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:26:24] (03Abandoned) 10Zfilipin: RuboCop: Fixed Style/BlockDelimiters offense [puppet] - 10https://gerrit.wikimedia.org/r/254855 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:26:25] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:08] kart_: done, is cxserver working fine now ? [14:27:22] got restarted an all on both boxes [14:27:45] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:27:45] RECOVERY - puppet last run on sca1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:23] Did I see something in scrollback about mathoid issues? [14:28:26] PROBLEM - dhclient process on mw1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:31] Someone just reported https://phabricator.wikimedia.org/T121763 [14:28:33] <_joe_> Reedy: yes [14:28:38] cool, cheers [14:28:40] will not as such [14:28:41] akosiaris: no. it seems page isn't loading. [14:29:02] (03Abandoned) 10Andrew Bogott: Correct a mistake in Ibfd446e2d78ca61854312f71a9697b9221728496. [puppet] - 10https://gerrit.wikimedia.org/r/259686 (owner: 10Andrew Bogott) [14:29:05] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:29:20] <_joe_> I'm looking at the jobrunners, can someone else investigate? mobrovac either you or someone else from services? [14:29:23] kart_: https://cxserver.wikimedia.org/v1/ ? [14:29:24] (03PS1) 10Andrew Bogott: Revert "Role::db:: was renamed to role::redisdb." [puppet] - 10https://gerrit.wikimedia.org/r/259694 [14:29:30] kart_: loads fine for me [14:29:34] (03PS2) 10Andrew Bogott: Revert "Role::db:: was renamed to role::redisdb." [puppet] - 10https://gerrit.wikimedia.org/r/259694 [14:29:43] any specific errors ? [14:30:06] (03PS2) 10Muehlenhoff: Stop opendj on the former labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/259226 [14:30:35] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [14:31:05] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [14:31:18] akosiaris: 1. https://cxserver.wikimedia.org/v1/languagepairs doesn't have Apertium/Yandex/dictionary yet :/ [14:31:20] (03CR) 10Andrew Bogott: [C: 031] Stop opendj on the former labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/259226 (owner: 10Muehlenhoff) [14:31:42] akosiaris: 2. and Page en:Louisiana State University could not be found. HTTPError: 403 [14:31:49] (03CR) 10Andrew Bogott: [C: 032] Revert "Role::db:: was renamed to role::redisdb." [puppet] - 10https://gerrit.wikimedia.org/r/259694 (owner: 10Andrew Bogott) [14:31:51] akosiaris: something with restbase_url? [14:32:04] morebots: can you check restbase_url value in cxserver? [14:32:04] I am a logbot running on tools-exec-1201. [14:32:04] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:32:04] To log a message, type !log . [14:32:10] oh [14:32:27] mobrovac: can you check restbase_url value in cxserver? [14:32:52] Reedy: _joe_: there's already a patch in ext/Math master to fix that, will coordinate with the swatters to get it on all groups [14:33:13] kart_: I see apertium and yandex fine in the registry in config.yaml [14:33:15] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:33:19] it's already oun on 1.27-wmf9 [14:33:30] kart_: looking [14:33:55] <_joe_> mobrovac: ok sorry :) [14:34:36] restbase_url: "http://restbase.svc.eqiad.wmnet:7231/@lang.wikipedia.org/v1/page/html/@title" is on the same level as registry [14:34:46] I don't see something wrong yet [14:36:06] akosiaris: more fixes. [14:36:22] kart_: the logs still show the same errors as posted earlier by _joe_ [14:36:24] (03PS1) 10KartikMistry: CX: Fix dictionary YAML config [puppet] - 10https://gerrit.wikimedia.org/r/259695 [14:36:25] RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:36:26] (03PS1) 10BBlack: text VCL: do not create hfp objects on 5xx [puppet] - 10https://gerrit.wikimedia.org/r/259696 [14:36:28] akosiaris: ^^ [14:36:53] (03PS1) 10BBlack: VCL: grace-mode only in frontend caches [puppet] - 10https://gerrit.wikimedia.org/r/259697 [14:37:28] (03Abandoned) 10BBlack: temporarily block diffusion browse [puppet] - 10https://gerrit.wikimedia.org/r/259545 (owner: 10BBlack) [14:38:23] mobrovac: ok. Probably my last fix will fix it. [14:38:25] (03PS2) 10Alexandros Kosiaris: CX: Fix dictionary YAML config [puppet] - 10https://gerrit.wikimedia.org/r/259695 (owner: 10KartikMistry) [14:38:25] RECOVERY - configured eth on mw1015 is OK: OK - interfaces up [14:38:35] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:38:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Fix dictionary YAML config [puppet] - 10https://gerrit.wikimedia.org/r/259695 (owner: 10KartikMistry) [14:38:45] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [14:38:55] RECOVERY - DPKG on mw1015 is OK: All packages OK [14:38:55] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:39:14] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:40:05] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [14:40:51] akosiaris: kart_: we have yet another problem ... the proxy [14:41:03] * mobrovac looking into it [14:41:24] (03CR) 10Rush: [C: 031] "as agreed with a final notice :)" [puppet] - 10https://gerrit.wikimedia.org/r/259226 (owner: 10Muehlenhoff) [14:41:25] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:30] <_joe_> moritzm: I just rebooted mw1015 [14:41:35] RECOVERY - Disk space on mw1015 is OK: DISK OK [14:41:38] <_joe_> FYI [14:41:45] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [14:41:54] kart_: merge and deployed [14:41:57] merged* [14:42:16] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [14:42:26] I don't see any change however on https://cxserver.wikimedia.org/v1/#!/Languages/get_v1_languagepairs [14:42:44] _joe_, I could have done that, too :-) [14:42:59] akosiaris: that's fine. [14:43:03] We have list. [14:43:05] oh [14:43:10] ok [14:43:15] ok then, I thought yandex and apertium should be there [14:43:24] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [14:43:35] kart_: so, what's left then ? [14:43:54] akosiaris: now CX can't load page in source pane (earlier too) :/ [14:44:02] <_joe_> jynus: I was trying to understand what got all jobrunners consume so much more memory all of a sudden [14:44:33] it is what mov* said? [14:44:51] I have my own problems with jobrunners too, on mysqls [14:45:05] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 166 seconds ago with 0 failures [14:45:14] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:26] kart_: akosiaris: that's probably because of the proxy, lemme do a patch [14:48:22] mobrovac: I see. [14:48:43] mobrovac: we should have proxy block only for Yandex? [14:48:58] yup [14:49:02] * mobrovac doing it [14:49:24] blah. [14:49:25] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [14:50:38] (03PS1) 10Jcrespo: Upgrading and reconfiguring mysql on db1031 and x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/259698 [14:50:56] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:57] (03PS1) 10Zfilipin: RuboCop: fixed Style/CaseIndentation offense [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) [14:55:49] mobrovac: done? :) [14:55:58] kart_: in 2 mins [14:57:34] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: / is CRITICAL: Could not fetch url http://10.64.48.29:10042/: Timeout on connection while downloading http://10.64.48.29:10042/ [14:57:55] (03PS1) 10Mobrovac: CXServer: Do not use the proxy for RESTBase and Apertium [puppet] - 10https://gerrit.wikimedia.org/r/259700 [14:57:59] (03PS1) 10DCausse: Enable completion suggester beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259701 (https://phabricator.wikimedia.org/T119989) [14:58:04] kart_: akosiaris: ^ [14:59:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CXServer: Do not use the proxy for RESTBase and Apertium [puppet] - 10https://gerrit.wikimedia.org/r/259700 (owner: 10Mobrovac) [14:59:39] mobrovac: cool. Thanks akosiaris [15:00:14] kart_: akosiaris: uuups, wrong config stanza name [15:00:15] damn [15:00:22] wait, correcting it [15:00:57] (03PS1) 10Zfilipin: RuboCop: fixed Style/ColonMethodCall offence [puppet] - 10https://gerrit.wikimedia.org/r/259702 (https://phabricator.wikimedia.org/T112651) [15:01:03] cxserver.yaml. just spotted. [15:01:05] :/ [15:01:22] (03PS1) 10Mobrovac: CXServer: s/no_proxy/no_proxy_list/ in config [puppet] - 10https://gerrit.wikimedia.org/r/259703 [15:01:24] (03PS2) 10BBlack: text VCL: do not create hfp objects on 5xx [puppet] - 10https://gerrit.wikimedia.org/r/259696 [15:01:28] akosiaris: ^^ [15:01:42] (03CR) 10BBlack: [C: 032 V: 032] text VCL: do not create hfp objects on 5xx [puppet] - 10https://gerrit.wikimedia.org/r/259696 (owner: 10BBlack) [15:01:44] kart_: what did you spot? [15:01:51] in cxserver.yaml [15:01:51] ? [15:01:57] (03PS1) 10Giuseppe Lavagetto: pybal: introduce role for testing machines [puppet] - 10https://gerrit.wikimedia.org/r/259704 [15:01:59] (03PS2) 10Alexandros Kosiaris: CXServer: s/no_proxy/no_proxy_list/ in config [puppet] - 10https://gerrit.wikimedia.org/r/259703 (owner: 10Mobrovac) [15:02:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CXServer: s/no_proxy/no_proxy_list/ in config [puppet] - 10https://gerrit.wikimedia.org/r/259703 (owner: 10Mobrovac) [15:02:59] some nice reversed races in puppet-merge heh [15:03:10] (03CR) 10DCausse: [C: 04-1] "We should maybe wait for 1.27.0-wmf.9 before deploying this config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259701 (https://phabricator.wikimedia.org/T119989) (owner: 10DCausse) [15:03:15] I only said "yes" to mine, but it picked up alex's changes on strontium :P [15:03:28] (03PS1) 10Faidon Liambotis: labs: widen access.conf exception to everything LOCAL [puppet] - 10https://gerrit.wikimedia.org/r/259705 (https://phabricator.wikimedia.org/T121765) [15:03:53] !log installing git security updates [15:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:15] (03PS1) 10Zfilipin: RuboCop: fixed Style/CommandLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259706 (https://phabricator.wikimedia.org/T112651) [15:04:24] mobrovac: shouldn't we use /labs/deployment-prep/common.yaml only for labs? [15:04:26] mobrovac: kart_ merged and deployed [15:04:30] akosiaris: ^ [15:04:38] kart_: we do [15:04:42] and common/cxserver.yaml for Production. [15:04:45] i don't see the problem [15:04:46] yes [15:04:51] okay. [15:05:02] kart_: test now please cxserver in prod [15:05:15] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [15:06:08] mobrovac: same error :/ [15:06:23] (03CR) 10Andrew Bogott: [C: 031] "If this doesn't scare Faidon then it doesn't scare me :)" [puppet] - 10https://gerrit.wikimedia.org/r/259705 (https://phabricator.wikimedia.org/T121765) (owner: 10Faidon Liambotis) [15:07:08] mobrovac: should we use block for Yandex? [15:07:22] (03PS2) 10Faidon Liambotis: labs: widen access.conf exception to everything LOCAL [puppet] - 10https://gerrit.wikimedia.org/r/259705 (https://phabricator.wikimedia.org/T121765) [15:07:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] labs: widen access.conf exception to everything LOCAL [puppet] - 10https://gerrit.wikimedia.org/r/259705 (https://phabricator.wikimedia.org/T121765) (owner: 10Faidon Liambotis) [15:07:31] (03PS1) 10Zfilipin: RuboCop: Fixed Style/DefWithParentheses offence [puppet] - 10https://gerrit.wikimedia.org/r/259708 (https://phabricator.wikimedia.org/T112651) [15:07:39] hey [15:07:50] kart_: no, yandex must be proxied because it's outside of the prod network [15:07:50] Nikerabbit: we're still stuck. [15:07:56] (03PS2) 10BBlack: VCL: grace-mode only in frontend caches [puppet] - 10https://gerrit.wikimedia.org/r/259697 [15:08:00] services: - conf: [15:08:00] yes. Only for Yandex. [15:08:11] (03CR) 10BBlack: [C: 032 V: 032] VCL: grace-mode only in frontend caches [puppet] - 10https://gerrit.wikimedia.org/r/259697 (owner: 10BBlack) [15:08:21] ah nevermind [15:08:22] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1887619 (10Ottomata) Hmm, interesting. Using Kafka for batch/aggregate data is a little funky, but I think ok. This wo... [15:09:53] (03PS1) 10Rush: phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) [15:09:55] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [15:10:03] (03PS1) 10Zfilipin: RuboCop: fixed Style/EmptyLiteral offense [puppet] - 10https://gerrit.wikimedia.org/r/259710 (https://phabricator.wikimedia.org/T112651) [15:10:09] kart_: which cxserver route uses restbase? [15:10:59] (03PS2) 10Rush: phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) [15:13:23] mobrovac: for eg https://cxserver.wikimedia.org/v1/page/en/Ukai%20Dam is 404 [15:13:24] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:35] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:56] (03CR) 10Chad: [C: 031] phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [15:14:25] (03PS1) 10Zfilipin: RuboCop: fixed Style/DotPosition offense [puppet] - 10https://gerrit.wikimedia.org/r/259712 (https://phabricator.wikimedia.org/T112651) [15:14:27] kk kart_ [15:14:28] thnx [15:14:38] (03CR) 10MarkTraceur: [C: 031] "I'm in favor, see no issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259665 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz DziewoƄski) [15:14:45] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:45] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:46] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:48] MT processing error: HTTPError: 403 [15:14:54] not very informative :/ [15:15:16] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:25] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:25] blarg [15:15:44] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:55] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:56] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:12] so proxy not set? [15:16:25] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:25] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:36] (03PS1) 10BBlack: post-merge syntax bugfix for be768ad7c6 [puppet] - 10https://gerrit.wikimedia.org/r/259715 [15:16:39] stupid semicolons [15:16:45] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:45] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:46] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:53] kart_: Nikerabbit: no log stack trace is produced when trying locally on sca [15:16:54] hm [15:16:54] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:55] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:57] 100 errors on an enwiki API server, not a problem by itself, but it could repeat [15:17:05] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:05] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:14] (03PS2) 10BBlack: post-merge syntax bugfix for be768ad7c6 [puppet] - 10https://gerrit.wikimedia.org/r/259715 [15:17:15] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:15] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:15] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:19] (03CR) 10BBlack: [C: 032 V: 032] post-merge syntax bugfix for be768ad7c6 [puppet] - 10https://gerrit.wikimedia.org/r/259715 (owner: 10BBlack) [15:17:24] Nikerabbit: no, the reverse actually - it seems it's still using the proxy for RB when it shouldn't [15:17:25] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:25] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:27] (03CR) 10Rush: [C: 032] phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [15:17:35] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:35] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:40] mobrovac: apparently logging for /page/ is failing and doesn't print anything [15:17:45] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:45] (03PS3) 10Rush: phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) [15:17:55] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:55] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:55] (03CR) 10Rush: [V: 032] phabricator: start using x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259709 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [15:17:58] for mt it at least prints an error [15:18:05] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:05] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:05] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:16] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:16] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:25] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:34] s/100/1000/, at 14:55 [15:18:35] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:37] mobrovac: I would use proxy block as in old config. [15:18:44] mobrovac: if that helps. [15:18:46] thank you icinga-wm for letting me know 100 times in a row that I made a syntax error :P [15:18:49] (03PS1) 10Zfilipin: RuboCop: fixed Style/IfUnlessModifier offense [puppet] - 10https://gerrit.wikimedia.org/r/259716 (https://phabricator.wikimedia.org/T112651) [15:18:54] kart_: proxy block? [15:18:55] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:55] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:56] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:56] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:56] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:05] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:06] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:06] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:15] bblack: technology to the rescue :) [15:19:16] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:16] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:31] 110 times :-P [15:19:44] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:44] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:45] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:54] mobrovac: <%- if @proxy -%> proxy: '<%= @proxy %>', <%- end -%> [15:19:55] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:06] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:06] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:06] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:20:06] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:20:14] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [15:20:24] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:34] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:34] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:20:34] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:20:35] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:39] kart_: that would have no effect at all because the proxy is set and is needed to get to yandex [15:20:44] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:55] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:20:55] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:20:55] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:55] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:20:56] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:20:58] ak [15:21:05] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:21:15] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:21:15] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:21:24] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:24] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:25] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:25] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:21:27] I haven't been following the main conversation here, are we saying our cxserver sends outbound requests to yandex? [15:21:35] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:21:45] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:45] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:45] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:45] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:45] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:21:54] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:05] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:22:05] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:05] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:07] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:07] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:22:14] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:24] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:25] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:22:25] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:22:35] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:22:35] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:54] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:22:54] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:55] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:55] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:55] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:56] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:56] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:58] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:06] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:23:24] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:24] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:34] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:35] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:23:35] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:23:45] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:23:47] (03PS2) 10Giuseppe Lavagetto: pybal: introduce role for testing machines [puppet] - 10https://gerrit.wikimedia.org/r/259704 [15:23:54] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:23:54] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:24:02] !log reinstall, reboot and reconfigure mysql at db1031 [15:24:06] (03PS1) 10Zfilipin: RuboCop: fixed Style/LeadingCommentSpace offense [puppet] - 10https://gerrit.wikimedia.org/r/259717 (https://phabricator.wikimedia.org/T112651) [15:24:07] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:07] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:24:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:25] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:24:31] (03PS2) 10Jcrespo: Upgrading and reconfiguring mysql on db1031 and x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/259698 [15:24:51] akosiaris: mobrovac anything more we can do. I'm out of more thoughts :/ [15:24:55] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:15] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:25:28] bblack: since a few weeks ago [15:25:35] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:25:45] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:47] bblack: there were mails, wikitech-l emails etc [15:25:51] fascinating [15:25:56] yeah I miss a lot of things :) [15:26:06] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:26:18] kart_: i'll do some testing a bit later, need to do something else now, back on cxserver in a bit [15:26:29] bblack: https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/Yandex [15:26:35] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:56] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:26:56] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:27:25] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:27:31] (03PS1) 10Zfilipin: RuboCop: fixed Style/MethodCallParentheses offense [puppet] - 10https://gerrit.wikimedia.org/r/259718 (https://phabricator.wikimedia.org/T112651) [15:28:05] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:15] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:28:26] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:35] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:46] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:28:53] (03CR) 10Jcrespo: [C: 032] Upgrading and reconfiguring mysql on db1031 and x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/259698 (owner: 10Jcrespo) [15:29:04] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:28] (03PS1) 10Zfilipin: RuboCop: fixed Style/MultilineIfThen offense [puppet] - 10https://gerrit.wikimedia.org/r/259719 (https://phabricator.wikimedia.org/T112651) [15:29:37] akosiaris: We are sure that config.yaml.erb is using default values from config.yaml? [15:29:45] (other than what is defined) [15:29:53] or does it matter? [15:30:26] kart_: I don't follow [15:31:07] akosiaris: we load values which are not present in config.yaml.erb from config.yaml for service-runner. [15:31:17] so, anything missing in it? [15:31:18] like ? [15:31:34] kart_: proxy, logstash host etc are loaded from service::node [15:32:59] mobrovac: ok. that one. [15:33:40] (03PS1) 10Rush: phabricator: log format to account for x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259720 (https://phabricator.wikimedia.org/T114014) [15:34:15] (03PS1) 10Jcrespo: Fix typo on I8f72fda4983 [puppet] - 10https://gerrit.wikimedia.org/r/259721 [15:34:22] (03PS1) 10Zfilipin: RuboCop: fixed Style/NegatedIf offense [puppet] - 10https://gerrit.wikimedia.org/r/259722 (https://phabricator.wikimedia.org/T112651) [15:34:34] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [15:34:42] akosiaris: last option. how difficult is to revert? [15:34:58] kart_: https://github.com/wikimedia/operations-puppet/blob/production/modules/service/templates/node/config.yaml.erb [15:35:10] these are the defaults loaded by service::node [15:35:18] kart_: not impossible for sure, it is possible to revert though [15:35:26] lol [15:35:28] but it is gonna take some time [15:35:35] lol [15:35:38] yeah tired... [15:35:41] (03CR) 10Jcrespo: "Typo corrected on gerrit:259698" [puppet] - 10https://gerrit.wikimedia.org/r/259698 (owner: 10Jcrespo) [15:35:46] (03CR) 10Rush: [C: 032] phabricator: log format to account for x-client-ip [puppet] - 10https://gerrit.wikimedia.org/r/259720 (https://phabricator.wikimedia.org/T114014) (owner: 10Rush) [15:35:50] (03PS68) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:35:51] so... not impossible, but difficult [15:36:01] so, it can [15:36:10] so, it can't be we can not figure out what's wrong [15:36:14] (03PS2) 10Jcrespo: Fix typo on I8f72fda4983 [puppet] - 10https://gerrit.wikimedia.org/r/259721 [15:36:19] can't we increase logging or something [15:38:30] (03CR) 10jenkins-bot: [V: 04-1] RuboCop: fixed Style/Not offense [puppet] - 10https://gerrit.wikimedia.org/r/259724 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:38:38] (03CR) 10Jcrespo: [C: 032] Fix typo on I8f72fda4983 [puppet] - 10https://gerrit.wikimedia.org/r/259721 (owner: 10Jcrespo) [15:39:06] (03PS1) 10Zfilipin: RuboCop: fixed Style/NumericLiterals offense [puppet] - 10https://gerrit.wikimedia.org/r/259725 (https://phabricator.wikimedia.org/T112651) [15:40:37] (03PS1) 10Zfilipin: RuboCop: fixed Style/ParallelAssignment offense [puppet] - 10https://gerrit.wikimedia.org/r/259726 (https://phabricator.wikimedia.org/T112651) [15:40:45] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:41:50] (03PS69) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:43:10] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [15:44:35] (03PS2) 10Alexandros Kosiaris: Grafana increase homepage dash list limits [puppet] - 10https://gerrit.wikimedia.org/r/259666 (owner: 10Addshore) [15:44:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Grafana increase homepage dash list limits [puppet] - 10https://gerrit.wikimedia.org/r/259666 (owner: 10Addshore) [15:45:13] kart_: so, which request fails ? [15:45:44] and can we replay it via curl ? or is that jwt thing blocking us ? [15:46:06] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/complete/: Timeout on connection while downloading http://10.64.32.153:10042/complete/ [15:47:16] akosiaris: curl -X GET --header "Accept: application/json" "https://cxserver.wikimedia.org/v1/page/en/Hello" [15:47:33] 403. [15:48:04] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [15:49:30] akosiaris: basically, something to do with restbase_url [15:50:27] in labs, it seems fine. [15:50:35] (03PS70) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:50:36] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [15:51:32] (03PS2) 10Zfilipin: RuboCop: fixed Style/Not offense [puppet] - 10https://gerrit.wikimedia.org/r/259724 (https://phabricator.wikimedia.org/T112651) [15:51:59] (03PS71) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [15:53:42] akosiaris: mobrovac should we use, https://@lang.wikipedia.org/api/rest_v1/page/html/@title for restbase_url? [15:53:45] santhosh: ^^ [15:54:12] (03CR) 10Zfilipin: RuboCop: fixed Style/Not offense (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259724 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:54:30] kart_: that still leaves apertium ... [15:55:24] yes that would work [15:55:32] but only for restbase [15:55:48] kart_: btw, all these problems mean we need better monitoring for all this [15:56:05] we should have been having 2-3 reds all this time in icinga [15:56:10] (03PS3) 10Muehlenhoff: Stop opendj on the former labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/259226 [15:56:10] but we have exactly 0 [15:56:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] Stop opendj on the former labs LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/259226 (owner: 10Muehlenhoff) [15:56:22] akosiaris: ack [15:56:28] !log depool sca1001, playing with cxserver config [15:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:06] !log stopping opendj LDAP servers on nembus/neptunium (read-only since about days now due to migration to openldap) [15:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:17] (03PS3) 10Giuseppe Lavagetto: pybal: introduce role for testing machines [puppet] - 10https://gerrit.wikimedia.org/r/259704 [15:58:55] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T1600). [16:00:04] MatmaRex dcausse kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:39] thcipriani: i'll add my patch to the deployment page in 10 mins [16:00:55] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [16:01:00] hi [16:01:09] I can SWAT. MatmaRex ping [16:01:22] dcausse: howdy mobrovac: sounds good. [16:02:54] ottomata: btw some kafka broker alerts are UNKNOWN in icinga since a couple of days [16:03:04] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/mml/: Timeout on connection while downloading http://10.64.32.153:10042/mml/ [16:03:17] dcausse: Oh I just saw your fix on the list. Thx so much for the quick fix on that. [16:03:41] yw :) [16:03:56] akosiaris: one thing. https://gerrit.wikimedia.org/r/#/c/259703/2 doesn't seems reflecting on sca1001's /etc/cxserver/config.yaml [16:04:30] thcipriani: i'm here [16:04:53] akosiaris: only shows no_proxy instead of no_proxy_list [16:05:03] kart_: reload the file, I am playing with it [16:05:27] MatmaRex: kk, getting dcausse 's patch out, then I'll circle back. [16:06:13] PROBLEM - LDAP on neptunium is CRITICAL: Connection refused [16:06:31] ^already silencing that [16:06:53] PROBLEM - Certificate expiration on neptunium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [16:07:00] MatmaRex: quick check: translations are live for the A/B test in wmf.8 ? [16:07:24] kart_: cxserver is practically not logging anything... no matter the level [16:07:32] unless it's a stacktrace or it's starting up [16:08:08] akosiaris: sorry I haven't been following... did we try to run without any proxy at all to see if that works? [16:08:18] only thing that should need proxy is Yandex [16:08:18] thcipriani: 
should be? and if not, wmf.9 goes out soon [16:08:33] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [16:08:36] Nikerabbit: we need the proxy [16:08:39] Nikerabbit: well, if we remove proxy, yandex support will be dead [16:08:44] thcipriani: oh wait, that doesn't matter [16:08:48] if you are fine with that, I am fine as well [16:08:51] thcipriani: since wmf.8 doesn't have the test code [16:08:58] so it's a no-op for wmf.8 wikis [16:08:58] but something tells me that's not the case [16:09:03] MatmaRex: ah, ok, cool :) [16:10:07] akosiaris: as of now, it is fine. [16:10:11] mobrovac, akosiaris: yes, but did we verify that everything but Yandex works without proxy? [16:11:18] !log thcipriani@tin Synchronized php-1.27.0-wmf.9/extensions/CirrusSearch/includes/CirrusSearch.php: SWAT: Fix array-to-string conversion [[gerrit:259633]] (duration: 00m 30s) [16:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:27] ^ dcausse check please [16:11:46] (still guessing) if the case is that servery-runner is using the proxy we have specified for everything, it seems easier for us to rename the proxy key we use for yandex only [16:12:19] dcausse: haven't seen any new ones in the error log, so that's good :) [16:13:15] Nikerabbit: better I guess. [16:13:27] thcipriani: sounds good. will double check on logstash when wmf9 is deployed on group2 [16:13:28] Nikerabbit: yandex uses mt.yandex.proxy. Not a highlevel proxy variable [16:13:29] thanks! [16:13:36] dcausse: thank you! [16:13:41] Nikerabbit: yes of course it works if proxy is not defined in the config [16:13:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259665 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz DziewoƄski) [16:13:54] but if that happens, then no yandex [16:14:08] santhosh: is that true ? then we might have a way out [16:14:13] (03Merged) 10jenkins-bot: Enable cross-wiki upload A/B test in additional languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259665 (https://phabricator.wikimedia.org/T120867) (owner: 10Bartosz DziewoƄski) [16:14:36] santhosh: are you sure https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/mt/Yandex.js#L42 ? [16:14:57] it does not say this.conf.mt.yandex.proxy [16:15:01] proxy: this.conf.proxy, - nope. [16:15:13] I would change that in cxserver and deploy that [16:15:25] (03PS1) 10BBlack: text VCL: exempt all variants of Special:Banner.* [puppet] - 10https://gerrit.wikimedia.org/r/259728 [16:15:43] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 32 failures [16:15:45] (my ext cherry-pick is https://gerrit.wikimedia.org/r/#/c/259729/) [16:16:24] (03CR) 10BBlack: [C: 032 V: 032] text VCL: exempt all variants of Special:Banner.* [puppet] - 10https://gerrit.wikimedia.org/r/259728 (owner: 10BBlack) [16:16:30] akosiaris: did you set no_proxy_list for apertium/restbase? [16:16:42] MatmaRex: should the ext one go out first? probably. [16:16:44] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:16:53] thcipriani: either, doesn't matter [16:16:57] kk [16:16:58] kart_: on sca1002 ? yes. on sca1001 ? it's in a state of flux, I am experimenting [16:17:04] (03PS4) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) [16:17:13] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:17:19] thcipriani: that one only affects Commons, which doesn't have the A/B test [16:17:40] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable cross-wiki upload A/B test in additional languages [[gerrit:259665]] (duration: 00m 30s) [16:17:46] ^ MatmaRex check please [16:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:56] (03PS2) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:19:22] thcipriani: works :) [16:19:32] MatmaRex: nice. Thanks for checking. [16:19:43] (03PS3) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:21:04] (03PS4) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:21:25] (03PS5) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:22:52] Is anyone poking math? Fair number of errors showing up in logstash. [16:23:19] !log thcipriani@tin Synchronized php-1.27.0-wmf.9/extensions/WikimediaEvents/WikimediaEventsHooks.php: SWAT: Actually define tags for cross-wiki upload A/B test [[gerrit:259729]] (duration: 00m 31s) [16:23:25] ^ MatmaRex check please [16:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:28] "Cannot get mml. Server problem." [16:24:25] ostriches: mathoid (assuming we talk about the same thing) has been misbehaving for a long time now. unfortunately nobody has managed to find time to poke it yey [16:24:27] yet* [16:24:32] thcipriani: yup, it appears correctly on https://commons.wikimedia.org/wiki/Special:Tags now. thanks! [16:24:43] MatmaRex: thank you! [16:24:47] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1887708 (10Eevans) [16:24:52] akosiaris: Is there already a task for it? [16:25:13] ostriches: not that I know of [16:25:22] * ostriches will file [16:25:33] hanks [16:25:36] thanks* [16:25:58] akosiaris: we are paching cxserver to read proxy config only under Yandex [16:26:48] Nikerabbit: er,, gimme a sec [16:26:52] I might be on to something [16:27:07] kart_: ^ [16:27:23] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [16:27:41] akosiaris: ok. will hold till akosiaris asys OK [16:30:55] (03PS6) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:30:58] akosiaris: ping me back, when done. [16:34:18] thcipriani: added the patch for sway @ https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0December.C2.A017 [16:34:45] akosiaris: ostriches: it seems there's a bug in mathoid killing worker [16:34:50] *s [16:34:52] mobrovac: kk [16:36:26] (03PS72) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [16:37:24] mobrovac: Is there a task already for that? I filed T121770 if not and you want to add more info. [16:37:27] kart_: Nikerabbit got it working [16:37:33] lemme upload a patch [16:37:39] akosiaris: cooler. [16:37:44] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1887746 (10Smalyshev) > My general point is: I am strongly against poking holes in firewalls for specially-crafted data... [16:38:17] ostriches: yup, there are 2 tasks, added a comment on your ticket [16:38:26] akosiaris: what was the trick??? [16:38:36] mobrovac: thanks! [16:39:07] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:39:08] !log thcipriani@tin Synchronized php-1.27.0-wmf.9/extensions/ContentTranslation/includes/Translation.php: SWAT: Fix Undefined index: targetRevisionId in ContentTranslation [[gerrit:259649]] (duration: 00m 29s) [16:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:32] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1887753 (10Ottomata) Sounds like we need a meeting...:) [16:39:41] (03PS1) 10BBlack: VCL: differentiate hit-vs-hfp in X-Cache [puppet] - 10https://gerrit.wikimedia.org/r/259736 [16:39:56] ^ kart_ Nikerabbit thank you for the quick fix, will keep an eye on it :) [16:40:04] thcipriani: :) [16:40:12] if we get cxserver working :/ [16:40:17] thcipriani: the backend is currently down, so can't test yet [16:40:42] Nikerabbit: ack, thanks. [16:41:23] !log problems with corruption on x1-slave for cebwiki. Fixed them. Will leave db1031 depooled for a while to check they are gone. [16:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:53] (03PS7) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [16:42:42] does anyone know how I can write to cebwiki.echo_event with mediawiki to double confirm everything is ok? [16:42:44] (03PS73) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [16:42:46] (03CR) 10BBlack: [C: 032] VCL: differentiate hit-vs-hfp in X-Cache [puppet] - 10https://gerrit.wikimedia.org/r/259736 (owner: 10BBlack) [16:44:57] mobrovac: looking through the changes seems like it _should be_ ok to just sync-dir the whole Math extension, does that seem correct to you? Any file in particular that has to be there before the others? [16:45:29] thcipriani: yup, that sounds right, no added or removed files, jsut changes [16:45:49] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [16:46:18] (03PS1) 10BryanDavis: stashbot: Add missing logstash::cluster_hosts hiera data [puppet] - 10https://gerrit.wikimedia.org/r/259737 [16:47:58] akosiaris: anything? :) [16:48:18] jynus: RoanKattouw_away or legoktm might be able to tell you how to trigger echo on cebwiki. I suppose making an edit and getting a thanks would do it [16:48:18] yes, uploading changes now [16:48:24] kart_: 3 in number [16:48:25] (03PS74) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [16:48:32] 2 reverts and the actual fix [16:48:40] okay :) [16:49:28] !log thcipriani@tin Synchronized php-1.27.0-wmf.8/extensions/Math: SWAT: Make math usable without RESTbase [[gerrit:259734]] (duration: 00m 30s) [16:49:35] ^ mobrovac check please [16:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:13] bd808, yes, I assumed that, I was just asking for someone that knew that codebase more, and you just helped me with that! Thank you [16:51:29] jynus: I made an edit at https://ceb.wikipedia.org/w/index.php?title=Gumagamit:BDavis_%28WMF%29/test&action=history ; maybe you can add a thanks to that? [16:51:38] (can't thank myself) [16:51:41] (03PS1) 10Alexandros Kosiaris: Revert "CXServer: s/no_proxy/no_proxy_list/ in config" [puppet] - 10https://gerrit.wikimedia.org/r/259739 [16:51:43] (03PS1) 10Alexandros Kosiaris: Revert "CXServer: Do not use the proxy for RESTBase and Apertium" [puppet] - 10https://gerrit.wikimedia.org/r/259740 [16:51:45] (03PS1) 10Alexandros Kosiaris: cxserver: Populate no_proxy_list correctly [puppet] - 10https://gerrit.wikimedia.org/r/259741 [16:52:58] seems to work, I see the table growing, too [16:53:07] "JCrespo (WMF) thanked you for your edit on Gumagamit:BDavis (WMF)/test." [16:53:36] (03PS2) 10Alexandros Kosiaris: Revert "CXServer: s/no_proxy/no_proxy_list/ in config" [puppet] - 10https://gerrit.wikimedia.org/r/259739 [16:53:38] (03PS2) 10Alexandros Kosiaris: Revert "CXServer: Do not use the proxy for RESTBase and Apertium" [puppet] - 10https://gerrit.wikimedia.org/r/259740 [16:53:40] (03PS2) 10Alexandros Kosiaris: cxserver: Populate no_proxy_list correctly [puppet] - 10https://gerrit.wikimedia.org/r/259741 [16:53:41] binary corruption is the worse of the problems one can find [16:53:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "CXServer: s/no_proxy/no_proxy_list/ in config" [puppet] - 10https://gerrit.wikimedia.org/r/259739 (owner: 10Alexandros Kosiaris) [16:53:57] luckly I was able to correct it [16:54:04] and I checked all other tables too [16:54:07] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [16:54:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "CXServer: Do not use the proxy for RESTBase and Apertium" [puppet] - 10https://gerrit.wikimedia.org/r/259740 (owner: 10Alexandros Kosiaris) [16:54:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Populate no_proxy_list correctly [puppet] - 10https://gerrit.wikimedia.org/r/259741 (owner: 10Alexandros Kosiaris) [16:57:23] (03PS1) 10BryanDavis: beta: Fix logstash::cluster_hosts [puppet] - 10https://gerrit.wikimedia.org/r/259742 [16:57:29] thcipriani: it seems it works! [16:57:31] thnx! [16:57:37] thank you again, bd808 [16:57:38] mobrovac: thanks for checking. [16:58:00] (03PS2) 10Andrew Bogott: stashbot: Add missing logstash::cluster_hosts hiera data [puppet] - 10https://gerrit.wikimedia.org/r/259737 (owner: 10BryanDavis) [16:58:38] akosiaris: ah so the trick was not to set the port for the no_proxy_list stanza? [16:58:57] mobrovac: http:// AND port [16:59:06] ah! [16:59:10] akosiaris: nice catch! [16:59:10] akosiaris: deployed? [16:59:18] kart_: got a minor bug, fixing [16:59:23] okay [16:59:24] sorry for creating the confusion [16:59:30] (03PS1) 10Alexandros Kosiaris: cxserver: no_proxy_entry is not an instance variable [puppet] - 10https://gerrit.wikimedia.org/r/259743 [16:59:30] damned ruby instance variables [16:59:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: no_proxy_entry is not an instance variable [puppet] - 10https://gerrit.wikimedia.org/r/259743 (owner: 10Alexandros Kosiaris) [16:59:56] (03CR) 10Chad: [C: 031] beta: Fix logstash::cluster_hosts [puppet] - 10https://gerrit.wikimedia.org/r/259742 (owner: 10BryanDavis) [17:00:00] (03PS1) 10BBlack: Revert "VCL: differentiate hit-vs-hfp in X-Cache" [puppet] - 10https://gerrit.wikimedia.org/r/259744 [17:00:03] akosiaris: that's ERB, not ruby :P [17:00:04] moritzm mutante: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T1700). [17:00:11] I still think it would have been easier to use our cxserver fix [17:00:19] (03PS2) 10BBlack: Revert "VCL: differentiate hit-vs-hfp in X-Cache" [puppet] - 10https://gerrit.wikimedia.org/r/259744 [17:00:25] mobrovac: erb is ruby [17:00:26] (03CR) 10BBlack: [C: 032 V: 032] Revert "VCL: differentiate hit-vs-hfp in X-Cache" [puppet] - 10https://gerrit.wikimedia.org/r/259744 (owner: 10BBlack) [17:00:35] and that @ notation is straight from ruby [17:01:43] akosiaris: yup, sure, i meant that the templating erb engine controls what becomes an instance var, but in your case, that's true, that's only ruby :) [17:04:14] what's the status? Yandex not working atm [17:05:02] !log restarting and reconfiguring mysql at db2009 [17:05:07] akosiaris: pageload/apertium seems okay. [17:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:32] kart_: Nikerabbit seems to work fine now [17:06:09] akosiaris: Yandex doesn't [17:06:19] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:06:19] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.32.153:10042/svg/: Timeout on connection while downloading http://10.64.32.153:10042/svg/ [17:06:19] detail: "Property 'conf' of object function (req, res, next) {↔ app.handle(req, res, next);↔ } is not a function" [17:07:10] (03PS3) 10Andrew Bogott: stashbot: Add missing logstash::cluster_hosts hiera data [puppet] - 10https://gerrit.wikimedia.org/r/259737 (owner: 10BryanDavis) [17:07:41] Nikerabbit: that sounds like something you can debug finally [17:07:53] and not like the other unhelpful error message [17:09:55] akosiaris: well, I find it quite unhelpful and backtrace useless, but santhosh has found a thing he is fixing ;) [17:10:01] (03CR) 10Andrew Bogott: [C: 032] stashbot: Add missing logstash::cluster_hosts hiera data [puppet] - 10https://gerrit.wikimedia.org/r/259737 (owner: 10BryanDavis) [17:10:08] akosiaris: healthcheck_url like other services can be added for cxserver? [17:10:09] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [17:10:48] Like ^ [17:10:58] kart_: yes [17:11:10] kart_: in fact, yes please [17:11:22] but you probably want to read the service_checker spec a bit first [17:11:32] as in add the things that should be monitoring [17:11:35] Yes. [17:11:40] kart_: that is what I was referring to earlier [17:11:48] akosiaris: we should have Task. [17:11:51] (03PS4) 10Giuseppe Lavagetto: pybal: introduce role for testing machines [puppet] - 10https://gerrit.wikimedia.org/r/259704 [17:11:53] it should be easy to add checks for most endpoints [17:11:53] (03CR) 10Andrew Bogott: [C: 04-1] "> thePuppet setup needs some value for the hiera variable" [puppet] - 10https://gerrit.wikimedia.org/r/259742 (owner: 10BryanDavis) [17:12:27] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/mml/: Timeout on connection while downloading http://10.64.48.29:10042/mml/ [17:12:58] Nikerabbit: well, if the backtrace is useless it's a bad backtrace. but you got a backtrace. on the other case you only got "Not found. HTTPError: 403" [17:13:20] YuviPanda: can i still get a patch into todays puppet swat? :) just got here. has +1 from ops already: https://gerrit.wikimedia.org/r/#/c/259443/ [17:13:46] oh i guess yuvi isn't the deployer anymore, moritzm mutante ^ ? [17:14:05] akosiaris: https://phabricator.wikimedia.org/T121776 [17:14:18] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:14:56] (03CR) 10BryanDavis: "It would be possible but arguably not better. A single node ELK cluster is a degenerate testing case. The firewall holes for intra-cluster" [puppet] - 10https://gerrit.wikimedia.org/r/259742 (owner: 10BryanDavis) [17:15:07] kart_, akosiaris: should be fixed by https://gerrit.wikimedia.org/r/#/c/259746/, but some of the previous cxserver commits should be reverted before deployment as they are not compatible with the current fix, I believe [17:15:34] (03PS2) 10Andrew Bogott: beta: Fix logstash::cluster_hosts [puppet] - 10https://gerrit.wikimedia.org/r/259742 (owner: 10BryanDavis) [17:16:28] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:18:33] (03PS2) 10Dzahn: extdist: Split skindist log into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/259639 (owner: 10Legoktm) [17:18:35] (03CR) 10Andrew Bogott: [C: 032] "ok!" [puppet] - 10https://gerrit.wikimedia.org/r/259742 (owner: 10BryanDavis) [17:18:37] !log setting mysql db1031 as db2009's master [17:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:44] (03CR) 10Dzahn: [C: 032] extdist: Split skindist log into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/259639 (owner: 10Legoktm) [17:20:19] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [17:20:29] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:22:14] (03PS1) 10EBernhardson: [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 [17:22:24] Nikerabbit: a great... weird though that it is not compatible with previous changes. it's a 1 line fix [17:22:46] (03PS2) 10EBernhardson: [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 [17:24:02] akosiaris: all OK with sca1001/sca1002 to deploy? [17:24:26] kart_: yeah yeah, go ahead [17:24:27] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:25:52] (03PS1) 10Jcrespo: Repool x1-slave (db1031), increase db1041 load to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259756 [17:26:37] akosiaris: and mobrovac thanks a lot. lots of to learn! [17:27:04] akosiaris: such complicated config has some more provision to test in production. [17:27:41] kart_: don't mention it. We probably need more monitoring as in cases like these, they are more or less your QA [17:27:53] kart_: may I assume we are ok and yandex works fine ? [17:28:16] I'm checking as well [17:28:40] still getting HTTP 500 [17:29:10] new code not yet deployed? [17:29:50] Deploying.. [17:29:57] ok [17:30:48] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:30:48] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:32:23] what's up with mathoid? [17:32:31] mobrovac, gwicke? [17:32:43] it's been warning for a while, what's going on? [17:33:02] jynus: still need help with echo? [17:33:23] paravoid: workers dying, known, will come up with a fix soon, possibly tonight [17:33:39] site impact? [17:33:52] we'll also need an incident report, btw [17:34:11] paravoid: There's already phab tasks tracking this. [17:34:27] (not an incident report, but the general "flapping and somebody needs to look/fix") [17:34:55] (03PS3) 10Dzahn: extdist: Split skindist log into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/259639 (owner: 10Legoktm) [17:34:59] legoktm, no, thank you, I think everithink is all right, but ping me if yous ee something problematic in the short future [17:36:24] is anyone going to be able to take care of puppet swat today? [17:36:58] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [17:37:00] moritzm: mutante ^ ? [17:37:01] (03CR) 10Dzahn: "T120047 - it doesn't break things but the groups we expected to see removed are also still here" [puppet] - 10https://gerrit.wikimedia.org/r/259319 (owner: 10Dzahn) [17:37:23] ebernhardson: 'm here [17:37:31] +1 to puppetswat if someone's doing them, I've got a pile of trivial things [17:37:46] mutante: awsome, there are 3, all pretty easy. two adjust the elasticsearch monitoriing [17:38:41] mutante: the third is a mediawiki cron job, runs once a week [17:38:58] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:39:13] yep, on it now, just a sec was rebasing Legos patch [17:39:33] paravoid: it's affecting only people that have mathml mode enabled [17:40:29] (03CR) 10DCausse: [elastic] Record count of searchs rejected due to thread pool exhaustion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259750 (owner: 10EBernhardson) [17:42:07] mutante: I added a few too [17:42:40] (03PS7) 10Dzahn: Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) (owner: 10EBernhardson) [17:42:44] kart_: how did the deploy go ? everything ok ? [17:43:17] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:43:21] (03PS3) 10EBernhardson: [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 [17:44:06] akosiaris: CERT_UNTRUSTED error is back [17:44:10] (03CR) 10DCausse: [C: 031] [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 (owner: 10EBernhardson) [17:44:28] (03CR) 10Dzahn: [C: 032] Cron job to rebuild completion indices [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) (owner: 10EBernhardson) [17:44:34] akosiaris: still we've issue. [17:44:42] time to update node :) [17:46:21] kart_: Nikerabbit: set certificate: null in the config for that? [17:46:50] (03PS4) 10Dzahn: [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 (owner: 10EBernhardson) [17:48:03] (03CR) 10Dzahn: "on terbium:" [puppet] - 10https://gerrit.wikimedia.org/r/258068 (https://phabricator.wikimedia.org/T112028) (owner: 10EBernhardson) [17:48:49] (03CR) 10Dzahn: [C: 032] [elastic] Record count of searchs rejected due to thread pool exhaustion [puppet] - 10https://gerrit.wikimedia.org/r/259750 (owner: 10EBernhardson) [17:49:18] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: /{format}/ is CRITICAL: Could not fetch url http://10.64.48.29:10042/complete/: Timeout on connection while downloading http://10.64.48.29:10042/complete/ [17:49:19] the cronjob has been created on terbium [17:49:40] mutante: thanks. I'll make note to check the logs and see that it all works next tuesday [17:49:49] mutante: thank you :D [17:50:21] :) [17:50:32] kart_: I thought that was fixed with that certificate: change [17:50:34] (03PS2) 10Dzahn: [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) (owner: 10EBernhardson) [17:50:50] kart_: probably something does not look up correctly that setting ? [17:51:14] akosiaris: Nikerabbit is checking. [17:52:22] ostriches: YuviPanda, yes the -2 on https://gerrit.wikimedia.org/r/#/c/207377/ looks bad for a swat change [17:52:42] It was from an old patch set. I want him to remove that -2 :p [17:52:47] (03CR) 10Yuvipanda: Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [17:52:49] (03CR) 10Dzahn: [C: 032] [elasticsearch] Collect cluster health stats about shard movement [puppet] - 10https://gerrit.wikimedia.org/r/259443 (https://phabricator.wikimedia.org/T117284) (owner: 10EBernhardson) [17:52:55] 6operations: move calcium to a VM - https://phabricator.wikimedia.org/T105553#1888004 (10RobH) 5Open>3declined a:3RobH Nope, killing the system and it has no data needed. Closing this task. [17:53:17] mutante: that was when it was still WIP, have removed it [17:53:18] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [17:53:45] It was WIP because I was also possibly tuning the feature. I decided not to do that so it's just a cleanup/move bit which I already puppetcompilered [17:53:48] (03PS2) 10Dzahn: More fixmes for scap/manifests/scripts.pp [puppet] - 10https://gerrit.wikimedia.org/r/259060 (owner: 10Chad) [17:53:49] :D [17:53:52] YuviPanda: thanks [17:55:18] !log running `nodetool cleanup` on restbase1002 [17:55:18] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [17:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:24] (03PS1) 10RobH: reclaim calcium to spares [dns] - 10https://gerrit.wikimedia.org/r/259763 [17:58:39] (03PS1) 10RobH: reclaiming calcium to spares [puppet] - 10https://gerrit.wikimedia.org/r/259764 [17:59:12] is puppet swat still on? [17:59:38] PROBLEM - mathoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:43] YuviPanda: mutante: can i still get into puppet swat? [18:01:40] mobrovac: if it's small ?:) [18:01:49] one-liner :) [18:01:50] coming up [18:01:52] ok [18:01:58] (03CR) 10Dzahn: [C: 032] More fixmes for scap/manifests/scripts.pp [puppet] - 10https://gerrit.wikimedia.org/r/259060 (owner: 10Chad) [18:02:57] 7Puppet, 6Analytics-Kanban, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} [? pts] - https://phabricator.wikimedia.org/T101763#1888068 (10Milimetric) [18:03:06] (03PS2) 10Dzahn: gmond_memcached.py: fix all kinds of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256438 (owner: 10Chad) [18:03:31] (03CR) 10Dzahn: [C: 032] gmond_memcached.py: fix all kinds of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256438 (owner: 10Chad) [18:03:47] RECOVERY - mathoid endpoints health on sca1001 is OK: All endpoints are healthy [18:03:57] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:08] ostriches: hmm, not super happy with that big toollabs change without any +1 [18:04:09] PROBLEM - Host calcium is DOWN: PING CRITICAL - Packet loss = 100% [18:04:14] (03PS1) 10Mobrovac: Mathoid: Increase the number of workers temporarily [puppet] - 10https://gerrit.wikimedia.org/r/259765 (https://phabricator.wikimedia.org/T121762) [18:04:19] mutante: ^ [18:04:21] mobrovac: link? [18:04:22] doing the other one first [18:04:24] i put it in maint mode [18:04:24] err [18:04:27] mutante: link? (tool labs change) [18:04:29] why is it echoing that i killed it, annoying... [18:04:38] YuviPanda: https://gerrit.wikimedia.org/r/#/c/257002/ [18:04:44] !log calcium is supposed to be down, reclaiming to spares, ignore any irc alerts (its in maint mode in icinga) [18:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:03] touches proxylistener etc, but with a +1 from Yuvi, i'll do it [18:05:14] YuviPanda: mutante: https://gerrit.wikimedia.org/r/#/c/259765/ [18:05:31] lemme put in wikitexch/deployments as well [18:05:45] mobrovac: yes please, ok [18:06:21] !log running `nodetool cleanup` on restbase1001 and restbase1005 [18:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:35] (03PS1) 10BBlack: VCL: Make X-Cache more accurate/informative, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/259766 [18:06:55] (03CR) 10Yuvipanda: [C: 031] toollabs: pep8 fixes for pretty code :) [puppet] - 10https://gerrit.wikimedia.org/r/257002 (owner: 10Chad) [18:07:06] ostriches: +1'd [18:07:12] ty! [18:07:14] * YuviPanda goes away for breakfast for rael [18:07:14] (03PS2) 10Dzahn: Mathoid: Increase the number of workers temporarily [puppet] - 10https://gerrit.wikimedia.org/r/259765 (https://phabricator.wikimedia.org/T121762) (owner: 10Mobrovac) [18:07:23] (03PS1) 10DCausse: Collect suggest stats and specific search groups [puppet] - 10https://gerrit.wikimedia.org/r/259767 [18:07:37] (03CR) 10Dzahn: [C: 032] Mathoid: Increase the number of workers temporarily [puppet] - 10https://gerrit.wikimedia.org/r/259765 (https://phabricator.wikimedia.org/T121762) (owner: 10Mobrovac) [18:08:26] (03PS8) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [18:08:45] (03PS2) 10Dzahn: Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 (owner: 10Chad) [18:09:17] (03PS2) 10BBlack: VCL: Make X-Cache more accurate/informative, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/259766 [18:09:29] (03CR) 10Jcrespo: [C: 032] Repool x1-slave (db1031), increase db1041 load to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259756 (owner: 10Jcrespo) [18:09:42] mobrovac: does that need manual restart or something? no of workers [18:09:55] 6operations: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1888120 (10RobH) a:5RobH>3Cmjohnson [18:09:57] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [18:10:01] ah :) [18:10:06] mutante: just forcing a puppet run on sca100x should do it [18:10:06] paravoid: ^ [18:10:14] 6operations, 10ops-eqiad: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1758512 (10RobH) [18:10:15] ok [18:10:18] doing that [18:10:29] (03CR) 10BBlack: [C: 032 V: 032] VCL: Make X-Cache more accurate/informative, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/259766 (owner: 10BBlack) [18:10:42] mutante: mathoid's been flapping for some time now, this patch will just make it flap less frequently :P [18:10:42] (03CR) 10RobH: [C: 032] reclaim calcium to spares [dns] - 10https://gerrit.wikimedia.org/r/259763 (owner: 10RobH) [18:11:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool x1-slave (db1031), increase db1041 load to 100% (duration: 00m 30s) [18:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:11:10] (03PS2) 10RobH: reclaiming calcium to spares [puppet] - 10https://gerrit.wikimedia.org/r/259764 [18:11:23] ehm.. i cant login on sca1001 [18:11:35] euh? [18:11:35] publickey [18:11:40] i'm there now [18:11:51] is that running puppet currently? [18:12:00] maybe it did not get my new yubikey key yet [18:12:17] (03CR) 10RobH: [C: 032] reclaiming calcium to spares [puppet] - 10https://gerrit.wikimedia.org/r/259764 (owner: 10RobH) [18:12:24] tries again after messing with ssh-agent [18:13:24] yea, it works with my old key [18:13:30] (03PS75) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [18:14:36] (03CR) 10Dzahn: [C: 032] Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 (owner: 10Chad) [18:14:44] (03PS1) 10BBlack: Followup missing bit from 70f4366dc [puppet] - 10https://gerrit.wikimedia.org/r/259768 [18:15:03] (03CR) 10BBlack: [C: 032 V: 032] Followup missing bit from 70f4366dc [puppet] - 10https://gerrit.wikimedia.org/r/259768 (owner: 10BBlack) [18:15:05] (03PS3) 10Dzahn: Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 (owner: 10Chad) [18:15:31] mobrovac: looks all done [18:15:56] and the recovery right before, so cool [18:15:57] it is indeed mutante! thnx for your help! [18:15:59] (03PS76) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [18:16:02] welcome [18:16:04] 6operations, 10ops-eqiad, 10hardware-requests: reclaim calcium to spares - https://phabricator.wikimedia.org/T116790#1888159 (10RobH) [18:16:08] PROBLEM - mathoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:16:37] (03PS4) 10Dzahn: Gerrit: move static assets to *.cache.* filenames [puppet] - 10https://gerrit.wikimedia.org/r/257676 (owner: 10Chad) [18:16:53] ugh, said that too early? [18:17:09] hopes it's just the restart [18:17:27] mutante: have you run puppet on sca1002 as well? [18:17:36] the number of processes there is still 32 [18:17:49] i am right now [18:18:14] k cool [18:18:37] Info: /Stage[main]/Mathoid/Service::Node[mathoid]/File[/etc/mathoid/config.yaml]: Scheduling refresh of Service[mathoid] [18:18:49] finished [18:19:34] (03PS1) 10Jcrespo: New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/259770 [18:19:44] cool thnx mutante! [18:19:59] yw! come on icinga-wm [18:19:59] RECOVERY - mathoid endpoints health on sca1002 is OK: All endpoints are healthy [18:20:03] :) [18:20:59] ostriches: could you get the grrrit-wm restarted after that gerrit config change [18:21:08] YuviPanda? [18:21:18] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1888199 (10Cmjohnson) [18:21:19] 6operations, 10hardware-requests, 5Patch-For-Review: Decommission and remove from racks out of warranty spares - https://phabricator.wikimedia.org/T121007#1888197 (10Cmjohnson) 5Open>3Resolved These were previously wiped...removed from racks updated racktables [18:21:22] ah, wait, that is just moving css and jpg etc [18:21:34] won't restart it? [18:22:00] (03Abandoned) 10Jcrespo: [WIP] New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/253905 (owner: 10Jcrespo) [18:23:00] ostriches: the static asset change.. running puppet on ytterbium [18:23:19] Yeah it shouldn't restart. [18:23:23] It's not a config change for gerrit [18:23:32] Notice: /Stage[main]/Gerrit::Jetty/Git::Clone[operations/gerrit/plugins]/Exec[git_pull_operations/gerrit/plugins]/returns: executed successfully [18:23:38] jetty updated [18:23:57] done [18:25:41] for some reason grrrit-wm is still kind of silent [18:26:00] ah, no, path conflict [18:26:17] ostriches: can you amend https://gerrit.wikimedia.org/r/#/c/207377/ , rebase button says path conflict [18:26:19] (03CR) 10GWicke: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [18:26:24] (03PS2) 10GWicke: Add /api/ listing to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) [18:27:05] ottomata, if you have a moment- could you have a look at https://gerrit.wikimedia.org/r/#/c/252863/ ? [18:27:08] (03PS2) 10Dzahn: toollabs: pep8 fixes for pretty code :) [puppet] - 10https://gerrit.wikimedia.org/r/257002 (owner: 10Chad) [18:27:10] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1888235 (10RobH) a:5RobH>3mark Mark, I'd like to get your blanket approval for the decommission of a large number of out of warranty EQIAD... [18:27:32] (03CR) 10Dzahn: [C: 032] toollabs: pep8 fixes for pretty code :) [puppet] - 10https://gerrit.wikimedia.org/r/257002 (owner: 10Chad) [18:27:45] (03PS2) 10Jcrespo: New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/259770 [18:28:59] (03Abandoned) 10KartikMistry: CX: Fix cxserver.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/259687 (owner: 10KartikMistry) [18:29:22] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1888239 (10Krenair) @Betacommand: Any updates for us? [18:29:25] (03PS3) 10Jcrespo: New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/259770 [18:29:45] (03PS5) 10Chad: Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 [18:29:47] gwicke: about to eat lucnh, but ja, i told dan I wasn't really sure what that affected [18:29:57] or where [18:30:03] so wanted someone around to merge w me [18:30:07] and make sure nothing broke [18:30:11] (Back after lunch.>..) [18:30:17] ottomata: okay, thanks! [18:30:25] (03CR) 10Jcrespo: [C: 032] New generated key for jcrespo (jynus) [puppet] - 10https://gerrit.wikimedia.org/r/259770 (owner: 10Jcrespo) [18:30:30] we basically have the same for most other domains, just missed this one [18:30:52] ok, puppet swat over now at 10.30. 9 merged [18:31:22] eh, 8. i'll do that last one later if it's mergeable [18:33:09] (03PS6) 10Dzahn: Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [18:33:10] mutante: I rebased, it should merge now [18:33:15] (03CR) 10Dzahn: [C: 032] Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [18:33:36] (03PS1) 10Jcrespo: Remove extra space on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/259773 [18:33:56] jynus: merged your new key on master [18:33:56] (03CR) 10Jcrespo: [C: 032] Remove extra space on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/259773 (owner: 10Jcrespo) [18:34:07] I was waiting for ^ [18:35:00] (03PS2) 10Jcrespo: Remove extra space on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/259773 [18:35:14] (03CR) 10Jcrespo: [V: 032] Remove extra space on data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/259773 (owner: 10Jcrespo) [18:35:28] *nod* [18:36:42] thanks! [18:36:52] it works, by the way [18:37:13] (03CR) 10Thcipriani: [C: 04-1] "one thought about the -cache directory, other than that looks pretty awesome." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259542 (owner: 10Ottomata) [18:38:14] jynus: :) i just saw it change on an elastics server too, looking at my change [18:38:22] ostriches: confirmed no-op on elastic1001 [18:40:45] mutante: Thanks! [18:40:51] yw [18:43:22] akosiaris, mobrovac: we got Yandex fixed, thakns again [18:43:39] Nikerabbit: gr8 news! [18:43:49] Nikerabbit: so now cxserver is fully functional? [18:44:25] Nikerabbit: a great!!! nice [18:45:48] mobrovac: yes. [18:45:50] :) [18:45:50] YuviPanda: around? [18:46:00] Good night! [18:46:20] kart_: awesome! good night [18:46:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1888262 (10Eevans) [18:52:44] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1888275 (10Smalyshev) Looking at https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Creating_a_schema our schem... [18:53:35] 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1888276 (10jcrespo) TLS rolled on x1, too, that makes all mediawiki production links. [18:55:13] (03CR) 1020after4: [C: 031] "I think it'd be better to have puppet create the deployment-cache directory, rather than trying to manage permissions on the parent direct" [puppet] - 10https://gerrit.wikimedia.org/r/259542 (owner: 10Ottomata) [18:55:40] (03PS1) 10BBlack: Revert "post-merge syntax bugfix for be768ad7c6" [puppet] - 10https://gerrit.wikimedia.org/r/259775 [18:55:42] (03PS1) 10BBlack: Revert "VCL: grace-mode only in frontend caches" [puppet] - 10https://gerrit.wikimedia.org/r/259776 [18:55:44] (03PS1) 10BBlack: Text VCL: exclude lower-layer cache hits from hfp object creation [puppet] - 10https://gerrit.wikimedia.org/r/259777 [18:56:48] (03CR) 10BBlack: [C: 032 V: 032] Revert "post-merge syntax bugfix for be768ad7c6" [puppet] - 10https://gerrit.wikimedia.org/r/259775 (owner: 10BBlack) [18:57:00] (03CR) 10BBlack: [C: 032 V: 032] Revert "VCL: grace-mode only in frontend caches" [puppet] - 10https://gerrit.wikimedia.org/r/259776 (owner: 10BBlack) [18:57:49] ^ don't merge those yet please :) [18:58:01] YuviPanda: nm, emailed [18:58:33] (03PS2) 10BBlack: Text VCL: exclude lower-layer cache hits from hfp object creation [puppet] - 10https://gerrit.wikimedia.org/r/259777 [18:59:54] (03CR) 10BBlack: [C: 032] Text VCL: exclude lower-layer cache hits from hfp object creation [puppet] - 10https://gerrit.wikimedia.org/r/259777 (owner: 10BBlack) [19:00:05] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T1900). [19:00:37] akosiaris: mobrovac: Thanks a lot for the help with cx tonight. [19:00:57] !log starting update of all wikis to 1.27.0-wmf.9 [19:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:37] arrbee: np, i'm happy we managed to fix everything! [19:01:48] thcipriani: duh, you're moving now everything to wmf9? [19:02:31] mobrovac: yessir: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151217T1900 [19:02:42] heh! [19:02:54] shouldn't have bothered with backporting to wmf8 then :D [19:03:09] but at least i know the patches are 100% working this way [19:03:10] :P [19:03:16] mobrovac: heh, sorry, I should have mentioned that :P [19:04:00] (03PS1) 10EBernhardson: [elastic] Fix bad method call in diamond collection [puppet] - 10https://gerrit.wikimedia.org/r/259778 [19:04:57] (03PS1) 10Thcipriani: all wikis to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259779 [19:05:17] arrbee: don't mention it. thanks for making this happen [19:05:26] could i poke anyone about a 1 line fix? i mucked it up in the patch deployed during puppet swat this morning: https://gerrit.wikimedia.org/r/259778 [19:05:35] akosiaris: :) [19:05:48] !log disable puppet on graphite2001, brief testing cluster aggregations [19:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:39] (03CR) 10Thcipriani: [C: 032] "Train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259779 (owner: 10Thcipriani) [19:07:03] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259779 (owner: 10Thcipriani) [19:07:33] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.9 [19:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:20] All wikis are updated to 1.27.0-wmf.9. Train is complete. [19:13:14] 6operations, 10RESTBase, 7Graphite, 7service-runner: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1888291 (10Pchelolo) I've run some benchmarks with and without batching, and it actually makes a surprisingly huge difference: on my laptop, for #restbase a `/title/{titl... [19:24:46] gwicke: back. [19:25:18] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:38:10] (03CR) 10Thcipriani: [C: 031] "-cache directory conversations in IRC have set me straight. Puppet compiler looks good, too: https://puppet-compiler.wmflabs.org/1505/ tha" [puppet] - 10https://gerrit.wikimedia.org/r/259542 (owner: 10Ottomata) [19:42:12] !log mathoid deploying a2187a6 [19:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:35] 6operations, 10RESTBase-Cassandra: Grafana bugginess; Graph scales sometimes off by an order of magnitude - https://phabricator.wikimedia.org/T121789#1888407 (10Eevans) 3NEW [19:58:27] (03PS1) 10Andrew Bogott: WIP: Set up special dhcp behavior for bare-metal boxes [puppet] - 10https://gerrit.wikimedia.org/r/259787 [19:58:29] (03PS1) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [19:58:35] (03PS9) 10Ottomata: Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 [20:01:07] (03CR) 10Ottomata: [C: 032] Move role::scap::target to scap::ferm, add scap::target define [puppet] - 10https://gerrit.wikimedia.org/r/259542 (owner: 10Ottomata) [20:06:58] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: puppet fail [20:07:01] (03PS77) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:10:39] (03PS78) 10Ottomata: Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:13:03] (03CR) 10DCausse: Enable completion suggester beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/259701 (https://phabricator.wikimedia.org/T119989) (owner: 10DCausse) [20:17:29] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:21:08] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:21:08] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:21:31] (03CR) 10Luke081515: [C: 04-1] "Normaly the group is named "rollbacker", the right is "rollback", so please change this to default, otherwise translations will not match," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255810 (https://phabricator.wikimedia.org/T119787) (owner: 10Mdann52) [20:33:09] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888526 (10Dzahn) 3NEW [20:33:09] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:33:46] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888544 (10Dzahn) [20:33:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:34:36] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888526 (10Dzahn) [20:36:16] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888560 (10Dzahn) p:5Triage>3Low [20:39:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:41:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:42:03] (03CR) 10Ottomata: [C: 032] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:43:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:43:28] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:48:08] (03PS1) 10Dzahn: mediawiki: fix puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259791 [20:49:33] (03PS2) 10Dzahn: mediawiki: fix puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259791 [20:50:30] (03CR) 10Dzahn: [C: 032] mediawiki: fix puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259791 (owner: 10Dzahn) [20:52:27] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [20:53:00] no, that's not that [20:53:09] looks at mira [20:53:56] Could not determine keyholder key_fingerprint for scap [20:54:06] when setting up eventlogging deployment source for eventlogging [20:54:14] hmm [20:55:04] ah [20:55:10] modules/eventlogging/manifests/deployment/source.pp:22 [20:55:11] i just merged a change [20:55:15] ah? [20:55:22] i am fixing some weirdness on eventlog1001 because of it [20:55:24] hadn't gotten to run puppet there [20:55:28] ahhhh [20:55:34] that's the other deployment server in codfw [20:55:34] yes, i haven't done that for production, hmmmMm [20:55:36] that is like tin [20:55:40] yeah [20:55:41] it [20:55:43] ll happen on tin too [20:55:54] umm, i can't get to it at the moment, but will merge a temp fix to keep puppet happey [20:55:56] one sec [20:56:00] ok, i was wondering if tin is ok or not [20:56:02] sure, thanks [20:57:37] (03PS1) 10Ottomata: Don't include eventlogging::deployment::source in production yet [puppet] - 10https://gerrit.wikimedia.org/r/259792 [20:59:06] (03CR) 10Ottomata: [C: 032] Don't include eventlogging::deployment::source in production yet [puppet] - 10https://gerrit.wikimedia.org/r/259792 (owner: 10Ottomata) [20:59:36] got it [21:02:38] hey folks :-) [21:02:47] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:02:53] hashar: hello [21:03:07] would someone be available to publish Zuul .deb on apt.wm.o ? [21:03:13] hashar: yes [21:03:30] I got them deployed on labs and published on people.wm.o https://phabricator.wikimedia.org/T119714#1838293 ;) [21:03:30] is it like jenkins? [21:03:42] not unfortunately :( [21:04:02] custom packages for Zuul that I build with the help of godog a few months ago. [21:04:31] I am merely bumping upstream version and cherry pick a few patches. The labs instances have been updated already [21:04:49] you need it for all 3 distros? [21:04:57] yes :/ [21:05:25] CI is scary [21:05:36] (03PS2) 10Krinkle: Remove unused $wgObjectCaches['resourceloader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258165 [21:05:55] ok, hold on for a moment [21:06:23] (meanwhile for your entertainment: https://phabricator.wikimedia.org/T121796) [21:06:27] (03PS1) 10Ottomata: Make eventlogging files consumer role manage output directory [puppet] - 10https://gerrit.wikimedia.org/r/259793 [21:07:12] yeah I have seen that puppet-lint craziness :/ [21:07:22] guess we will have to dig in the source code :D [21:09:23] (03CR) 10Ottomata: [C: 032] Make eventlogging files consumer role manage output directory [puppet] - 10https://gerrit.wikimedia.org/r/259793 (owner: 10Ottomata) [21:09:48] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888624 (10hashar) + my favorites rubyist It can be either: - a weird bug in puppet-lint or a strange oddity on the CI slaves - or instances have a dirty workspac... [21:10:44] (03CR) 10Ottomata: "I should note that this change adds myself and madhuvishy to the eventlogging-admins group. However, we are both already in the eventlogg" [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [21:12:06] Unable to find pool/thirdparty/z/zuul/zuul_2.1.0-60-g1cc37f7.orig.tar.gz needed by zuul_2.1.0-60-g1cc37f7-wmf4precise1.dsc! [21:12:09] Perhaps you forgot to give dpkg-buildpackage the -sa option, [21:12:12] hashar: ^ [21:12:33] reprepro -C thirdparty include precise-wikimedia zuul_2.1.0-60-g1cc37f7-wmf4precise1_amd64.changes [21:12:46] zuul_2.1.0-60-g1cc37f7-wmf4precise1.debian.tar.gz [21:12:53] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888637 (10hashar) Looked at the slaves with `git status` and the workspaces are clean. Command on integration-saltmaster: `salt '*slave-trusty*' cmd.run 'git -C /... [21:13:26] mutante: did I forgot to upload the orig.tar.gz so ? :( [21:13:48] hashar: i have -wmf4precise1.debian.tar.gz [21:14:03] https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-60-g1cc37f7-wmf4jessie1/zuul_2.1.0-60-g1cc37f7.orig.tar.gz [21:14:07] (03CR) 10Aaron Schulz: [C: 031] Remove unused $wgObjectCaches['resourceloader'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258165 (owner: 10Krinkle) [21:14:09] no, the other way around [21:14:31] ehmm. [21:14:41] ohhhh [21:14:56] so git build package used the same tarball for all distro [21:15:04] but somehow rename it when crafting the dsc [21:15:06] grr [21:15:09] is it right about that -sa option? [21:15:25] no clue :( [21:16:40] maybe you have to pass the tarball name explicitly ? [21:17:15] i tried renaming it, but no [21:18:00] or you could try --ignore=missingfile to guess possible files to use. [21:18:55] maybe you can copy the orig tarball where it expect it i.e. pool/thirdparty/z/zuul/zuul_2.1.0-60-g1cc37f7.orig.tar.gz [21:19:10] yeah I am looking at the same paragraph in the manpage [21:20:13] hashar: yes, that works [21:20:19] the one for precise has been added [21:21:10] sorry for the trouble :( [21:21:27] http://apt.wikimedia.org/wikimedia/pool/thirdparty/z/zuul/ [21:21:50] np [21:22:50] !log add zuul_2.1.0-60-g1cc37f7-wmf4precise1 to precise-wikimedia APT [21:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:02] (03PS1) 10Ottomata: Use new kafka role for eventlogging service eventbus configuration [puppet] - 10https://gerrit.wikimedia.org/r/259796 [21:25:35] hashar: the one for trusty is a bit different [21:25:44] not the .tar.gz but: [21:25:55] annot find file './zuul_2.1.0-60-g1cc37f7-wmf4trusty1.dsc' needed by 'zuul_2.1.0-60-g1cc37f7-wmf4trusty1_amd64.changes'! [21:26:00] ... [21:26:16] (03CR) 10Ottomata: [C: 032] Use new kafka role for eventlogging service eventbus configuration [puppet] - 10https://gerrit.wikimedia.org/r/259796 (owner: 10Ottomata) [21:26:45] hashar: no, my bad, it's ok [21:27:04] just failed to download it [21:27:09] ohhhh [21:27:17] actually that was imported without any issues [21:27:21] I should have build some tarballs containing everything :D [21:27:23] no copying needed [21:27:38] yeah the orig tarballs are in the pool now [21:28:01] !log add zuul_2.1.0-60-g1cc37f7-wmf4trusty1 to trusty-wikimedia repo [21:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:25] all fine :) [21:28:29] last one would be Jessie [21:29:45] !log add zuul_2.1.0-60-g1cc37f7-wmf4jessie1 to jessie-wikimedia repo [21:29:48] done [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:31:17] mutante: awesome. Danke very much (or something like that) [21:31:31] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1888677 (10Dzahn) a:5hashar>3Dzahn [21:31:53] hashar: de rien ("vielen dank") [21:32:05] I definitely need to get some german lessons [21:32:33] lit. "many thank" [21:33:12] if you do half yoda and half doge and then translate literally from English it often works out :p [21:33:25] !log gallium: upgrading Zuul from 2.1.0-60-g1cc37f7-wmf2precise1 .. 2.1.0-60-g1cc37f7-wmf4precise1 . Should be noop, only change zuul-cloner which is not used there [21:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:33] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1888679 (10Dzahn) 13:26 < mutante> !log add zuul_2.1.0-60-g1cc37f7-wmf4precise1 to precise-wikimedia APT 13:31 < mutante> !log... [21:34:33] (03PS1) 10BBlack: Text VCL: raise hfp TTL to 601s [puppet] - 10https://gerrit.wikimedia.org/r/259799 [21:34:34] (03PS1) 10BBlack: Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/259800 (https://phabricator.wikimedia.org/T121564) [21:34:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:35:03] (03CR) 10BBlack: [C: 032 V: 032] Text VCL: raise hfp TTL to 601s [puppet] - 10https://gerrit.wikimedia.org/r/259799 (owner: 10BBlack) [21:35:10] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1888682 (10Dzahn) 5Open>3Resolved http://apt.wikimedia.org/wikimedia/pool/thirdparty/z/zuul/ [21:35:46] 7Blocked-on-Operations, 6operations: Upload new Zuul packages zuul_2.1.0-60-g1cc37f7-wmf4 on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T119714#1888685 (10hashar) Thank you very much @Dzahn. One less item in the backlog! [21:36:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:37:27] hashar: oh, re: the puppet-lint bug, i did all that on my laptop, that would exclude CI-slave issues [21:38:01] it's really just the distro package running on my local clone [21:38:31] 6operations, 7Availability: Set $wmfSwiftCodfwConfig in PrivateSettings - https://phabricator.wikimedia.org/T119651#1888687 (10fgiunchedi) a:5aaron>3fgiunchedi [21:38:52] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888689 (10Dzahn) i did these things on my laptop. so CI slaves would be unrelated. it was just the Debian package when i run it locally on my clone of the ops/pup... [21:39:14] (03PS1) 10Ottomata: Ensure target path exists, but don't ensure directory [puppet] - 10https://gerrit.wikimedia.org/r/259802 [21:39:36] (03PS2) 10Ottomata: Ensure target path exists, but don't ensure directory [puppet] - 10https://gerrit.wikimedia.org/r/259802 [21:39:57] ottomata: I merged your last +2 [21:40:19] oh thanks [21:40:19] ja [21:40:30] i forget to do that when i'm also merging on deployment-puppetmaster in beta [21:40:48] (03CR) 10Ottomata: [C: 032] Ensure target path exists, but don't ensure directory [puppet] - 10https://gerrit.wikimedia.org/r/259802 (owner: 10Ottomata) [21:40:57] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888690 (10hashar) Do you have a way to reliably reproduce the issue? Or at least some Jenkins builds showing it? That would help. [21:40:57] (03CR) 10BBlack: [C: 032] Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/259800 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [21:41:22] (03PS2) 10BBlack: Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/259800 (https://phabricator.wikimedia.org/T121564) [21:41:24] (03PS1) 10Dzahn: backup: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259815 [21:41:37] (03CR) 10BBlack: [V: 032] Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/259800 (https://phabricator.wikimedia.org/T121564) (owner: 10BBlack) [21:42:39] (03PS3) 10Ori.livneh: Add /api/ listing to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [21:42:47] (03CR) 10Ori.livneh: [C: 032 V: 032] Add /api/ listing to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke) [21:42:55] (03PS2) 10Dzahn: backup: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259815 [21:43:09] (03PS3) 10Dzahn: backup: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259815 [21:43:26] (03PS4) 10Dzahn: backup: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259815 [21:45:45] (03CR) 10Dzahn: [C: 032] backup: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259815 (owner: 10Dzahn) [21:46:31] (03PS1) 10Dzahn: l10nupdate, geowiki: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/259857 [21:51:51] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888734 (10Dzahn) hmm.. i guess that would be: install Debian strech apt-get install puppet-lint git clone https://gerrit.wikimedia.org/r/p/operations.git grep "80... [21:54:30] (03PS2) 10Dzahn: l10nupdate, geowiki: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/259857 [21:54:37] (03CR) 10Dzahn: [C: 032] l10nupdate, geowiki: minimal lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/259857 (owner: 10Dzahn) [21:58:36] (03PS1) 10Dzahn: salt-master: ignore arrow alignement warnings [puppet] - 10https://gerrit.wikimedia.org/r/259863 [21:58:55] (03PS2) 10Dzahn: salt-master: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259863 [21:59:18] (03PS3) 10Dzahn: salt-master: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259863 [21:59:27] (03CR) 10Dzahn: [C: 032] salt-master: ignore arrow alignment warnings [puppet] - 10https://gerrit.wikimedia.org/r/259863 (owner: 10Dzahn) [22:02:15] (03PS1) 10Ottomata: Configure keyholder keys for eventlogging scap deployment on tin [puppet] - 10https://gerrit.wikimedia.org/r/259864 [22:03:15] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888776 (10hashar) Well with puppet.git at f387ac8a857bd8a0e6ad0a3daffa7a1bd5ea3958 ``` $ cd manifests/role $ puppet-lint *.pp|wc -l 927 $ puppet-lint *.pp|g... [22:03:44] (03PS1) 10Dzahn: contint/hhvm: ignore alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259865 [22:05:07] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888796 (10Dzahn) @hashar oh, but how come this doesn't seem to be the case for options other than the 80char check? [22:05:28] (03CR) 10Ottomata: [C: 032] Configure keyholder keys for eventlogging scap deployment on tin [puppet] - 10https://gerrit.wikimedia.org/r/259864 (owner: 10Ottomata) [22:05:36] (03PS2) 10Ottomata: Configure keyholder keys for eventlogging scap deployment on tin [puppet] - 10https://gerrit.wikimedia.org/r/259864 [22:05:55] mutante: if i cd manifests/role puppet-lint complains with a lot of different errors ( 927 ) [22:06:01] (03PS2) 10Dzahn: contint/hhvm: ignore alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259865 [22:06:05] mutante: and 246 of those 927 errors are 'than 80' [22:06:28] mutante: i.e. .puppet-lint.rc is entirely ignored because puppet-lint doesn't look up in parent directories [22:06:29] (03CR) 10Ottomata: [V: 032] Configure keyholder keys for eventlogging scap deployment on tin [puppet] - 10https://gerrit.wikimedia.org/r/259864 (owner: 10Ottomata) [22:07:26] hashar: i see about the .rc in the parent dir, but i used the same commands and other checks seemed to change when i edited the config [22:07:32] hmm [22:08:25] hashar: so always stay in the root of the repo and use full pathes? [22:08:35] yup :/ [22:09:14] i'm still a bit confused how that seemed different for just this one, but my bad [22:09:18] thanks [22:09:28] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888836 (10hashar) if i cd manifests/role puppet-lint complains with a lot of different errors ( 927 ) and 246 of those 927 errors are 'than 80'. i.e. .puppet-... [22:09:51] (03PS3) 10Dzahn: contint/hhvm: ignore alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259865 [22:10:07] (03CR) 10Dzahn: [C: 032] contint/hhvm: ignore alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259865 (owner: 10Dzahn) [22:10:15] mutante: I am not sure why it would ignore other errors though [22:11:22] (03PS1) 10Ottomata: Include eventbus role on kafka1001 and kafka1002 [puppet] - 10https://gerrit.wikimedia.org/r/259867 [22:11:44] (03PS2) 10Ottomata: Include eventbus role on kafka1001 and kafka1002 [puppet] - 10https://gerrit.wikimedia.org/r/259867 [22:13:27] (03CR) 10Ottomata: [C: 032] Include eventbus role on kafka1001 and kafka1002 [puppet] - 10https://gerrit.wikimedia.org/r/259867 (owner: 10Ottomata) [22:13:30] hashar: yea, this one is fine: for manifest in $(find . -name *.pp); do echo $manifest; puppet-lint $manifest; done [22:15:02] find . -name * -z .|xargs -z puppet-lint [22:15:03] :D [22:15:08] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888867 (10Dzahn) if i do it like this, and avoid changing directory, it is working indeed: `for manifest in $(find . -name *.pp); do echo $manifest; puppet-lint $... [22:15:55] mutante: but yeah that is the idea. Were you running puppet-lint to only check for 80-chars errors ? [22:17:27] hashar: no, i pick one or two checks at a time and enable them, then run something like above and |tee into a file [22:17:37] (03PS1) 10Rush: phabricator: scaffolding to deal with unruly clients [puppet] - 10https://gerrit.wikimedia.org/r/259869 [22:17:44] and then use that to make fixes and eliminate one type of warning globally [22:17:56] and finally re-enable it in the .rc [22:18:13] we can close this ticket as invalid though [22:18:51] 7Puppet, 6operations, 10Continuous-Integration-Config: puppet-lint ignores --no-80chars-check option - https://phabricator.wikimedia.org/T121796#1888904 (10Dzahn) 5Open>3Invalid a:3Dzahn [22:21:27] (03PS1) 10Dzahn: tendril: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259871 [22:21:51] (03PS2) 10Dzahn: tendril: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259871 [22:22:06] (03CR) 10Dzahn: [C: 032] tendril: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259871 (owner: 10Dzahn) [22:23:03] (03PS1) 10Dzahn: jenkins: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259872 [22:23:51] (03PS1) 10Dzahn: security: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259873 [22:24:16] (03PS2) 10Dzahn: jenkins: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259872 [22:25:09] 10Ops-Access-Requests, 6operations: Add James Alexander to Security@ - https://phabricator.wikimedia.org/T121807#1888952 (10Jalexander) 3NEW [22:25:48] mutante: in Jenkins I have been using: find . -type f -name '*.pp' -print0 | barges -t -0 somecommand [22:25:54] grrr [22:26:04] mutante: in Jenkins I have been using: find . -type f -name '*.pp' -print0 | xargs -t -0 somecommand [22:26:19] mutante: so that invokes somecommand file.pp file2.pp file3.pp [22:26:25] and saves the startup overhead of the command [22:27:06] hashar: that works (too). thanks [22:27:07] (03CR) 10Hashar: [C: 031] jenkins: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259872 (owner: 10Dzahn) [22:27:12] 6operations, 6Phabricator, 5Patch-For-Review, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1888986 (10chasemp) 5Open>3Resolved a:3chasemp this all seems done [22:27:30] (03CR) 10Dzahn: [C: 032] jenkins: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259872 (owner: 10Dzahn) [22:27:33] (03CR) 10Hashar: [C: 031] security: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259873 (owner: 10Dzahn) [22:27:41] :-) [22:27:53] ty. there aren't _that_ many left :) [22:28:10] mutante: well I am gone. Thank you a lot for the Zuul.deb Turns out it unlock some other tasks [22:28:17] and it's not going on forever since once it's "locked in" they won't come back that much [22:28:31] hashar: :) welcome. have a good night then [22:28:49] is there a problem with tin? [22:29:03] nop [22:29:29] http://downforeveryoneorjustme.com/tin.com It's just you. http://tin.com is up. [22:29:35] lol [22:29:38] Krenair: more seriously, what kind of issue are you facing ? [22:29:40] I can't ssh into tin [22:29:47] It says Permission denied (publickey). [22:29:52] ditto Permission denied (publickey). [22:29:59] [tin:~] $ [22:30:05] uhm [22:30:29] yea, failed publickey in log [22:30:39] ottomata: mortals can't reach out tin.eqiad.wmnet could it be related to some of the scap / Eventlogging permissions you have been working on yesterday and potentially today ? [22:30:50] But it works against a bunch of other servers [22:30:54] I can log into mira [22:31:02] it is the deployment server [22:31:06] so is mira though [22:31:29] root@mira:~# file /etc/ssh/userkeys/hashar [22:31:30] /etc/ssh/userkeys/hashar: OpenSSH RSA public key [22:31:42] root@tin:/# file /etc/ssh/userkeys/hashar [22:31:42] /etc/ssh/userkeys/hashar: ERROR: cannot open `/etc/ssh/userkeys/hashar' (No such file or directory) [22:31:45] the f... [22:32:13] (03PS2) 10Rush: phabricator: scaffolding to deal with unruly clients [puppet] - 10https://gerrit.wikimedia.org/r/259869 [22:32:17] the users exist, but the keys are gone [22:32:21] hashar: its possible, i just reamred keyholder there [22:32:28] if anything is broken it is probably my fault [22:32:39] ottomata: the key files are just gone :p [22:32:44] see paste above [22:32:49] !!!! [22:32:51] in userkeys? [22:32:53] yea [22:33:07] uhhhh [22:33:09] but the user itself is there [22:33:22] dunno what I could have done that did that... [22:33:27] hmmMmMM [22:33:33] time to dig in puppet.log :-D [22:33:42] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1889020 (10RobH) [22:33:53] 4310 Dec 17 21:01:09 tin puppet-agent[21250]: (/Stage[main]/Ssh::Server/File[/etc/ssh/userkeys/hashar]/ensure) removed [22:33:56] :p [22:34:02] (03CR) 10Rush: [C: 032] phabricator: scaffolding to deal with unruly clients [puppet] - 10https://gerrit.wikimedia.org/r/259869 (owner: 10Rush) [22:34:06] it removed like all users [22:34:08] except ops [22:34:18] !log Ssh User keys are gone on deployment servers ( tin / mira ) [22:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:34:46] HuHhh? ok here's what I did: [22:34:48] not on mira hashar [22:34:56] i created a new ssh key for eventlogging [22:35:00] added private key to private repo [22:35:05] added public key to puppet [22:35:07] Krenair: oops I misread [22:35:10] krenair@mira:~$ ls -al /etc/ssh/userkeys | wc -l [22:35:10] 86 [22:35:29] deployed public key using keyholder::agent in puppet [22:35:32] hashar: not on mira, which makes it even weirder [22:35:35] becaues the roles are the same [22:35:35] TZ=C git log --date=local [22:35:36] ^^^ tip of the day [22:35:40] i restarted keyholder [22:35:44] and reamred keyholder [22:35:50] hashar: i did not restart keyholder on mira [22:36:14] I don't think messing with keyholder will cause puppet to remove all users? [22:36:15] not sure how that would matter [22:36:15] right [22:36:20] but that's the only different thing i did [22:36:27] wwwaaathe [22:36:43] ohh [22:36:48] ok, let's check the timestamp [22:36:48] f5685c6ce0c048b06c84c7decd44ea99bd31edc9 [22:36:54] Dec 17 21:01 [22:37:01] now it's 22:36 [22:37:03] +admin::groups: [22:37:04] + # Any group that uses scap and keyholder to deploy [22:37:04] + # will need to be included here. [22:37:05] + - eventlogging-admins [22:37:09] that is in hieradata/hosts/tin.yaml [22:37:18] looks like to me it drop everyone but eventloggings-admins [22:37:38] Yeah. [22:37:38] oh! [22:37:43] i did do that..>..but uhhhh [22:37:43] That will kick out everyone else. [22:37:46] ooook [22:37:52] Why did you add it to tin.yaml instead of the role file? [22:37:58] Isn't eventlogging deployed from mira too? [22:38:02] though some spec of YAML let you do merging [22:38:02] do not know whyyYyy [22:38:07] perhaps I should add it to the role [22:38:11] yes, please use the role instead of host name [22:38:14] k fixing [22:38:17] ottomata: I think you overrode the general site admin::groups def [22:38:25] PROBLEM - Check that eventlogging-service-eventbus is running on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name /srv/deployment/eventlogging/eventbus/bin/eventlogging-service [22:38:35] I believe this is a bug in the hiera handler where it doesn't merge dicts down the tree [22:38:39] Ugh [22:38:41] which role? [22:38:44] should I add to? [22:38:54] hang on let me see what's up [22:39:02] ottomata: role/common/deployment/serer [22:39:04] ah ha! [22:39:04] server [22:39:07] ok mamkes much more sense [22:39:07] thank you [22:39:14] hieradata/role/common/deployment/server.yaml: - deployment [22:39:19] there you go [22:39:24] !log Only tin lost SSH user keys apparently due to https://gerrit.wikimedia.org/r/#/c/253465/ overriding the admin::groups to simply "eventlogging-admins" [22:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:44] (03PS1) 10Ottomata: Fix admin::groups on tin [puppet] - 10https://gerrit.wikimedia.org/r/259876 [22:39:46] PROBLEM - Check that eventlogging-service-eventbus is running on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name /srv/deployment/eventlogging/eventbus/bin/eventlogging-service [22:39:55] hey YAML 1.1 let you merge http://yaml.org/type/merge.html :D [22:39:55] After all that merging of tin and mira stuff, someone still managed to commit tin-specific things and got it past review :( [22:39:56] Uh oh ha [22:40:28] (03CR) 10Ottomata: [C: 032 V: 032] Fix admin::groups on tin [puppet] - 10https://gerrit.wikimedia.org/r/259876 (owner: 10Ottomata) [22:40:36] hashar: I believe our handler says it does it but it doesn't seem to work [22:40:42] ottomata: fwiw you can do [22:40:51] ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [22:40:56] and see where something comes from [22:40:58] o [22:41:00] oh! that's cool [22:41:06] yaml merge keys and how hiera works are very different. Also the host specific stuff needs to be able to replace more general config and not just augment it [22:41:19] well it's based on type actually [22:41:25] OH COMES NOBODY TOLD ME WE HAD A WAY TO LOOKUP HIERADATA FROM CLI ??? [22:41:26] arrays are meant to merge and hashes [22:41:35] but strings clearly can't [22:41:51] but its' also first value wins so foo[bar] at host level would win over foo[bar] at role level [22:41:58] our handler seems to find the first instance of the foo dict and stop [22:42:01] actually [22:42:04] if you use the native hiera_hash [22:42:07] this would work [22:42:09] where is that ./util chasemp? [22:42:13] puppet/util [22:42:17] chasemp: thank you for the ./utils/hiera_lookup trick. I have never noticed that utility [22:42:28] chasemp: TIL, cool [22:43:03] we need a reddit [22:43:20] https://www.reddit.com/r/Wikimedia hah [22:43:24] /r/wmf-ops-tipn-tricks [22:43:30] first time i open that [22:43:43] hashar: i want to add them to an infobot [22:43:56] i'll puppetize "flooterbuck" *g" [22:44:08] http://flooterbuck.sourceforge.net/ [22:44:25] !hiera [22:44:38] !hiera is ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [22:44:39] Key was added [22:44:48] :) [22:44:49] !hiera | mutante [22:44:49] mutante: ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [22:45:35] it is too late for me, but one would want to add the tip to https://wikitech.wikimedia.org/wiki/Puppet_Hiera [22:45:35] !thx is Thank you very much, $user. [22:45:36] Key was added [22:45:41] !thx | hashar [22:45:42] hashar: Thank you very much, $user. [22:45:49] hehe [22:46:06] user accounts should be back on tin no [22:46:24] root@tin# [22:46:25] yeah [22:46:27] confirmed, your key is back [22:46:32] nice [22:46:58] ottomata: well done! thanks for showing up :-) [22:47:25] !log ssh to tin is back https://gerrit.wikimedia.org/r/#/c/259876/ [22:47:26] yeah, sorry about that [22:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:31] ok, i have to run for the eve [22:47:54] * hashar waves [22:47:56] Results (Found 1): hiera, [22:47:56] @regsearch hiera [22:47:57] have a good evening folks [22:49:12] @seenrx hash [22:49:13] mutante: Last time I saw hashar they were talking in the channel, they are still in the channel. It was in #wikimedia-releng at 12/17/2015 10:48:55 PM (17s ago) (multiple results were found: mahmoudhashemi, hashar_, hasharAW, bhashkar, olegshashin and 69 more results) [22:50:34] !thx unalias [22:50:52] !thx del [22:50:52] Successfully removed thx [22:52:56] !thx is $infobot_nick appreciates your help, $1 [22:52:56] Key was added [22:53:00] !thx | hashar [22:53:00] hashar: mutante appreciates your help, [22:53:09] mutante: well done! [22:53:37] Danke and guten Abend [22:53:44] bis bald [22:55:41] (03CR) 1020after4: "+1" [puppet] - 10https://gerrit.wikimedia.org/r/259869 (owner: 10Rush) [23:04:31] (03PS1) 10BBlack: Text VCL: Fix up logged-in users caching [puppet] - 10https://gerrit.wikimedia.org/r/259882 [23:05:53] (03PS2) 10Dzahn: security: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259873 [23:06:20] (03CR) 10Dzahn: [C: 032] security: fix alignment lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/259873 (owner: 10Dzahn) [23:07:25] PROBLEM - puppet last run on elastic1005 is CRITICAL: CRITICAL: Puppet has 1 failures [23:07:34] PROBLEM - puppet last run on mw2071 is CRITICAL: CRITICAL: puppet fail [23:09:41] (03PS1) 10Dzahn: jmxtrans: ignore indentation warnings [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/259884 [23:12:33] (03PS1) 10Dzahn: mediawiki,statistics: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259885 [23:13:07] (03PS2) 10Dzahn: vagrant,statistics: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259885 [23:13:17] (03PS3) 10Dzahn: vagrant,statistics: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259885 [23:13:33] (03CR) 10Dzahn: [C: 032] vagrant,statistics: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259885 (owner: 10Dzahn) [23:14:43] (03PS1) 10Dzahn: rsyslog: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259886 [23:15:31] (03PS2) 10Dzahn: rsyslog: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259886 [23:15:48] (03CR) 10Dzahn: [C: 032] rsyslog: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259886 (owner: 10Dzahn) [23:17:55] (03PS1) 10Dzahn: zuul/merge: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/259888 [23:18:34] (03PS2) 10Dzahn: zuul/merge: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/259888 [23:18:55] (03CR) 10Dzahn: [C: 032] zuul/merge: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/259888 (owner: 10Dzahn) [23:20:29] (03PS1) 10Dzahn: puppetmaster/labs: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259889 [23:20:52] (03PS2) 10Dzahn: puppetmaster/labs: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259889 [23:21:20] (03CR) 10Dzahn: [C: 032] puppetmaster/labs: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/259889 (owner: 10Dzahn) [23:25:19] !hiera | mutante [23:25:19] mutante: ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [23:28:44] heh [23:31:51] bblack: there is even a regex based search for info on the bot:) [23:31:55] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:32:30] uses it more to keep little snippets [23:35:00] except usr/lib/ruby/vendor_ruby/hiera/config.rb:31:in `load': Config file /home/dzahn/puppet/modules/puppetmaster/files/production.hiera.yaml not found (RuntimeError) [23:35:04] hrmmm [23:36:04] RECOVERY - puppet last run on mw2071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:36:57] !hiera [23:36:57] ./util/hiera_lookup --fqdn=tin.eqiad.wmnet --roles=role::deployment::server admin::groups -v [23:37:07] !confctl [23:37:28] !confctl is ... [23:37:42] hmmm maybe a better one would be [23:37:44] !depool [23:37:49] No results were found, remember, the bot is searching through content of keys and their names [23:37:49] @regsearch pool [23:38:04] !depool is for s in nginx varnish-fe varnish-be varnish-be-rand; do confctl --tags dc=eqiad,cluster=cache_text,service=$s --action set/pooled=no cp1053.eqiad.wmnet; done [23:38:04] Key was added [23:38:07] :) [23:38:17] !help [23:38:17] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [23:38:24] https://meta.wikimedia.org/wiki/Wm-bot [23:39:13] Results (Found 1): depool, [23:39:13] @regsearch nginx [23:40:09] Results (Found 8): git, help, instancelist, instance-json, credentials, logging, bots, depool, [23:40:09] @regsearch for [23:40:18] !instancelist [23:40:18] https://labsconsole.wikimedia.org/w/index.php?title=Special:Ask&offset=0&limit=100&q=[[Resource+Type%3A%3Ainstance]]&p=format%3Dbroadtable&po=%3FInstance+Name%0A%3FInstance+Type%0A%3FProject%0A%3FImage+Id%0A%3FFQDN%0A%3FLaunch+Time%0A%3FPuppet+Class%0A%3FModification+date%0A%3FInstance+Host%0A%3FNumber+of+CPUs%0A%3FRAM+Size%0A%3FAmount+of+Storage%0A [23:40:26] !credentials [23:40:26] when you see No Nova credentials found for your account just relog to wiki and should be ok [23:40:32] !bots [23:40:32] bot down? http://wikitech.wikimedia.org/view/Category:Bots | proposal for new bot infra: http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure [23:40:39] !git [23:40:39] for more information about git on labs see https://labsconsole.wikimedia.org/wiki/Git [23:40:48] Results (Found 69): puppet, instance, morebots, git, bang, nagios, bot, labs-home-wm, labs-nagios-wm, labs-morebots, gerrit-wm, wiki, labs, bastion, extension, wm-bot, projects, putty, gerrit, wikitech, revision, monitor, alert, password, unicorn, help, bz, os-change, instancelist, instance-json, leslie's-reset, damianz's-reset, amend, credentials, queue, socks-proxy, info, security, logging, ask, sudo, access, $realm, keys, $site, bug, pageant, blueprint-dns, stucked, pxe, ghsh, group, pathconflict, terminology, rt, erb, regsubst, bots, wt, gerrit-search, change, dn, opshelp, testwiki, sal, task, hiera, thx, depool, [23:40:48] @regsearch . [23:40:51] :) [23:41:06] !leslie's-reset [23:41:06] git reset --hard origin/test [23:41:24] !$realm [23:41:24] $realm is a variable used in puppet to determine which cluster a system is in. See also $site. [23:41:36] "cluster" is such an overloaded word now [23:41:50] @rss+ debiansec https://www.debian.org/security/dsa [23:41:51] Item was inserted to feed [23:41:56] @rss-on [23:41:56] Permission denied [23:41:56] !cluster is such an overloaded word now [23:41:56] Key was added [23:41:59] :p [23:42:00] but I guess it would sound silly to say "$realm is a variable used in puppet to determine which realm a system is in" [23:42:04] http://bots.wmflabs.org/~wm-bot/db/%23wikimedia-operations.htm [23:42:17] !log mathoid deploying 8d2295 [23:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:43] plenty of labsconsole, rt, svn, bugzilla, etc. in there [23:45:39] [[test]] [23:45:40] @link [23:45:40] http://enwp.org/test [23:45:46] ah :) [23:47:37] The quick [[brown]] [[fox]] jumps over the [[lazy]] [[de:dog]]. [23:47:41] @link [23:47:41] http://enwp.org/brown http://enwp.org/fox http://enwp.org/lazy https://de.wikipedia.org/wiki/dog [23:48:09] gotta use w.wiki [23:50:29] 6operations, 6Discovery, 7Elasticsearch, 7Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#1889328 (10Deskana)