[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T0000). Please do the needful. [00:08:48] (03PS10) 10Yuvipanda: docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 [00:09:50] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Add nginx frontend for registry [puppet] - 10https://gerrit.wikimedia.org/r/281998 (owner: 10Yuvipanda) [00:23:28] matt_flaschen: verified the deployment worked [00:23:33] thanks again [00:23:40] No problem [01:16:17] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [01:41:26] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:56:06] PROBLEM - DPKG on db2047 is CRITICAL: Timeout while attempting connection [01:56:15] PROBLEM - dhclient process on db2047 is CRITICAL: Timeout while attempting connection [01:56:15] PROBLEM - MariaDB Slave IO: s7 on db2047 is CRITICAL: Timeout while attempting connection [01:56:16] PROBLEM - configured eth on db2047 is CRITICAL: Timeout while attempting connection [01:56:16] PROBLEM - puppet last run on db2047 is CRITICAL: Timeout while attempting connection [01:56:36] PROBLEM - Disk space on db2047 is CRITICAL: Timeout while attempting connection [01:56:45] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: Timeout while attempting connection [01:56:46] PROBLEM - salt-minion processes on db2047 is CRITICAL: Timeout while attempting connection [01:56:55] PROBLEM - RAID on db2047 is CRITICAL: Timeout while attempting connection [01:56:55] PROBLEM - MariaDB Slave SQL: s7 on db2047 is CRITICAL: Timeout while attempting connection [01:57:00] PROBLEM - MariaDB disk space on db2047 is CRITICAL: Timeout while attempting connection [01:57:26] PROBLEM - Check size of conntrack table on db2047 is CRITICAL: Timeout while attempting connection [01:57:50] PROBLEM - mysqld processes on db2047 is CRITICAL: Timeout while attempting connection [02:03:02] ^trying it db2047 anyone know waht's up? [02:06:35] unresponsive and console just dumps out 'BUG: soft lockup - CPU#26 stuck for 22s! [migration/26:267]' [02:06:39] for ...seemingly all cpu's [02:06:42] so there is that [02:08:29] chasemp: do you know if it's a slave or not? [02:08:31] * YuviPanda checks [02:08:53] oh, it's codfw [02:08:56] yes [02:09:03] I'm rebooting as no other choice I think [02:09:15] yeah [02:09:20] !log reboot db2047 (it's totally stuck w/ BUG: soft lockup - CPU#26 stuck for 22s! [migration/26:267] across) [02:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:09:57] tendril shows no spikes or whatever for it [02:10:01] https://tendril.wikimedia.org/host/view/db2047.codfw.wmnet/3306 [02:10:05] probably random hardware/kernel bug [02:10:37] RECOVERY - DPKG on db2047 is OK: All packages OK [02:10:37] RECOVERY - dhclient process on db2047 is OK: PROCS OK: 0 processes with command name dhclient [02:10:46] RECOVERY - configured eth on db2047 is OK: OK - interfaces up [02:10:46] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 24 minutes ago with 0 failures [02:11:06] RECOVERY - Disk space on db2047 is OK: DISK OK [02:11:15] RECOVERY - salt-minion processes on db2047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:11:25] RECOVERY - RAID on db2047 is OK: OK: no RAID installed [02:11:30] RECOVERY - MariaDB disk space on db2047 is OK: DISK OK [02:11:56] RECOVERY - Check size of conntrack table on db2047 is OK: OK: nf_conntrack is 0 % full [02:12:10] this kind of thing is pretty reassuring :) [02:12:11] mcelog: Warning: cpu 39 offline?, imc_log not set#012: No such file or directory [02:13:07] heh [02:13:14] at least it's only a slave in the nonactive dc [02:13:34] I'm going to drop a ticket w/ a few notes for volan.s [02:13:46] considering codfw etc and call it for the moment [02:13:52] assumign I don't find anything more specific [02:16:54] chasemp: +1 sounds good [02:17:33] 6Operations, 10DBA: db2047 froze up and had to be hard rebooted (possible hardware error) - https://phabricator.wikimedia.org/T132011#2185858 (10chasemp) [02:18:04] ok it grows late here adios till the morning YuviPanda [02:21:53] chasemp: cya! [02:34:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [02:38:31] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 17m 51s) [02:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:57:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [02:58:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 10m 00s) [02:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0] [03:08:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Apr 7 03:08:28 UTC 2016 (duration 9m 52s) [03:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:38] !log backfilled missing SAL entries from 2016-04-04T17:30Z to 2016-04-05T20:36Z to https://tools.wmflabs.org/sal/production [03:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:16:33] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2185883 (10mmodell) 5Open>3Resolved a:5mmodell>3chasemp Thanks @chasemp! It work... [03:16:48] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2185887 (10mmodell) [03:25:36] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [03:45:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [03:50:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [03:52:15] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:59:16] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: puppet fail [05:27:27] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:57:49] PROBLEM - mysqld processes on db2047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:59:05] <_joe_> volans: ^^ [05:59:09] <_joe_> should I worry? [05:59:21] <_joe_> (I doubt volans is even awake, but still..) [06:10:40] (03CR) 10KartikMistry: "maps/kartotherian uses 'sources:' to call different config file(s)/values from Puppet, but this is specific to kartotherian." [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [06:22:55] (03PS1) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [06:24:05] _joe_: we dealt with it a little earlier, possibly hardware issue? https://phabricator.wikimedia.org/T132011#2185858 [06:24:43] (03PS2) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [06:27:45] (03PS3) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [06:30:56] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:30] (03PS4) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [06:34:26] _joe_: thanks [06:37:22] (03PS5) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [06:47:09] (03PS1) 10Volans: Depool crashed db2047, needs to be reimaged [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282103 (https://phabricator.wikimedia.org/T132011) [06:49:25] (03CR) 10Nikerabbit: "I would like to note that we failed one of our quarterly goals because we have not been able to deploy new language pairs. There were othe" [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [06:50:48] (03PS2) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) [06:51:07] (03CR) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [06:51:48] ACKNOWLEDGEMENT - MariaDB Slave IO: s7 on db2047 is CRITICAL: CRITICAL slave_io_state could not connect Volans Crashed, needs reimage: https://phabricator.wikimedia.org/T132011 [06:51:48] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag could not connect Volans Crashed, needs reimage: https://phabricator.wikimedia.org/T132011 [06:51:48] ACKNOWLEDGEMENT - MariaDB Slave SQL: s7 on db2047 is CRITICAL: CRITICAL slave_sql_state could not connect Volans Crashed, needs reimage: https://phabricator.wikimedia.org/T132011 [06:51:51] ACKNOWLEDGEMENT - mysqld processes on db2047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Volans Crashed, needs reimage: https://phabricator.wikimedia.org/T132011 [06:52:29] (03PS3) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) [06:55:49] (03CR) 10Muehlenhoff: [C: 031] mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [06:56:16] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:58:25] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:47] (03CR) 10Volans: [C: 032] Depool crashed db2047, needs to be reimaged [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282103 (https://phabricator.wikimedia.org/T132011) (owner: 10Volans) [07:02:12] (03Merged) 10jenkins-bot: Depool crashed db2047, needs to be reimaged [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282103 (https://phabricator.wikimedia.org/T132011) (owner: 10Volans) [07:04:45] !log volans@tin Synchronized wmf-config/db-codfw.php: Depool crashed db2047, needs to be reimaged T132011 (duration: 00m 38s) [07:04:47] T132011: db2047 froze up and had to be hard rebooted (possible hardware error) - https://phabricator.wikimedia.org/T132011 [07:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:55] (03PS6) 10Yuvipanda: devpi: Add module + role [puppet] - 10https://gerrit.wikimedia.org/r/282102 [07:06:20] 6Operations, 10DBA, 13Patch-For-Review: db2047 froze up and had to be hard rebooted (possible hardware error) - https://phabricator.wikimedia.org/T132011#2186165 (10Volans) Depooled from mediawiki-config, in any case the DB needs to be reimported and given is on Trusty better to reimage it too. I'll check fo... [07:06:37] (03CR) 10Elukey: "Very good point, I haven't thought about it.. I checked two things:" [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [07:19:28] (03PS1) 10Alexandros Kosiaris: Add a comment about the lvs_grain's information purposes [puppet] - 10https://gerrit.wikimedia.org/r/282104 [07:22:03] (03CR) 10Alexandros Kosiaris: [C: 032] Add a comment about the lvs_grain's information purposes [puppet] - 10https://gerrit.wikimedia.org/r/282104 (owner: 10Alexandros Kosiaris) [07:24:28] 6Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966#2186170 (10MoritzMuehlenhoff) We had the same problem before with eeden on the 2nd of February, see the log (starting at 13:58:38 with my reboot): http://wm-bot.wmflabs.org/browser/index.php?s... [08:02:32] akosiaris: thank you for merging my patch on wikilabels :) [08:09:29] (03CR) 10Alexandros Kosiaris: "@Kartik, service-runner does not provide that feature, hence the specific solution for kartotherian/tilerator/tileratorui. While it would " [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) (owner: 10KartikMistry) [08:15:45] akosiaris: thanks! [08:16:07] Amir1: got more coming ;-) [08:16:15] \o/ [08:16:22] awesome [08:16:23] thanks [08:17:04] !log renamed duplicate cxserver-deploy and cxserver repos in phabricator to citoid-deploy and citoid respectively [08:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:17:33] same name given to multiple repos ... that was confusing [08:20:21] akosiaris: yep. Thanks. I was going to ask that :) [08:22:15] !log stop cassandra bootstrap of restbase2004-b, not enough disk [08:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:25:36] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [08:27:23] ACKNOWLEDGEMENT - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Filippo Giunchedi bootstrap failed [08:30:08] (03CR) 10Filippo Giunchedi: [C: 031] "Dereckson, to answer your question on IRC: yes if mediawiki logs a timeout expired whenever it happens we'll be able to find it, I haven't" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [08:31:25] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2186242 (10Joe) So in T131755 I built the HHVM packages for jessie, BUT: they are using a newer version of libicu (which is availab... [08:35:56] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures [08:41:23] (03CR) 10Filippo Giunchedi: [C: 04-1] "commit message says it is waiting on https://gerrit.wikimedia.org/r/#/c/281571/ and a comment from gwicke says the code is deployed, not s" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [08:41:56] ACKNOWLEDGEMENT - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi restbase2004-b bootstrap failed [08:43:41] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2186279 (10akosiaris) OK, then. Seems like we got this done way faster than I anticipated. I suppose it should be enough for CI nee... [08:53:08] (03PS1) 10Muehlenhoff: Enable base::firewall on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282113 [08:53:14] (03PS3) 10KartikMistry: WIP: cxserver: Read config from cxserver/deploy [puppet] - 10https://gerrit.wikimedia.org/r/278235 (https://phabricator.wikimedia.org/T122498) [08:58:57] (03PS1) 10Giuseppe Lavagetto: hhvm: make the perf maps cron work under systemd [puppet] - 10https://gerrit.wikimedia.org/r/282114 [08:59:17] !log removed aqs1001.eqiad.wmnet from LVS pool via confd for nodejs upgrade [08:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:38] <_joe_> elukey: it's confctl, confd is another (externally written) software [09:01:44] _joe_ yes sorry I realized it only after logging and I knew that your parser would have spotted it :P [09:02:28] * elukey proceeds with the upgrade hiding from Joe [09:02:45] 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2186298 (10Joe) Since we are at the point where there are no precise machines left running php, we should really build HHVM with libicu52 and make the conversion now. The process will be: # Removin... [09:02:55] 6Operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2186299 (10Joe) p:5Normal>3High [09:08:45] PROBLEM - AQS root url on aqs1001 is CRITICAL: Connection refused [09:08:56] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.123, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:10:09] ---^ and this is me, silencing it [09:12:25] RECOVERY - AQS root url on aqs1001 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.016 second response time [09:12:36] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [09:19:33] !log re-added aqs1001.eqiad.wmnet to LVS pool via confctl [09:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:05] (03PS1) 10Aude: Replace my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/282119 [09:28:53] !log de-pooling/re-pooling aqs100[23].eqiad.wmnet for nodejs upgrade [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:06] (03CR) 10Hoo man: [C: 031] "Katie is sitting next to me." [puppet] - 10https://gerrit.wikimedia.org/r/282119 (owner: 10Aude) [09:32:43] 6Operations, 7HHVM, 7user-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2186327 (10matmarex) >>! In T86096#2186298, @Joe wrote: > 3. Installing it fleet-wide (during this phase, what problems can we expect?) > >I have no idea what the user impact could be... [09:35:42] !log deploying HHVM HTTP pool sizing on all MW nodes (T131839 / https://gerrit.wikimedia.org/r/#/c/281881), not used yet, no impact expected [09:35:43] T131839: Activate SSL + connection pooling for CirrusSearch on PROD - https://phabricator.wikimedia.org/T131839 [09:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:36:01] (03PS7) 10Gehel: Activate SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) [09:37:29] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2186354 (10Joe) a:3Joe [09:38:27] (03CR) 10Gehel: [C: 032] Activate SSL + connection pooling for CirrusSearch on PROD [puppet] - 10https://gerrit.wikimedia.org/r/281881 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [09:38:27] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Convert the hhvm puppet module to be compatible with Debian jessie - https://phabricator.wikimedia.org/T131756#2177370 (10Joe) the hhvm module has been converted to base::system_unit already, and has now an (hopefully working, but still untested) support fo... [09:39:47] <_joe_> gehel: did you deactivated puppet on the mw* hosts? [09:40:00] <_joe_> *deactivate [09:40:08] _joe_: yes, following https://wikitech.wikimedia.org/wiki/Application_servers#Deploying_config [09:40:38] <_joe_> yeah, you don't need to force the puppet run though [09:40:49] <_joe_> this is not an externally visible change [09:40:51] just wait 15' ? [09:40:58] <_joe_> 30 [09:41:00] <_joe_> overall [09:41:05] <_joe_> but I think that's fine [09:41:24] <_joe_> I mean, as long as you check on a couple of servers immediately [09:41:29] I'll still apply on the test nodes first to test .. [09:41:35] <_joe_> yes [09:41:39] <_joe_> that's the spirit :P [09:41:56] Ok, then I should be good (you had me worried for a second...) [09:42:19] <_joe_> no I was trying to help, but you got it covered already [09:42:26] looks good on mw1017... [09:42:27] <_joe_> go on and sorry for the interrupt [09:43:35] thanks for the help! It is welcomed! I'm still a bit tense when touching that cluster... At former job, only money was involved when I screwed up. I feel the stakes are higher here... [09:44:13] <_joe_> yeah but we're a friendlier environment [09:44:41] <_joe_> :) [09:45:44] Job^1 was an extremely friendly environment too. I had a coworker reset passwords for > 3M accounts, it just cost him a few beers... [09:46:23] <_joe_> heh [09:50:56] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 67045 bytes in 3.261 second response time [09:51:22] Ok, all looks good (as expected). I'll be around in case and ready for the real change at 12:00 UTC [09:51:25] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.096 second response time [09:53:26] RECOVERY - HHVM rendering on mw1141 is OK: HTTP OK: HTTP/1.1 200 OK - 67051 bytes in 0.566 second response time [09:53:26] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.073 second response time [09:53:31] (03PS1) 10Muehlenhoff: Use bastionhost::2fa role on iron [puppet] - 10https://gerrit.wikimedia.org/r/282122 [09:54:05] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.063 second response time [09:54:17] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 67043 bytes in 0.128 second response time [09:55:10] I should have checked before deploying, but I now see a number of mw* hosts in error in Icinga. Some for > 10h. [09:55:24] 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2186361 (10elukey) >>! In T123629#2184014, @Eevans wrote: > Do we need the Cassandra restart here? Nope! Just upgraded node only stopping restbase: ``` aqs1003.eqiad.wmnet: ii nodejs... [09:55:24] (03PS1) 10ArielGlenn: set up scap config and target host list for dumps scap3 deployment [dumps/scap] - 10https://gerrit.wikimedia.org/r/282123 [09:55:27] (03PS2) 10Giuseppe Lavagetto: hhvm: make the perf maps cron work under systemd [puppet] - 10https://gerrit.wikimedia.org/r/282114 [09:55:33] <_joe_> gehel: yeah I was looking at those [09:57:13] _joe_: I saw a few in error for ~15', that have now recovered. Timing seems a bit correlated with my change... [09:57:25] <_joe_> that is due to hhvm restarts [09:58:19] <_joe_> interesting [09:58:23] (03CR) 10ArielGlenn: [C: 032 V: 032] set up scap config and target host list for dumps scap3 deployment [dumps/scap] - 10https://gerrit.wikimedia.org/r/282123 (owner: 10ArielGlenn) [09:58:31] so expected, does it have significant end user impact? Or do we have a mechanism to automatically depool on HHVM restart? [09:59:26] <_joe_> ok, interestingly, on mw1114 we have a defunct children process [09:59:47] also looking at it... [09:59:57] <_joe_> I'm restarting hhvm there [10:01:55] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.041 second response time [10:02:55] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67049 bytes in 0.115 second response time [10:03:44] <_joe_> !log restarted hhvm on mw1114 (one of the child processes was defunct) and on mw1173 (deadlock in HPHP::Treadmill::getAgeOldestRequest) [10:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:05:55] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time [10:05:56] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 67050 bytes in 0.124 second response time [10:07:06] (03PS1) 10ArielGlenn: actually use dumps scap repo for scap3 deployment config [puppet] - 10https://gerrit.wikimedia.org/r/282129 [10:12:20] _joe_ where did you find the deadlock info? (curious) [10:12:45] <_joe_> elukey: hhvm-dump-debug [10:13:39] 6Operations, 6Analytics-Kanban, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2186374 (10elukey) @faidon do you prefer to upgrade the vk package anyway or the above solution would suffice for the moment? [10:13:46] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.161 second response time [10:14:07] <_joe_> rotfl [10:14:36] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67050 bytes in 0.398 second response time [10:14:40] <_joe_> that ^^ is hhvm crashing while I tried to get a core dump [10:22:13] 6Operations, 6Analytics-Kanban, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2186378 (10faidon) The Varnishkafka package shipping an rsyslog config is a bug in itself, so we should queue a fix for it. As you've already handled the issue at hand via puppet, it's okay... [10:22:55] 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2186379 (10elukey) I probably needed some coffee, I upgraded nodejs-legacy too (thanks @MoritzMuehlenhoff for the hint): ``` aqs1002.eqiad.wmnet: ii nodejs 4.3.0~dfsg-2~wmf1... [10:30:42] * elukey checks https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting [10:31:22] (03CR) 10ArielGlenn: [C: 032] actually use dumps scap repo for scap3 deployment config [puppet] - 10https://gerrit.wikimedia.org/r/282129 (owner: 10ArielGlenn) [10:41:04] (03PS1) 10ArielGlenn: remove now outdated scap dir setup and keyholder from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/282132 [10:45:08] (03PS2) 10Muehlenhoff: Use bastionhost::2fa role on iron [puppet] - 10https://gerrit.wikimedia.org/r/282122 [10:45:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use bastionhost::2fa role on iron [puppet] - 10https://gerrit.wikimedia.org/r/282122 (owner: 10Muehlenhoff) [10:46:06] (03PS3) 10Gehel: Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) [10:46:57] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2186385 (10Joe) [10:47:07] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport nutcracker 0.4.1 to jessie - https://phabricator.wikimedia.org/T132032#2186385 (10Joe) a:5Joe>3None [10:49:32] (03PS2) 10ArielGlenn: remove now outdated scap dir setup and keyholder from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/282132 [10:51:33] (03CR) 10ArielGlenn: [C: 032] remove now outdated scap dir setup and keyholder from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/282132 (owner: 10ArielGlenn) [10:52:54] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport libvpx2 to jessie - https://phabricator.wikimedia.org/T132033#2186400 (10Joe) [10:53:07] (03CR) 10TheDJ: [C: 031] "https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals/Archive/2015/06#Allow_WebP_upload" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) (owner: 10Matanya) [10:54:47] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [11:20:06] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:31:22] (03PS1) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282135 (https://phabricator.wikimedia.org/T129344) [11:44:01] (03PS1) 10Ema: Misc cluster VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/282137 (https://phabricator.wikimedia.org/T128188) [11:46:20] (03CR) 10Alexandros Kosiaris: "no, not really. our admin module has this limitation and one more, i.e. adding users to non defined groups (use case is for example ops i" [puppet] - 10https://gerrit.wikimedia.org/r/281918 (owner: 10Ema) [11:50:57] (03PS2) 10Ema: Misc cluster VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/282137 (https://phabricator.wikimedia.org/T128188) [11:57:11] !log bump raid rebuild to 20MB/s on restbase2003 [11:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:03:01] (03PS1) 10BBlack: misc VCL: disable yarn.wm.o more-completely [puppet] - 10https://gerrit.wikimedia.org/r/282140 (https://phabricator.wikimedia.org/T131501) [12:06:06] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: puppet fail [12:06:11] !log remove restbase2004-b cassandra data directory [12:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:47] !log enabling HTTPS + connection pooling for CirrusSerach on all mediawiki nodes (T131839) [12:08:48] T131839: Activate SSL + connection pooling for CirrusSearch on PROD - https://phabricator.wikimedia.org/T131839 [12:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:45] (03CR) 10Gehel: [C: 032] Activate SSL and connection pooling for CirrusSearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281882 (https://phabricator.wikimedia.org/T131839) (owner: 10Gehel) [12:17:46] (03CR) 10Bmansurov: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [12:22:49] !log gehel@tin Synchronized wmf-config: (no message) (duration: 00m 35s) [12:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:31:55] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:34:18] (03PS1) 10Giuseppe Lavagetto: Don't use HTTPS+pooling for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282145 [12:34:26] <_joe_> dcausse, gehel ^^ [12:34:35] <_joe_> either this is ok, or we should revert IMHO [12:35:23] (03CR) 10Gehel: [C: 031] Don't use HTTPS+pooling for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282145 (owner: 10Giuseppe Lavagetto) [12:35:27] (03CR) 10DCausse: [C: 031] Don't use HTTPS+pooling for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282145 (owner: 10Giuseppe Lavagetto) [12:35:35] _joe_: thanks for saving my ass... [12:36:44] (03CR) 10Gehel: [C: 032] Don't use HTTPS+pooling for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282145 (owner: 10Giuseppe Lavagetto) [12:37:54] !log gehel@tin Synchronized wmf-config: Don't use HTTPS+pooling for labswiki (duration: 00m 27s) [12:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:26] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:26] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:17] (03PS1) 10Giuseppe Lavagetto: Point the codfw label back to the codfw cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282147 [12:43:23] <_joe_> gehel, dcausse ^^ [12:43:27] <_joe_> that's what I mean [12:44:02] ouch, good catch _joe_ [12:45:16] Damn... [12:45:26] (03CR) 10DCausse: [C: 031] Point the codfw label back to the codfw cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282147 (owner: 10Giuseppe Lavagetto) [12:45:34] (03CR) 10Gehel: [C: 032] Point the codfw label back to the codfw cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282147 (owner: 10Giuseppe Lavagetto) [12:45:37] gehel: can you deploy ^ we're writing twice to the same cluster [12:45:48] <_joe_> oh geez [12:45:53] <_joe_> yeah let's deploy it [12:46:44] dcausse: so we're loosing updates on codfw? not good... [12:46:45] !log gehel@tin Synchronized wmf-config: Point the codfw label back to the codfw cluster (duration: 00m 27s) [12:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:14] gehel: yes [12:47:36] (03PS1) 10Hoo man: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282148 [12:47:55] ok, deployed [12:48:13] gehel: Are you done deploying? [12:48:17] yes [12:48:44] hoo: we are done (unless we find another issue) [12:48:52] (03CR) 10Hoo man: [C: 032] Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282148 (owner: 10Hoo man) [12:49:07] (03CR) 10Bmansurov: [C: 04-1] "OK, got it. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov) [12:49:09] I should be done very fast [12:49:17] (03Merged) 10jenkins-bot: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282148 (owner: 10Hoo man) [12:49:26] <_joe_> !log restarting hhvm on mw1213, deadlock in HPHP::Treadmill::getAgeOldestRequest [12:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:36] hoo: could give us a few more minutes, it has been bumpy so far... [12:49:52] sync is running by now [12:49:57] once it's done, you can take over [12:50:04] should be a few more seconds [12:50:06] !log hoo@tin Synchronized wmf-config/Wikibase.php: Bump $wgCacheEpoch on Wikidata after Property conversions (duration: 00m 26s) [12:50:09] here you go [12:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:13] hoo: ok, thanks! [12:50:26] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.041 second response time [12:50:26] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 67027 bytes in 0.139 second response time [13:02:20] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186521 (10chasemp) @bblack -- I hope this is what you are asking for, but our server using our mx's is the only valid source of a phabricator.wikimedia.org email... [13:08:50] (03PS1) 10BBlack: cache_misc: removed unused backends [puppet] - 10https://gerrit.wikimedia.org/r/282149 [13:08:52] (03PS1) 10BBlack: cache_misc: whitespace/comment/format cleanup [puppet] - 10https://gerrit.wikimedia.org/r/282150 [13:08:54] (03PS1) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [13:10:33] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186525 (10BBlack) ok, so we should amend the patch to use `-all` and merge that [13:13:07] (03PS1) 10DCausse: Fix TTMServer elastic config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282153 [13:13:09] (03PS1) 10Giuseppe Lavagetto: Make ElasticaTTMServer config work with the new CirrusSearch structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282154 [13:13:15] <_joe_> heh [13:13:19] :) [13:13:20] 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#2186527 (10Andrew) a:3Andrew [13:13:21] <_joe_> synchronicity [13:13:57] <_joe_> dcausse: my version wouldn't work? [13:14:11] <_joe_> probably not, yours is more "secure" [13:14:14] <_joe_> let's go with that [13:14:20] I think it will work but if we switch to codfw we'll break translate [13:14:31] because the indices are not replicated [13:14:41] <_joe_> oh I see [13:14:44] translate is not codfw aware :/ [13:15:15] (03PS2) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [13:15:37] (03CR) 10Ottomata: [C: 031] misc VCL: disable yarn.wm.o more-completely [puppet] - 10https://gerrit.wikimedia.org/r/282140 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [13:17:47] (03CR) 10Ottomata: "Petr, ok. I don't have strong feelings here. Let's leave it as is and if it becomes too confusing later, we can revisit." [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [13:18:30] (03CR) 10Ottomata: "Filippo did you mean to -1 or +1?" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [13:20:53] (03CR) 10Gehel: [C: 032] Fix TTMServer elastic config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282153 (owner: 10DCausse) [13:20:58] (03PS3) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [13:21:54] <_joe_> bblack: I admire your will in transforming what our varnish puppet code was into something clean :) [13:22:16] one baby step at a time! [13:22:24] !log gehel@tin Synchronized wmf-config: Fix TTMServer elastic config (duration: 00m 32s) [13:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:44] <_joe_> bblack: we should look at what it was circa 1 year ago [13:23:10] don't, you might go blind! [13:24:57] (03PS4) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [13:26:07] for that matter, circa one year ago we had 6 clusters: text, upload, misc, mobile, bits, parsoid. now we have 4: text, upload, misc, maps. [13:26:55] and before text, upload, and mobile were 2layer+multi-tier, bits was 1layer+multi-tier, misc was 1layer+1tier, and parsoid was 2layer+1tier. [13:27:06] now all remaining ones are all 2layer+multi-tier [13:27:23] <_joe_> it was moar fun [13:28:47] <_joe_> !log restarted hhvm on mw1015 [13:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:34:51] 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#2186543 (10Andrew) This script uses the nova-tools-bot credentials, and they are currently invalid. [13:35:25] 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2161064 (10elukey) I tried to add myself to the user's list too but I can't login (even if I verified that my username is saved on the db). Maybe this phab task could be a good incentive to prioriti... [13:38:40] (03CR) 10Filippo Giunchedi: "I meant to -1 since I was planning to merge but wasn't sure from the comments/commit message whether this would do anything in production " [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [13:40:55] 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#2186548 (10Andrew) Best I can tell, that account has a new password and private hiera has fallen behind. I'm not sure how we would've gotten here though. [13:41:15] (03PS5) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [13:41:17] (03PS1) 10BBlack: misc VCL: sort backends by backend [puppet] - 10https://gerrit.wikimedia.org/r/282156 [13:41:19] (03PS1) 10BBlack: misc VCL: merge duplicate backend blocks [puppet] - 10https://gerrit.wikimedia.org/r/282157 [13:41:52] (03PS1) 10Gehel: Switching CirrusSearch to codfw Elasticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282158 [13:42:05] dcausse: ^ can you just have a quick look ? [13:42:13] (03CR) 10BBlack: [C: 032] misc VCL: disable yarn.wm.o more-completely [puppet] - 10https://gerrit.wikimedia.org/r/282140 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [13:42:35] (03CR) 10BBlack: [C: 032] cache_misc: removed unused backends [puppet] - 10https://gerrit.wikimedia.org/r/282149 (owner: 10BBlack) [13:42:41] (03CR) 10jenkins-bot: [V: 04-1] Switching CirrusSearch to codfw Elasticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282158 (owner: 10Gehel) [13:42:57] (03PS1) 10Muehlenhoff: Manage /etc/pam.d/sshd in role::bastionhost::2fa on via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 [13:42:59] (03PS1) 10Muehlenhoff: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 [13:43:09] (03CR) 10BBlack: [C: 032] cache_misc: whitespace/comment/format cleanup [puppet] - 10https://gerrit.wikimedia.org/r/282150 (owner: 10BBlack) [13:43:34] gehel: looking [13:44:09] unittest failing (I did not update them to reflect the change, lemme do that) [13:44:18] (03PS2) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282135 (https://phabricator.wikimedia.org/T129344) [13:44:23] <_joe_> gehel: heh [13:44:45] (03CR) 10jenkins-bot: [V: 04-1] Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [13:44:47] (03PS3) 10Ema: Misc cluster VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/282137 (https://phabricator.wikimedia.org/T128188) [13:45:29] yes the unittest make sure that you use the wmfDatacenter as the cluster :/ [13:45:33] (03Abandoned) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282135 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [13:45:51] (03PS2) 10Gehel: Switching CirrusSearch to codfw Elasticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282158 [13:47:11] (03CR) 10DCausse: [C: 031] Switching CirrusSearch to codfw Elasticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282158 (owner: 10Gehel) [13:48:51] !log switching CirrusSearch to use Elasticsearch cluster in codfw [13:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:07] (03CR) 10Gehel: [C: 032] Switching CirrusSearch to codfw Elasticsearch cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282158 (owner: 10Gehel) [13:50:19] (03PS3) 10Rush: nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) [13:50:21] (03PS3) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) [13:50:35] !log gehel@tin Synchronized wmf-config: switching CirrusSearch to use Elasticsearch cluster in codfw (duration: 00m 31s) [13:51:16] (03PS1) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [13:51:28] Scheduled 'shutdown' today? [13:51:36] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186555 (10faidon) `-all`, like other antispam features (DMARC etc.) works poorly with mailing lists/reforwarders. If we're sure that there aren't any Phabricator... [13:51:39] my bad (probably) [13:51:42] rolling back [13:52:00] (Sigh) [13:52:01] I'm getting 500 on wikitech? [13:52:05] dewp ist down [13:52:08] PHP fatal error: [13:52:09] Invalid operand type was used: cannot perform this operation with arrays [13:52:16] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.016 second response time [13:52:16] PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50482 bytes in 0.014 second response time [13:52:16] PROBLEM - HHVM rendering on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50482 bytes in 0.019 second response time [13:52:16] PROBLEM - Apache HTTP on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.018 second response time [13:52:16] If you had a dual rolllout strategy [13:52:17] PROBLEM - HHVM rendering on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.034 second response time [13:52:17] PROBLEM - Apache HTTP on mw1232 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.017 second response time [13:52:25] PROBLEM - Apache HTTP on mw1106 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.012 second response time [13:52:25] PROBLEM - Apache HTTP on mw1141 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.016 second response time [13:52:25] PROBLEM - Apache HTTP on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50482 bytes in 0.011 second response time [13:52:25] PROBLEM - Apache HTTP on mw1202 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50482 bytes in 0.009 second response time [13:52:25] PROBLEM - Apache HTTP on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.025 second response time [13:52:25] PROBLEM - Apache HTTP on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50482 bytes in 0.025 second response time [13:52:26] PROBLEM - Apache HTTP on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.019 second response time [13:52:26] PROBLEM - Apache HTTP on mw1203 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.017 second response time [13:52:26] PROBLEM - Apache HTTP on mw1205 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.020 second response time [13:52:27] <_joe_> gehel: revert immediately! [13:52:35] <_joe_> manually on tin [13:52:38] <_joe_> should I do it? [13:52:40] oh crap [13:52:52] time to panic [13:53:06] Nah [13:53:10] Not yet [13:53:12] rollback in progress [13:53:15] AHHH I JUST JOINED CAN I HELP WITH THE PANICKING? [13:53:17] <_joe_> cool [13:53:23] i'm in a cafe, i'd make a big scene [13:53:35] !log gehel@tin Synchronized wmf-config: switching CirrusSearch to use Elasticsearch cluster in codfw (duration: 00m 35s) [13:53:36] * gehel is on it [13:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:47] PAnic is for when they can't rollback ;) [13:54:06] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.044 second response time [13:54:06] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 67013 bytes in 0.105 second response time [13:54:06] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 67013 bytes in 0.092 second response time [13:54:06] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.048 second response time [13:54:06] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 67013 bytes in 0.102 second response time [13:54:06] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.032 second response time [13:54:07] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.045 second response time [13:54:07] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.050 second response time [13:54:15] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.036 second response time [13:54:15] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.032 second response time [13:54:15] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.037 second response time [13:54:15] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.028 second response time [13:54:15] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.045 second response time [13:54:15] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.048 second response time [13:54:15] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.049 second response time [13:54:16] RECOVERY - Apache HTTP on mw1209 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.033 second response time [13:54:16] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.033 second response time [13:54:29] lol [13:55:01] (03CR) 10Rush: [C: 032] nslcd specifying shell override [puppet] - 10https://gerrit.wikimedia.org/r/282060 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [13:55:26] 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2186570 (10faidon) p:5Triage>3High [13:56:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:56:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:56:26] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [13:56:46] PROBLEM - HHVM rendering on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.029 second response time [13:57:05] PROBLEM - Apache HTTP on mw1257 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50483 bytes in 0.042 second response time [13:57:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [13:57:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:57:27] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:58:17] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [13:58:48] !log labstore restart nfs-export that crashed [13:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:01] (03PS4) 10Ema: Misc cluster VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/282137 (https://phabricator.wikimedia.org/T128188) [13:59:03] 6Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10fgiunchedi) should be fairly straightforward, though we'd need big disks (I'd suggest raid1 2x4TB or raid10 4x4TB as it is mostly cold data anyway) ``` fluorine:~$ df -h Filesystem Size Used Avail Us... [13:59:15] (03CR) 10Ema: [C: 032 V: 032] Misc cluster VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/282137 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [14:00:19] ottomata: still in time to say "YES PLEASE MAKE A SCENE" ? [14:01:52] there is a $5 minimum here, all i've had so far is a coffee. If I start some destruction, that would help put me over my $5 minimum [14:04:23] 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2186653 (10Ottomata) Chris reached out to me, he will be in the datacenter on Tuesday when I am online (he’s off today and tomorrow, I’m off Monday). [14:04:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:27] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:46] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:04:52] (03PS1) 10DCausse: Fix Completion pool counter settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282163 [14:05:16] (03PS1) 10Gehel: Revert "Switching CirrusSearch to codfw Elasticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282164 [14:06:06] <_joe_> gehel: don't revert [14:06:22] <_joe_> gehel: we can just merge the patch from dcausse as well [14:06:24] _joe_: ok, waiting... [14:06:31] <_joe_> and distribute both toghether [14:06:40] and test just on mw1017 first this time :-) [14:06:43] <_joe_> yes [14:06:52] <_joe_> I am taking a break guys [14:07:06] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:07:13] <_joe_> I need a coffee and I've been here since 7:30 AM :P [14:07:15] <_joe_> bbiab [14:10:16] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:10:46] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [14:11:07] (03PS1) 10Rush: nscld changes need to restart nscd as well [puppet] - 10https://gerrit.wikimedia.org/r/282165 (https://phabricator.wikimedia.org/T131541) [14:13:08] (03PS1) 10Ottomata: Create new eventlogging::analytics role in modules/role, use scap for deployment [puppet] - 10https://gerrit.wikimedia.org/r/282166 (https://phabricator.wikimedia.org/T118772) [14:14:58] (03PS1) 10Ema: Fix ownership of VTC tests directory [puppet] - 10https://gerrit.wikimedia.org/r/282167 (https://phabricator.wikimedia.org/T128188) [14:19:00] (03CR) 10Ema: [C: 032 V: 032] Fix ownership of VTC tests directory [puppet] - 10https://gerrit.wikimedia.org/r/282167 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [14:24:53] ottomata: thanks you for your support in (not) panicking... :P [14:24:59] hehehhe [14:25:56] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures [14:25:59] (03CR) 10Ema: [C: 031] misc VCL: sort backends by backend [puppet] - 10https://gerrit.wikimedia.org/r/282156 (owner: 10BBlack) [14:26:39] (03PS2) 10Rush: nscld changes need to restart nscd as well [puppet] - 10https://gerrit.wikimedia.org/r/282165 (https://phabricator.wikimedia.org/T131541) [14:26:53] (03PS2) 10BBlack: misc VCL: sort backends by backend [puppet] - 10https://gerrit.wikimedia.org/r/282156 [14:26:53] <_joe_> gehel: https://www.youtube.com/watch?v=i0GW0Vnr9Yc [14:27:04] (03CR) 10BBlack: [C: 032 V: 032] misc VCL: sort backends by backend [puppet] - 10https://gerrit.wikimedia.org/r/282156 (owner: 10BBlack) [14:28:19] (03PS3) 10Rush: nscld changes need to restart nscd as well [puppet] - 10https://gerrit.wikimedia.org/r/282165 (https://phabricator.wikimedia.org/T131541) [14:28:25] _joe_: thanks to IRC it is harder to get physical... [14:28:28] (03CR) 10Rush: [C: 032 V: 032] nscld changes need to restart nscd as well [puppet] - 10https://gerrit.wikimedia.org/r/282165 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [14:29:50] (03PS4) 10Rush: toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) [14:29:57] (03CR) 10Rush: [C: 032 V: 032] toollabs bastions install cgroup-bin [puppet] - 10https://gerrit.wikimedia.org/r/282072 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [14:35:42] !log switching CirrusSearch to use Elasticsearch cluster in codfw (again), testing on mw1017 first [14:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:14] (03PS2) 10BBlack: misc VCL: merge duplicate backend blocks [puppet] - 10https://gerrit.wikimedia.org/r/282157 [14:36:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:36:26] (03CR) 10Gehel: [C: 032] Fix Completion pool counter settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282163 (owner: 10DCausse) [14:36:48] (03CR) 10BBlack: [C: 032 V: 032] misc VCL: merge duplicate backend blocks [puppet] - 10https://gerrit.wikimedia.org/r/282157 (owner: 10BBlack) [14:37:39] (03PS6) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [14:37:56] (03PS2) 10Ottomata: Create new eventlogging::analytics role in modules/role, use scap for deployment [puppet] - 10https://gerrit.wikimedia.org/r/282166 (https://phabricator.wikimedia.org/T118772) [14:40:16] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:40:25] (03CR) 10Ottomata: [C: 032] Create new eventlogging::analytics role in modules/role, use scap for deployment [puppet] - 10https://gerrit.wikimedia.org/r/282166 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [14:44:23] (03CR) 10Ema: [C: 031] cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [14:44:39] (03PS1) 10Ottomata: Remove unused eventlogging::common source, this will be done via eventlogging module with a git clone [puppet] - 10https://gerrit.wikimedia.org/r/282169 [14:44:54] (03PS2) 10Ottomata: Remove unused eventlogging::common source, this will be done via eventlogging module with a git clone [puppet] - 10https://gerrit.wikimedia.org/r/282169 [14:45:09] (03CR) 10Ottomata: [C: 032 V: 032] Remove unused eventlogging::common source, this will be done via eventlogging module with a git clone [puppet] - 10https://gerrit.wikimedia.org/r/282169 (owner: 10Ottomata) [14:46:17] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:07] !log syncing config to activate switch of CirrusSearch to codfw [14:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:50] (03PS7) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [14:47:52] (03PS1) 10BBlack: misc VCL: whitespace and Host capitalization fixups [puppet] - 10https://gerrit.wikimedia.org/r/282170 [14:47:53] !log gehel@tin Synchronized wmf-config: switching CirrusSearch to use Elasticsearch cluster in codfw (duration: 00m 30s) [14:47:54] (03PS1) 10BBlack: misc VCL: improved planet host regex [puppet] - 10https://gerrit.wikimedia.org/r/282171 [14:48:06] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.028 second response time [14:48:36] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [14:49:19] (03PS2) 10BBlack: misc VCL: whitespace and Host capitalization fixups [puppet] - 10https://gerrit.wikimedia.org/r/282170 [14:49:37] RECOVERY - HHVM rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 67093 bytes in 0.078 second response time [14:49:38] (03CR) 10BBlack: [C: 032 V: 032] misc VCL: whitespace and Host capitalization fixups [puppet] - 10https://gerrit.wikimedia.org/r/282170 (owner: 10BBlack) [14:49:51] (03PS2) 10BBlack: misc VCL: improved planet host regex [puppet] - 10https://gerrit.wikimedia.org/r/282171 [14:50:28] (03CR) 10BBlack: [C: 032 V: 032] misc VCL: improved planet host regex [puppet] - 10https://gerrit.wikimedia.org/r/282171 (owner: 10BBlack) [14:50:57] (03PS8) 10BBlack: cache_misc: declarative req.http.host=>backend map [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) [14:51:46] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:53:51] (03CR) 10BBlack: [C: 032] "compiler says no-op on the VCL output" [puppet] - 10https://gerrit.wikimedia.org/r/282151 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [14:58:53] (03PS1) 10BBlack: misc VCL: backend-setting for varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/282173 (https://phabricator.wikimedia.org/T131501) [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T1500). [15:00:04] James_F _joe_ matanya: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:17] I can SWAT. Who's around to be SWATted? [15:02:46] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:52] * James_F is here. [15:03:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280828 (owner: 10Jforrester) [15:03:36] Whee. [15:04:03] (03Merged) 10jenkins-bot: Enable VisualEditor Beta Feature on Wikisources, Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280828 (owner: 10Jforrester) [15:04:32] (03PS1) 10Eevans: remove query parms from urls [puppet] - 10https://gerrit.wikimedia.org/r/282174 [15:06:33] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor Beta Feature on Wikisources, Wiktionaries [[gerrit:280828]] (duration: 00m 30s) [15:06:36] ^ James_F check please [15:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:12] thcipriani: Looks OK. Checking further. [15:08:18] kk [15:08:26] (03PS2) 10BBlack: misc VCL: backend setting for varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/282173 (https://phabricator.wikimedia.org/T131501) [15:08:54] 6Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#1936574 (10fgiunchedi) a VM seems like a good use case for this, we have VMs with public ip addresses already (e.g. lists) so it could be permanent also note the udp echo bot doesn't seem to have been restarted in a while, still pos... [15:09:12] hi James_F [15:10:24] thcipriani: LGTM. May be some follow-up but should be good for now. [15:10:37] James_F: okie doke. Thanks for checking. [15:10:48] _joe_: matanya ping me when you're around for SWAT [15:11:16] (03PS2) 10Muehlenhoff: Enable two-factor authentication in sshd [puppet] - 10https://gerrit.wikimedia.org/r/282160 [15:11:26] apergos: HIIIII [15:11:35] <_joe_> thcipriani: ping [15:11:37] ottomata: [15:11:48] you see they created my repo so I did the rest [15:11:52] thanks for all the setup [15:11:56] those patches are merged! Buti had to comment out the scap::source for dumps/dumps [15:12:02] _joe_: pong :) [15:12:02] oh? [15:12:16] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:12:17] oh I have already the repo for that and removed the comments and it runs [15:12:22] it's exactly as you set it up [15:12:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279350 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:12:26] (03CR) 10Ema: [C: 031] "LGTM, noop on varnish 3: https://puppet-compiler.wmflabs.org/2350/" [puppet] - 10https://gerrit.wikimedia.org/r/282173 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [15:12:26] oh great [15:12:29] so thank you [15:12:40] AHH great that is what i was goign to get you to do! [15:12:41] awesome! [15:12:50] <_joe_> thcipriani: let me try it on one server [15:12:53] I removed the manifests I had, gone now, all good! all puppetized! yay! [15:12:58] yeehaw! [15:13:02] _joe_: sure. [15:13:10] (03Merged) 10jenkins-bot: Use ProductionServices for the jobqueue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279350 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:13:28] yeah I just had to wait for someone to get to the repo creation request [15:13:29] <_joe_> thcipriani: so merge it on tin and ack me when you're done :) [15:13:36] but they did it in a couple days, not bad [15:14:00] _joe_: done [15:14:35] <_joe_> thcipriani: testing on one jobrunner (mw1001) [15:14:43] ack. [15:16:11] 6Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2186978 (10Krenair) it still posts as that because changing it might be a breaking change for some bots [15:16:19] <_joe_> thcipriani: seems ok for now [15:17:10] _joe_: kk. Thinking about sync order for remaining servers: looks like ProductionServices then JobQueue then wmf-config dir, sound sane? [15:17:59] (03PS3) 10Ppchelko: Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) [15:18:08] trying not to cause any temporary error log spike. [15:18:13] <_joe_> thcipriani: seems to make sense, yes [15:18:22] okie doke, doing. [15:18:29] <_joe_> thcipriani: actually, productionservices then the dir [15:18:46] (03PS1) 10Ottomata: Comment fix in hieradata/common/scap/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/282179 [15:18:51] <_joe_> no you are right [15:18:59] <_joe_> prodservices, then jobqueue, then sync-dir [15:19:03] <_joe_> sorry [15:19:07] no problem :) [15:19:08] <_joe_> I am a bit tired by now [15:19:10] (03CR) 10Ppchelko: "Filippo: I've amended the commit message to be more clear. The code was deposed yesterday, so merging this will enable the event in produc" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [15:19:25] (03CR) 10Ottomata: [C: 032 V: 032] Comment fix in hieradata/common/scap/server.yaml [puppet] - 10https://gerrit.wikimedia.org/r/282179 (owner: 10Ottomata) [15:19:41] 6Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2186979 (10fgiunchedi) ah, thanks @Krenair , also I was mistaken, the version in puppet is the correct one despite having two on argon ``` /etc/init/udpmxircecho:exec /usr/local/bin/udpmxircecho.py rc-eqiad argon.wikimedia.org /etc... [15:21:28] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Use ProductionServices for the jobqueue configuration 1/3 [[gerrit:279350]] (duration: 00m 30s) [15:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:12] !log thcipriani@tin Synchronized wmf-config/jobqueue.php: SWAT: Use ProductionServices for the jobqueue configuration 2/3 [[gerrit:279350]] (duration: 00m 27s) [15:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:55] !log thcipriani@tin Synchronized wmf-config: SWAT: Use ProductionServices for the jobqueue configuration 3/3 [[gerrit:279350]] (duration: 00m 31s) [15:22:56] ^ _joe_ check please [15:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:08] <_joe_> thcipriani: the only check is looking at the logs [15:23:12] <_joe_> and at the jobqueue [15:23:15] kk [15:23:18] <_joe_> which I am looking at [15:23:23] <_joe_> can you check logstash? [15:24:03] I'm watching logstash [15:24:19] <_joe_> queue seems ok-ish but let's wait a few more mins [15:24:53] kk [15:25:52] nothing in fatalmonitor or mediawiki-errors seems abnormal/changed [15:26:07] <_joe_> let's call it a success then? [15:27:20] yup wfm :) [15:27:43] ready for the next one? [15:28:00] <_joe_> yes [15:28:09] <_joe_> this is *way* less scary [15:28:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279355 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:28:22] :D [15:28:41] <_joe_> this should be a noop [15:28:43] (03Merged) 10jenkins-bot: Use local resources in codfw for parsoid, url-downloader and mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279355 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:28:55] <_joe_> again, only metric to test: errors in logstash [15:29:11] * _joe_ loves how his changes are good if they do nothing at all [15:30:31] !log thcipriani@tin Synchronized wmf-config/ProductionServices.php: SWAT: Use local resources in codfw for parsoid, url-downloader and mathoid [[gerrit:279355]] (duration: 00m 25s) [15:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:39] ^ _joe_ sync'd [15:31:13] no log explosions as yet :) [15:31:21] <_joe_> nope :) [15:31:48] neat. /me declares victory [15:31:53] thanks! [15:31:58] <_joe_> cool [15:32:01] <_joe_> meeting now! [15:33:02] matanya: lemme know if you're around for SWAT. [15:33:52] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2186995 (10greg) Unless some team has a team list specified for their team project contact (I highly highly doubt it), the only thing I know of is/was wikibugs-l,... [15:35:13] 6Operations, 10Mathoid, 6Services: Travis PNG looks different from vagrant png - https://phabricator.wikimedia.org/T94379#2186997 (10Physikerwelt) [15:43:22] 6Operations, 10Analytics-Cluster: Complete installation of analytics1017.eqiad.wmnet - https://phabricator.wikimedia.org/T125055#2187062 (10Ottomata) 5Open>3declined Yuvi is taking over this server [15:46:25] Is there a maximum execution time for jobs before they are killed? [15:47:46] 6Operations, 10Analytics-Cluster: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2187108 (10Ottomata) [15:48:16] 6Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 (or 0.10?) - https://phabricator.wikimedia.org/T121562#2187115 (10Ottomata) [15:49:15] AaronSchulz: ^ ? [15:49:18] (03PS1) 1020after4: add a group for phabricator deployment on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/282186 [15:49:35] 3600 sec, 1 day for TMH ones [15:49:52] well, that's the recycle time, not the kill time [15:50:15] I guess the hhvm server settings on the runners would dictate any max execution time [15:51:24] paravoid: do you want all logrotate confs removed from vk .deb package? [15:51:37] /jobrunner.yaml: max_execution_time: 1200 [15:51:38] the default vanrishkafka config that comes with the package writes json stats to a file in /var/cache/varnishkafka [15:52:52] and there is a logrotate that rotates it [15:52:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Since this is temporary, I am fine with it. But let's revert this after we 're done migrating services. It's ugly." [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [15:53:03] (03PS6) 10Alexandros Kosiaris: Scap3: chown the target root dir if owned by root [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [15:53:19] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Scap3: chown the target root dir if owned by root [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [15:54:04] I guess thats cpu time, since we don't have TimeoutsUseWallTime set [15:54:36] PROBLEM - puppet last run on mw2085 is CRITICAL: CRITICAL: Puppet has 1 failures [15:55:51] (03Abandoned) 10Gehel: Revert "Switching CirrusSearch to codfw Elasticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282164 (owner: 10Gehel) [15:56:17] (03PS2) 1020after4: add a group for phabricator deployment on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/282186 [15:58:04] (03PS1) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282188 (https://phabricator.wikimedia.org/T129344) [15:58:45] ottomata: https://gerrit.wikimedia.org/r/#/c/282188/1/debian/varnishkafka.conf [15:59:04] sorry wrong channel ) [15:59:06] :) [16:00:04] godog moritzm: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T1600). Please do the needful. [16:01:20] bring it on puppet SWAT! [16:02:40] (03CR) 10Ottomata: "Get rid of all mentions of /var/cache/varnishkafka, see inline comment." (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282188 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [16:04:46] (03PS2) 10Filippo Giunchedi: remove query parms from urls [puppet] - 10https://gerrit.wikimedia.org/r/282174 (owner: 10Eevans) [16:04:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove query parms from urls [puppet] - 10https://gerrit.wikimedia.org/r/282174 (owner: 10Eevans) [16:06:21] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2187221 (10fgiunchedi) and again, actually not while synching to strontium but as soon as puppet-merge is ran ``` palladium:~$ sudo puppet-merge Fetching new commits from https://gerrit... [16:07:46] (03PS2) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282188 (https://phabricator.wikimedia.org/T129344) [16:09:15] (03Abandoned) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282188 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [16:10:48] (03PS2) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [16:10:59] Elukey: Please don't create so much patches, every single change let my client ping :P [16:13:45] (03CR) 10Ema: [C: 031] "Please add yourself to the Uploaders field in debian/control. Other than that, it LGTM! The package builds fine on copper and the various " [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [16:14:57] (03PS1) 10CSteipp: Revert "Revert "Enable Ex:OATHAuth in beta, disabled for all users"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282193 [16:15:16] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "Enable Ex:OATHAuth in beta, disabled for all users"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282193 (owner: 10CSteipp) [16:19:07] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:19:38] Csteipp: revert the revert of the revert of the revert? Sound like an edit-war :P [16:20:27] (03PS3) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [16:22:56] (03CR) 10Filippo Giunchedi: [C: 031] "thanks Petr, if all pieces are in place LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [16:26:21] ottomata: hey. remember our conversations about something archiva-like for python? I'm like, 90% of the way there now :) https://gerrit.wikimedia.org/r/#/c/282102/ [16:27:32] ottomata: also if you look at the patch itself, and the gerrit repo operations/wheels/devpi, you'll see that that's a reasonably ok way of deploying python services by itself [16:27:34] (03CR) 10EBernhardson: "looks good to go from here" [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [16:28:25] COOL YuviPanda [16:29:25] YuviPanda: .wheels files directories? [16:29:26] (03CR) 10Ppchelko: "Filippo.. I thought you'd +2 it :)" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [16:29:34] or some kind of packaged file binary thing? [16:29:39] ottomata: packaged file binary thing. [16:29:42] cool [16:29:58] then it should be easy to do the same git fat repo thing we do for archiva artifacts [16:30:02] ottomata: ya [16:30:14] ottomata: and generating wheels is easy enough [16:30:27] might want to abstract out the git fat stuff from archiva puppet [16:30:38] and let it take multiple directories to look for files that should be in the git fat store [16:30:52] then the same git fat store ccould be used for any types of files [16:31:02] * YuviPanda nods [16:31:08] but jaaaa one thing at a time, super cool! [16:31:11] (03Abandoned) 10CSteipp: Revert "Revert "Enable Ex:OATHAuth in beta, disabled for all users"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282193 (owner: 10CSteipp) [16:31:27] if you are looking for a place to host in production, i betcha we could colocate with archiva [16:31:34] probably should to do the git fat store combo [16:31:45] on titanium [16:31:57] although, i think moritzm is trying to get a ganeti instance for that [16:32:01] so we should keep that in mind [16:32:44] ottomata: ya [16:32:50] ottomata: I don't need this in prod for now, just in labs. [16:32:56] err, just in tools even [16:33:06] k cool [16:34:15] ottomata: but if you want it for eventlogging, I'll be happy to help etc [16:34:25] well, i'm not in a hurry [16:34:28] we have all the deps as debs [16:34:40] so, it isin't important now [16:35:14] 7Blocked-on-Operations, 6Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2187332 (10fgiunchedi) restbase2003 just finished expanding its raid0, moving to 2002 [16:35:35] (03PS2) 10Yuvipanda: Use die-on-term on ores uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/281161 (https://phabricator.wikimedia.org/T131572) (owner: 10Ladsgroup) [16:35:37] (03PS1) 10CSteipp: Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282198 [16:35:46] Amir1: ^ am going to merge this, are you here? :) [16:35:49] ottomata: kkk [16:36:02] ottomata: well, then I'll poke you if I end up wanting it in prod and you can help me etc [16:36:24] k danke [16:36:31] sounds good [16:37:04] !log repool restbase2003, raid expansion finished, depool restbase2002 [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:11] (03CR) 10Alexandros Kosiaris: "It will NOT shutdown the server when it receives the TERM signal. It will brutally kill it. Which means requests still being served will b" [puppet] - 10https://gerrit.wikimedia.org/r/281161 (https://phabricator.wikimedia.org/T131572) (owner: 10Ladsgroup) [16:44:50] YuviPanda: you ok with the die-on-term setting ? [16:45:03] akosiaris: hahaha, so uwsgi is bullshit and by default reloads config on term [16:45:12] akosiaris: this one just moves it back to normal behavior I think [16:45:30] this bit me when I was setting up uwsgi for labs, but totally forgot [16:45:40] Til uWSGI 2.1, by default, sending the SIGTERM signal to uWSGI means “brutally reload the stack”  [16:46:12] akosiaris: til or till? :D [16:46:34] (03CR) 10Alexandros Kosiaris: "Er, I meant brutally reload btw. The rest holds true" [puppet] - 10https://gerrit.wikimedia.org/r/281161 (https://phabricator.wikimedia.org/T131572) (owner: 10Ladsgroup) [16:46:50] YuviPanda: c/ped from https://uwsgi-docs.readthedocs.org/en/latest/ThingsToKnow.html [16:47:05] akosiaris: right. so that's what we don't want. [16:47:09] we want it to stop [16:47:26] since it basically fucks all process managers [16:47:43] ah, I just read it once more for the 3rd time [16:47:48] ok it makes sense [16:48:05] ffs the entire documentation is bad there [16:48:08] akosiaris: :) yeah. this patch only makes sense because uwsgi does not make sense. [16:48:10] die != shutdown [16:48:14] grr [16:48:17] ok +2ing then [16:48:20] :D thanks! [16:48:31] akosiaris: btw, we should probably move that to default in the uwsgi module [16:48:39] !log start raid expansion on restbase2002 T127951 [16:48:40] T127951: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951 [16:48:40] akosiaris: even graphite has super slow restart times due to this btw [16:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:18] YuviPanda: gunicorn :P [16:49:21] ftw! [16:49:22] :P [16:49:28] not sure if joking, etc :) [16:49:46] uwsgi has otherwise been rock solid though. and actually does have docs, even if they're a bit.. scattered [16:49:49] (03PS1) 10Urbanecm: Add flood group to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) [16:50:36] (03CR) 10Alexandros Kosiaris: [C: 032] "OK, I 've reread once more https://uwsgi-docs.readthedocs.org/en/latest/ThingsToKnow.html. die-on-term is badly chosen as die != shutdown," [puppet] - 10https://gerrit.wikimedia.org/r/281161 (https://phabricator.wikimedia.org/T131572) (owner: 10Ladsgroup) [16:51:22] akosiaris: thanks :) [16:52:08] (03CR) 10Alexandros Kosiaris: [C: 032] ores: do git clone in staging [puppet] - 10https://gerrit.wikimedia.org/r/281228 (owner: 10Ladsgroup) [16:52:14] (03PS3) 10Alexandros Kosiaris: ores: do git clone in staging [puppet] - 10https://gerrit.wikimedia.org/r/281228 (owner: 10Ladsgroup) [16:52:25] (03CR) 10Alexandros Kosiaris: [V: 032] ores: do git clone in staging [puppet] - 10https://gerrit.wikimedia.org/r/281228 (owner: 10Ladsgroup) [16:53:27] YuviPanda: I'm around now [16:53:33] akosiaris: awesome [16:53:34] thanks [16:53:38] Amir1: patch got merged :) \o/ [16:53:49] yes, and another one [16:53:51] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2187405 (10Nemo_bis) http://markmail.org/search/?q=from%3Aphabricator.wikimedia.org is empty, unlike http://markmail.org/search/?q=from%3Agerrit.wikimedia.org , s... [16:54:05] the only thing left is the scap3 config [16:55:31] \o/ [16:58:08] akosiaris: if you want to test it in beta cluster, tell me and I help you with that [16:58:17] (we have that already in beta cluster) [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T1700). [17:00:47] nothing for parsoid [17:00:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Scap3 deployment configurations (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (https://phabricator.wikimedia.org/T130404) (owner: 10Ladsgroup) [17:01:23] Amir1: no, first I want to bring it up to par with service::node [17:03:21] kk [17:03:29] I'll fix the base.pp in the mean time [17:13:25] (03CR) 10Filippo Giunchedi: "I would, has it been tested in staging already?" [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [17:24:22] (03PS1) 10Faidon Liambotis: squid: replace the last remaining netmask with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/282206 [17:24:57] (03CR) 10Faidon Liambotis: [C: 032 V: 032] squid: replace the last remaining netmask with CIDR [puppet] - 10https://gerrit.wikimedia.org/r/282206 (owner: 10Faidon Liambotis) [17:25:18] (03PS4) 10Filippo Giunchedi: Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [17:25:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [17:47:34] thcipriani: sorry for missing the window, stuff came up at work [17:48:06] (03PS45) 10Ladsgroup: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 [17:48:29] akosiaris: ^ [17:48:41] do you want me to do the split? [17:48:44] matanya: np :) evening SWAT looks fairly open if you'll be around. [17:48:59] it is problematic for me :) morning is when i am at work and evening is when i sleep [17:49:28] thcipriani: i'll try to move it to evening, hope i'll be around [17:51:14] (03PS1) 10Muehlenhoff: Fix traceback in error message if update is deployed twice [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282209 [17:52:19] joal: am around if you still wanna chat [17:52:22] matanya: I've added the time one too [17:52:26] as you wiwhed. [17:52:28] wished [17:52:34] thanks [17:52:53] if you can do them one after the other, it would be great [17:53:22] god og has suggested we double the time to 180 seconds instead to go to 300. So we'll be at 4 Gb / 180 seconds. [17:54:18] That should work for 1 Gb files, and perhaps 2 Gb from good links [17:56:49] matanya: you've a test file by the way to test a > 2 Gb upload? [17:57:21] yes Dereckson [17:57:24] sec [17:58:29] Dereckson: https://commons.wikimedia.org/wiki/File:Aurat,_1940.webm [17:58:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix traceback in error message if update is deployed twice [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282209 (owner: 10Muehlenhoff) [18:00:20] Okay, and a second one at 2.1 Gb / 2.2 Gb perhaps as fallback if it timeouts? [18:01:41] Dereckson: https://commons.wikimedia.org/wiki/File:Dilawar,_1931.webm [18:02:23] 2.03 is perfect. [18:02:37] Dereckson: note you will get a dup warning [18:02:42] so modify it abit [18:03:05] (03PS1) 10Ottomata: On second thought, remove new eventlogging::analytics role [puppet] - 10https://gerrit.wikimedia.org/r/282212 [18:03:19] As upload.wikimedia.org isn't in the whitelist, could you upload them to video2commons service perhaps? [18:03:19] 6Operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2187697 (10Varnent) 5Resolved>3Open We would like to change the redirect for an announcement on Monday (if possible) to this URL: https://policy.wikimedia.org/stopsurveillance So http://w... [18:03:37] 6Operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2187700 (10Varnent) p:5Normal>3High [18:04:09] Dereckson: as in host it in labs ? or do the entire process ? [18:04:26] first is enough I think [18:05:12] ok, will curl it to there [18:05:33] To avoid the duplicate hash, perhaps you could add a \n to both files (cat >> file.ext, enter, ctrl + D)? [18:06:04] (03CR) 10Ottomata: [C: 032] On second thought, remove new eventlogging::analytics role [puppet] - 10https://gerrit.wikimedia.org/r/282212 (owner: 10Ottomata) [18:07:49] (03PS1) 10Ottomata: Unpuppetize eventlogging on eventlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/282216 [18:08:54] will do Dereckson [18:09:48] Perfect, so all will be ready for the SWAT test. [18:11:19] (03CR) 10Ottomata: [C: 032] Unpuppetize eventlogging on eventlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/282216 (owner: 10Ottomata) [18:13:42] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:28] (03PS1) 10Ottomata: Fix cut/paste, didn't mean to remove standard and firewall [puppet] - 10https://gerrit.wikimedia.org/r/282219 [18:15:44] (03CR) 10Ottomata: [C: 032 V: 032] Fix cut/paste, didn't mean to remove standard and firewall [puppet] - 10https://gerrit.wikimedia.org/r/282219 (owner: 10Ottomata) [18:26:48] (03PS1) 10Ottomata: eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) [18:30:43] (03PS2) 10Ottomata: eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) [18:32:14] (03CR) 10jenkins-bot: [V: 04-1] eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [18:33:29] (03PS3) 10Ottomata: eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) [18:34:37] (03CR) 10jenkins-bot: [V: 04-1] eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [18:37:18] (03PS4) 10Ottomata: eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) [18:42:36] (03PS3) 1020after4: add a group for phabricator deployment on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/282186 [18:43:11] (03PS5) 10Ottomata: eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) [18:46:27] (03CR) 10Ottomata: [C: 032] eventlogging::service::* classes now depend but don't include eventlogging::server [puppet] - 10https://gerrit.wikimedia.org/r/282220 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [18:51:24] YuviPanda: BTW, one day, maybe maybe, instead of updating all of the upstart/puppet stuff for eventlogging just so we can use systemd with multiple dependent services...it would be really cool to deploy and distribute the various eventlogging services with wheels and kubernetes :) [19:00:04] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T1900). Please do the needful. [19:01:44] (03PS2) 10BBlack: secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 [19:02:26] (03CR) 10BBlack: [C: 032 V: 032] secure GeoIP cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281980 (owner: 10BBlack) [19:02:32] (03PS3) 10BBlack: secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 [19:02:45] (03CR) 10BBlack: [C: 032 V: 032] secure WMF-Last-Access cookie T119576 [puppet] - 10https://gerrit.wikimedia.org/r/281979 (owner: 10BBlack) [19:03:38] (03PS1) 10Ottomata: Remove notify from eventlogging::service:* classes to eventlogging/init Service. [puppet] - 10https://gerrit.wikimedia.org/r/282224 (https://phabricator.wikimedia.org/T118772) [19:05:17] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2187877 (10BBlack) [19:05:19] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review, 7Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2187875 (10BBlack) 5Open>3Resolved a:3BBlack [19:05:37] (03CR) 10Ottomata: [C: 032] Remove notify from eventlogging::service:* classes to eventlogging/init Service. [puppet] - 10https://gerrit.wikimedia.org/r/282224 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [19:07:26] (03PS1) 10Ottomata: Run eventlogging analytics daemons out of the scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/282225 (https://phabricator.wikimedia.org/T118772) [19:10:29] (03PS2) 10BBlack: varnish+statsd: refactor classes, move rls to text-only [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) [19:10:42] (03CR) 10BBlack: [C: 032 V: 032] varnish+statsd: refactor classes, move rls to text-only [puppet] - 10https://gerrit.wikimedia.org/r/281439 (https://phabricator.wikimedia.org/T131353) (owner: 10BBlack) [19:11:53] (03PS1) 10Dduvall: all wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282226 [19:14:48] (03CR) 10Dduvall: [C: 032] all wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282226 (owner: 10Dduvall) [19:15:15] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282226 (owner: 10Dduvall) [19:15:34] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.20 [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:41] (03PS2) 10Ottomata: Run eventlogging analytics daemons out of the scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/282225 (https://phabricator.wikimedia.org/T118772) [19:28:31] (03CR) 10Ottomata: [C: 032] Run eventlogging analytics daemons out of the scap deployment [puppet] - 10https://gerrit.wikimedia.org/r/282225 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [19:32:40] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:34:54] !log restarting eventlogging so it runs out of the scap deploy in eventlogging/analytics [19:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:40] 6Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2187915 (10BBlack) [19:38:50] (03PS1) 10Mattflaschen: Disable Echo survey on all wikis except test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282230 [19:46:48] Krinkle: yt? [19:48:27] 6Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2187948 (10BBlack) {F3845100} [19:55:14] ottomata: I am [19:55:37] so, i'd like to clean up some misc eventlogging uses that the analytics team doesn' tmanage much [19:55:40] i'm looking at the webperf ones [19:55:46] Yep, assumed as much [19:55:48] and, the only one that actually uses the eventlogging pythong lib [19:55:50] is ve.py [19:55:50] !log stop->disable varnishrls service on non-text clusters (upload, maps, misc) - ( https://gerrit.wikimedia.org/r/#/c/281439/ ) [19:55:53] the others connect to zmq [19:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:00] ottomata: So which one is the naughty one? [19:56:05] zmq or el? [19:56:15] I can see both being up for migration in some way [19:56:18] well, we'd like to stop using zmq, we maintain a special forwarder so you guys can run that code [19:56:23] but [19:56:33] that doesn't block me on this deployment refactoring i'm doing [19:56:35] Yeah, I saw that. [19:56:38] OK [19:56:42] so ve.py [19:56:46] could consume directly from kafka [19:56:50] So this is about changing the EL one to not use the global lib [19:56:51] since the data you need is all in a single topic [19:56:55] Yeah [19:57:00] Or change to zmq I suppose [19:57:02] i mean, either way, but you aren't really using much of EL in that code [19:57:07] naww, then you'd have to filter! [19:57:08] but I'm fine with using EL. [19:57:16] you could just import a python kafka client [19:57:19] and consume from kafk [19:57:31] I'd prefer not to use that directly though [19:57:33] am happy to make some code changes if you will review [19:57:34] oh? [19:57:38] We can use the newer EL library, right? [19:57:46] Which presumably has a method that does exactly that? [19:57:52] ja for sure [19:57:55] either way [19:58:01] How old is the global install on hafnium? [19:58:03] Does it use kafka? [19:58:10] but, if you don't use EL...then we don't have to do special deployments for yall :) [19:58:17] hm, not sure how old it is [19:58:21] we don't log in there regularly during deployments [19:58:28] i believe it has kafka support [19:58:47] so, i was thinking [19:58:51] of changing this deployment of kafka [19:58:58] ones that don't have el module managed daemons [19:59:02] I'm mostly worried about not hardcoding broker names or protocol stuff. So a client library makes sense. [19:59:03] to just use git::clone [19:59:23] scap and trebuchet have deployment targets, right? [19:59:32] ja, we are migrating away from trebuchet [19:59:38] we could do a 'common' or 'library' scap deployment [19:59:40] for your thing [19:59:44] Right [20:00:01] isn't that how it is now? [20:00:14] yeah, except it was deployed along with the analytics eventlogging target [20:00:20] i want to change that [20:00:28] i just finished a bunch of work to make this possible [20:00:29] Hm.. ok [20:00:31] so we can do it either way [20:00:40] Because it makes assumptinos about the server being an EL server rather than client? [20:00:41] i'm thinking a git::clone into /usr/local/src/eventlogging for this kind of use might be better [20:00:49] ja, but i've decoupled all that [20:00:54] There is also an EL client on stats and terbium, right? [20:01:01] (or should be at least) [20:01:02] terbium ja, need to check in on that one [20:01:08] we also deploy the code to stat1002 so we can use it from the CLI [20:01:16] Yeah, exactly [20:01:21] but would be fine with a git::clone there [20:01:32] I test hafnium webperf scripts on terbium ususally [20:01:38] since I don't have access to hafnium [20:01:47] And for one-off scrips I use terbium or stats1002 [20:02:16] which is several times a week usually [20:03:09] aye, woudl you prefer if deployments were done manually with scap, [20:03:15] or automated git pulls via git::clone in puppet [20:03:16] ? [20:03:35] I don't mind either way as long as it works and uses a version that isn't outdated. [20:03:48] if we did a git::clone, we were thinking of making a stable branch [20:03:51] I assume scap would be preferred. [20:03:51] that we only pushed to occasionally [20:03:56] Since you'd use that already [20:04:02] so that deployments happen more explicitly [20:04:13] for terbium I wouldn't mind a random version [20:04:14] yeah [20:04:18] but for hafnium stability is important [20:04:20] we use scap for eventlogging analytics, and for eventbus [20:04:32] not random in the middle of the night when someone happens to merge some code [20:04:33] coudl easily add one more [20:04:43] yeah [20:04:49] How is the scap for analytics different? [20:05:02] It's the same library that could be scapped to stat2, terbium and hafnium as well, right? [20:05:19] yes, for eventbus, we wanted to separate out deployments [20:05:24] they were different services [20:05:40] scaps' deploy checks and service restarts are used for eventbus [20:05:43] so the scap configs are different there [20:05:55] we don't use scap checks or service restarts to manage the analytics daemons on eventlog1001 [20:05:57] Hm.. eventlogging-for-analytics as well? [20:05:58] at least not yet [20:06:02] (03PS3) 10BBlack: misc VCL: backend setting for varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/282173 (https://phabricator.wikimedia.org/T131501) [20:06:12] i'm not sure if we will, we might prefer to keep doing things manually there [20:06:13] (03CR) 10BBlack: [C: 032 V: 032] misc VCL: backend setting for varnish4 [puppet] - 10https://gerrit.wikimedia.org/r/282173 (https://phabricator.wikimedia.org/T131501) (owner: 10BBlack) [20:06:44] but, Krinkle, just looking at this code, i'm not sure you need eventlogging. eventlogging could be cool, but you are only consuming and then doing everything in your own code [20:06:57] ottomata: I guess there could be an eventlogging-client and eventlogging-for-analytics scap on tin that both sync the same code, but are triggered separately. [20:07:05] E.g. you'd git pull and scap in both of them [20:07:08] if ve.py was a @reads handler plugin in eventlogging [20:07:15] then you could use the eventlogging-consumer and some configs [20:07:35] yes [20:07:36] brb [20:07:40] (~15min) [20:07:47] Krenair: that's how we do eventlogging-analytics and eventlogging-eventbus separately now [20:07:50] oops [20:07:55] sorry for ping, meant to use Krinkle [20:08:05] (brb ACKed) [20:08:28] i'm not really sure what eventlogging on terbium is used for [20:12:26] Krinkle: if all these different webperf (coal, navtiming, deprecate, etc.) things actually used the eventlogging code, then it might make sense to deploy eventlogging to hafnium [20:12:43] but, afact, only ve.py uses it, and then only barely. it still connects to the zmq endpoint [20:13:06] as the rest do [20:13:21] which means they consume all eventlogging events just to filter it down to a few schemas [20:13:59] if they used a kafka client directly, or even the eventlogging kafka reader, you could pick the topics you wanted and wouldn't have to spend the cycles filtering [20:14:09] AND you'd be able to restart and not miss messages :) [20:14:29] hm, although, maybe the ZMQ endpoint does that too, not sure [20:17:05] ATTN: I'm rebooting one of the frack puppetmasters, might be an alert or two from other hosts [20:17:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [20:22:57] (03PS1) 10Ema: misc: Use random backend director for logstash [puppet] - 10https://gerrit.wikimedia.org/r/282234 [20:25:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 28 failures [20:25:10] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 35 failures [20:27:42] (03CR) 10BryanDavis: [C: 031] misc: Use random backend director for logstash [puppet] - 10https://gerrit.wikimedia.org/r/282234 (owner: 10Ema) [20:30:09] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 28 failures [20:30:10] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 251 seconds ago with 0 failures [20:30:22] (03PS1) 10Ema: Misc VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/282235 (https://phabricator.wikimedia.org/T131501) [20:34:58] (03PS1) 10Ottomata: Lower eventlogging-service access log level to WARNING [puppet] - 10https://gerrit.wikimedia.org/r/282237 [20:35:09] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:35:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 28 failures [20:39:09] (03CR) 10Ema: [C: 032 V: 032] misc: Use random backend director for logstash [puppet] - 10https://gerrit.wikimedia.org/r/282234 (owner: 10Ema) [20:40:09] RECOVERY - check_puppetrun on mintaka is OK: OK: Puppet is currently enabled, last run 219 seconds ago with 0 failures [20:40:30] (03PS2) 10Ottomata: Lower eventlogging-service access log level to WARNING [puppet] - 10https://gerrit.wikimedia.org/r/282237 [20:40:36] (03CR) 10Ottomata: [C: 032 V: 032] Lower eventlogging-service access log level to WARNING [puppet] - 10https://gerrit.wikimedia.org/r/282237 (owner: 10Ottomata) [20:47:32] (03CR) 10Dereckson: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) (owner: 10Urbanecm) [20:48:11] ottomata: Yeah, I'm all for using kafka. [20:48:23] ottomata: afaik the statsd one does that already [20:48:29] well, it has to since its not EL [20:48:32] scrap that [20:49:05] ja [20:49:08] well actually [20:49:12] it could use EL [20:49:13] ottomata: Though preferably via the EL library [20:49:20] the statsv one [20:49:21] except for statsd [20:49:26] it could! [20:49:34] all you'd be using EL for is to consume from kafka [20:49:41] using its URI address schemes [20:49:46] rather than configuring the kafka client directly [20:49:48] does it take a raw topic as ID? [20:49:59] interesting [20:50:03] anyway [20:50:29] an eventlogging kafka client is configured using a uri [20:50:30] Converting webperf subscribers to use EL lib via kafka makes sense [20:50:34] that wil be passed to this handler [20:50:34] https://github.com/wikimedia/eventlogging/blob/master/eventlogging/handlers.py#L438 [20:50:44] returning a generator of messages to iterate on [20:50:59] Hm.. but that requires knowing the brokers [20:51:10] Krinkle: , but ja, you'd pass them in as args, no? [20:51:16] and a lot of other details (ports, topic names used by EL internally) [20:51:18] you 'have to know' the ZMQ address [20:51:26] Yeah, and I don't like it :) [20:51:39] ja, i mean, you are going to have to parameterize your connection info somewhere [20:51:42] no matter the client or service [20:52:18] Krinkle: mobrovac has been working on a patch to make this easier in puppet with kafka [20:52:38] ottomata: Hm.. I figured something was puppetized somewhere so that EL knows which brokers to use [20:52:43] so you could configure your ve.pp class to use a function to pull out kafka configs, and then contruct the full input uri [20:52:46] how do you configure the EL server (the producer) [20:52:47] ja it is [20:52:52] right now, via config classes [20:52:54] like this [20:53:17] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/eventlogging.pp#L29 [20:53:23] include role::kafka::analytics::config [20:53:29] $kafka_base_uri = inline_template('kafka:///<%= @kafka_brokers_array.join(":9092,") + ":9092" %>') [20:53:31] which is kinda gross [20:53:48] but, it is 100% in puppet :) [20:54:00] marko is changing all that ::config class stuff [20:54:04] kafka_schema_uri [20:54:05] to a function that pulls data out of hiera [20:54:05] https://gerrit.wikimedia.org/r/#/ [20:54:08] ops [20:54:11] https://gerrit.wikimedia.org/r/#/c/279280/ [20:54:22] https://gerrit.wikimedia.org/r/#/c/279280/15/modules/role/lib/puppet/parser/functions/kafka_config.rb [20:54:36] so for eventlogging, we'll be able to do [20:55:15] ideally eventlogigng client lib can read 'kafka_schema_uri' (or something like that) and take only a schema name from the method signature and get a stream going. [20:55:28] $kafka_config = kafka_config('analytics')[ [20:55:28] $kafka_base_uri = "$kafka_config['brokers']['string']?zookeeper_connect=$kafka_config['zookeeper']['url']" [20:55:29] or something [20:56:06] hmmm, maybe. schema's don't necesarily map 1to1 with topics [20:56:12] Oh? [20:56:16] well, heh [20:56:22] there are many revisions of a schema [20:56:24] and [20:56:41] there's no reason why a schema *couldn't* be used for multiple topics. in the eventlogging analytics case, this doesn't happen [20:56:45] soooo maybe a moot point [20:56:57] but, i woudln't want to make the code enforce that [20:57:08] our subscribers are revision agonistic usually. always forward-compat. [20:57:11] ja [20:57:13] OK [20:57:16] So alternatively [20:57:18] but ja [20:57:19] you can do [20:57:24] a kafka uri with eventlogging like [20:57:38] kafka:///brokers.../?zookeeper_connect=...&topic=eventlogging_Edit [20:57:39] we could discourage using EL + schema as an API and instead have these be kafka subscribers directly. [20:57:51] As long as most of it can be abstracted via puppet config [20:57:57] eitiher way is fine with me [20:58:02] we;ll need to refactor the script to take a cli parameter that puppet can pass on startup [20:58:16] which contains eveyrthing except topic name [20:58:31] ja, you'll have less work to do in your code if you use EL, because EL abstracts the kafka setup for ya. AND when we upgrade kafka versions, you'll know the client witih eventlogging will work, cause we have to manage that [20:58:42] OK [20:58:46] So let's stick with using EL then [20:58:46] but, it means you migiht be subject to more bugs, because eventlogging code moves more than kafka clients do [20:59:24] ok Krinkle, i'm happy to work on the change for ve.py if you don't think you'll have time in the short term [20:59:28] if you can review [20:59:43] the other stuff should change too, but it isn't a blocker for my current task, since they use ZMQ directly [20:59:45] so within parameters of not using python-kafka directly and not hardcoding broker names in the .py script, is there something we can do now in puppet:files:webperf/*.py ? [21:00:10] So you'd change it to use the local install instead of global? [21:00:16] oh [21:00:35] yeah, hm [21:00:51] i guess you'd set the PYTHONPATH env var in the init scripts" [21:00:52] ? [21:00:58] to /srv/deployment/eventlogging/eventlogging? [21:01:06] if you do, you should try to make that configurable in puppet [21:01:25] since we are going to move that deployment to something like eventlogging/common, or maybe even in /usr/local/src if we do git::clone [21:01:28] https://github.com/wikimedia/operations-puppet/blob/production/modules/webperf/manifests/ve.pp#L9 [21:01:35] I see what you mean now [21:01:38] ah for URIs? [21:01:40] it uses the zmq url [21:01:50] I was wondering why I didn't see any zmq code in ve.pp [21:02:06] but it's abstracted behind eventlogging.connect() [21:02:13] eventlogging uses the same tcp:// scheme for zmq [21:02:13] which supports a ZMQ url I guess [21:02:13] jaaa [21:02:15] yup [21:02:27] so for that you'd just change endpoing to the kafka URI [21:02:30] I figured it just took a hostname and did the rest internally [21:02:32] actually [21:02:37] i think that would just work [21:02:43] events.filter('Edit') would be redundant then [21:03:44] ottomata: Are there eventlogging consumers in prod that use python and are not bundled inside the main server runtime? [21:03:48] (besides webperf) [21:06:00] Krinkle: not really that I know of, the only other one I know of is the eventlogging include on terbium, but you seem to knwo what that is about [21:06:16] cool, Krinkle, just tested on stat1002 [21:06:18] yeah, there are no persistent subscirbers there afaik [21:06:47] I'm asking to potentially create a secondary use case to help make consistent. I'm open to anything. Just unclear on what the "right" pattern is. [21:06:51] export PYTHONPATH=/srv/deployment/eventlogging/eventlogging python && python /home/otto/ve-otto.py 'kafka:///kafka1012.eqiad.wmnet:9092?zookeeper_connect=conf1001.eqiad.wmnet:2181/kafka/eqiad&topic=eventlogging_Edit' [21:07:31] exact code as ve.py, just print() instead of statsd.sendto [21:08:15] ja, so, I think right pattern is either scap deploy for eventlogging/common (name TBD), OR a git::clone of eventlogging code into /usr/local/src/eventlogging [21:08:18] in either case [21:08:33] PYTHONPATH needs to be set [21:08:36] in order for it to work [21:08:37] there exists a local install on hafnium? [21:08:39] (now) [21:08:48] in /srv/deployment/eventlogging/eventlogging? [21:08:51] e.g. is setting pythonpath actionable now? [21:08:53] yes [21:08:56] But it's unused? [21:09:03] right, because it is globally installed from that path [21:09:07] we aren't doing that anymore [21:09:12] I see [21:09:14] and, that path is deployed by trebuchet, not scap [21:09:24] the global install used that same directly to install the global bindings [21:09:25] so, if we did the work to make it possible to run webperf stuff with configured PYTHONPATH [21:09:32] then changing to the scap/git::clone one later is easy [21:09:38] yes [21:09:43] previously deploy process was [21:09:45] ssh tin [21:09:50] cd ...; git pull, git deploy [21:10:08] for each EL_host; do ssh $EL_host; cd ...; sudo python setup.py install; done [21:10:16] Right [21:10:20] which was no good [21:10:21] :) [21:10:52] so ja, an intermediate step would be to get ve.py to run out of /srv/depoyment via PYTHONPATH in the init script [21:10:54] It'd be nice to not need pythonpath though. It makes cli usage harder than needed. For one case it seems fine, but I'm wondering if this is a good pattern to spread further. Is there a different way to make it bind in the default path, to to extend the default path through puppet or scap? [21:10:59] but, the PYTHONPATH should be configurable [21:11:02] so we can change it [21:11:18] oh sure [21:11:23] Would also like ori's thoughts on what seems a good code pattern for python as I'm not experienced with that myself [21:11:24] ori: [21:11:24] i mean, it could be put in shell profiles by default [21:11:42] if we deployed with a virtualenv (which we might one day), one coudl just source the venv [21:11:51] oo boy i hvae to leave in 8 mins [21:11:52] I mean, should we change ve.pp to export that variable? [21:12:17] it uses systemd as well [21:12:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:12:34] if webperf::ve had an $eventlogging_path parameter [21:12:46] then we could configure it [21:12:51] and the init script would just set the env var [21:13:06] Environment="PYTHONPATH=<%= @eventlogging_path %>" [21:14:05] but ja, indeed for casual shell users its not nice to have to set that all the time [21:15:14] ottomata: is there like an equiv to php.ini in which we could puppetize a file in /etc/php/conf.d/eventlogging that appends sometihng to the default runtime include path or some such [21:15:20] or a symlink to puppetize [21:15:28] it seems like sudo phython install doesn't do very much [21:15:45] ah, that would be nice if so [21:15:47] i am not aware of one [21:15:48] 6Operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#2188266 (10Andrew) The username/password issue is resolved on palladium; a fix for another bug is attached. After all that, the test will still fail because the bastio... [21:15:53] i've only done it via PYTHONPATH [21:15:55] or, in code [21:15:57] editing [21:16:05] sys.path (or whatever) [21:16:28] (03PS1) 10Andrew Bogott: Tools: Fix spreadcheck.py [puppet] - 10https://gerrit.wikimedia.org/r/282287 (https://phabricator.wikimedia.org/T119929) [21:16:54] Hm. yeah [21:17:14] ottomata: Can you summarise later on https://phabricator.wikimedia.org/T131977 what we should do? I think it'd be nice to do this one ourselves to get a little more familiar. [21:17:24] ok [21:17:51] And save you some time. I'll see if we can do the same to the others soon after. Should be easy enough that way to phase out zmw [21:17:53] zmq [21:18:04] Krinkle: this is the other one [21:18:04] https://phabricator.wikimedia.org/T110903 [21:18:06] for that [21:18:13] assuming there are no other downsides to be uncovered in terms of resource consumption or something like that [21:18:13] (03PS2) 10Andrew Bogott: Tools: Fix spreadcheck.py [puppet] - 10https://gerrit.wikimedia.org/r/282287 (https://phabricator.wikimedia.org/T119929) [21:18:13] cool danke! [21:18:27] naw shouldn't be, there is the issue of if you want to store offsets [21:18:27] less filtering sounds good :) [21:18:32] so if you stop your process [21:18:36] it starts where it left off [21:18:41] yeah, we definitely want to keep offsets I imagine [21:18:49] ja [21:18:50] might as well [21:18:53] Though there is the issue of graphite/statsd not supporting timestamps [21:19:04] so it tends to skew things when it catches up [21:19:11] (03CR) 10Andrew Bogott: [C: 032] Tools: Fix spreadcheck.py [puppet] - 10https://gerrit.wikimedia.org/r/282287 (https://phabricator.wikimedia.org/T119929) (owner: 10Andrew Bogott) [21:19:13] but for quick restarts it's nice [21:19:21] (03CR) 10Rush: [C: 031] "thanks andrew" [puppet] - 10https://gerrit.wikimedia.org/r/282287 (https://phabricator.wikimedia.org/T119929) (owner: 10Andrew Bogott) [21:19:23] ja [21:19:39] OOOK time to go [21:19:42] thanks Krinkle ttyl [21:19:47] cya [21:21:47] (03PS2) 10Krinkle: Remove inaccessible symlinks at /w/extensions and /w/skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281379 (https://phabricator.wikimedia.org/T99096) [21:21:57] (03CR) 10Krinkle: [C: 032] Remove inaccessible symlinks at /w/extensions and /w/skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281379 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:22:23] (03Merged) 10jenkins-bot: Remove inaccessible symlinks at /w/extensions and /w/skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281379 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [21:25:49] (03PS1) 10Mattflaschen: Set 'upload' service to false on Labs to avoid 'Undefined index' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282290 [21:28:10] (03CR) 10Mattflaschen: [C: 032] "Minor Beta Cluster-only change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282290 (owner: 10Mattflaschen) [21:28:26] 6Operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2188311 (10Krenair) Why was this reopened instead of a new task being created? This task was resolved months ago... [21:28:52] (03Merged) 10jenkins-bot: Set 'upload' service to false on Labs to avoid 'Undefined index' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282290 (owner: 10Mattflaschen) [21:30:20] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:30:53] !log krinkle@tin Synchronized w/: I69653efe0f1968: rm old symlinks (duration: 00m 46s) [21:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:33] !log mattflaschen@tin Synchronized wmf-config/LabsServices.php: Beta Cluster change (duration: 00m 30s) [21:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:27] 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1451751 (10bd808) The kibana dashboard at (NDA required) shows the long tail of b... [21:43:09] 6Operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2188377 (10Varnent) I was following the same process as the last time the redirect was changed - which is what Dzahn suggested we do (see earlier comments). [21:45:29] hello channel [21:46:17] I'm working on getting out puppet module that deploys mediawiki on precise to deploy mediawiki on trusty [21:46:44] the image scaler functionality uses image magik which has a lot of requirments [21:46:56] I got the font libraries renamed for trusty [21:47:36] unfortunatly I'm not having much luck with finding an ffmpeg or a or a libvips15 [21:47:38] https://review.openstack.org/#/c/302456/1/manifests/image_scaler.pp [21:48:15] do you have a puppet module that deploys mediawiki on trusty and if yes, might you be willing to share a link to where I might find it? [21:48:18] my thanks [21:52:30] anteaya: I think someone mentioned using libav precisely because of that [21:52:42] I will look at that, thank you [21:52:51] see https://phabricator.wikimedia.org/T103335#2184918 [21:54:19] * anteaya reads [21:55:36] that was kind of hashar to mention my questions, sorry that I missed his direction to this url but thank you Platonides for ensuring I saw it [21:59:09] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#1970805 (10RobH) [21:59:38] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2188435 (10RobH) [21:59:44] you are welcome, anteaya [21:59:48] :) [22:00:22] !log start update RESTBase to 7f69f86ee9 [22:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:00:36] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) [22:01:32] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) [22:01:34] Hey, quick question, if a email comes in without a valid destination it is dropped permanently--correct? [22:01:43] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) [22:02:21] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) [22:05:55] !log finished update RESTBase to 7f69f86ee9 [22:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:45] (03CR) 10BBlack: [C: 031] "+1 in that I think this works for a literal conversion of the existing VCL. There's still the fetch_large_objects mess I've rambled about" [puppet] - 10https://gerrit.wikimedia.org/r/282235 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [22:15:05] (03CR) 10BBlack: [C: 031] "Is this ready to merge if the code works, or is there more decision process left?" [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [22:22:01] 6Operations, 10Traffic, 7HTTPS, 5MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2188513 (10BBlack) Thinking ahead a little (because I'm sure the long tail will still be long after we go through the announcement -> cutoff date phase): it would... [22:31:47] 6Operations, 10Analytics-Cluster: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2188546 (10RobH) [22:37:42] (03PS1) 10Halfak: Moves wikilabels wsgi to labels_web.py [puppet] - 10https://gerrit.wikimedia.org/r/282296 [22:40:47] greg-g, I'm going to go out on a limb and assume you're still the deploy person? [22:41:02] or even if not; I'm cleaning up some stuff and noted that I still have admin on wikitech [22:41:11] which is not something I've needed for a long time [22:41:12] :'( [22:42:32] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2188594 (10CCogdill_WMF) Closing the loop on this, finally! We no longer need to support Silverpop's domains (mtk41 or mkt4... [22:43:18] an mwalker !!! [22:43:28] I know! [22:43:33] how's the flying car business?! [22:43:36] flying! [22:43:43] better than the alternative! [22:43:48] most definitely [22:43:55] the earth has enough holes in it already [22:43:59] :) [22:44:09] so, you want me to remove your admin or what's up? [22:44:18] yep; I want you to remove my admin [22:44:31] from deployment-prep or something else? [22:44:40] from wikitech in general actually [22:44:44] oh [22:44:56] I think I needed it for the deployment calendar work I was doing [22:44:59] 6Operations, 10Analytics-Cluster: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2188623 (10RobH) a:5RobH>3Ottomata [22:45:03] gotcha [22:45:07] which hopefully you've found something better than my hacky code [22:45:13] lest someone curse me for eternity [22:45:34] https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers&group=sysop [22:45:37] I see [22:45:47] uhhh, jouncebot is still alive and kicking :) [22:45:48] 6Operations, 10Analytics-Cluster: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2181451 (10RobH) puppet/salt keys signed and ready for service implementation. I've assigned to @Ottomata since he filed the initial #hardware-requests . [22:45:51] jouncebot: next [22:45:51] In 0 hour(s) and 14 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T2300) [22:46:38] hrmm, I can't remove admin from you... [22:46:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Moves wikilabels wsgi to labels_web.py [puppet] - 10https://gerrit.wikimedia.org/r/282296 (owner: 10Halfak) [22:47:01] mwalker: Gosh. Hello. [22:47:20] Hello. [22:47:26] I know... the imposter returns and all [22:47:28] greg-g: so 280831 and 281823 are swattable? [22:47:54] Dereckson: yeppers [22:48:05] yay [22:48:09] :) [22:48:16] i am even around, and have test cases ready [22:48:29] James_F: do you happen to know how to revoke mwalker's sysop from wikitech? [22:48:39] I can do that godog [22:48:41] greg-g: [22:48:45] :) [22:48:45] greg-g: You'll need a steward account. [22:48:53] greg-g: Which of course don't exist. [22:49:04] what is the user name ? [22:49:06] https://wikitech.wikimedia.org/w/index.php?title=Special:ListUsers&group=steward right [22:49:12] matanya, Mwalker [22:49:12] matanya: Mwalker [22:49:16] jinx [22:49:36] matanya: You don't have steward rights on that wiki, I imagine? Non-SUL wiki and https://wikitech.wikimedia.org/w/index.php?title=Special%3AListUsers&username=&group=steward&limit=50 is empty. [22:50:00] greg-g: Get a shell user to do it. [22:50:02] if i am not mistaken i have crat rights, and that is rnough there [22:50:12] (change visibility) 22:49, 7 April 2016 Reedy (talk | contribs | block) changed group membership for Mwalker from shell and shellmanagers to shell [22:50:12] (change visibility) 22:49, 7 April 2016 Reedy (talk | contribs | block) changed group membership for Mwalker from shell, shellmanagers and administrator to shell and shellmanagers [22:50:36] :) thanks Reedy [22:50:53] mwalker: you are now powerless [22:50:57] * James_F grins. [22:51:07] Ah, I have +crat rights too. [22:51:11] Ah well. [22:51:30] thanks much; one security hole fewer :) [22:51:40] mwalker: You could've just enabled 2FA ;) [22:51:40] mwalker: Do you still have shell? ;-) [22:52:19] Reedy: give me steward there please :) [22:52:22] Reedy, I did have 2FA enabled; but I need to re-install my phone and it seemed better to not have rights I didn't need than transfer the account [22:52:29] I don't have steward [22:52:32] James_F, not that I'm aware of [22:52:56] Reedy: i thought it is a shop you request for random rights ;) [22:53:48] mwalker: well, if you're ever around the area again, come on by, preferably on a Mon or Thurs when I'm in the office (or come on up to Petaluma :) ) [22:53:52] Chaos monkey that assigns right [22:54:24] is there even a process for rights on wikitech ? [22:54:35] andrewbogott should probably know [22:54:38] or bd808 [22:54:41] Process is probably a strong word [22:54:51] I really should! but it seems that the hour into the city is not a commute I ever make during the work week; only during the weekends; so I've missed all your respective companies [22:55:06] :( [22:55:22] matanya, why do you want steward rights there? [22:55:32] to help people [22:55:37] mwalker: where do you spend your time nowadays ? [22:56:09] do what exactly? do you guys usually hold that the local group outside of meta when you're not using it? [22:56:20] no, not at all [22:56:24] in Mountain View; at a company with decidedly less flexible working locations [22:56:32] but wikitech os outside the main cluster [22:56:49] (03PS2) 10Dereckson: Logo update for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) [22:57:05] well I would ask the operations team if they are okay with you having that right there [22:57:06] I hope you enjoy mwalker and have the great time in general. [22:57:40] coming to think if it Krenair it would make more sense to give crats the ability to add and remove rights as acting steward [22:57:46] *of it [22:58:11] add and remove admin rights, maybe [22:58:28] for example [22:58:43] but we should be careful about who gets to grant cloudadmin [22:58:53] (03PS3) 10Dereckson: Logo update for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) [22:59:31] (03CR) 10Dereckson: "PS2: new logo by Perhelion, optipng -o7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [23:00:04] RoanKattouw ostriches Krenair MaxSem awight Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160407T2300). [23:00:04] Dereckson matanya matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:28] yes, of course Krenair [23:00:52] I can SWAT. matt_flaschen [23:01:00] Would you be there? [23:03:33] Okay, so we'll start with the two upload one. [23:06:26] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [23:07:11] (03Merged) 10jenkins-bot: upload limit: raise to 4 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [23:08:31] Dereckson, sorry, I'm here. I just didn't notice the ping. [23:09:26] Dereckson: poke when i should test the change [23:09:32] matt_flaschen: k [23:10:19] Krenair: could you check in Tin if the git log looks good yo you? [23:11:31] (it's on master and starts by the exact merge commit we have as the master repo, ddc208ec, so yes, that looks good) [23:13:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Raise upload-by-URL request timeout (T118887) (duration: 00m 41s) [23:13:36] T118887: Upload by URL doesn't work well for large files: HTTP request timed out. - https://phabricator.wikimedia.org/T118887 [23:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:41] matanya: you can test [23:13:47] doing [23:19:03] matanya: how it works? [23:19:14] still running [23:19:24] hope it on't time out [23:19:31] won't [23:20:44] Dereckson, looks fine [23:20:46] (sorry I was slow to respond, wifi connection broke and IRC didn't reconnect for a while) [23:20:53] Krenair: thanks for looking [23:20:58] Okay, I offer to swat the matt_flaschen change pending the longer upload tests. [23:21:34] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282230 (owner: 10Mattflaschen) [23:22:23] Dereckson, fine by me [23:22:40] (03Merged) 10jenkins-bot: Disable Echo survey on all wikis except test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282230 (owner: 10Mattflaschen) [23:26:29] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Disable Echo survey on all wikis except test ([[gerrit:282230]]) (duration: 00m 25s) [23:26:31] matt_flaschen: you can test [23:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:40] Thanks, trying now. [23:29:35] Dereckson, didn't work. I forgot, I think we need to wait 5 minutes for Varnish. [23:29:45] It's still showing old value in startup module. [23:30:03] I've this issue when I change WikiLove stuff. [23:30:54] Yeah, I've seen it before, it's just been a little while so I forgot. [23:32:52] matt_flaschen: by the way, I've used sync-file. [23:33:32] Dereckson, that should be okay, I think. It still hasn't updated, though, which is weird. [23:34:49] greg-g, is it still 5 minutes for Varnish caching the startup module? [23:35:53] It finally updated in startup module, I'll retest. [23:37:28] Dereckson, works, thank you. [23:37:34] You're welcome. [23:38:27] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [23:39:05] (03Merged) 10jenkins-bot: Raise upload-by-URL request timeout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281823 (https://phabricator.wikimedia.org/T118887) (owner: 10Dereckson) [23:42:07] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Raise upload-by-URL request timeout (task T118887) (duration: 00m 32s) [23:42:08] T118887: Upload by URL doesn't work well for large files: HTTP request timed out. - https://phabricator.wikimedia.org/T118887 [23:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:29] matanya: retry with the increased timeout setting? [23:42:41] mwalker: after it will fail [23:42:46] Dereckson: ^ [23:43:31] ReadTimeout: HTTPSConnectionPool(host='commons.wikimedia.org', port=443): Read timed out. (read timeout=30) [23:43:31] Dereckson [23:43:48] trying again [23:43:52] 30? [23:44:06] that's not the wgCopyUploadTimeout value this message prints. [23:44:17] (was 90, now 180) [23:45:21] it is from pything, maybe it is lower there [23:46:02] python [23:46:42] !log Synchronized wmf-config/InitialiseSettings.php: Raise upload limit to 4 GB ([[Gerrit:280831]]) — erratum for 23:13 [23:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:02] matanya: could you try with https://commons.wikimedia.org/wiki/Special:Upload ? [23:47:52] (which correctly print the 4 Gb value o/) [23:48:05] Dereckson: i'd need to upload it to a white listed domain [23:48:19] have a good way of doing that ? [23:48:26] video2commons is whitelisted I think [23:48:39] or archive.org [23:48:56] already running in video2commons [23:49:09] it is definitly working [23:49:26] as i don't get WARNING: API warning (upload): filesize may not be over 2146435072 (set to 3065780905) for users [23:49:30] anymore [23:49:39] but the upload takes ages [23:50:58] Okay. I offer we ask experienced Commons users with server-side upload to test more these settings, and evaluate them next week? Then see if we're happy / need more work / need to revert? [23:51:25] cool with me Dereckson [23:51:34] thanks much! [23:51:46] You're welcome. [23:53:42] I'm next. [23:54:01] (03CR) 10Rxy: [C: 031] "Thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) (owner: 10Dereckson) [23:54:26] (03CR) 10Dereckson: [C: 032] Logo update for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [23:55:07] (03Merged) 10jenkins-bot: Logo update for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [23:56:49] !log dereckson@tin Synchronized static/images/project-logos/ladwiki.png: Logo update for lad.wikipedia (Task T118491) (duration: 00m 27s) [23:56:50] T118491: Updated logo for Ladino Wikipedia - https://phabricator.wikimedia.org/T118491 [23:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:19] Okay, logo synced. Let's purge. [23:58:05] Krenair: you log the mwscript purge operation or only the sync of the new static file? [23:58:44] I don't log the mwscript purgeList [23:58:52] (03PS3) 10Dereckson: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T131976) [23:59:05] Okay thanks. [23:59:24] And the last patch.