[00:01:07] (03CR) 10Dereckson: [C: 032] Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T131976) (owner: 10Dereckson) [00:01:35] (03Merged) 10jenkins-bot: Set wgSemiprotectedRestrictionLevels for en.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281967 (https://phabricator.wikimedia.org/T131976) (owner: 10Dereckson) [00:04:19] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set wgSemiprotectedRestrictionLevels for en.wikipedia (Task T131976) (duration: 00m 26s) [00:04:20] T131976: Set wgSemiprotectedRestrictionLevels for enwiki - https://phabricator.wikimedia.org/T131976 [00:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:10] Works, https://en.wikipedia.org/w/index.php?title=Gamergate_controversy&action=edit isn't red anymore. [00:06:19] That ends the SWAT. [00:32:05] 6Operations, 10ops-eqiad, 10DBA: Decomission db1010 - https://phabricator.wikimedia.org/T129395#2188932 (10Peachey88) [00:32:51] 6Operations, 10ops-codfw, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452#2188933 (10Peachey88) [01:58:21] (03PS1) 10Dereckson: HD logo for lad.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282305 (https://phabricator.wikimedia.org/T132120) [02:03:21] (03CR) 10Dereckson: [C: 031] Add flood group to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) (owner: 10Urbanecm) [02:06:31] (03CR) 10Dereckson: [C: 031] "I concur, consensus is on Commons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281973 (https://phabricator.wikimedia.org/T27397) (owner: 10Matanya) [02:11:38] (03CR) 10Dereckson: [C: 031] Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [02:24:37] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 10m 04s) [02:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:07] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Apr 8 02:33:06 UTC 2016 (duration 8m 29s) [02:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:45:45] (03PS2) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) [04:00:17] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [04:01:16] PROBLEM - RAID on ms-be1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [04:01:47] !log Update cxserver to bd4739b [04:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:02:16] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [04:05:07] PROBLEM - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:32:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [04:33:07] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [05:20:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:28:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:49:27] (03Abandoned) 10Giuseppe Lavagetto: Make ElasticaTTMServer config work with the new CirrusSearch structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282154 (owner: 10Giuseppe Lavagetto) [05:50:29] (03CR) 10Giuseppe Lavagetto: mediawiki: make base class trusty and forward only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [05:51:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: make base class trusty and forward only [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) [05:53:46] (03PS3) 10Giuseppe Lavagetto: mediawiki: make base class trusty and forward only [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) [06:05:57] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [06:08:14] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: make base class trusty and forward only [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [06:10:47] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:12:37] (03PS1) 10Giuseppe Lavagetto: mediawiki: salt::grain needs a value even when absented [puppet] - 10https://gerrit.wikimedia.org/r/282309 [06:13:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: salt::grain needs a value even when absented [puppet] - 10https://gerrit.wikimedia.org/r/282309 (owner: 10Giuseppe Lavagetto) [06:13:38] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: puppet fail [06:17:18] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:20:57] (03PS1) 10Giuseppe Lavagetto: mediawiki: do remove the php grain [puppet] - 10https://gerrit.wikimedia.org/r/282310 [06:21:47] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: do remove the php grain [puppet] - 10https://gerrit.wikimedia.org/r/282310 (owner: 10Giuseppe Lavagetto) [06:28:25] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) [06:28:53] <_joe_> apergos: we're almost there re jessie/mediawiki [06:30:27] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:31:46] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:07] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:40:07] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:55:36] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:55:37] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:08] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:42] (03PS2) 10Muehlenhoff: Enable base::firewall on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282113 [07:11:46] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 682 bytes in 0.030 second response time [07:12:50] ^ not me (so far the patch for base::firewall has only been rebased, but not merged) [07:13:36] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2504 bytes in 0.142 second response time [07:21:39] _joe_: awesome! what's left? [07:25:27] <_joe_> apergos: a few details, besides merging my pending patches [07:25:32] <_joe_> at least part of those :) [07:26:05] I still have the misc crons to move off the one host so that it can be used for testing [07:26:19] so might be ready about the same time :-) [07:26:56] <_joe_> nah you'll be done earlier [07:27:06] <_joe_> I have quite a few things to fix [07:35:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282113 (owner: 10Muehlenhoff) [07:42:32] (03PS1) 10ArielGlenn: Override hhvm.server.light_process_count on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282314 [07:43:30] (03CR) 10jenkins-bot: [V: 04-1] Override hhvm.server.light_process_count on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [07:43:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Setting lightprocesses to 0 is a bad idea" [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [07:49:14] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) [07:50:06] !log enable base::firewall on netmon1001 [07:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:56:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::php: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:00:34] (03PS4) 10Giuseppe Lavagetto: mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) [08:00:54] (03CR) 10ArielGlenn: "From python Popen and from php maintenance scripts that spawn other processes." [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [08:09:19] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::packages: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:10:40] 6Operations, 6Analytics-Kanban: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2189365 (10elukey) 5Open>3Resolved [08:18:46] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/281410 (https://phabricator.wikimedia.org/T126310) [08:23:05] (03PS1) 10Muehlenhoff: ve: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 [08:23:58] (03CR) 10jenkins-bot: [V: 04-1] ve: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 (owner: 10Muehlenhoff) [08:24:56] (03PS2) 10Muehlenhoff: ve: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 [08:26:34] moritzm: regarding role::ve and the ferm rule, maybe you want to have the ferm rule on the role::jsbench class instead [08:26:43] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/281410 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:26:45] moritzm: since the xvfb service is installed by the role::jsbench :D [08:27:45] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop compatibility with precise, mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281411 (https://phabricator.wikimedia.org/T126310) [08:28:58] sure, will update the patch [08:30:14] (03PS3) 10Muehlenhoff: jsbench: Add ferm rules for xvfb [puppet] - 10https://gerrit.wikimedia.org/r/282318 [08:31:08] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [08:35:13] (03PS2) 10Muehlenhoff: Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 [08:39:41] Amir1: no, I am doing the split already [08:39:51] Amir1: and good morning btw :-) [08:39:51] hey akosiaris [08:39:58] awesome :) [08:40:16] tell me if I can help at anything [08:40:18] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: drop compatibility with precise, mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281411 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:40:55] (03PS3) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [08:41:46] (03PS4) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [08:41:50] (03CR) 10jenkins-bot: [V: 04-1] cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [08:42:31] Amir1: will do [08:43:01] thanks [08:43:05] (03CR) 10jenkins-bot: [V: 04-1] cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [08:44:07] (03CR) 10Hashar: "I have rebased this change and fix the conflict (some role class got dropped)." [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [08:44:44] (03CR) 10Giuseppe Lavagetto: "So we have php spawn php as well?" [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [09:01:47] (03PS2) 10Giuseppe Lavagetto: mediawiki::jobrunner: drop precise compatibility [puppet] - 10https://gerrit.wikimedia.org/r/281412 (https://phabricator.wikimedia.org/T126310) [09:10:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 605 [09:10:39] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Activate SSL + connection pooling for CirrusSearch on PROD - https://phabricator.wikimedia.org/T131839#2189452 (10Gehel) HTTPS + Connection pooling is now active on production. Traffic was switched to codfw at the same time. [09:11:07] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#1964465 (10Gehel) Traffic is now encrypted in production. [09:15:08] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 905 [09:17:30] _joe_: that "drop compat" patch of yours broke gallium [09:17:56] wasn't the gallium upgrade to jessie a quarterly goal for releng this past quarter? [09:18:07] (03CR) 10Giuseppe Lavagetto: [C: 032] "noop according to the puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/281412 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [09:18:33] (03PS2) 10Muehlenhoff: Manage /etc/pam.d/sshd in role::bastionhost::2fa via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 [09:20:54] <_joe_> paravoid: oh sigh [09:21:22] hashar: ^? [09:22:41] <_joe_> hashar: do we really need the mediawiki php packages on gallium? [09:23:00] <_joe_> since we don't use precise anymore anywhere, one might think... [09:23:12] (03PS5) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [09:23:22] iirc it was to run tests against php 5.3 which we still support for third-party users? [09:24:57] (03CR) 10Hashar: "I have fixed the puppet manifest syntax error. Cherry picked PS5 on beta cluster puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [09:25:18] RECOVERY - check_mysql on lutetium is OK: Uptime: 1883330 Threads: 1 Questions: 18579766 Slow queries: 10869 Opens: 114338 Flush tables: 2 Open tables: 64 Queries per second avg: 9.865 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:25:33] <_joe_> paravoid: then I can just include a different class maybe, sigh [09:27:27] _joe_: we still use Precise to run MediaWiki tests with Zend 5.3 :-( [09:27:41] though it is probably not needed on gallium anymore since that node barely run any tests [09:28:08] (03PS4) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [09:28:08] it is definitely needed on the CI Precise slaves. They should get the MediaWiki package dependencies using the puppet class mediawiki::packages [09:28:09] <_joe_> still, it's needed for precise [09:28:11] <_joe_> ok [09:28:20] <_joe_> they're all broken right now, sigh [09:28:32] <_joe_> I'll find the best possible fix [09:28:34] (03CR) 10jenkins-bot: [V: 04-1] https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [09:28:34] gallium has Error: Could not retrieve catalog from remote server: Error 400 on SERVER: OS ubuntu >= trusty || Debian >= jessie required. at /etc/puppet/modules/mediawiki/manifests/packages/php5.pp:9 on node gallium.wikimedia.org [09:29:53] hashar: what's the status of the gallium upgrade? [09:29:59] I thought it'd be done last quarter [09:30:07] we discussed it ~4 months ago if you recall [09:30:12] <_joe_> hashar: I'm fixing it [09:31:29] _joe_: I am going to clear out what is installed on gallium [09:31:36] <_joe_> hashar: no, wait [09:31:42] <_joe_> I am fixing the problem [09:31:55] would still need some changes to have the puppet classes to apply properly on the Precise instances in labs :D [09:32:52] <_joe_> hashar: I am doing it [09:33:17] <_joe_> hashar: to be clear: it's role::contint::website, right? [09:38:04] _joe_: there are a bunch of classes [09:38:21] role::contint::website is indeed on gallium that is for integration.wm.o and doc.wm.o iirc [09:38:45] <_joe_> hashar: which other classes? [09:38:48] (03PS5) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [09:38:51] <_joe_> so that I can fix those [09:40:22] (03CR) 10Elukey: [C: 031] https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [09:42:31] (03PS6) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [09:44:20] (03CR) 10ArielGlenn: [C: 032] https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [09:44:49] (03PS1) 10Hashar: contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 [09:44:51] (03PS1) 10Hashar: contint: move npmtravis out of prod slave [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) [09:45:45] <_joe_> hashar: found it, I'm fixing all of it [09:46:11] <_joe_> hashar: when can we get rid of those php 5.3 tests? [09:46:13] _joe_: and the couple patches above remove a few random classes from gallium :D [09:46:34] 5.3 tests would be needed for as long as we still support a MediaWiki release that claims to support 5.3 :( [09:46:37] so quite a while [09:46:46] but the .plan is to have Zend 5.3 on the Jessie Nodepool instances [09:47:06] <_joe_> when? [09:48:17] hopefully this quarter [09:48:41] I am on vacation, will look at provisioning Zend 5.3 and 5.5 on Jessie instances when I am back [09:48:51] we have a few potential solutions to investigate :D [09:49:04] <_joe_> hashar: just one more question then [09:49:21] <_joe_> hashar: in contint::packages::labs I see # Fonts needed for browser tests screenshots [09:49:25] !log rebooting krypton for kernel upgrade [09:49:26] <_joe_> (T71535) [09:49:27] T71535: Fonts for Japanese and Chinese must be installed for VisualEditor localized screenshots - https://phabricator.wikimedia.org/T71535 [09:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:30] in short the idea would be to compile different php, have them under /opt/php53 /opt/php55 etc and then we can have the Jenkins job to use the proper bin [09:49:35] <_joe_> hashar: are those going to run on precise? [09:49:38] 6Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2145327 (10fgiunchedi) update: this is waiting on decommissioning restbase1004-restbase1005, for which we'll need to cassandra bootstrap new hardware, which in turn was waiting for codfw cassandra... [09:51:08] _joe_: the browser tests no, they are on Trusty [09:51:13] <_joe_> ok cool [09:51:19] <_joe_> that simplifies things a lot [09:51:26] the only stuff let on Precise slaves are the Zend 5.3 jobs and maybe a few others [09:51:39] if "few others" start failing, we can move them to Trusty fairly easily [09:51:49] there should be none [09:52:09] and in short the goals are roughly: [09:52:20] * get rid of Precise labs slave by having Zend 5.3 on Jessie instances [09:52:31] * get rid of Trusty labs slave by having Zend 5.5 and HHVM on jessie instances [09:52:50] * phase out gallium (still on precise) which has a fairly large set of dependencies I have yet to work on :( [09:53:02] then all of CI jobs would be running solely on Jessie :-} [09:53:28] !log dumps.wikimedia.org is now accepting only https:// (redirecting http:// to https://) [09:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:33] the python/ruby/javascript based jobs have (almost) all have been migrated to jessie already [09:53:46] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic, 13Patch-For-Review: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2189490 (10ArielGlenn) 5Open>3Resolved with collaboration with elukey this is done. [09:53:48] 6Operations, 13Patch-For-Review: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#2189493 (10MoritzMuehlenhoff) 5Open>3Resolved base::firewall has been enabled on netmon1001 [09:55:09] (03PS1) 10Giuseppe Lavagetto: mediawiki::packages: don't break precise CI installs [puppet] - 10https://gerrit.wikimedia.org/r/282330 [09:55:12] <_joe_> hashar: ^^ [09:55:43] oh [09:56:12] <_joe_> hashar: btw, can you give me the name of one of the precise slaves used for php 5.3? [09:57:04] _joe_: https://integration.wikimedia.org/ci/label/UbuntuPrecise/ [09:57:33] <_joe_> hashar: ok thanks [09:57:42] <_joe_> hashar: so, this change fixes gallium [09:57:46] <_joe_> I am going to merge it [09:58:48] (03CR) 10Hashar: [C: 031] "sounds good and will fix gallium :-)" [puppet] - 10https://gerrit.wikimedia.org/r/282330 (owner: 10Giuseppe Lavagetto) [09:58:49] yeah [09:58:51] +1 -:) [09:59:16] (03PS2) 10Giuseppe Lavagetto: mediawiki::packages: don't break precise CI installs [puppet] - 10https://gerrit.wikimedia.org/r/282330 [10:00:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki::packages: don't break precise CI installs [puppet] - 10https://gerrit.wikimedia.org/r/282330 (owner: 10Giuseppe Lavagetto) [10:00:39] that lame operations-puppet-puppetlint-strict is way too slow :D [10:02:21] (03PS2) 10Hashar: contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 [10:02:36] <_joe_> uhm I guess I missed something [10:02:46] (03PS2) 10Hashar: contint: move npmtravis out of prod slave [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) [10:03:56] (03PS1) 10Giuseppe Lavagetto: mw::packages::legacy: php-apc, not php5-apc [puppet] - 10https://gerrit.wikimedia.org/r/282333 [10:05:17] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mw::packages::legacy: php-apc, not php5-apc [puppet] - 10https://gerrit.wikimedia.org/r/282333 (owner: 10Giuseppe Lavagetto) [10:07:46] <_joe_> hashar: gallium is now ok [10:07:55] <_joe_> I'll look at the precise ci slaves [10:07:58] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:10:35] (03CR) 10Hashar: [C: 04-1] "Puppet compile https://puppet-compiler.wmflabs.org/2372/gallium.wikimedia.org/ will have to do the clean up manually." [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [10:11:56] (03PS1) 10Giuseppe Lavagetto: contint::packages::labs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/282335 [10:12:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] contint::packages::labs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/282335 (owner: 10Giuseppe Lavagetto) [10:12:21] (03PS3) 10Hashar: contint: clean up role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/282322 [10:12:44] * hashar plays with puppet compiler [10:15:04] (03CR) 10Hashar: "Puppet compile for gallium https://puppet-compiler.wmflabs.org/2374/gallium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/282322 (owner: 10Hashar) [10:15:20] _joe_: and if you get time, https://gerrit.wikimedia.org/r/#/c/282322/ get rid of a few classes from gallium [10:15:55] !log rebooting radium (tor node) for kernel upgrade to 4.4 [10:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:57] <_joe_> hashar: fixed! [10:22:02] (03CR) 10Hashar: [C: 04-1] "Need https://gerrit.wikimedia.org/r/#/c/282322/ to be merged than I can run the puppet compiler to assert it is a noop on gallium" [puppet] - 10https://gerrit.wikimedia.org/r/282323 (https://phabricator.wikimedia.org/T114421) (owner: 10Hashar) [10:22:48] _joe_: great ! and some more removal of no more used cruft for gallium: https://gerrit.wikimedia.org/r/#/c/282322/ && https://gerrit.wikimedia.org/r/#/c/282323/ :) [10:22:56] ran puppet compiler on the first one [10:23:04] https://puppet-compiler.wmflabs.org/2374/gallium.wikimedia.org/ [10:23:08] which looks all fine to me [10:23:18] none of those puppet resources are needed anymore on gallium [10:26:56] (03PS2) 10Ema: Misc VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/282235 (https://phabricator.wikimedia.org/T131501) [10:27:14] (03CR) 10Ema: [C: 032 V: 032] Misc VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/282235 (https://phabricator.wikimedia.org/T131501) (owner: 10Ema) [10:36:40] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:40:28] RECOVERY - DPKG on tungsten is OK: All packages OK [10:51:00] 6Operations, 10DNS, 10Mail, 10Phabricator, and 2 others: phabricator.wikimedia.org has no SPF record - https://phabricator.wikimedia.org/T116806#2189557 (10Aklapper) >>! In T116806#2186995, @greg wrote: > Unless some team has a team list specified for their team project contact (I highly highly doubt it),... [10:51:49] (03PS3) 10Giuseppe Lavagetto: hhvm: make the perf maps cron work under systemd [puppet] - 10https://gerrit.wikimedia.org/r/282114 [10:53:18] (03PS1) 10Muehlenhoff: Add ferm rule for labmon/http [puppet] - 10https://gerrit.wikimedia.org/r/282336 [10:54:30] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rule for labmon/http [puppet] - 10https://gerrit.wikimedia.org/r/282336 (owner: 10Muehlenhoff) [10:55:14] (03CR) 10Giuseppe Lavagetto: [C: 032] "Noop according to the compiler" [puppet] - 10https://gerrit.wikimedia.org/r/282114 (owner: 10Giuseppe Lavagetto) [10:56:07] 6Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#1972522 (10mark) >>! In T125028#1981051, @jcrespo wrote: > As per Ops Meeting. Could you elaborate please? I don't know what this means. :) [10:57:27] (03PS2) 10Muehlenhoff: Add ferm rule for labmon/http [puppet] - 10https://gerrit.wikimedia.org/r/282336 [11:01:01] 6Operations, 6Commons: page with interwiki prefixes in title needs to be deleted - https://phabricator.wikimedia.org/T132139#2189606 (10FDMS) [11:02:45] (03Abandoned) 10Muehlenhoff: Add ferm rule for labmon/http [puppet] - 10https://gerrit.wikimedia.org/r/282336 (owner: 10Muehlenhoff) [11:03:27] (03CR) 10ArielGlenn: "That's how the MW php maintenance script handles retrieving revision text: by a separate process which can be restarted when something goe" [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [11:03:33] 6Operations, 6Commons, 10MediaWiki-General-or-Unknown: page with interwiki prefixes in title needs to be deleted - https://phabricator.wikimedia.org/T132139#2189622 (10Steinsplitter) [11:17:16] (03PS1) 10Ema: apt: ignore errors updating apt-show-versions package cache [puppet] - 10https://gerrit.wikimedia.org/r/282337 [11:22:21] (03Abandoned) 10Hashar: beta: nutcracker::verbosity: "4" [puppet] - 10https://gerrit.wikimedia.org/r/276950 (owner: 10Hashar) [11:26:00] (03Abandoned) 10Hashar: nodepool: raise min pool from 10 to 14 [puppet] - 10https://gerrit.wikimedia.org/r/273459 (owner: 10Hashar) [11:26:39] (03PS6) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [11:26:47] (03PS3) 10Hashar: contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) [11:27:11] (03CR) 10Hashar: [C: 031] contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [11:27:51] (03CR) 10jenkins-bot: [V: 04-1] contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [11:27:56] (03CR) 10jenkins-bot: [V: 04-1] contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [11:28:17] 6Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2189634 (10fgiunchedi) [11:28:19] 6Operations, 10ops-codfw, 13Patch-For-Review: rack/setup new host graphite2002 - https://phabricator.wikimedia.org/T130938#2189636 (10fgiunchedi) [11:29:54] (03PS7) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [11:30:57] (03PS4) 10Hashar: contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) [11:34:32] (03CR) 10Hashar: [C: 031] "Rebased, fixed puppet lint arrows alignment. Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [11:34:37] (03CR) 10Hashar: [C: 031] "Rebased, fixed puppet lint arrows alignment. Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [11:35:33] paravoid akosiaris I forgot to mention https://gerrit.wikimedia.org/r/281631 and https://gerrit.wikimedia.org/r/277490 to split graphite cluster in two [11:35:53] (03Abandoned) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:37:13] 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 7WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#2189664 (10hashar) a:5hashar>3None [11:40:39] 6Operations, 10ops-eqiad: ms-be1001.eqiad.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T132142#2189672 (10fgiunchedi) 3NEW [11:40:58] godog: ok [11:41:48] RECOVERY - RAID on ms-be1001 is OK: OK: optimal, 13 logical, 13 physical [11:42:03] (03PS1) 10Muehlenhoff: Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282340 [11:42:05] (03PS1) 10Muehlenhoff: Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282341 [11:43:25] (03CR) 10jenkins-bot: [V: 04-1] Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282340 (owner: 10Muehlenhoff) [11:43:38] (03CR) 10jenkins-bot: [V: 04-1] Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282341 (owner: 10Muehlenhoff) [11:46:15] (03CR) 10Filippo Giunchedi: [C: 031] apt: ignore errors updating apt-show-versions package cache [puppet] - 10https://gerrit.wikimedia.org/r/282337 (owner: 10Ema) [11:47:03] ACKNOWLEDGEMENT - puppet last run on ms-be1001 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdi failed T132142 [11:48:16] 6Operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#2189683 (10hashar) 5declined>3Open Reopening. conftool refuses to start on any of the beta cluster Varnish caches. Ex: deployment-cach... [11:48:38] 6Operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on beta cluster Varnish caches - https://phabricator.wikimedia.org/T116224#2189686 (10hashar) [11:50:11] 6Operations, 10Traffic, 13Patch-For-Review: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2189688 (10ema) 5Open>3Resolved [11:50:15] (03PS1) 10Muehlenhoff: Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282343 [11:50:17] (03PS1) 10Muehlenhoff: Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282344 [12:17:23] 6Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#2189705 (10MoritzMuehlenhoff) The following updates from jessie 8.4 have been deployed: php5 ruby-defaults nss-pam-ldapd libsndfile cairo [12:19:47] (03CR) 10BBlack: [C: 031] update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [12:21:38] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: Puppet has 1 failures [12:29:39] (03CR) 10Elukey: [C: 031] apt: ignore errors updating apt-show-versions package cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282337 (owner: 10Ema) [12:33:05] 7Blocked-on-Operations, 6Operations: Jessie mirror - https://phabricator.wikimedia.org/T132146#2189768 (10hashar) [12:33:21] 7Blocked-on-Operations, 6Operations: Jessie mirror - https://phabricator.wikimedia.org/T132146#2189780 (10hashar) 5Open>3declined [12:34:15] (03PS4) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [12:37:39] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure: Wikimedia mirror for Debian Jessie is incomplete - https://phabricator.wikimedia.org/T132147#2189784 (10hashar) [12:41:54] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure: Wikimedia mirror for Debian Jessie is incomplete - https://phabricator.wikimedia.org/T132147#2189784 (10MoritzMuehlenhoff) Did you run "apt-get update"? All these files have been updated in the recent jessie point update and were s... [12:42:34] 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Wikimedia mirror for Debian Jessie is incomplete - https://phabricator.wikimedia.org/T132147#2189802 (10hashar) Bah my bad it does not run `apt-get update`.. I need to figure out how to have the puppet ::apt c... [12:48:00] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:27] (03CR) 10Elukey: "Thanks to Ema for the packaging introduction :)" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [12:51:31] moritzm: sorry I have been too fast in filling the apt outdated class. apt-get update definitely solve the issue. Thank you! [12:52:43] hashar: whenever you have time it would be great to brainstorm how to add complete submodule support to the puppet-compiler [12:52:51] (if you want to be involved of course :) [12:53:06] elukey: it still does not submodule update properly ? :( [12:53:40] hashar: nono I mean taking the feature to the next step, namely using pcc with a code review filed against the submodule repo [12:53:50] OHHH [12:54:16] the Jenkins job pass the change ID to the puppet compiler software which then retrieve the change [12:54:27] when retrieving the change, the soft could look at which repository it is aimed for [12:54:46] if that is for some operations/puppet/foobar , then clone puppet.git as usual and apply the patch to the proper path [12:54:54] ie modules/foobar ? [12:54:54] exactly [12:55:17] this is the idea, it would be so much helpful in my opinion [12:55:44] yeah this way one can puppet compile in the submodule before that is merged and proposed as a submodule bump [12:55:55] * elukey nods [12:56:13] time for some python hacking I guess? :-} [12:56:30] I am on vacation starting this evening, so no real cycle to look at it :-( [12:56:49] sure! not that urgent :) [12:57:22] I am currently trying to figure out how to link the gerrit repo name to the correct puppet module path [12:58:08] or maybe name.. to use something like git submodule foreach to identify the correct submodule and merge/fetch the change [12:58:27] then submodules init, etc.. [12:58:43] anyhow, let's chat again when you'll be back :) [12:59:23] 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 7WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#2189919 (10jayvdb) You may want to use https://pypi.python.org/pypi/flake8-putty for per fi... [13:01:18] elukey: guess you could try hacking something and sync up with _joe_ who wrote the puppet compiler? [13:01:39] 6Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler: puppet compiler wrongly indicates errors when dealing with subrepositories - https://phabricator.wikimedia.org/T118406#2189923 (10elukey) 5Open>3Resolved a:3elukey Duplicate of https://phabricator.wikimedia.org/T130703, closing. [13:03:02] $ git config --get-regexp submodule [13:03:02] submodule.modules/cdh.url https://gerrit.wikimedia.org/r/operations/puppet/cdh [13:03:02] submodule.modules/jmxtrans.url https://gerrit.wikimedia.org/r/operations/puppet/jmxtrans [13:03:06] elukey: ^^^magic :-} [13:03:18] might let you easily map the Gerrit project name with the subdir [13:05:40] hashar: WOA! [13:05:44] thanks! [13:05:48] * elukey opens a phab task [13:06:20] you are welcome. Donations welcome https://donate.wikimedia.org/ :D [13:06:57] (03PS2) 10Ema: apt: ignore errors updating apt-show-versions package cache [puppet] - 10https://gerrit.wikimedia.org/r/282337 [13:07:13] elukey: addressed your comment ^ [13:10:15] (03CR) 10Reedy: [C: 031] delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [13:10:38] ema: LGTM [13:12:12] (03PS3) 10Ema: apt: ignore errors updating apt-show-versions package cache [puppet] - 10https://gerrit.wikimedia.org/r/282337 [13:14:04] (03CR) 10Ema: [C: 032 V: 032] apt: ignore errors updating apt-show-versions package cache [puppet] - 10https://gerrit.wikimedia.org/r/282337 (owner: 10Ema) [13:18:21] (03PS1) 10Faidon Liambotis: Rename "statsdlb" role to "statsd" [puppet] - 10https://gerrit.wikimedia.org/r/282355 [13:18:23] (03PS1) 10Faidon Liambotis: Add a new statsd_proxy module and replace statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) [13:18:26] (03PS1) 10Faidon Liambotis: Remove statsdlb, unreferenced now [puppet] - 10https://gerrit.wikimedia.org/r/282357 [13:18:28] godog: ^^^ [13:18:36] (03PS1) 10Giuseppe Lavagetto: hhvm: hhvm-dump-debug compatibility with jessie [puppet] - 10https://gerrit.wikimedia.org/r/282358 [13:19:27] (03CR) 10Ori.livneh: "Seems fine to me, but have you verified that it is, indeed, faster?" [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [13:20:06] (03CR) 10jenkins-bot: [V: 04-1] Add a new statsd_proxy module and replace statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [13:20:14] ori: no, I haven't :) [13:20:25] it is multithreaded though [13:21:00] paravoid: nice! never came across that before [13:21:03] it's probably going to slower with thread=1, since it does ketana which is md5 etc. [13:21:22] ori: I also forgot you're in this TZ heh :) [13:21:25] threads have overhead [13:21:31] it may well be faster, not saying it won't be [13:21:42] but it's not so obviously true that it doesn't need verifying [13:21:55] it uses SO_REUSEPORT, so separate socket per thread [13:22:07] so there is no worker pool (or lock contention) [13:23:22] IIRC I used https://github.com/octo/statsd-tg when trying out statsite [13:23:44] even that isn't rocket science heh but it is already there [13:25:20] (03PS2) 10Faidon Liambotis: Add a new statsd_proxy module and replace statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) [13:25:22] (03PS2) 10Faidon Liambotis: Remove statsdlb, unreferenced now [puppet] - 10https://gerrit.wikimedia.org/r/282357 [13:25:53] ori: statsdlb's for(;;) { recv() } is too trivial to perform really [13:26:18] should be easy to demonstrate, then [13:26:26] all blocking, no (e)poll() no recvmmsg(), no threads [13:26:50] see commit message for https://github.com/wikimedia/analytics-udplog/commit/32cced279d82bf0afc6d4b98dd2f0bf584a5d6bc for example [13:27:24] Tim found that blocking, single-threaded reads were the most efficient in that particular case, IIRC [13:28:14] how did you guys measure before that statsdlb was getting starved? [13:29:25] investigation started from https://phabricator.wikimedia.org/T101141 [13:29:52] (03PS2) 10Giuseppe Lavagetto: hhvm: hhvm-dump-debug compatibility with jessie [puppet] - 10https://gerrit.wikimedia.org/r/282358 [13:30:04] 22128: 00000000:1FBD 00000000:0000 07 00000000:00032140 00:00000000 00000000 113 0 17514 2 ffff881001a38000 -951643664 [13:30:10] heh, wrapped around [13:30:26] (03PS1) 10Gehel: CirrusSearch on Labs uses the full Elasticsearch cluster again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282359 (https://phabricator.wikimedia.org/T130219) [13:30:41] I also did similar work on mwprof. In that case I ended up making it multithreaded, but it was not clear-cut there either ( https://github.com/wikimedia/operations-software-mwprof/commit/2efc01013bd3a5b0ec7fdd4c151a3f7717794a04 ) [13:32:05] ok, brb [13:32:24] ori: http://netdata.firehol.org/ <-- already played with it? [13:33:29] I don't think I've seen it before, elukey. Looks interesting, though. Bookmarked. Thanks! [13:37:48] this thread is fascinating too https://lists.wikimedia.org/pipermail/wikitech-l/2010-August/048806.html [13:38:27] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1014-a [puppet] - 10https://gerrit.wikimedia.org/r/282361 (https://phabricator.wikimedia.org/T128107) [13:44:36] (03CR) 10Ottomata: "/var/cache/varnishkafka only existed to contains the varnishkafka.stats.json file. If we are removing that from the default config, then " [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [13:44:45] (03CR) 10Ottomata: "to contain*" [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [13:45:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1014-a [puppet] - 10https://gerrit.wikimedia.org/r/282361 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [13:45:53] (03CR) 10Ottomata: "Oh, which I guess also means you can remove the varnishkafka.dirs file altogether, since there is nothing else in it." [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [13:46:46] (03PS1) 10Muehlenhoff: Port from optparse to argparse [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282362 [13:47:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Port from optparse to argparse [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282362 (owner: 10Muehlenhoff) [13:48:07] netdata is making the rounds [13:48:25] but I've heard if you look more closely it's horrible [13:48:29] as in, spawns shell scripts [13:50:51] it doesn't seem to offer much over ganglia [13:51:37] (03PS2) 10Faidon Liambotis: Rename "statsdlb" role to "statsd" [puppet] - 10https://gerrit.wikimedia.org/r/282355 [13:51:52] (03CR) 10Faidon Liambotis: [C: 032] Rename "statsdlb" role to "statsd" [puppet] - 10https://gerrit.wikimedia.org/r/282355 (owner: 10Faidon Liambotis) [13:52:10] that part is fairly uncontentious, I hope :) [13:52:52] so the migration to statsd-proxy is going to mess with aggregation for a while, as the hashing algorithm is going to be different [13:55:25] paravoid: do we have the deleteCounters (or equivalent for the version we use) set on the statsd hosts behind the proxies? [13:55:39] no clue [13:55:59] hehe yeah the rename makes sense, thanks paravoid [13:56:23] volans: no clue :) [13:56:33] because the original statsd (etsy) by default continue to send values if it doesn't receive data [13:56:50] ah yeah, we used to have that with txstatsd, no longer with statsite [13:57:03] 0 for counters and timers and sets, the value for gauges IIRC [13:57:19] PROBLEM - HHVM rendering on mw1209 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50393 bytes in 0.784 second response time [13:57:34] yeah I see the reason for that but it was very confusing [13:58:26] godog: we used to have the config or the issue? :) [13:59:07] the sending of old counters [13:59:09] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 71612 bytes in 0.097 second response time [14:00:01] yeah you cannot send them if you have more than one statsd and have a proxy with consistent hashing in front, you might end up with double/wrong data [14:01:43] wtf [14:02:01] tcp6 0 0 :::8125 :::* LISTEN 1738/statsite [14:02:04] udp6 0 0 :::8125 :::* 1738/statsite [14:02:09] !log start cassandra bootstrap of restbase2004-a [14:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:39] awesome [14:02:47] we have two sockets bound on 8125 [14:02:51] one is statsdlb, the other one is statsite [14:03:00] I was just checking because if we replace the proxies and the hashing algorithm change and the statsd keep sending the old data the mess is not at the moment of the release but continues until you don't restart all statsd [14:03:11] !log correction of the above, restbase1014-a [14:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:14] paravoid: sigh, I'm guessing the default/system statsite instance? [14:05:34] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [14:06:40] yes. [14:07:24] PROBLEM - Restbase root url on restbase1014 is CRITICAL: Connection refused [14:08:04] PROBLEM - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: Connection refused [14:08:43] ACKNOWLEDGEMENT - Restbase root url on restbase1014 is CRITICAL: Connection refused Filippo Giunchedi cassandra bootstrapping [14:08:43] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: Connection refused Filippo Giunchedi cassandra bootstrapping [14:08:44] ACKNOWLEDGEMENT - restbase endpoints health on restbase1014 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.133, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi cassandra bootstrapping [14:10:24] PROBLEM - statsdlb process on graphite1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name statsdlb [14:10:36] yeah, statsd-proxy is still dropping packets [14:10:46] much less than statsdlb [14:11:36] how many proxy we have? [14:12:05] proxy? [14:12:13] statsd-proxy [14:12:38] ATM one, statsd.eqiad.wmnet [14:12:58] I've configured it with 8 threads, on a single host [14:13:49] ok [14:14:05] nice, how many fewer is it dropping btw? [14:14:26] fixing up my english is left as an exercise for the reader [14:15:53] "how many fewer is it dropping" is perfectly correct [14:16:21] oh ok, thanks! it didn't sound correct in my head [14:16:36] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2190070 (10ArielGlenn) During the full run some jobs produce "Lost parent" from hhvm; so far I have only seen this for articles dumps, i.e. those that spawn... [14:20:22] (03CR) 10Bmansurov: "Yes, the patch is ready to merge. Although I am not sure why jenkins is failing." [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [14:21:11] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1158594 (10Joe) the LightprocessCount: Lost parent message is just signaling the exit of a child php process. So it's most definitely not an error. You migh... [14:21:44] (03PS5) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [14:22:51] godog: I finally figured out sometime in college that "what sounds correct" is not always correct in English :) [14:25:03] RECOVERY - statsdlb process on graphite1001 is OK: PROCS OK: 1 process with command name statsdlb [14:25:51] heheh and vice versa apparently [14:27:05] 6Operations: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2190078 (10Anomie) [14:27:26] 6Operations: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2190094 (10Anomie) [14:27:38] ok, I restored statsdlb as it was, but killed statsite [14:27:47] the default 8125 instance that is [14:28:07] let's see first how things are in this stable situation as to not compare apples and oranges [14:28:43] and seriously, let's offload statsd a bit [14:29:08] our statsd metrics are so much filled with gaps that are unusable [14:29:17] between that and less but usable metrics, I'd pick the latter [14:29:25] but let's start with diamond and see how we fare after that :) [14:29:36] oh and the new box, with a more recent hardware, distro and kernel [14:29:38] (03PS6) 10Elukey: Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) [14:30:13] agreed! if diamond doesn't help that much, the next step would be to move off clients that don't actually need aggregation [14:33:24] 6Operations, 10Analytics-Cluster: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2190122 (10Ottomata) Thanks! [14:35:55] I wonder if carbon will handle the load from diamond [14:36:15] or if we'll just have the same problem [14:37:53] I've worked on a clustered version of this stack, in case we want to scale it a bit :) [14:38:50] (in another place, in case was not clear) [14:42:30] paravoid: heh, I don't think so but don't have any hard data to back it up until next week! [14:43:22] volans: I'd love to hear about it [14:43:39] (03PS1) 10Ema: Use LDFLAGS in Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282375 [14:44:08] paravoid: of course depends on how many resources we can dedicate to it, but a quick summary is: [14:44:54] - N statsd-proxy with consistent hashing load balanced (optionally with multiple instance on each host if the proxy doesn't scale properly) [14:45:38] - N statsd that don't send old data if no data is received for a key [14:46:52] - N carbon relay load balanced, with consistent hashing and replica factor of 2 that sends data with pickle to [14:47:35] - N carbon cache instances with a local carbon relay to relay to multiple carbon-cache instance to use all CPUs as long as they don't fill up IOPS [14:47:50] each cluster can scale independently of the others [14:48:09] 6Operations, 10OCG-General, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2190190 (10Jdforrester-WMF) [14:48:12] 6Operations, 10OCG-General, 6Scrum-of-Scrums, 6Services, 7Technical-Debt: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2190189 (10Jdforrester-WMF) [14:48:32] 6Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#2190200 (10Jdforrester-WMF) [14:49:00] paravoid: EOF [14:49:02] 6Operations, 10Analytics-Cluster, 6Analytics-Kanban: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2190216 (10Ottomata) a:5Ottomata>3elukey [14:49:26] 6Operations, 10Analytics-Cluster, 6Analytics-Kanban: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2181451 (10Ottomata) Ok! @elukey this is online as stat1004.eqiad.wmnet. It has base puppet. It needs the following two roles: - analytics_cluster::client - an... [14:49:29] 6Operations, 10Analytics-Cluster, 6Analytics-Kanban: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2190229 (10Ottomata) [14:49:50] although there are some limitations to scale the carbon-cache hosts [14:50:46] traffic to statsd-proxy and statsd over UDP and the rest over TCP [14:51:10] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 10OCG-PDF-renderer, and 2 others: Error: Sysctl::Parameters[wikimedia base]: Could not evaluate: can't dup Symbol on deployment-pdf01 - https://phabricator.wikimedia.org/T87197#2190274 (10Jdforrester-WMF) [14:53:13] volans: nice! each cluster of how many machines btw? also how did it work to sync up a failed machine with the others? [14:54:07] the closest I've seen for sth like that is https://github.com/graphite-project/carbonate [14:55:02] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 10OCG-PDFRenderer, 7WorkType-Maintenance: Error: Sysctl::Parameters[wikimedia base]: Could not evaluate: can't dup Symbol on deployment-pdf01 - https://phabricator.wikimedia.org/T87197#2190419 (10Jdforrester-WMF) [14:55:07] so we started with 2 also for HA, this cluster will handle metrics from the application's code so they will have tons of metrics [14:56:10] and given that started recently, they are still adding metrics so it's not yet in a position to need more machines or give some good reference about numbers to compare [14:57:57] (03CR) 10Ema: [C: 031] cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [14:58:05] all the clusters but the carbon-cache one can scale up/down without any issue apart affecting a single aggregation unit (10s) in this case [14:59:01] the carbon-cache one is the problematic one and yeah carbonate is there for this, but given the replica factor of 2 and that graphite is smart to retrieve data [14:59:37] if one host is down has no user impact [14:59:45] (03CR) 10Elukey: [C: 031] Use LDFLAGS in Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282375 (owner: 10Ema) [15:01:16] although if you add carbon-cache hosts yes you have to rebalance the cluster and it's quite resource consuming AFAIK [15:02:55] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:03:32] volans: indeed, but yeah perhaps with two-machine clusters would be ok already to start with [15:04:03] also there's https://wikitech.wikimedia.org/wiki/Graphite/Scaling if you are interested, likely needs updating by now [15:04:51] then there are 2 frontend with graphite that connects to the graphite on the carbon-cache hosts load balanced and also to an elasticsearch where the annotations are saved [15:05:06] thanks, I'll read it godog [15:06:35] np! feel free to update it as you wish [15:09:02] (03CR) 10Ema: [C: 032 V: 032] Use LDFLAGS in Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/282375 (owner: 10Ema) [15:15:29] hey grafana fans [15:15:43] trying to figure out why i can see a metric in graphite UI, but not in grafana UI [15:15:46] it shows up via autocomplete [15:15:50] but doesn't show anyting on graph [15:18:25] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190627 (10RobH) [15:18:57] thats not really private, but its 'other security issue' so odd it echos... [15:18:59] meh [15:21:52] ottomata: do you have some template variable set that maybe doesn't match? [15:22:25] nope no template [15:22:29] its a single stat [15:23:36] and you see data in graphite and no data in graphana for the same timeframe [15:25:26] yup [15:25:34] this graphite [15:25:35] https://graphite.wikimedia.org/render/?width=588&height=313&target=eventbus.timers.eventlogging.service.EventHandler.POST.201.median [15:25:46] robh, nope, it doesn't have a non-public visibility policy [15:25:59] heh, oh well, i assumed due to it being a security drop down at all [15:26:02] =P [15:26:10] (good thing it isnt really private at all!) [15:26:17] yeah it used to set the policy when you did that [15:26:35] Ok, good, I wasn't losing my mind. It has simply changed as we shake things out. [15:26:57] so why did you mark it as confidential when it wasn't really private at all? [15:27:09] its a secuirty issue [15:27:24] but its also more of a 'im 99.999999% this is ok but im doing this to make sure' [15:27:26] heh [15:28:06] i can remove the confidental part and just rely on 'security-other' project i suppose [15:28:07] public security issues don't get marked private [15:28:20] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190679 (10RobH) [15:28:28] fixed [15:28:47] Krenair: yep, i should have defaulted to open, all good now [15:29:20] we should have a task to fix up the security dropdown option for other confidential [15:29:29] assuming there isn't one already [15:29:34] ottomata: I can see the data [15:30:07] did you selected the data source in the bottom-right to graphite? [15:34:01] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190627 (10Platonides) I think you mean store.wikimedia.org, which is the one hosted by shopify and the dns name in their SAN. shop.wikimedia.org is hosted by WMF... [15:35:50] robh or someone else, could you please merge https://gerrit.wikimedia.org/r/#/c/282119/ [15:36:04] hoo is sitting next to me and can verify again [15:36:46] (03PS2) 10Aude: Replace my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/282119 [15:37:05] aude: typically just entering your own patch is good enough, hoo also is bonus, cool to me [15:37:16] +1 :) [15:37:19] (03PS3) 10Aude: Replace my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/282119 [15:37:27] robh: ok [15:37:44] im just watching zuul run tests [15:37:57] though it seems to be stuck in rebase hell [15:38:04] * aude didn't trust israeli security, especially since i came from jordan [15:38:17] i can try to rebase [15:38:38] (03PS4) 10RobH: Replace my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/282119 (owner: 10Aude) [15:39:08] (03CR) 10RobH: [C: 032] Replace my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/282119 (owner: 10Aude) [15:39:12] just waiting on zuul [15:39:13] thanks [15:40:03] quite welcome, thx for being paranoid. (thats not sarcasm, i applaud caution in regards to ssh keys!) [15:40:23] i did the same when my laptop was taken out of my direct line of sight on travel a few years ago. [15:40:54] security was pretty bad, but survived (and laptop in my sight at all times) [15:41:40] as i typed the line of sight thing i realized it doesnt apply someone could totally clone within line of sight. i should be more paranoid. [15:42:07] all they did was xray and swab all my electronics for explosives [15:42:20] they could have looked at the pictures on my camera but didn't [15:42:24] on the wikimania trip they made me empty all my bags to check the lining of them [15:42:27] and power on every device [15:42:30] good times =] [15:42:31] yep, they did that [15:42:37] Krenair: robh: the security dropdown is totally useless now, we make custom forms for things that need actual security policies [15:42:45] not power on [15:42:47] twentyafterfour: cool, noted! [15:42:53] so if we need a 'submit confidential stuff' form, I can make one of those [15:42:59] but found all my electronic things and swabbed them [15:42:59] aude: new key merged, it'll take a bit to push out to all the sysetms [15:43:00] twentyafterfour, aren't we missing one for the WMF-NDA area then? [15:43:04] ok, thanks [15:43:10] welcome [15:43:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [15:43:33] what they didn't find is my receipt from my hotel in ramallah :) [15:43:50] Krenair: possibly, I never knew what 'other condifential' was used for or exactly how it should behave - the old behavior was ill defined anyway [15:43:58] nor ask me aobut that, (distracted enough with asking me about jordan and other muslim countries) [15:44:03] that receipt could have been awkward heh [15:44:07] yeah [15:44:22] twentyafterfour, private but not security-sensitive [15:44:23] anyway.... [15:50:06] aude: glad you made it out. I was lucky. I just got about 10 minutes of poorly worded questions from a trainee. [15:50:54] jealous [15:52:03] * aude thinks it would have been better to go across the bridge again, back to jordan [15:54:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:57:18] (03PS1) 10Volans: Allow a third value to use the Puppet certs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/282380 (https://phabricator.wikimedia.org/T111654) [16:00:54] 6Operations, 10Beta-Cluster-Infrastructure, 3Scap3, 7WorkType-NewFunctionality: etcd/confd is not started on beta cluster Varnish caches - https://phabricator.wikimedia.org/T116224#2190866 (10mmodell) [16:05:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [16:09:40] (03CR) 10Volans: [C: 032] Allow a third value to use the Puppet certs [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/282380 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:14:07] (03PS1) 10Greg Grossmeier: Revert "350K articles celebration logo on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) [16:19:36] (03PS1) 10Giuseppe Lavagetto: [WiP] redis: add monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/282383 [16:33:42] (03CR) 10Phuedx: "Has the patch been tested? By whom?" [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [16:49:32] (03PS1) 10Volans: MariaDB: allow multiple MySQL TLS configurations [puppet] - 10https://gerrit.wikimedia.org/r/282385 (https://phabricator.wikimedia.org/T111654) [16:49:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [16:56:01] volans: sorry meetings! [16:56:01] um,yes [16:56:11] i actually duplicated an existing panel and just editedit [16:56:35] I've tried creating a new one from scratch and it showed the data [16:56:42] ottomata: nw, me too ;) [16:57:13] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2191090 (10eranroz) [16:57:19] oh really [16:57:20] hm [16:59:00] bah! me too now volans [16:59:03] new panel works fine [16:59:05] dunno! ok! [16:59:29] lol, probably something in the old panel that made it not show the data at this point [17:02:50] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#1454386 (10jrbs) [Wordpress is now using HTTPS as default for blogs making use of their backend.](http://thenextweb.com/insider/2016/04/08/wordpress-rolls-encrypted-https-standard/) I'm n... [17:09:02] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2191090 (10jayvdb) I see the same image for the thumbnail as fullsize on commons. [17:11:09] (03PS1) 10Alexandros Kosiaris: ganeti: add a system::role [puppet] - 10https://gerrit.wikimedia.org/r/282390 [17:13:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One inline comment, otherwise looks good" (031 comment) [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry) [17:16:33] (03CR) 10Ottomata: [C: 032] Remove logrotate/syslog configurations. [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/282161 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [17:16:39] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: add a system::role [puppet] - 10https://gerrit.wikimedia.org/r/282390 (owner: 10Alexandros Kosiaris) [17:17:42] 6Operations: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2191147 (10Yurivict) This problem is a recent regression. Didn't happen a month ago. [17:43:00] (03PS1) 10Dereckson: New logo for vec.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) [18:04:45] (03CR) 10Luke081515: [C: 031] New logo for vec.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) (owner: 10Dereckson) [18:06:28] (03CR) 10Luke081515: [C: 031] Revert "350K articles celebration logo on cs.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [18:07:49] (03CR) 10Luke081515: [C: 031] Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282198 (owner: 10CSteipp) [18:08:29] (03CR) 10Luke081515: [C: 031] Use extension registration for ProofreadPage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [18:09:01] (03CR) 10Luke081515: [C: 031] Enable VisualEditor on the Project ('Wikipedya') of htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) (owner: 10Jforrester) [18:09:40] (03CR) 10Luke081515: [C: 031] Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [18:11:03] (03CR) 10Luke081515: [C: 031] Use extension registration for SpamBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281239 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [18:11:23] (03CR) 10Luke081515: [C: 031] Use extension registration for LabeledSectionTransclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281237 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [18:12:53] mafk: Hi. Could you check the procedure I suggest at https://phabricator.wikimedia.org/T66122#2136528? [18:13:20] (03PS1) 10BryanDavis: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 [18:20:05] (03PS2) 10BryanDavis: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 [18:21:34] bd808: nice aside :) [18:23:13] greg-g: I haven't looked to see how many are current/former WMF employees, but I think that is actually the majority [18:23:38] technically the WMF could pick any OSI license for those contributions I believe [18:23:43] (03PS1) 10Halfak: Adds russian myspell package to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/282408 [18:24:51] bd808: yep, that's the saving grace for it, it probably brings us down to <20 (rough/wild guess) [18:25:07] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2191341 (10eranroz) I see two different images: *[[File:Erel_Margalit.jpg|250px]] - this shows one image *[[File:Erel_Margalit.jpg|251px]] - this show a totally different one For example: https://en.wikipedia.org/wiki/User:%D7... [18:28:01] 6Operations, 10DBA, 10hardware-requests: Decomission db1010 - https://phabricator.wikimedia.org/T129395#2191344 (10RobH) This needs more info before it can be acted upon. Is this machine no longer in use and no longer wanted by the DB team? (I would assume so since the task is generated, but as it doesn't... [18:28:14] (03PS3) 10BryanDavis: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 [18:28:37] 6Operations, 10DBA, 10hardware-requests: Decomission db1010 - https://phabricator.wikimedia.org/T129395#2191347 (10RobH) a:3Volans assigned to @volans for input since @jcrespo is out. If this waits for his return, that is likely ok. [18:34:14] (03PS4) 10BryanDavis: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 [18:54:42] Dereckson: hi - sorry but I'm not confident with that. If I can't fully see if the code I wrote is working as expected I won't take care of the patch. [18:55:00] maybe varnent can grant me temp. access to chapcom wiki in order to test [18:55:05] and then remove my account [19:07:25] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [19:07:43] PROBLEM - HHVM rendering on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [19:12:16] PROBLEM - Apache HTTP on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [19:12:34] PROBLEM - HHVM rendering on mw1143 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [19:20:22] (03CR) 10Jforrester: [C: 031] "Scheduled for Monday's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280869 (owner: 10Jforrester) [19:20:31] (03CR) 10Jforrester: [C: 031] "Scheduled for Monday's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280870 (owner: 10Jforrester) [19:20:43] (03CR) 10Jforrester: [C: 031] "Scheduled for Monday's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) (owner: 10Jforrester) [19:21:31] (03CR) 10Jforrester: [C: 031] "Scheduled for Tuesday's SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274131 (https://phabricator.wikimedia.org/T128478) (owner: 10Jforrester) [19:29:56] (03PS1) 10Mattflaschen: Enable Echo survey on French-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) [19:32:06] (03CR) 10Luke081515: [C: 031] Enable Echo survey on French-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [19:33:23] (03PS1) 10EBernhardson: Maintain a current symlink for cirrussearch dumps [puppet] - 10https://gerrit.wikimedia.org/r/282415 [19:37:26] 6Operations, 10ops-codfw: codfw: return one intel ssd to dasher for warranty replacement - https://phabricator.wikimedia.org/T132210#2191544 (10RobH) [19:39:16] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2190627 (10csteipp) @RobH, if they're adding our domain in their SAN already, is there a need to also give them a shop.wikimedia.org cert for SNI clients? If we do... [19:39:58] 6Operations, 10ops-codfw: codfw: return one intel ssd to dasher for warranty replacement - https://phabricator.wikimedia.org/T132210#2191561 (10RobH) [19:46:22] (03Abandoned) 10Yuvipanda: eventlogging: Emit \n at beginning of log message, not end [puppet] - 10https://gerrit.wikimedia.org/r/263816 (owner: 10Yuvipanda) [19:49:32] 6Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2191579 (10Southparkfan) [20:12:09] (03PS3) 10Yuvipanda: tools: Remove ldapspportlib use from toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/242044 (https://phabricator.wikimedia.org/T114063) [20:15:17] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2191641 (10RobH) >>! In T132172#2191559, @csteipp wrote: > @RobH, if they're adding our domain in their SAN already, is there a need to also give them a shop.wikim... [20:15:25] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/282415 (owner: 10EBernhardson) [20:19:07] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove ldapspportlib use from toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/242044 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [20:28:25] (03PS1) 10Yuvipanda: tools: Fixups for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/282427 [20:28:44] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fixups for toolschecker [puppet] - 10https://gerrit.wikimedia.org/r/282427 (owner: 10Yuvipanda) [20:29:19] 6Operations, 10Wikimedia-Shop, 7Security-Other: approval for shop.wikimedia.org with shopify/digicert - https://phabricator.wikimedia.org/T132172#2191681 (10csteipp) > The verification is simple, Digicert has been spamming us with the request to verify to our dns-admin alias, which is what alerted me to this... [20:34:10] (03CR) 10EBernhardson: [C: 031] CirrusSearch on Labs uses the full Elasticsearch cluster again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282359 (https://phabricator.wikimedia.org/T130219) (owner: 10Gehel) [20:39:47] (03PS1) 10Gehel: Adding SSH key for gehel / glederrey@wikimedia.org [labs/private] - 10https://gerrit.wikimedia.org/r/282428 [20:40:53] (03CR) 10Yuvipanda: [C: 032 V: 032] "Welcome!" [labs/private] - 10https://gerrit.wikimedia.org/r/282428 (owner: 10Gehel) [20:42:31] (03Abandoned) 10Yuvipanda: scap: Fix mwgrep pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/255279 (owner: 10Yuvipanda) [20:42:40] (03Abandoned) 10Yuvipanda: scap: Allow customizing search host in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/255282 (owner: 10Yuvipanda) [20:43:13] (03Abandoned) 10Yuvipanda: WIP: Make yaml+ldap ENC into a http service [puppet] - 10https://gerrit.wikimedia.org/r/197712 (https://phabricator.wikimedia.org/T85279) (owner: 10Yuvipanda) [20:52:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [20:52:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [20:59:34] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:59:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:01:23] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2191753 (10Deskana) [21:01:25] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Activate SSL + connection pooling for CirrusSearch on PROD - https://phabricator.wikimedia.org/T131839#2191752 (10Deskana) 5Open>3Resolved [21:04:11] 6Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#2191757 (10Deskana) [21:04:13] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2191755 (10Deskana) 5Open>3Resolved Yay! Thanks for also adding this to this week's email update. :-) [21:17:55] (03CR) 10Volans: "No-op apart the require => File[/etc/ssl/certs/Puppet_Internal_CA.pem]" [puppet] - 10https://gerrit.wikimedia.org/r/282385 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [21:41:09] (03PS3) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 [21:44:03] (03CR) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 (owner: 10Mattflaschen) [21:45:51] (03PS4) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 [22:11:41] (03PS1) 10Andrew Bogott: 'makedomain' needs to import the os lib. [puppet] - 10https://gerrit.wikimedia.org/r/282435 [22:13:20] (03CR) 10Andrew Bogott: [C: 032] 'makedomain' needs to import the os lib. [puppet] - 10https://gerrit.wikimedia.org/r/282435 (owner: 10Andrew Bogott) [22:24:45] hi! Anyone know how I can get a list of cookies names we're seeing from clients? [23:15:30] (03PS1) 10Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) [23:18:51] ^ Anyone able to review that (Beta Cluster change to use External Store, initially on testwiki)? ^ greg-g, andrewbogott, ori [23:21:09] matt_flaschen: +1 in spirit, not in actual code review [23:21:14] matt_flaschen: It looks sane [23:21:16] o.ri is on a plane, I believe [23:23:55] Okay, good to know. [23:24:03] * AndyRussG waves [23:24:23] Hi! Anyone know who I could ask to get a list of cookies in the wild, or somehow investigate that? [23:24:38] Hey, AndyRussG. I saw your question, but I don't know if that is logged/where. [23:25:12] matt_flaschen: hey :) K thanks... Any thoughts on whom to ping? [23:26:22] AndyRussG: What do you want? A list of cookies MW/the wmf stack may or may not send etc? [23:26:30] Or ones we're being sent back? [23:26:36] Reedy: hi! the latter [23:26:42] It's for this ticket: https://phabricator.wikimedia.org/T108849 [23:27:03] AndyRussG, I would file a task tagged with both operations and Analytics, then ping milimetric, bblack (I know he is one of the ops people interested in cookies) [23:27:10] AndyRussG, why do you need the full list for that task, though? [23:27:15] Yeah, I'd suggest bblack too [23:27:17] I'd hope to scoop up cookies leftover from old banners, etc [23:27:19] Couldn't you grep through the fundraising codebase for cookie usage. [23:27:46] matt_flaschen: sort of [23:27:51] I can go through old banner code [23:27:52] That doesn't answer about latent coookies peoples browsers are still sending back etc? [23:28:03] with silly long expiry or other stuff [23:28:05] AndyRussG, oh, is this because of onwiki-JavaScript? [23:28:06] 'cause that's also a source of old CentralNotice cookies [23:28:29] Yeah! Or, mostly, yeah, that's how a large part of banners work [23:29:05] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2192084 (10jayvdb) A screenshot of what you see would be helpful. [23:29:10] So it might be easy to miss old stuff, so I was looking for an additional source to make a list of old cookies to scoop up [23:29:34] I'd check in with Brandon [23:30:26] Also, ori has a comment on that task with a list of some cookies, so I though he might have gotten that from somewhere... standard/checkable [23:30:27] https://phabricator.wikimedia.org/T108849#2050638 [23:30:49] He said "I have" [23:30:57] so presumably his local browser/cookie stash [23:31:25] Reedy: hmm yeah good point ;p [23:31:28] * AndyRussG adjusts glasses [23:31:41] sorry, I meant, spectacles [23:33:00] Reedy: matt_flaschen: thx much!! [23:48:08] (03PS1) 10Mattflaschen: Fix Beta update.php error handling [puppet] - 10https://gerrit.wikimedia.org/r/282441 (https://phabricator.wikimedia.org/T110407) [23:52:11] (03PS1) 10Yuvipanda: tools: s/tools-checker.wmflabs.org/checker.tools.wmflabs.org/ [puppet] - 10https://gerrit.wikimedia.org/r/282442 (https://phabricator.wikimedia.org/T131796) [23:52:23] (03PS2) 10Yuvipanda: tools: s/tools-checker.wmflabs.org/checker.tools.wmflabs.org/ [puppet] - 10https://gerrit.wikimedia.org/r/282442 (https://phabricator.wikimedia.org/T131796) [23:54:10] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: s/tools-checker.wmflabs.org/checker.tools.wmflabs.org/ [puppet] - 10https://gerrit.wikimedia.org/r/282442 (https://phabricator.wikimedia.org/T131796) (owner: 10Yuvipanda) [23:56:46] YuviPanda: Hm.. this applies to non-web hostnames only? I assume we don't want to deal with ssl certs for *.tools.wmflabs.org etc. [23:56:54] I thought we were moving en mass away from subsubdomains [23:57:04] Krinkle: we already have SSL certs for *.tools.wmflabs.org [23:57:25] Krinkle: but yeah, tools.wmflabs.org will still exist. I might setup static.tools.wmflabs.org though [23:57:37] Krinkle: and eventually hopefully provide $toolname.tools.wmflabs.org [23:57:52] Hm.. [23:57:53] interesting [23:58:39] what about non-tools? [23:58:42] e.g. cvn.wmflabs.org [23:58:51] we'll continue to have a generic web proxy that just works as plain subdomain? [23:58:53] Krinkle: yeah, so those are all proxies, which will remain under *.wmflabs.org [23:59:09] Krinkle: its' only floating IPs that are being modified. [23:59:19] right [23:59:29] YuviPanda: why not make tools-static behind the proxy as well though? [23:59:46] in what way is it / should it be special