[00:02:10] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2566366 (10hoo) >>! In T142944#2566353, @DaBPunkt wrote: >>>! In T142944#2560827, @hoo wrote: >> For this, we also desire to get th... [00:04:20] 07Puppet, 10Beta-Cluster-Infrastructure, 10Salt: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2566370 (10AlexMonk-WMF) [00:05:36] (03PS1) 10Dzahn: phabricator: only run dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/305600 [00:06:28] (03CR) 10Faidon Liambotis: [C: 04-1] "That will leave relforge unmonitored though, won't it?" [puppet] - 10https://gerrit.wikimedia.org/r/305519 (https://phabricator.wikimedia.org/T133844) (owner: 10Gehel) [00:06:46] (03CR) 10jenkins-bot: [V: 04-1] phabricator: only run dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/305600 (owner: 10Dzahn) [00:09:21] (03CR) 10Faidon Liambotis: [C: 031] Disable unprivileged user namespaces on trusty systems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [00:13:23] (03PS2) 10Dzahn: phabricator: only run dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/305600 [00:14:22] (03CR) 10Faidon Liambotis: [C: 031] "Looks good, with the limited time I spent on it. The There's some trailing whitespace though :)" [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [00:14:32] (03Abandoned) 10Dzahn: DHCP: make configurable in Hiera which is the running server [puppet] - 10https://gerrit.wikimedia.org/r/305429 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:16:19] (03CR) 10Dzahn: "this just splits the DHCP role out of the general "installserver" role and is a no-op on carbon" [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [00:19:13] (03CR) 10Faidon Liambotis: "Another, different comment:" [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [00:21:46] (03CR) 10BryanDavis: [C: 031] deployment-prep: Move udp2log to deployment-fluorine02 [puppet] - 10https://gerrit.wikimedia.org/r/305587 (owner: 10Alex Monk) [00:23:08] (03CR) 10Dzahn: "noop on iridium, disables dumps on phab2001 -> http://puppet-compiler.wmflabs.org/3774/" [puppet] - 10https://gerrit.wikimedia.org/r/305600 (owner: 10Dzahn) [00:42:06] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:43:28] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2607:f6f0:205::153 [00:48:16] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [00:49:36] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 39.22 ms [01:01:28] (03CR) 10Dzahn: [C: 032] phabricator: only run dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/305600 (owner: 10Dzahn) [01:09:17] (03CR) 10Dzahn: "cronspam should stop now. Notice: /Stage[main]/Phabricator::Tools/Cron[/srv/phab/tools/public_task_dump.py]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/305600 (owner: 10Dzahn) [01:11:03] (03PS8) 10Dzahn: installserver: split DHCP part out into own role [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) [01:12:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [01:14:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [01:24:41] (03PS1) 10Mattflaschen: Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305604 (https://phabricator.wikimedia.org/T140588) [01:29:40] (03CR) 10Mattflaschen: [C: 032] Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305604 (https://phabricator.wikimedia.org/T140588) (owner: 10Mattflaschen) [01:30:08] (03Merged) 10jenkins-bot: Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305604 (https://phabricator.wikimedia.org/T140588) (owner: 10Mattflaschen) [01:31:42] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta-only change (duration: 00m 51s) [01:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:42:22] (03PS2) 10EBernhardson: logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) [01:43:37] (03CR) 10EBernhardson: "will schedule this for deploy next week" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [01:48:07] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1815.494609 Seconds [01:49:58] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 72.796458 Seconds [01:50:26] PROBLEM - Postgres Replication Lag on maps-test2002 is CRITICAL: CRITICAL - Rep Delay is: 1817.495185 Seconds [01:52:18] RECOVERY - Postgres Replication Lag on maps-test2002 is OK: OK - Rep Delay is: 103.904506 Seconds [02:05:23] (03PS1) 10Mattflaschen: Revert "Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305608 [02:05:54] (03PS2) 10Mattflaschen: Revert "Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305608 (https://phabricator.wikimedia.org/T140588) [02:06:01] (03CR) 10Mattflaschen: [C: 032] Revert "Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305608 (https://phabricator.wikimedia.org/T140588) (owner: 10Mattflaschen) [02:06:30] (03Merged) 10jenkins-bot: Revert "Temporarily make NS_PROJECT_TALK wikitext again on Beta cawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305608 (https://phabricator.wikimedia.org/T140588) (owner: 10Mattflaschen) [02:08:24] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta-only change (duration: 00m 54s) [02:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:20:36] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.15) (duration: 08m 59s) [02:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:42] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Aug 19 02:26:42 UTC 2016 (duration 6m 6s) [02:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:18] (03CR) 10Krinkle: Send xhprof profiles from mw1017 to xhgui (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271946 (owner: 10Ori.livneh) [03:22:10] (03PS1) 10Ladsgroup: ores: Add memory report to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/305611 (https://phabricator.wikimedia.org/T143081) [03:49:17] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.60 seconds [04:05:37] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 636.99 seconds [04:07:46] (03CR) 10BBlack: "Yeah I had the same thought today while re-doing that patch. I think it's probably acceptable as a short-term workaround, assuming the ra" [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [04:10:28] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:12:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:41:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:42:13] uhh [04:42:46] 241 error: Timeout reached waiting for an available pooled curl connection! [04:42:46] in /srv/mediawiki/php-1.28.0-wmf.15/extensions/CirrusSearch/includes/Elastica/Po [04:42:46] oledHttp.php on line 66 [04:43:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:11:06] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.60 seconds [05:26:36] PROBLEM - MariaDB Slave Lag: s2 on db2049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.06 seconds [05:51:57] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.60 seconds [06:10:37] (03PS1) 10Yuvipanda: labs: Set timeout for ldap3 using scripts [puppet] - 10https://gerrit.wikimedia.org/r/305616 (https://phabricator.wikimedia.org/T143375) [06:21:25] * Krinkle holds the deployment lock [06:21:37] Backporting https://gerrit.wikimedia.org/r/#/c/305618/1 to unbreak a regression from this week [06:21:54] legoktm: interesting error [06:22:57] https://github.com/facebook/hhvm/search?utf8=%E2%9C%93&q=%22Timeout+reached+waiting+for+an+available%22&type=Code [06:31:15] Verified on mw1099 with testwiki [06:32:21] !log krinkle@tin Synchronized php-1.28.0-wmf.15/includes/OutputPage.php: T143357 (duration: 00m 55s) [06:32:22] T143357: "ext.globalCssJs.user.styles" loaded before site styles marker instead of after - https://phabricator.wikimedia.org/T143357 [06:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:42:43] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10Nikerabbit) >>! In T100902#2566120, @Platonides wrote: > I think a new ULS version not relying on that should be released before t... [06:51:02] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should only use the Cookie - https://phabricator.wikimedia.org/T143270#2566808 (10Nemo_bis) [06:58:37] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [07:16:24] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 07User-notice: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#2566811 (10MoritzMuehlenhoff) 05Open>03Resolved Seems all fine, previously on average 250 SVG thumbailings failed daily due to size limitation... [07:16:43] (03PS1) 10ArielGlenn: reduce cronspam, eliminate dataset rsync whines about vanished files [puppet] - 10https://gerrit.wikimedia.org/r/305621 [07:18:29] (03CR) 10ArielGlenn: [C: 032] reduce cronspam, eliminate dataset rsync whines about vanished files [puppet] - 10https://gerrit.wikimedia.org/r/305621 (owner: 10ArielGlenn) [07:30:06] !log installing gnupg security updates [07:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:16] Nice [07:31:36] moritzm: I heard about that security flaw [07:31:51] http://lists.gnu.org/archive/html/info-gnu/2016-08/msg00008.html [07:31:54] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2566816 (10Nemo_bis) If all else fails, could the WMF consider serving something like https://freegeoip.net/ (which uses the free downloads h... [07:32:27] I also updated my debian VM :P [07:32:58] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7315.91 seconds Jcrespo This is the slowest and busiest s2 slave (analytics) - this is warning-level only. [07:37:51] Bsadowski1: if you use gnupg 2, make sure to also upgrade libgcrypt, it handles most of the crypto in gpg2 [07:38:41] haha, I don't really use my Debian VM for much. It is just in case I guess. [07:40:06] I did update that though, moritzm. [07:40:18] A package manager did it for me [07:50:59] !log depooling mw2215 for some tests with the hhvm systemd unit [07:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:51:12] (03PS3) 10Giuseppe Lavagetto: scap: add conftool class [puppet] - 10https://gerrit.wikimedia.org/r/305278 [07:53:07] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet last ran 9 hours ago [07:53:16] that's me ^ [07:54:15] (03PS2) 10Jcrespo: Add public logic for grants to m5 db for striker application [puppet] - 10https://gerrit.wikimedia.org/r/305506 (https://phabricator.wikimedia.org/T142545) [07:54:23] (03PS1) 10Muehlenhoff: Amend CVE ID to changelog which was only assigned today, but already fixed in 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/305623 [07:54:47] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [07:55:07] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:55:16] man the week is too short [07:55:43] (03CR) 10Jcrespo: [C: 032] Add public logic for grants to m5 db for striker application [puppet] - 10https://gerrit.wikimedia.org/r/305506 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [07:56:13] (03PS2) 10Alexandros Kosiaris: ores: Add memory report to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/305611 (https://phabricator.wikimedia.org/T143081) (owner: 10Ladsgroup) [07:56:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Add memory report to uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/305611 (https://phabricator.wikimedia.org/T143081) (owner: 10Ladsgroup) [07:56:53] do I deploy ores or wait, akosiaris ? [07:56:58] jynus: merged the striker one as well [07:57:03] hehe [07:57:06] good [07:57:07] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2566841 (10faidon) We're not, cannot and will not be a general purpose GeoIP provider, for various and diverse reasons. You can use freegeoi... [07:57:30] (03CR) 10Muehlenhoff: [C: 032] Amend CVE ID to changelog which was only assigned today, but already fixed in 4.4.7 [debs/linux44] - 10https://gerrit.wikimedia.org/r/305623 (owner: 10Muehlenhoff) [08:05:01] (03CR) 10Faidon Liambotis: "If we do change the format slightly, perhaps dropping the :v4/:v6 suffic would make sense. It was only used for falling back to geoiplooku" [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [08:22:17] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:23:35] (03PS1) 10Jcrespo: Set innodb_large_prefix/barracuda as default config for m5 [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) [08:46:41] (03CR) 10Ema: [C: 032] varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [08:46:47] (03PS4) 10Ema: varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) [08:46:50] (03CR) 10Ema: [V: 032] varnishlog4: allow methods to be used as callbacks [puppet] - 10https://gerrit.wikimedia.org/r/305517 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [08:56:31] !log deploying schema change on s2 hosts T139090 [08:56:33] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [08:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:58:58] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.65 seconds [09:02:44] (03CR) 10Jcrespo: [C: 031] "https://puppet-compiler.wmflabs.org/3776/db1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [09:08:57] (03PS3) 10Ema: Port varnishprocessor to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305525 (https://phabricator.wikimedia.org/T131353) [09:09:22] (03CR) 10Ema: [C: 032 V: 032] Port varnishprocessor to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305525 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [09:09:40] (03PS2) 10Ema: Port varnishmedia to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305527 (https://phabricator.wikimedia.org/T131353) [09:09:47] (03CR) 10Ema: [C: 032 V: 032] Port varnishmedia to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/305527 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [09:13:53] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2566982 (10fgiunchedi) I've reenabled the two ssd and reinstalled ms-be1027, afaict no errors reported now, @Cmjohnson no disks blinking either now? if so we can call this done! [09:21:27] (03CR) 10Filippo Giunchedi: [C: 031] Disable unprivileged user namespaces on trusty systems [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [09:25:48] (03CR) 10Filippo Giunchedi: [C: 031] statsd proxy: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/303532 (owner: 10Muehlenhoff) [09:32:21] (03CR) 10Filippo Giunchedi: "what about api appservers, needed there too?" [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) (owner: 10Dzahn) [09:34:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Include it in role::mediawiki::webserver instead." [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) (owner: 10Dzahn) [09:35:30] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2566993 (10fgiunchedi) next step proposed by @cmjohnson is to swap ssds and see if the error follows the disk, at this point it is unlikely it is the disk itself. More likely cable/cont... [09:39:26] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Display the admin set message [puppet] - 10https://gerrit.wikimedia.org/r/305482 [09:39:28] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Move the failure checks at the top [puppet] - 10https://gerrit.wikimedia.org/r/305483 [09:39:30] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Remove statefile usage [puppet] - 10https://gerrit.wikimedia.org/r/305484 [09:39:32] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Remove unused lastrun_failed var [puppet] - 10https://gerrit.wikimedia.org/r/305485 [09:39:34] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Remove old failure handling code [puppet] - 10https://gerrit.wikimedia.org/r/305486 [09:39:36] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Add reportfile handling [puppet] - 10https://gerrit.wikimedia.org/r/305487 [09:39:38] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Improve full fail error message [puppet] - 10https://gerrit.wikimedia.org/r/305504 [09:39:40] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Add failed resource warning/critical levels [puppet] - 10https://gerrit.wikimedia.org/r/305505 [09:39:42] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 [09:39:44] (03PS1) 10Alexandros Kosiaris: check_puppetrun: Give on extra timespan to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) [09:43:00] 06Operations: Handling of customised systemd units via puppet in base::service_unit - https://phabricator.wikimedia.org/T143210#2567001 (10MoritzMuehlenhoff) > That would actually be an option as well, at least for the hhvm use case: It has the same unit dependencies in the shipped HHVM unit and in our puppetis... [09:48:35] 06Operations, 06Community-Tech, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2567005 (10akosiaris) OK, scheduled for monday 11:00 UTC [09:49:38] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Display the admin set message [puppet] - 10https://gerrit.wikimedia.org/r/305482 (owner: 10Alexandros Kosiaris) [09:51:10] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Display the admin set message [puppet] - 10https://gerrit.wikimedia.org/r/305482 [09:51:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] check_puppetrun: Display the admin set message [puppet] - 10https://gerrit.wikimedia.org/r/305482 (owner: 10Alexandros Kosiaris) [09:51:21] _joe_: thanks [09:51:39] _joe_: have I told you how much I hate ruby 1.8 right now ? [09:52:06] <_joe_> today? maybe not [09:52:15] hehehe [09:52:19] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Move the failure checks at the top (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305483 (owner: 10Alexandros Kosiaris) [09:53:23] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Remove statefile usage [puppet] - 10https://gerrit.wikimedia.org/r/305484 (owner: 10Alexandros Kosiaris) [09:54:15] (03CR) 10jenkins-bot: [V: 04-1] check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 (owner: 10Alexandros Kosiaris) [09:54:42] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Remove unused lastrun_failed var [puppet] - 10https://gerrit.wikimedia.org/r/305485 (owner: 10Alexandros Kosiaris) [09:55:31] (03CR) 10Alexandros Kosiaris: check_puppetrun: Move the failure checks at the top (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305483 (owner: 10Alexandros Kosiaris) [09:57:47] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Remove old failure handling code [puppet] - 10https://gerrit.wikimedia.org/r/305486 (owner: 10Alexandros Kosiaris) [09:58:22] (03CR) 10jenkins-bot: [V: 04-1] check_puppetrun: Give on extra timespan to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) (owner: 10Alexandros Kosiaris) [10:04:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] check_puppetrun: Add reportfile handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305487 (owner: 10Alexandros Kosiaris) [10:07:30] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Improve full fail error message (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305504 (owner: 10Alexandros Kosiaris) [10:10:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I am ok with such errors being critical and needing the eye test; you can't really say if 1 resource failing is the one you just added or " [puppet] - 10https://gerrit.wikimedia.org/r/305505 (owner: 10Alexandros Kosiaris) [10:11:15] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 (owner: 10Alexandros Kosiaris) [10:11:35] (03PS1) 10Filippo Giunchedi: cassandra: add instance ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) [10:13:39] (03CR) 10Giuseppe Lavagetto: [C: 031] check_puppetrun: Give on extra timespan to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) (owner: 10Alexandros Kosiaris) [10:28:07] (03PS2) 10Filippo Giunchedi: cassandra: add instance ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) [10:33:21] (03CR) 10Muehlenhoff: "Thanks for the reviews, will deploy this next week" [puppet] - 10https://gerrit.wikimedia.org/r/304474 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [10:37:13] (03CR) 10Filippo Giunchedi: "results from PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [10:48:42] (03PS1) 10Muehlenhoff: Provide override file for base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/305635 [10:49:52] (03CR) 10jenkins-bot: [V: 04-1] Provide override file for base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/305635 (owner: 10Muehlenhoff) [10:51:15] (03PS2) 10Muehlenhoff: Provide override file for base::service_unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/305635 [11:02:41] (03CR) 10Alex Monk: "Have you tested this somehow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) (owner: 10TTO) [11:06:41] (03CR) 10TTO: "Yes, I ran the script with php -s (with the define at the top uncommented), and it worked as far as I could tell. Could do with a double-c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) (owner: 10TTO) [11:13:43] (03CR) 10Jcrespo: "Bryan, I have applied this already to production; you do not have to wait for me to deploy it, anyone can." [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [11:19:14] (03CR) 10Alex Monk: "How exactly though? I don't think anyone is going to merge this in it's current state." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) (owner: 10TTO) [11:20:10] Krenair: "merge this in it's current state." It's just a one line patch; I don't see how its "current state" could be any better [11:20:30] You don't explain how this fixes the problem [11:20:46] I thought I did on the Phab task? Let me revisit it [11:21:09] Hmm, yeah I'll expand the commit message a bit [11:21:35] You just said that the protocol needs to be removed [11:21:54] Which doesn't appear to match what the task is actually about [11:22:08] is there a way to reproduce this issue on beta.wmflabs.org? [11:23:31] (03PS3) 10TTO: Don't prepend protocol in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) [11:23:36] Krenair: Does the new commit message help? [11:25:56] Krenair: It can be reproduced on beta, but fixing beta's missing.php won't change anything, because beta seems to use production interwikis [11:26:06] "prepending the protocol causes some redirections to fail, as the double protocol causes the redirects to be misinterpreted" is where this all falls apart [11:26:14] But [11:26:28] alex@alex-laptop:~/Development/Wikimedia/Operations-MediaWiki-Config/wmf-config (deployment-fluorine)$ curl -I https://wuu.wikibooks.org/wiki/zh: 2>/dev/null | grep Location: [11:26:28] Location: https:https://zh.wikibooks.org/wiki/ [11:26:55] Clearly this commit fixes that [11:27:20] I just haven't figured out how incubator gets involved yet [11:27:22] Mmm, I have to admit I don't really know why :) [11:27:50] Or more precisely, why the problem was manifesting itself in the exact way it was [11:29:16] Okay here's what's going on [11:29:30] https://wuu.wikipedia.org/wiki/s:zh: -> location:https://wuu.wikisource.org/wiki/zh: [11:29:38] Chrome gets https://wuu.wikisource.org/wiki/zh: [11:29:51] https://wuu.wikisource.org/wiki/zh: -> location:https:https://zh.wikisource.org/wiki/ [11:30:02] Chrome gets https://wuu.wikisource.org/wiki/https://zh.wikisource.org/wiki/ (!) [11:30:16] https://wuu.wikisource.org/wiki/https://zh.wikisource.org/wiki/ -> location:https://wikisource.org/wiki/https://zh.wikisource.org/wiki/ [11:30:34] Yes, the way browsers handle the malformed Location: header is odd, maybe some historical quirk [11:30:36] Chrome gets https://wikisource.org/wiki/https://zh.wikisource.org/wiki/ [11:30:48] https://wikisource.org/wiki/https://zh.wikisource.org/wiki/ -> location:https://wikisource.org/wiki/Https://zh.wikisource.org/wiki/ [11:31:29] Chrome gets https://wikisource.org/wiki/Https://zh.wikisource.org/wiki/ [11:31:49] HTTP 404, redirects stop here [11:33:18] Krenair: Convinced now? :) [11:33:33] You've certainly convinced me [11:37:55] tto, the way chrome is handling this seems correct [11:38:00] It's not a valid absolute URL [11:38:36] Oh no wait, but then it strips the extra https: [11:38:37] hm [11:38:38] weird [11:38:43] well whatever, clearly this will do the job [11:40:14] (03CR) 10Alex Monk: [C: 031] "Per my comment on the task, this should fix it, even though it's not immediately obvious how until you look into how browsers interpret a " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304689 (https://phabricator.wikimedia.org/T141208) (owner: 10TTO) [11:41:01] Thanks. [11:41:09] tto, it's a friday so I can't deploy this now, and I'm on holiday next week [11:41:14] I suggest putting it up for monday morning swat? [11:41:27] Sure, no rush. I can't really do swats, as they're at awful hours of the day [11:41:31] ah [11:42:31] I usually just leave the patches in the queue and hope someone decides to merge them eventually :) I don't really have any other choice [11:42:50] that's not a very effective way of getting things merged in mediawiki-config.git [11:44:13] well, you could work with greg to get it scheduled somehow, or find someone else to attend swat [11:44:54] or send me an email to deal with it on monday 29th [11:45:13] It would be great if you could do it then [11:45:58] I'll try to remember to send you an email closer to the time [11:48:02] thanks [11:49:04] Hi. We've a special page with localisation broken. Nobody really cares about this special page, it's Special:VipsScaler. How would you prioritize that, and is that worth to backport right now (well once merged to master) the fix to wmf.15? [11:49:35] Sorry, Special:VipsTest [11:50:55] 06Operations, 10Parsoid, 06Services, 10service-runner, 10service-template-node: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1015491 (10Jcook) Hello, I wondered if you ever considered using something like yeoman for creating a generic se... [11:56:30] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2567190 (10akosiaris) >>! In T142002#2565031, @Halfak wrote: > Hi @akosiaris. It seems that you've commented on the naming scheme. Er no. I was commenting on t... [12:05:05] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Move the failure checks at the top [puppet] - 10https://gerrit.wikimedia.org/r/305483 [12:05:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] check_puppetrun: Move the failure checks at the top [puppet] - 10https://gerrit.wikimedia.org/r/305483 (owner: 10Alexandros Kosiaris) [12:05:51] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Remove statefile usage [puppet] - 10https://gerrit.wikimedia.org/r/305484 [12:05:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] check_puppetrun: Remove statefile usage [puppet] - 10https://gerrit.wikimedia.org/r/305484 (owner: 10Alexandros Kosiaris) [12:06:09] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Remove unused lastrun_failed var [puppet] - 10https://gerrit.wikimedia.org/r/305485 [12:06:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] check_puppetrun: Remove unused lastrun_failed var [puppet] - 10https://gerrit.wikimedia.org/r/305485 (owner: 10Alexandros Kosiaris) [12:06:58] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Remove old failure handling code [puppet] - 10https://gerrit.wikimedia.org/r/305486 [12:07:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] check_puppetrun: Remove old failure handling code [puppet] - 10https://gerrit.wikimedia.org/r/305486 (owner: 10Alexandros Kosiaris) [12:10:10] (03CR) 10Alexandros Kosiaris: check_puppetrun: Add reportfile handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305487 (owner: 10Alexandros Kosiaris) [12:16:10] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567227 (10fgiunchedi) [12:20:58] (03CR) 10DCausse: [C: 031] logging: Require acknowledgment of kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292514 (https://phabricator.wikimedia.org/T135159) (owner: 10EBernhardson) [12:28:48] (03CR) 10Alexandros Kosiaris: "My reasons for adding this were multiple. One is to avoid spamming for obviously transient issues like a random apt fetch failing." [puppet] - 10https://gerrit.wikimedia.org/r/305505 (owner: 10Alexandros Kosiaris) [12:31:07] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: puppet fail [12:35:07] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:11:21] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should only use the Cookie - https://phabricator.wikimedia.org/T143270#2567345 (10BBlack) Ok, I guess I missed the cookie-parsing code in ULS.... or is it relying on the fact that CN has already parsed... [13:27:41] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567365 (10chasemp) Ah, yeah sorry about this. @yuvipanda enabled this a few days ago as we have tracked down the primarily symptom of {T141673} to io going stale (freezing) which... [13:34:13] (03PS6) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [13:34:15] (03PS6) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [13:34:17] (03PS6) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [13:34:19] (03PS13) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [13:34:21] (03PS1) 10BBlack: Prep varnishd for libmaxminddb-based VCL [puppet] - 10https://gerrit.wikimedia.org/r/305647 (https://phabricator.wikimedia.org/T99226) [13:34:23] (03PS1) 10BBlack: varnish: remove libgeoip from text VCL compilation [puppet] - 10https://gerrit.wikimedia.org/r/305648 (https://phabricator.wikimedia.org/T99226) [13:35:15] (03CR) 10BBlack: "Fixed the whitespace issue. Also, normalized the v[46] fields to always be v4 to make things easier on the next patch." [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [13:35:27] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567399 (10fgiunchedi) indeed it might be hard to track via the uuids, alternatively we could purge instance directories not updated for some period of time, e.g. 4/5 weeks [13:37:53] 06Operations, 10Cassandra, 06Services: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044#2554903 (10fgiunchedi) the procedure to rollover / extend expiration is outlined at https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates we... [13:39:21] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2567423 (10BBlack) [13:41:42] (03CR) 10BBlack: [C: 032] Prep varnishd for libmaxminddb-based VCL [puppet] - 10https://gerrit.wikimedia.org/r/305647 (https://phabricator.wikimedia.org/T99226) (owner: 10BBlack) [13:42:28] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2567443 (10Halfak) @akosiaris, I see. The current state is on the left side of the arrows --> and the proposed state is on the right side of the arrows. I made... [13:45:49] !log cache_misc: varnish-frontend global rolling restart (~3 mins to completion) [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:54] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2567456 (10Nikerabbit) >>! In T143270#2567345, @BBlack wrote: > Ok, I guess I missed the cookie-parsing code in ULS.... o... [13:51:23] (03PS1) 10Filippo Giunchedi: graphite: fix hostname lookup for systemd carbon-c-relay units [puppet] - 10https://gerrit.wikimedia.org/r/305650 [13:53:47] (03CR) 10Filippo Giunchedi: [C: 032] "PCC: https://puppet-compiler.wmflabs.org/3780" [puppet] - 10https://gerrit.wikimedia.org/r/305650 (owner: 10Filippo Giunchedi) [13:53:50] _joe_: *waves*. Have you had any time to look at the puppet compiler? :-) [13:53:55] (03PS2) 10Filippo Giunchedi: graphite: fix hostname lookup for systemd carbon-c-relay units [puppet] - 10https://gerrit.wikimedia.org/r/305650 [13:56:34] <_joe_> valhallasw`cloud: nope, I'll try before next week [13:56:43] ok! [13:56:46] thanks :-) [14:02:26] bblack: I'm looking again at T101141 what was the dashboard that showed skewed metrics ? [14:02:27] T101141: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 [14:06:19] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567498 (10chasemp) I would like to do better than having to look up UUID's every time honestly but it looks like KVM does not support the domhostname argument for virsh > error:... [14:06:23] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2567499 (10BBlack) So, the obvious options are: http://freegeoip.net/json/8.8.8.8 http://geoip.nekudo.com/api/8.8.8.8 B... [14:07:18] godog: TLS ciphers was showing it on very short timescales. But I think it had gotten better recently. [14:08:20] godog: https://grafana.wikimedia.org/dashboard/db/tls-ciphers -> switch to 7d time range, and 1m in the averaging window dropdown. there's a couple of events in the past few days. [14:08:36] best seen in the "All Ciphers - Log Scale" at the bottom [14:14:21] 06Operations, 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Clean up puppet & configs for ORES - https://phabricator.wikimedia.org/T142002#2567511 (10akosiaris) >>! In T142002#2567443, @Halfak wrote: > @akosiaris, I see. The current state is on the left side of the arrows --> and the proposed state i... [14:16:23] bblack: thanks, I see the peaks/drops on the 18th for example in the bottom graph, is that what you're seeing too? [14:18:46] (03PS4) 10Giuseppe Lavagetto: scap: add conftool class [puppet] - 10https://gerrit.wikimedia.org/r/305278 [14:19:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: add conftool class [puppet] - 10https://gerrit.wikimedia.org/r/305278 (owner: 10Giuseppe Lavagetto) [14:24:04] (03CR) 10Giuseppe Lavagetto: service::node: add scap::conftool when relevant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305290 (owner: 10Giuseppe Lavagetto) [14:24:17] (03PS2) 10Giuseppe Lavagetto: service::node: add scap::conftool when relevant [puppet] - 10https://gerrit.wikimedia.org/r/305290 [14:26:02] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2567532 (10fgiunchedi) yeah I agree looking up by uuid isn't great, I'm fine with 4w staleness. It looks like about ~6GB per day on average so 30d is 200G which is fine. I won't be... [14:28:17] (03PS4) 10Alexandros Kosiaris: check_puppetrun: Add reportfile handling [puppet] - 10https://gerrit.wikimedia.org/r/305487 [14:28:18] godog: yeah, the very very thin spikes, a pair of them [14:28:19] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Improve full fail error message [puppet] - 10https://gerrit.wikimedia.org/r/305504 [14:28:21] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Add failed resource warning/critical levels [puppet] - 10https://gerrit.wikimedia.org/r/305505 [14:28:23] (03PS2) 10Alexandros Kosiaris: check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 [14:28:25] (03PS2) 10Alexandros Kosiaris: check_puppetrun: extra time to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) [14:31:31] (03CR) 10jenkins-bot: [V: 04-1] check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 (owner: 10Alexandros Kosiaris) [14:31:33] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: add scap::conftool when relevant [puppet] - 10https://gerrit.wikimedia.org/r/305290 (owner: 10Giuseppe Lavagetto) [14:33:27] (03CR) 10jenkins-bot: [V: 04-1] check_puppetrun: extra time to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) (owner: 10Alexandros Kosiaris) [14:33:53] bblack: ack, thanks I'm taking a look [14:39:04] (03PS1) 10Alex Monk: udplog: stop requiring udp-filter [puppet] - 10https://gerrit.wikimedia.org/r/305652 [14:39:54] (03CR) 10Alex Monk: " yeah so it doesn't use udp-filter" [puppet] - 10https://gerrit.wikimedia.org/r/305652 (owner: 10Alex Monk) [14:40:08] (03PS3) 10Alexandros Kosiaris: check_puppetrun: Add more info in failed resources message [puppet] - 10https://gerrit.wikimedia.org/r/305629 [14:40:10] (03PS3) 10Alexandros Kosiaris: check_puppetrun: extra time to recently enabled hosts [puppet] - 10https://gerrit.wikimedia.org/r/305630 (https://phabricator.wikimedia.org/T143099) [14:49:55] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2567611 (10Nikerabbit) Simplest thing would be to switch the default freegeoip.net which is already supported and give so... [14:52:47] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2567647 (10fgiunchedi) @BBlack we've stopped using statsdlb in favour of statsd-proxy, but yeah I didn't see statsdlb restarting quickly even when... [14:54:48] 06Operations, 10Graphite, 05MW-1.27-release-notes, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2567656 (10fgiunchedi) I've also noticed the udp inerrors come back and on a steady level after the 16th: {F4376920} and according to SAL a varn... [14:56:07] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567659 (10Nikerabbit) >>! In T100902#2566841, @faidon wrote: > You can use freegeoip, or another free or paid-for service — or set up your o... [14:57:51] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2567661 (10BBlack) ULS already supports the freegeoip format? [14:58:28] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567664 (10BBlack) >>! In T100902#2567659, @Nikerabbit wrote: >Non-WMF MediaWiki installs? That [14:59:11] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567671 (10faidon) Sorry, my bad. I meant that non-WMF MediaWiki installs could use any of these. Someone (the ULS maintainers/contributors)... [14:59:25] (03PS1) 10Addshore: Enable RevisionSlider BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305653 (https://phabricator.wikimedia.org/T143421) [14:59:52] (03CR) 10Addshore: [C: 04-1] "Too be scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305653 (https://phabricator.wikimedia.org/T143421) (owner: 10Addshore) [15:08:10] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567686 (10Nikerabbit) Location based language suggestions is a feature in ULS that is maintained by the Language team. It would have been ni... [15:21:46] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [15:25:36] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567717 (10BBlack) We'll be talking to the relevant teams next week about making new releases ahead of completely disabling any service, that... [15:25:37] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [15:26:01] 06Operations, 10Cassandra, 06Services: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044#2567721 (10Eevans) The proposed solution to monitoring certificate expiration (https://gerrit.wikimedia.org/r/#/c/305633), acts remotely using the encrypted inter-node mess... [15:32:04] (03CR) 10Eevans: "This LGTM, but it only monitors the server cert. For the root CA, we'll either need to set the expiration high enough that we never need " [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [15:33:21] (03CR) 10BryanDavis: [C: 031] Set innodb_large_prefix/barracuda as default config for m5 [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [15:35:20] (03PS1) 10BBlack: give pinkunicorn IPv6 [dns] - 10https://gerrit.wikimedia.org/r/305656 [15:38:46] (03PS1) 10Rush: tools: mount scratch on labstore1003 as well [puppet] - 10https://gerrit.wikimedia.org/r/305657 (https://phabricator.wikimedia.org/T134896) [15:38:48] (03CR) 10BBlack: [C: 032] give pinkunicorn IPv6 [dns] - 10https://gerrit.wikimedia.org/r/305656 (owner: 10BBlack) [15:39:46] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [15:44:11] (03PS7) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [15:44:13] (03PS7) 10BBlack: GeoIP VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [15:44:15] (03PS2) 10BBlack: varnish: remove libgeoip from text VCL compilation [puppet] - 10https://gerrit.wikimedia.org/r/305648 (https://phabricator.wikimedia.org/T99226) [15:44:17] (03PS7) 10BBlack: GeoIP VCL: re-set old IPv6 no-data cookies [puppet] - 10https://gerrit.wikimedia.org/r/305419 (https://phabricator.wikimedia.org/T99226) [15:44:19] (03PS14) 10BBlack: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [15:45:32] (03PS2) 10Yuvipanda: labs: Set timeout for ldap3 using scripts [puppet] - 10https://gerrit.wikimedia.org/r/305616 (https://phabricator.wikimedia.org/T142394) [15:46:07] (03CR) 10BryanDavis: [C: 031] labs: Set timeout for ldap3 using scripts [puppet] - 10https://gerrit.wikimedia.org/r/305616 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [15:47:38] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [15:48:06] PROBLEM - Host labstore1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:27] RECOVERY - Host labstore1005 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [15:58:16] (03PS2) 10Jcrespo: Set innodb_large_prefix/barracuda as default config for m5 [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) [15:59:53] (03CR) 10Jcrespo: [C: 032] Set innodb_large_prefix/barracuda as default config for m5 [puppet] - 10https://gerrit.wikimedia.org/r/305624 (https://phabricator.wikimedia.org/T142545) (owner: 10Jcrespo) [16:03:20] (03PS1) 10Rush: labs: nfs manager add clean (all) and brief help [puppet] - 10https://gerrit.wikimedia.org/r/305659 [16:03:21] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#1936565 (10AlexMonk-WMF) fluorine doesn't need to run the same OS as mediawiki app servers, it can run jessie. I just moved deployment-fluorine (precise) stuff to deployment-fluorine02 with some packaging help from ottoma... [16:06:23] (03CR) 10Alex Monk: "cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/305652 (owner: 10Alex Monk) [16:11:19] !log cache_maps: rolling frontend cache restarts, ~5 minute window [16:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:13] 06Operations, 13Patch-For-Review: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2567917 (10Joe) I have created an "etcd recovery script generator", that can be run and it proved to work with the labs cluster. It can be found at P3855 It can be used to generate disaster rec... [16:13:33] 06Operations, 13Patch-For-Review: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2567918 (10Joe) As soon as I've updated the etcd docs, I'll close this ticket. [16:19:09] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase2007-a.codfw.wmnet [16:19:10] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [16:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:18] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase2004-a.codfw.wmnet [16:20:19] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [16:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:56] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2567932 (10BBlack) [16:20:58] 06Operations, 10Traffic, 13Patch-For-Review: Solve large-object/stream/pass/chunked in upload cluster better - https://phabricator.wikimedia.org/T131761#2567929 (10BBlack) 05Open>03Resolved a:03BBlack So, we've reviewed the VCL and the Swift output, and the bottom line is this is a non-issue for cache_... [16:23:47] (03PS1) 10Alex Monk: logging: remove reference to deployment-fluoride [puppet] - 10https://gerrit.wikimedia.org/r/305660 [16:25:13] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2567938 (10BBlack) [16:25:15] 06Operations, 10Traffic, 13Patch-For-Review: Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections - https://phabricator.wikimedia.org/T142233#2567936 (10BBlack) 05Open>03Resolved The commit merged above gives us two behaviors (on all cache layers) for Range requests that... [16:27:45] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2567939 (10faidon) >>! In T100902#2567686, @Nikerabbit wrote: > Location based language suggestions is a feature in ULS that is maintained by... [16:28:20] 06Operations, 10fundraising-tech-ops, 07Security-General: use granularity (g=) restrictions for wikimedia.org fundraising DKIM records - https://phabricator.wikimedia.org/T142205#2567940 (10CCogdill_WMF) I've been reviewing this request with IBM, and haven't gotten good news as to their ability to integrate... [16:32:36] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2567949 (10BBlack) The remaining blockers aren't full blockers and aren't linked into here, but basically they're these: 1. T131353 - Just the cache_upload -related scripts here (not... [16:32:56] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:33:40] (03PS1) 10Yuvipanda: Fix generic webservices on gridengine [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/305661 (https://phabricator.wikimedia.org/T143403) [16:33:49] 06Operations, 10Traffic: Stop using persistent storage in our backend varnish layers. - https://phabricator.wikimedia.org/T142848#2548194 (10BBlack) Note we'd like to make a call on this before converting cache_upload to v4 in T131502 (and presumably switching to `file` as part of the process). If anyone has... [16:34:58] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2567964 (10hoo) [16:39:07] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [16:49:18] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: Puppet has 1 failures [16:53:56] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:00:17] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [17:04:23] Hello. Backport of the fix for wmf/1.28.0-wmf.15 https://gerrit.wikimedia.org/r/#/c/305662/ [17:04:30] (03PS2) 10Ottomata: udplog: stop requiring udp-filter [puppet] - 10https://gerrit.wikimedia.org/r/305652 (owner: 10Alex Monk) [17:04:47] (03CR) 10Ottomata: [C: 032 V: 032] udplog: stop requiring udp-filter [puppet] - 10https://gerrit.wikimedia.org/r/305652 (owner: 10Alex Monk) [17:04:55] That fixes the l10n keys missing for https://en.wikipedia.org/wiki/Special:VipsTest. [17:05:19] Dereckson: are you asking for it to be done now? [17:05:55] That's a good question. We've a special page with localisation broken in every wikis of the cluster. But to be honest, I'm not sure anyone goes to this page. [17:07:06] * greg-g failed to open second link [17:07:28] Remove the final dot. [17:07:51] :) [17:07:54] I just didn't click [17:08:05] (gnome-terminal interpreted correctly) [17:08:32] I've noticed the issue because of a key more visible on the special pages index: ⧼vipstest⧽ at https://en.wikipedia.org/wiki/Special:SpecialPages#Media_reports_and_uploads [17:08:50] minor enough, I'll allow it :) [17:09:09] Dereckson: you wanna? [17:09:57] I can deploy it, yes. [17:10:25] thanks [17:10:35] scap incoming on a Friday, I approved [17:10:35] (03PS1) 10Rush: labstore: drbd framework for software based HA [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) [17:15:08] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:20:35] * Dereckson deploys it. [17:24:29] Extension still loads on mw1099. No fatal error. Let's scap (as it needs to rebuild l10n cache). [17:25:08] +1 [17:26:03] !log dereckson@tin Started scap: (no message) [17:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:19] Hi! Is there someone to help me in setting up my connection to https://people.wikimedia.org ? [17:26:38] !log Current scap is for [[Gerrit:305662]] (T143402) [17:26:39] T143402: Missing translation keys for vips - https://phabricator.wikimedia.org/T143402 [17:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:42] Hi Volker_E. Did you follow https://wikitech.wikimedia.org/wiki/Production_shell_access#SSH_configuration ? [17:28:18] If so you can `rutherfordium.eqiad.wmnet` [17:28:24] ssh rutherfordium.eqiad.wmnet [17:28:58] 1https://wikitech.wikimedia.org/wiki/People.wikimedia.org [17:29:02] -1 [17:29:03] https://wikitech.wikimedia.org/wiki/People.wikimedia.org [17:29:31] Ok, I might ask some very amateurish question, sorry upfront [17:30:24] if I use bastion.wmflabs.org it's rerouting automatically to the nearest cluster, by putting a special one there, it's a tad faster, correct? [17:31:33] wmflabs is not production :) [17:32:21] Volker_E: see the map here https://wikitech.wikimedia.org/wiki/Bastion [17:32:23] (I just reordered the Production shell access page, just fyi if you reload and it looks different, just put "requesting access" above "how to configure" because it makes more sense that way, to me) [17:33:26] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2568165 (10RobH) I won't be in the operations meeting next Monday for this to be reviewed, but I've emailed the operations team for it... [17:33:41] greg-g: therefore wmflabs is also not in the standard config example, correct? [17:33:56] Volker_E: for wmflabs there is just the "eqiad" location currently [17:34:03] Volker_E: well, therefore you can't get to the people.wm.o server via wmflabs [17:34:36] https://wikitech.wikimedia.org/wiki/Production_shell_access#Standard_config is what you want [17:35:20] greg-g: there's also in config by other people *.eqiad.wmnet and *.eqiad.wmflabs – what is what and could I join them to one Host config ruleset? [17:35:48] don't join wmnet and wmflabs [17:36:55] Volker_E: wmnet = production wmflabs = labs , and treat them as separate things with separate keys [17:37:21] thanks! [17:37:47] https://phabricator.wikimedia.org/P433 is a good one, if I do say so myself ;) [17:38:13] notice I have two ssh keys and two known_hosts files [17:38:20] Volker_E: but to answer the original question, yea, if you have production access there are 4 different bastions in different places, and yea, using a closer one makes it a tad faster [17:39:04] greg-g: mutante: ok, great [17:39:09] oh right, that, forgot to answer, sorry! [17:39:29] Is the IdentitesOnly rule always a good thing to include? [17:39:47] when defining an IdentityFile? [17:40:31] yeah. that will keep your ssh client from offering up other keys on failure [17:40:35] I do, because sometimes ssh-agent could know of more [17:40:38] Specifies that ssh(1) should only use the authentication identity files configured in the ssh_config files, even if ssh-agent(1) offers more identities. [17:40:44] http://linux.die.net/man/5/ssh_config [17:40:59] (sorry to link to the man page, I dont' mean to say RTFM, just citing source) [17:41:26] greg-g: it's just once in your P433 [17:41:26] P433 greg's ssh config - https://phabricator.wikimedia.org/P433 [17:41:57] Volker_E: it applies to all of .wmnet [17:42:16] I should also do it for wmflabs as well, noted [17:42:23] greg-g: but I thought you'd put [17:42:27] greg-g: yeah, exactly [17:42:27] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:29] :) [17:42:30] greg-g: :) [17:42:46] Reported https://phabricator.wikimedia.org/T143438 (unrelated with currenty deployment): Undefined index: wsCollection in /srv/mediawiki/php-1.28.0-wmf.15/extensions/Collection/Collection.php on line 348 [17:44:20] You can also do "Host *.wikimedia.org *.wmnet !gerrit.wikimedia.org !bast1001.wikimedia.org .. ProxyCommand .." to exclude just some hosts like gerrit and the bastions itself from the wildcard [17:45:06] mutante: oh, but that's different from https://wikitech.wikimedia.org/wiki/Production_shell_access#SSH_configuration example [17:45:16] mutante: excluding bast... [17:45:33] (03CR) 10Madhuvishy: [C: 031 V: 031] tools: mount scratch on labstore1003 as well [puppet] - 10https://gerrit.wikimedia.org/r/305657 (https://phabricator.wikimedia.org/T134896) (owner: 10Rush) [17:47:24] Volker_E: true. well, that is to avoid a loop. usually you would connect to hosts behind the bastions but if you want to do something on the bastion itself it would ensure that it's not trying to proxy via itself [17:47:35] Volker_E: what the configuration achieve is (1) to connect to a bastion (2) from this bastion, to connect to the target server. So if you tell to a block "use bastion", you have to exclude the bastion, unless you would have an infinite loop [17:48:26] 06Operations, 10fundraising-tech-ops, 07Security-General: use granularity (g=) restrictions for wikimedia.org fundraising DKIM records - https://phabricator.wikimedia.org/T142205#2568249 (10faidon) Could you explain a little bit what happens when you send using e.g. caitlin@wikimedia.org? Do you get random p... [17:49:06] maybe it's not in the config because we don't want to encourage using the bastions as work hosts, heh [17:50:53] arrgh, doesn't take my password [17:51:17] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:37] your ssh key is what is identifying you to the hosts, so your ssh key passphrase (nitpick, in case you google for help) is what you're typing [17:52:47] !log dereckson@tin Finished scap: (no message) (duration: 26m 44s) [17:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:17] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10jeblad) Talk to the language team. Please? [17:53:22] Works fine. [17:53:27] thanks Dereckson [17:56:36] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10greg) >>! In T100902#2568267, @jeblad wrote: > Talk to the language team. That would be what the conversation with @Nikerabbit is... [18:00:27] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase2006-a.codfw.wmnet [18:00:28] T143226: Cluster-wide major compactions: parsoid-html - https://phabricator.wikimedia.org/T143226 [18:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:38] (03PS4) 10Dzahn: mediawiki: include font packages on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) [18:05:47] (03PS5) 10Dzahn: mediawiki: include fonts in role::mediawiki::webserver [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) [18:06:46] (03CR) 10Dzahn: "alright, now including in role::mediawiki::webserver" [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) (owner: 10Dzahn) [18:10:38] (03CR) 10BBlack: [C: 031] "Thanks for working on this! Every rebase is going to be tricky to re-review as more changes pile in every day. Let's rebase+review+merge" [dns] - 10https://gerrit.wikimedia.org/r/304155 (owner: 10Dzahn) [18:12:11] (03CR) 10BBlack: [C: 031] Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) (owner: 10BryanDavis) [18:13:56] bblack: when that gets applied, does the dns need to be active first or does it matter? [18:14:37] bd808: it can be done either way, but it's better if the varnish change lands before the DNS. Otherwise if the hostname is first someone will hit it and cache a 4xx before there's content available at that name. [18:14:55] ah. right [18:15:23] from a driveby report in #wikimedia-traffic: [18:15:24] 18:09 -!- Framawiki [~farmawiki@wikipedia/framawiki] has joined #wikimedia-traffic [18:15:27] 18:10 < Framawiki> hi, 8 differents users report http 400 error on edits. https://fr.wikipedia.org/wiki/Sujet:T9wtwe84pz3n18fy [18:15:30] 18:10 < Framawiki> thanks [18:15:43] I don't think we've done in anything in VCL-land to cause a 400 lately. [18:16:13] in fatalmonitor: 12 Cannot modify header information - headers already sent in /srv/mediawiki/php-1.28.0-wmf.15/includes/WebResponse.php on line 42 [18:16:14] it's remotely possible it's related to the Chrome/41.0.2272 bug, but that would be a 401 (maybe misreported by the UA or the user)? [18:16:31] but only 12 [18:18:04] (if it does turn out the frwiki "400" issue is related to Chrome/41 and a 401 response, the answer is probably "upgrade your browser") [18:19:20] or downgrade, or switch. either way, Chrome/41.0.2272.76 on Windows hosts is a borked browser as far as we know, and the 401 is a mitigation [18:20:19] (03PS1) 10Dduvall: beta: Create and mount LVM volumes for maridb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [18:21:53] (03PS1) 10Mattflaschen: Set Flow as default for User talk on kabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305670 (https://phabricator.wikimedia.org/T140588) [18:22:10] (03CR) 10Mattflaschen: [C: 04-2] Set Flow as default for User talk on kabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305670 (https://phabricator.wikimedia.org/T140588) (owner: 10Mattflaschen) [18:24:31] (03PS2) 10Dduvall: beta: Create and mount LVM volumes for maridb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [18:25:35] (03CR) 10jenkins-bot: [V: 04-1] beta: Create and mount LVM volumes for maridb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:25:56] (03PS3) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [18:27:08] (03CR) 10jenkins-bot: [V: 04-1] beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [18:27:26] (03PS4) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [18:33:05] Hi all :) I've a big problem :/ I accidently wiped out my phone, and unfortunately doesn't have a backup. However, that's my problem. Now, I had enabled the two factor authentication to my account at wikitech, which means, that (if I want to login again) I would need to enter the code of my phone, which, as you can imaginge, would be generated on my phone I wiped :( Now, I don't have the possibility to access wikitech [18:33:05] anymore :( Can someone reset the token for my user? (I could provide my public key stored in the OpenStack key list as a verification, for example, or do something in phabricator to verify that I own my account). Would be great to have some help (and I swear, that I'll properly backup my emergency codes next time :() [18:37:36] FlorianSW: it is possible, assuming that someone verifies that you are you ;) https://wikitech.wikimedia.org/wiki/Password_reset [18:38:30] MatmaRex: yes, I found that already :P That's why I request that here :D [18:38:46] hi, i don't know what channel is for this, but visual editor looks entirely down on frwiki [18:38:57] https://phabricator.wikimedia.org/T141226 [18:39:26] I have got the same error when try open VE on every page (firefox user) [18:39:44] Framawiki: there is #mediawiki-visualeditor. here is fine for urgent problems though [18:39:49] https://fr.wikipedia.org/wiki/Michel_Juffé?veaction=edit loads the editor for me [18:39:56] Framawiki: if operations staff can help, this is the correct channel, for product issues you should ask in phabricator, the appropriate channel for the product or in #wikimedia-dev [18:40:08] damn, too slow [18:40:11] me no now ! [18:41:52] MatmaRex: Is it right to assume, that you can't reset the 2fa token? :D I could create a request in my home directory of Tools-bastion :) [18:42:23] FlorianSW: no, i don't have the access [18:42:44] ok, thanks nonetheless! :) [18:46:00] 07Puppet, 10Beta-Cluster-Infrastructure, 10Salt: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2568418 (10thcipriani) So when I logged on yesterday, there were several hundred: systemctl start salt-minion jobs systemd on thi... [18:50:18] (03PS1) 10MaxSem: WIP: discovery stats module [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) [18:50:22] (03CR) 10Dzahn: "that sounds great, yep, let's do that. thanks" [dns] - 10https://gerrit.wikimedia.org/r/304155 (owner: 10Dzahn) [18:50:42] (03CR) 10Madhuvishy: [C: 031 V: 031] labs: nfs manager add clean (all) and brief help [puppet] - 10https://gerrit.wikimedia.org/r/305659 (owner: 10Rush) [18:51:41] (03CR) 10jenkins-bot: [V: 04-1] WIP: discovery stats module [puppet] - 10https://gerrit.wikimedia.org/r/305673 (https://phabricator.wikimedia.org/T143048) (owner: 10MaxSem) [18:52:39] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2568441 (10thcipriani) [18:52:41] 07Puppet, 10Beta-Cluster-Infrastructure, 10Salt: puppet on deployment-changeprop taking forever because of systemctl start salt-minion - https://phabricator.wikimedia.org/T143371#2568438 (10thcipriani) 05Open>03Resolved a:03thcipriani Seems to be fixed: ``` thcipriani@deployment-changeprop:~$ sudo sys... [18:55:26] https://phabricator.wikimedia.org/T141226#2568344 [18:55:39] ^ more on that 400, seems to be a possible VE issue that might be growing? [18:55:56] oh I see it's already being discused above too :) [18:58:36] (03PS3) 10Yuvipanda: labs: Set timeout for ldap3 using scripts [puppet] - 10https://gerrit.wikimedia.org/r/305616 (https://phabricator.wikimedia.org/T142394) [18:58:38] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Set timeout for ldap3 using scripts [puppet] - 10https://gerrit.wikimedia.org/r/305616 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [18:59:06] bblack: and in #mediawiki-visualeditor ;) [19:05:02] (03PS2) 10Rush: labs: nfs manager add clean (all) and brief help [puppet] - 10https://gerrit.wikimedia.org/r/305659 [19:05:06] (03PS2) 10Rush: tools: mount scratch on labstore1003 as well [puppet] - 10https://gerrit.wikimedia.org/r/305657 (https://phabricator.wikimedia.org/T134896) [19:05:09] (03PS2) 10Rush: labstore: drbd framework for software based HA [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) [19:05:29] MatmaRex: bblack: a contributor attached a screenshot of the error to the Flow topic: https://commons.wikimedia.org/wiki/File:Probl%C3%A8me_%C3%A9dition_d%27article_wikip%C3%A9dia.png [19:05:31] 06Operations, 10Packaging, 10Phabricator: upload php-mailparse and python-phabricator to jessie - https://phabricator.wikimedia.org/T138689#2568470 (10Dzahn) [19:05:39] the http 400 is reported by JS [19:05:46] 06Operations, 10Packaging, 10Phabricator: upload php-mailparse and python-phabricator to jessie - https://phabricator.wikimedia.org/T138689#2407363 (10Dzahn) p:05Triage>03Normal [19:07:38] (oh was already on the task too) [19:08:33] 06Operations, 10Packaging, 10Phabricator: upload php-mailparse and python-phabricator to jessie - https://phabricator.wikimedia.org/T138689#2407363 (10Dzahn) Backport the one from stretch? [19:11:26] (03CR) 10Madhuvishy: "Just one comment, otherwise +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [19:13:25] (03PS3) 10Rush: labstore: drbd framework for software based HA [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) [19:15:07] (03CR) 10RobH: [C: 031] "removing potential inconsistencies for user access seems like a great idea." [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [19:15:46] (03Abandoned) 10Madhuvishy: [WIP] labstore: Configure drbd for a HA labstore setup [puppet] - 10https://gerrit.wikimedia.org/r/302488 (owner: 10Madhuvishy) [19:15:52] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#2568496 (10EdErhart-WMF) [19:16:04] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#1425614 (10EdErhart-WMF) 05Open>03Resolved [19:17:27] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 07User-notice: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#2568500 (10Johan) OK, thanks. It'll go out in the newsletter that's distributed on Monday. [19:18:29] 06Operations, 10Cassandra, 06Services: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044#2554903 (10Dzahn) We can check that with existing "check_ssl_certfile" or a slight variation of it. "via NRPE. It runs "openssl x509 -checkend 324000 -noout -in $1 on t... [19:30:28] (03CR) 10Dzahn: [C: 031] "looks good. for the root CA we'd want a different kind of check (or none), so this should be added either way" [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [19:31:17] (03CR) 10Dzahn: [C: 032] "ok, let me just do that, i'll check it on neon" [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [19:31:23] (03PS3) 10Dzahn: cassandra: add instance ssl monitoring [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [19:38:37] (03PS1) 10Dduvall: beta: Configure storage cluster for migrated databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305675 (https://phabricator.wikimedia.org/T138778) [19:41:11] 06Operations, 10Traffic, 10Wikimedia-Blog, 07HTTPS: make blog links from wmfwiki front page use HTTPS links - https://phabricator.wikimedia.org/T104728#2568535 (10jeremyb) [19:46:09] (03CR) 10Dzahn: "new check command has been defined on neon" [puppet] - 10https://gerrit.wikimedia.org/r/305633 (https://phabricator.wikimedia.org/T120662) (owner: 10Filippo Giunchedi) [19:47:25] (repost from earlier, maybe now is someone online, who can help me :)) Hi all :) I've a big problem :/ I accidently wiped out my phone, and unfortunately doesn't have a backup. However, that's my problem. Now, I had enabled the two factor authentication to my account at wikitech, which means, that (if I want to login again) I would need to enter the code of my phone, which, as you can imaginge, would be generated on my [19:47:25] phone I wiped :( Now, I don't have the possibility to access wikitech anymore :( Can someone reset the token for my user? (I could provide my public key stored in the OpenStack key list as a verification, for example, or do something in phabricator to verify that I own my account). Would be great to have some help (and I swear, that I'll properly backup my emergency codes next time :() [19:48:53] FlorianSW: i think there is a process for that, let me look it up [19:49:06] mutante: https://wikitech.wikimedia.org/wiki/Password_reset [19:49:08] :D [19:49:25] * Nemo_bis was about to link [19:49:37] the first question is: Is it enough for you, if I write the request to my home directory on tools-bastion? [19:50:18] * FlorianSW thanks Nemo_bis nonetheless! :) [19:52:24] FlorianSW: let's do the home directory thing and let me PM you [19:52:32] ok :) [19:52:50] PROBLEM - cassandra SSL 10.192.16.179:7001 on maps2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:03:27] !log removed 2fa for user Florianschmidtwelzow on labswiki, confirmed via file on tools-login and IRC identity [20:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:46] FlorianSW: ^ please try and add it back [20:04:28] PROBLEM - cassandra SSL 10.192.0.144:7001 on maps2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:05:09] PROBLEM - cassandra SSL 10.192.0.128:7001 on maps-test2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:05:28] PROBLEM - cassandra SSL 10.64.32.175:7001 on aqs1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:05:38] PROBLEM - cassandra SSL 10.64.48.154:7001 on maps1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:05:50] PROBLEM - cassandra-a SSL 10.64.32.189:7001 on aqs1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:06:00] those are the new checks [20:06:01] got it [20:06:29] PROBLEM - cassandra-b SSL 10.64.32.190:7001 on aqs1005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:06:33] i'll make that stop [20:06:34] mutante: Validated two-factor credentials. Two-factor authentication will now be enforced. :) [20:06:41] (03PS5) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [20:06:46] FlorianSW: great ! [20:06:48] PROBLEM - cassandra SSL 10.64.0.123:7001 on aqs1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:07:06] thanks for your prompt and competent help! [20:07:08] PROBLEM - cassandra-a SSL 10.64.0.126:7001 on aqs1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:07:12] PROBLEM - cassandra SSL 10.192.32.146:7001 on maps2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:07:30] PROBLEM - cassandra-b SSL 10.64.0.127:7001 on aqs1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:07:49] PROBLEM - cassandra SSL 10.192.16.34:7001 on maps-test2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:07:49] PROBLEM - cassandra SSL 10.64.16.42:7001 on maps1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:08:19] FlorianSW: welcome [20:12:37] ^ so , the new SSL cassandra checks there [20:12:43] it actually works fine for many [20:13:05] but the cassandra role is also on aqs, maps, maps-test ... [20:13:50] and maybe the check should be on all of them or maybe not [20:14:33] it works fine for restbase* and these have monitoring now expiring certs, which they did not before [20:15:33] ah, and/or it's just a different port, 7001 vs 9042 [20:17:48] mutante: so those alarms are real or you are just testing monitoring of certs? [20:20:25] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 13Patch-For-Review: Track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2568609 (10Dzahn) It has been added to Icinga and it works fine for restbase* hosts, but the cassandra role is also applied on aqs* and maps* and it does n... [20:22:33] nuria_: some are real that warn about expiry but just on 'test' hosts, some are real and just working OK but just got added, a new thing that did not exist before, and some are applied on hosts that they should not try to be checking. none of it is something to worry about [20:22:51] mutante: k, not worrying then cc ottomata [20:23:24] i merged a change that adds "new" cert monitoring for cassandra and it mostly works [20:23:42] aye [20:23:44] just tries to do it on aqs* and maps* too w [20:23:51] which it probably should not [20:23:59] because the check is added via cassandra role [20:24:15] i was just about to ask... [20:25:30] ACKNOWLEDGEMENT - cassandra SSL 10.64.0.123:7001 on aqs1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn new check from cassandra role that is just for restbase* [20:25:33] ACKNOWLEDGEMENT - cassandra SSL 10.64.32.175:7001 on aqs1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn new check from cassandra role that is just for restbase* [20:25:36] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.127:7001 on aqs1004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn new check from cassandra role that is just for restbase* [20:26:28] disabling notifications for the wrong ones [20:26:48] then i'll see how to limit it to restbase hosts [20:27:32] there are some real warnings that confirm the check works otherwise, like that it expires at some point on restbase-test [20:27:54] mutante: yeah, saw those... that looks right [20:28:12] a nice test of the test, too [20:28:46] ok, yes [20:34:11] PROBLEM - cassandra SSL 10.192.0.129:7001 on maps-test2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:35:40] PROBLEM - cassandra SSL 10.64.32.117:7001 on maps1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:36:00] PROBLEM - cassandra SSL 10.64.48.117:7001 on aqs1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:36:29] PROBLEM - cassandra SSL 10.192.48.57:7001 on maps2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:36:29] PROBLEM - cassandra SSL 10.64.0.79:7001 on maps1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:36:40] PROBLEM - cassandra-a SSL 10.64.48.148:7001 on aqs1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:37:00] PROBLEM - cassandra SSL 10.192.16.35:7001 on maps-test2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:37:09] aww, that was supposed to be off already [20:37:10] PROBLEM - cassandra-b SSL 10.64.48.149:7001 on aqs1006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:37:15] is on the fix [20:37:42] (03PS1) 10Dzahn: cassandra: limit SSL cert monitoring to restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/305711 (https://phabricator.wikimedia.org/T120662) [20:39:09] (03CR) 10Dzahn: [C: 032] "quick-fix for now..." [puppet] - 10https://gerrit.wikimedia.org/r/305711 (https://phabricator.wikimedia.org/T120662) (owner: 10Dzahn) [20:45:08] (03PS1) 10RobH: robh on vacation next week, remove from paging [puppet] - 10https://gerrit.wikimedia.org/r/305734 [20:45:11] (03PS1) 10Rush: sge: add more stats for grid collector [puppet] - 10https://gerrit.wikimedia.org/r/305735 [20:45:35] (03PS2) 10Rush: sge: add more stats for grid collector [puppet] - 10https://gerrit.wikimedia.org/r/305735 [20:46:26] (03PS3) 10Rush: labs: nfs manager add clean (all) and brief help [puppet] - 10https://gerrit.wikimedia.org/r/305659 [20:46:44] (03CR) 10Rush: [C: 032 V: 032] labs: nfs manager add clean (all) and brief help [puppet] - 10https://gerrit.wikimedia.org/r/305659 (owner: 10Rush) [20:47:21] (03PS3) 10Rush: sge: add more stats for grid collector [puppet] - 10https://gerrit.wikimedia.org/r/305735 [20:47:27] (03PS4) 10Rush: labstore: drbd framework for software based HA [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) [20:47:46] (03CR) 10Rush: [C: 032 V: 032] labstore: drbd framework for software based HA [puppet] - 10https://gerrit.wikimedia.org/r/305663 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [20:49:03] (03PS4) 10Rush: sge: add more stats for grid collector [puppet] - 10https://gerrit.wikimedia.org/r/305735 [20:50:33] (03CR) 10Rush: [C: 032] sge: add more stats for grid collector [puppet] - 10https://gerrit.wikimedia.org/r/305735 (owner: 10Rush) [20:57:18] (03PS1) 10Dduvall: labs: Allow LVM mount permissions to remain unmanaged [puppet] - 10https://gerrit.wikimedia.org/r/305737 [20:59:22] (03PS6) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [20:59:40] (03CR) 10Merlijn van Deen: [C: 032] Fix generic webservices on gridengine [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/305661 (https://phabricator.wikimedia.org/T143403) (owner: 10Yuvipanda) [21:01:17] (03PS7) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [21:05:56] (03PS8) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [21:06:12] (03Merged) 10jenkins-bot: Fix generic webservices on gridengine [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/305661 (https://phabricator.wikimedia.org/T143403) (owner: 10Yuvipanda) [21:10:26] (03PS1) 10Rush: labstore100[4|5]: define initial labstore::drbd::resource [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) [21:11:18] (03PS2) 10Rush: labstore100[4|5]: define initial labstore::drbd::resource [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) [21:14:21] 06Operations, 10Graphite, 06Labs: lots of graphite metrics under "instances" created - https://phabricator.wikimedia.org/T143405#2568698 (10yuvipanda) I'm thinking of just running this in a cron: ``` find . -type f \! -mtime 672 -delete ``` 672 is 28 days, 4 weeks. That sound ok to everyone? [21:24:30] 06Operations, 10fundraising-tech-ops, 07Security-General: use granularity (g=) restrictions for wikimedia.org fundraising DKIM records - https://phabricator.wikimedia.org/T142205#2568708 (10CCogdill_WMF) > Could you explain a little bit what happens when you send using e.g. caitlin@wikimedia.org? Do you get... [21:25:23] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [21:33:53] (03CR) 10Madhuvishy: labstore100[4|5]: define initial labstore::drbd::resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [21:41:24] (03CR) 10Rush: [C: 032] labstore100[4|5]: define initial labstore::drbd::resource (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [21:42:19] (03PS3) 10Rush: labstore100[4|5]: define initial labstore::drbd::resource [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) [21:42:27] (03PS4) 10Rush: labstore100[4|5]: define initial labstore::drbd::resource [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) [21:44:00] (03CR) 10Madhuvishy: [C: 032 V: 032] labstore100[4|5]: define initial labstore::drbd::resource [puppet] - 10https://gerrit.wikimedia.org/r/305738 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [21:54:24] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [21:55:32] (03PS1) 10Madhuvishy: nfs: Fix disk path for drbd resource test [puppet] - 10https://gerrit.wikimedia.org/r/305743 [21:56:05] (03CR) 10Rush: [C: 031] nfs: Fix disk path for drbd resource test [puppet] - 10https://gerrit.wikimedia.org/r/305743 (owner: 10Madhuvishy) [21:56:23] (03CR) 10Madhuvishy: [C: 032 V: 032] nfs: Fix disk path for drbd resource test [puppet] - 10https://gerrit.wikimedia.org/r/305743 (owner: 10Madhuvishy) [22:00:44] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [22:14:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [22:18:53] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [22:24:14] (03PS2) 10Jforrester: Follow-up I049fa67: Remind people not to enable wgKartographerWikivoyageMode elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 [22:24:27] (03CR) 10jenkins-bot: [V: 04-1] Follow-up I049fa67: Remind people not to enable wgKartographerWikivoyageMode elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [22:25:30] (03CR) 10MaxSem: "It's not deprecated. We are just not using it for the sake of VE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [22:32:28] (03PS3) 10Jforrester: Follow-up I049fa67: Remind people not to enable wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 [22:32:30] (03PS1) 10Jforrester: Remind people not to enable wmgUseKartographer elsewhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305748 [22:40:13] (03PS2) 10Dzahn: Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:40:44] (03CR) 10Dzahn: "manual rebase because a couple people got added meanwhile" [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:43:49] (03CR) 10Esanders: [C: 031] "Seems sensible to have an inline comment to remind people not to enable on other wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [22:51:07] (03CR) 10Dzahn: [C: 032] "with 4 +1's i'm now merging it. compiler confirms only that one user "dkg" is changed as the commit mesage says" [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:51:14] (03PS3) 10Dzahn: Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:54:02] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 13Patch-For-Review: Track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#2568976 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=cassandra [22:55:48] !log temp. disabled puppet on bastion hosts, confirming change 301149 works as expected [22:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:51] (03CR) 10Dzahn: "[bast3001:~] $ puppet agent -tv" [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [22:58:34] !log works fine, no more "bastiononly" group - all users get automatically added on bastions [22:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:41] Krenair: ^ thanks [23:00:47] mutante: Yeah, we use it in doc.wikimedia.org already [23:01:05] Krinkle: i'd merge but i have no perms on that repo, not even +1 [23:01:41] oh,, the bot merged it meanwhile.. that's why i'm confused [23:01:50] ok [23:05:16] 06Operations, 13Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#2568996 (10Dzahn) Merged that one after it had a couple +1, checked on bastions. It really just created the "dkg" user as expected. No other change. And now there... [23:06:26] mutante: How can I easily purge a url on a varnish that is not in the text/maps/upload cluster? [23:06:35] (I'd use mwscript purgeList in that case) [23:07:12] Url 'https://integration.wikimedia.org/cover' is stuck in Varnish cache as 301 redirect to doc.wm.o/cover instead of doc.wm.o/cover/ [23:07:23] Age: 2934 [23:07:24] X-Cache: cp1061 miss, cp2012 miss, cp4002 miss, cp4002 hit/35 [23:07:58] Krinkle: i just know this one https://wikitech.wikimedia.org/wiki/Varnish#How_to_execute_a_ban_.28on_one_machine.29 hmm [23:08:10] Hm.. a ban, right [23:08:32] I saw that but for a specific url I prefer an actual purge [23:08:46] Not sure how to craft those directly from bash [23:09:08] it used be called "one-off purges" [23:09:13] but "One-off purges (bans)" [23:10:33] i could run the salt commands but they are bans [23:11:39] htcp purges are not bans, though? [23:12:05] Shoudl be possible to simply send a real purge instead of a ban, which tends to stay around longer [23:14:16] bblack: ^ is that possible? [23:36:39] (03PS9) 10Dzahn: installserver: split DHCP part out into own role [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) [23:36:54] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/3783/" [puppet] - 10https://gerrit.wikimedia.org/r/305163 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:39:51] (03PS2) 10Dzahn: put installserver::dhcp on install1001, install2001 [puppet] - 10https://gerrit.wikimedia.org/r/305431 (https://phabricator.wikimedia.org/T132757) [23:58:02] (03CR) 10Dzahn: [C: 032] put installserver::dhcp on install1001, install2001 [puppet] - 10https://gerrit.wikimedia.org/r/305431 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [23:59:40] (03PS3) 10Dzahn: put installserver::dhcp on install1001, install2001 [puppet] - 10https://gerrit.wikimedia.org/r/305431 (https://phabricator.wikimedia.org/T132757)