[00:09:56] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [00:12:11] (03CR) 10Yurik: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [00:15:29] (03CR) 10Yurik: [C: 04-2] "We need to be able to document map usage for Wikivoyage, and therefor show how it works. MediaWiki.org is used explicitly for help pages, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [00:19:02] (03CR) 10Dereckson: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [00:19:47] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:21:56] PROBLEM - SSH on serpens is CRITICAL: Server answer [00:24:38] (03CR) 10Yurik: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [00:25:57] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [00:35:22] (03CR) 10Dereckson: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [00:43:34] PROBLEM - SSH on serpens is CRITICAL: Server answer [00:47:25] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [00:49:31] 06Operations, 10ops-eqiad: eqiad: Failed DIMM db1065 - https://phabricator.wikimedia.org/T133250#2226121 (10Cmjohnson) [00:51:01] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2226134 (10Cmjohnson) yes this is true and we have the adapters from the restbase's [00:53:26] PROBLEM - SSH on serpens is CRITICAL: Server answer [00:53:45] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 4 failures [00:57:46] mutante, ^ [00:57:56] both serpens and seaborgium having issues? [00:59:54] hmm.. seaborgium i could ssh to now [01:00:07] looks fairly normal, ldap running [01:00:35] serpens though.. [01:01:02] lets see if the puppet fails are a lie [01:03:45] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:03:55] RECOVERY - Disk space on serpens is OK: DISK OK [01:03:55] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 3.655 seconds response time [01:04:02] aha [01:04:04] RECOVERY - RAID on serpens is OK: OK: no RAID installed [01:04:14] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [01:04:14] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures [01:04:15] RECOVERY - configured eth on serpens is OK: OK - interfaces up [01:04:45] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:05:15] RECOVERY - Check size of conntrack table on serpens is OK: OK: nf_conntrack is 0 % full [01:05:34] RECOVERY - DPKG on serpens is OK: All packages OK [01:11:46] PROBLEM - Auth DNS on labs-ns1-former-placeholder.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:12:35] PROBLEM - Auth DNS on labs-ns0-former-placeholder.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [01:13:36] RECOVERY - Auth DNS on labs-ns1-former-placeholder.wikimedia.org is OK: DNS OK: 0.044 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [01:14:34] RECOVERY - Auth DNS on labs-ns0-former-placeholder.wikimedia.org is OK: DNS OK: 0.028 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [01:20:04] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [01:23:15] (03PS2) 10Dzahn: Typo fix at the warning for the inactive deployment server [puppet] - 10https://gerrit.wikimedia.org/r/284605 (owner: 10Luke081515) [01:25:05] (03CR) 10Dzahn: [C: 032] Typo fix at the warning for the inactive deployment server [puppet] - 10https://gerrit.wikimedia.org/r/284605 (owner: 10Luke081515) [01:29:54] !log installed python-progressbar on terbium for warmup script, will be puppetized later [01:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:36] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:45:49] mutante: ^ strontium failed puppet-merge [01:47:57] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2226160 (10Andrew) Seaborgium went OOM briefly just now and threw some errors. [01:48:32] !log belated log: restarted slapd on seaborgium [01:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:27] !log git pull on strontium, ops/puppet [01:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:50:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:51:08] bblack: ^ [01:54:11] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140296 (10Dzahn) seems like we had one again today. serpens and seaborgium were reported by Icinga as having various issues. then serpens recoverd itself. seaborgium showed puppet errors.... [02:04:18] 06Operations, 07Documentation: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2226168 (10Peachey88) [02:05:06] PROBLEM - configured eth on install1001 is CRITICAL: Connection refused by host [02:05:36] PROBLEM - dhclient process on install1001 is CRITICAL: Connection refused by host [02:05:56] PROBLEM - puppet last run on install1001 is CRITICAL: Connection refused by host [02:06:16] PROBLEM - salt-minion processes on install1001 is CRITICAL: Connection refused by host [02:07:04] ^ ACKed [02:10:13] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2226169 (10Dzahn) p:05Low>03Normal [02:22:27] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 09m 48s) [02:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Apr 21 02:31:04 UTC 2016 (duration 8m 37s) [02:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:20:42] (03PS2) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [03:21:15] (03CR) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [03:21:55] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [03:38:36] (03PS3) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [03:39:39] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [04:05:16] (03PS4) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [06:21:13] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:43] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:42] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:51] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:51] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:01] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 6 failures [06:37:31] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:48:01] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:50:08] (03PS2) 10Giuseppe Lavagetto: Fix appserver font package name for Indian fonts [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [06:55:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix appserver font package name for Indian fonts [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [06:57:22] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] 06Operations, 10Flow, 10MediaWiki-Redirects, 03Collab-Team-2016-Q4, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2226439 (10Catrope) [06:58:13] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:03] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:38] (03CR) 10Muehlenhoff: "I think this is now superceded by the recent jessie/wikimedia work (which deprecated precise support amongst other changes(" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [07:24:28] (03PS1) 10Elukey: Add delaycompress to kafkatee's logrotate config to avoid cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284635 (https://phabricator.wikimedia.org/T132322) [07:30:45] (03CR) 10Elukey: [C: 032] Add delaycompress to kafkatee's logrotate config to avoid cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284635 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [07:48:30] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2226458 (10MoritzMuehlenhoff) I'm in favour of rebuilding apache2. The overhead isn't that big (jessie has been released a year ago and saw one update in a DSA and two in point releases) and it's a tran... [07:51:59] (03PS1) 10Elukey: Add mod_headers to the httpd config to allow the Header directive. [puppet] - 10https://gerrit.wikimedia.org/r/284638 (https://phabricator.wikimedia.org/T132324) [07:53:36] (03PS2) 10Elukey: Add mod_headers to the httpd config to allow the Header directive. [puppet] - 10https://gerrit.wikimedia.org/r/284638 (https://phabricator.wikimedia.org/T132324) [08:03:24] (03PS1) 10Giuseppe Lavagetto: admin: disable on mw1081 [puppet] - 10https://gerrit.wikimedia.org/r/284640 [08:04:15] <_joe_> moritzm: ^^ is ok with you? I'll revert as soon as I gathered enough data [08:04:26] sure [08:06:37] (03CR) 10Alexandros Kosiaris: [C: 032] Add mod_headers to the httpd config to allow the Header directive. [puppet] - 10https://gerrit.wikimedia.org/r/284638 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [08:07:07] 06Operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#2226489 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:07:19] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2226492 (10MoritzMuehlenhoff) [08:07:21] 06Operations, 13Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#2226493 (10MoritzMuehlenhoff) [08:07:23] 06Operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1387187 (10MoritzMuehlenhoff) 05Open>03Resolved I built the trusty jessie of src:ubuntu-font-family-sources for jessie-wikimedia an... [08:08:16] (03PS2) 10Giuseppe Lavagetto: admin: disable on mw1081 [puppet] - 10https://gerrit.wikimedia.org/r/284640 [08:08:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: disable on mw1081 [puppet] - 10https://gerrit.wikimedia.org/r/284640 (owner: 10Giuseppe Lavagetto) [08:21:13] (03PS1) 10Giuseppe Lavagetto: Revert "admin: disable on mw1081" [puppet] - 10https://gerrit.wikimedia.org/r/284644 [08:21:27] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "admin: disable on mw1081" [puppet] - 10https://gerrit.wikimedia.org/r/284644 (owner: 10Giuseppe Lavagetto) [08:21:31] (03PS1) 10Jcrespo: Add puppet-cert TLS certs to x1 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/284645 [08:24:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/284645 (owner: 10Jcrespo) [08:25:30] (03PS2) 10Jcrespo: Add puppet-cert TLS certs to x1 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/284645 [08:32:19] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2226525 (10elukey) [08:32:21] 06Operations, 10Analytics, 13Patch-For-Review: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2226524 (10elukey) 05Open>03Resolved [08:32:55] (03CR) 10Jcrespo: [C: 032] Add puppet-cert TLS certs to x1 database hosts [puppet] - 10https://gerrit.wikimedia.org/r/284645 (owner: 10Jcrespo) [08:36:19] !log restarting db1031 to apply new mysql config [08:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:38:56] (03PS1) 10Filippo Giunchedi: Revert "varnish: switch upload eqiad from 'direct' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284648 [08:38:58] (03PS1) 10Filippo Giunchedi: Revert "varnish: switch esams from 'eqiad' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284649 [08:39:00] (03PS1) 10Filippo Giunchedi: Revert "varnish: switch upload codfw from 'eqiad' to 'direct'" [puppet] - 10https://gerrit.wikimedia.org/r/284650 [08:39:02] (03PS1) 10Filippo Giunchedi: Revert "varnish: route upload backends to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284651 [08:39:04] (03PS1) 10Filippo Giunchedi: Revert "Set synchronous swift writes for eqiad/codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284652 [08:49:58] (03CR) 10Mobrovac: [C: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [08:51:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:57:58] (03PS1) 10Muehlenhoff: Update font package dependency in toollabs::exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/284653 [08:58:47] <_joe_> jynus: I merged your change on strontium right now, it failed [08:58:57] <_joe_> when you puppet-merged [08:59:16] <_joe_> so now it should be stable [09:00:28] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:01:22] (03PS1) 10KartikMistry: Read config from registry.yaml from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [09:06:00] sorry about that, I didn't see that [09:06:14] I am in the middle on an imporant failure [09:11:20] where can I find ssh fingerprint of gerrit? [09:11:28] (03PS1) 10Muehlenhoff: Add additional Gujarati fonts (Rekha) (fonts-gujr-extra) [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) [09:12:04] (03PS2) 10KartikMistry: WIP: Read config from registry.yaml from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [09:13:38] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [09:20:42] (03Abandoned) 10KartikMistry: WIP: cxserver: Read config from cxserver/deploy [puppet] - 10https://gerrit.wikimedia.org/r/278235 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [09:21:30] (03PS3) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [09:33:48] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2226570 (10elukey) The new version seems to be in Sid: https://packages.debian.org/sid/memcached What about: 1) test the package quickly in labs to make sure that a manual install... [09:38:47] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:45:51] (03CR) 10KartikMistry: [C: 031] Add additional Gujarati fonts (Rekha) (fonts-gujr-extra) [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [09:47:22] (03CR) 10Jcrespo: [C: 031] switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) (owner: 10Volans) [09:48:51] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2226591 (10elukey) a:03elukey [09:53:17] (03PS1) 10Volans: Add parameter to specify larger threadpools [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) [09:54:17] (03CR) 10Volans: "This just add an optional parameter, the other changes are in the main puppet repo." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [09:55:04] (03CR) 10Jcrespo: [C: 04-1] "Parameters do not scale. While eventually we need a solution for better parameter handling, editing a different my.cf should be preferred." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [09:57:23] !log removed apache2 logrotate config manually from argon as temp patch to remove cronspam from root@ (T132896) [09:57:24] T132896: cronspam from argon - apache2 logrotate - https://phabricator.wikimedia.org/T132896 [09:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:18] (03CR) 10Volans: "I'll do it, but this means that all changes to production.my.cnf will need to be duplicated to the separated my.cnf." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [10:04:49] (03PS1) 10Muehlenhoff: Update to 4.4.8 [debs/linux44] - 10https://gerrit.wikimedia.org/r/284665 [10:06:24] (03CR) 10Jcrespo: "Either we need only 1 config and the changes actually apply to all core servers, or they have fundamentally different roles and should be " [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [10:07:29] I would add on core an if for es* shards and apply a different config [10:07:47] akosiaris: cxserver can now read from registry.yaml, beta works OK. I'll test more and let you know :) [10:07:49] is what I'm doing now [10:07:59] eventually, we may(?) want to apply those changes to all servers, if they actually work better [10:08:15] or make them hardware-dependent [10:09:04] but I think es vs s* are worth having its own config [10:09:25] es are HDs, and es will soon be all SSDs, for example [10:10:18] a different story is config management- I agree that needs fixing [10:10:33] maybe a multi-evaluated parameter [10:10:53] (03PS1) 10Volans: MariaDB: separate external storage my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) [10:10:59] current parameters are almost all "transitioning", and should be eliminated [10:11:04] (03Abandoned) 10Volans: Add parameter to specify larger threadpools [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284662 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [10:11:21] SSL,p_s,ROW should be default eventually [10:11:46] (03PS2) 10Volans: MariaDB: separate external storage my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) [10:13:08] (03CR) 10Volans: "@JCrespo: unfortunately due to the new file you need to do a manual diff with the current one." [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [10:17:17] (03PS2) 10Muehlenhoff: Update to 4.4.8 [debs/linux44] - 10https://gerrit.wikimedia.org/r/284665 [10:21:39] (03Draft1) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [10:21:41] (03PS5) 10Volans: MariaDB: separate external storage my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) [10:22:51] let's put 40 explicitly [10:24:40] why? at each new server it might be wrong [10:24:55] we only have 1 type of server [10:25:26] and we have to switch it from the default some months ago [10:25:39] because it was causing bottlneck issues [10:25:57] *had [10:26:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.8 [debs/linux44] - 10https://gerrit.wikimedia.org/r/284665 (owner: 10Muehlenhoff) [10:27:54] (03PS6) 10Volans: MariaDB: separate external storage my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) [10:27:54] https://phabricator.wikimedia.org/rOPUPb4113805e638bfaa0794d93543d0a2b89a6bf24c [10:29:45] ok, then it this applies to es too we could increase it even more... but agree with leaving the number of cores for now [10:30:06] probably it may need to be longer, but I agree [10:30:24] it was more for not doing the same to the other servers [10:30:31] plus it is more legible [10:30:55] puppet compiler running, to check it changes only es* and not db* [10:30:56] if we buy new servers (will not happen in 3-5 years) we will need to tune things no matter what [10:36:53] (03PS1) 10Jcrespo: Update dns records for new eqiad masters [dns] - 10https://gerrit.wikimedia.org/r/284667 [10:38:29] (03PS1) 10Volans: Skip SSL options when using socket connection [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284668 (https://phabricator.wikimedia.org/T111654) [10:39:50] ^that title and the change do not match [10:41:12] kart_: that sounds great! [10:41:15] that will apply to all mysql clients, no matter if using socket or not [10:42:10] (03CR) 10Volans: "compiler results for es*:" [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [10:43:24] jynus: true is one sepecify the -h option to connect to a host, the socket is defined the line above [10:43:28] akosiaris: beta seems fine, what is the best way to test Puppet patch in Production? [10:44:01] akosiaris: ie https://gerrit.wikimedia.org/r/#/c/284654/ [10:44:59] you should do as I do, create an alias for quickly connecting to the local mysql [10:45:14] but leave mysql and all its client force SSL [10:45:20] by default [10:45:35] otherwise, all the work we have done will not be worth it [10:46:04] maybe educate ops on mysql + TLS [10:46:42] kart__: puppet compiler, lemme run the change for you [10:46:43] or let them decide, but "disable encription by default seems like a wrong policy" [10:46:57] and if we remove the socket and set the host to the FQDN of itself by default? with SSL of course [10:47:08] mmm [10:47:10] not good for maintenance in case there are issues though [10:47:12] could be [10:47:12] akosiaris: ok. thanks! [10:47:17] yes [10:47:22] speed also [10:47:31] lets talk in private [11:05:45] (03Abandoned) 10Volans: Skip SSL options when using socket connection [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284668 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [11:05:49] (03PS1) 10Muehlenhoff: Move from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/284672 [11:07:03] (03PS4) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [11:07:54] akosiaris: also, patch depends on deploying of cxserver, so tricky. but, we can still test in puppet-compiler. [11:08:24] kart__: the only thing we can test in puppet compiler is the correctness of the puppet config [11:08:36] so beta is essential here [11:08:44] yep. [11:09:33] akosiaris: registry is fine, as it works in beta. Last option is to just use single registry for beta/production. [11:10:19] akosiaris: once puppet compiler is done, I'll cherry-pick in beta to test. [11:10:34] then we can deploy in cxserver and merge the patch. [11:11:59] (03PS2) 10Muehlenhoff: Move from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/284672 (https://phabricator.wikimedia.org/T133101) [11:15:08] (03CR) 10Faidon Liambotis: Move from python-irclib to python-irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284672 (https://phabricator.wikimedia.org/T133101) (owner: 10Muehlenhoff) [11:17:06] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [11:17:18] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:47] ugh [11:17:57] (03CR) 10Alexandros Kosiaris: [C: 031] "Puppet compiler says exactly what we expected it will happen, so +1ing this" [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [11:23:27] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [11:23:36] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 12.34 ms [11:24:13] akosiaris: thanks. cherry-picking in beta for more testing. [11:33:36] (03CR) 10Volans: "NOOP results for db*" [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [11:39:50] oh wait, is the switch back happening today? [11:40:02] addshore: yes [11:40:18] AFAIK [11:40:29] (03PS6) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [11:40:29] would it be worth adding something about this to the sequence of events (after everything) https://phabricator.wikimedia.org/T133048 [11:40:42] (03CR) 10Muehlenhoff: Move from python-irclib to python-irc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284672 (https://phabricator.wikimedia.org/T133101) (owner: 10Muehlenhoff) [11:40:42] I expect it will be needed again [11:41:16] _joe_, paravoid ^^^ [11:41:47] addshore: we discussed it yesterday, it seems it won't be needed (much) today [11:42:16] oohh, okay, how so? [11:42:36] <_joe_> addshore: actually since we don't expect to lose RC events, I got from you it should not happen again [11:42:44] <_joe_> did I misunderstand that? [11:43:03] ahh, this is caused by a different issue (nothing relating to RC) [11:43:26] <_joe_> oh, ok, anyways I can run the script again post - migration [11:43:36] yeh, It will probably need it [11:44:27] it must be something to do with the initial db write for page creation happening before read only but then then next write adding it to wikibase tables is after the readonly and thus doesnt get written. something could probably be fixed there but no doubt it would take a while to track down [11:44:42] <_joe_> for almost 500 objects? [11:44:44] <_joe_> uhm [11:45:03] yeh, thats what I thought ;) [11:45:07] <_joe_> maybe some script automated script didn't die properly [11:45:26] hmm, automated script? [11:45:50] <_joe_> we have several crons running for wikidata [11:46:08] ahh yes, this would also be nothing to do with those :) [11:46:11] <_joe_> but i see they are for syncronization [11:46:13] <_joe_> yep [11:48:29] <_joe_> addshore: mystery solved, most of those edits are by a bot [11:49:19] but why would that cause different behaviour to a user? (or just because there are more of them the chances and number of occourances i higher)? [11:49:31] <_joe_> the latter [11:49:48] ahh yes, that will always be the case for wikidata though ;) [11:49:54] <_joe_> and yes, all those edits are in a 2-minute timespan around the readonly set [11:50:20] <_joe_> anyways, we will run the script after the dust is settled [11:50:26] <_joe_> better, I will [11:50:47] I was thinking it would be nice if mediawiki had 2 levels of readonly, the first only for user / main db writes and the second for ALL writes, but meh complicated stuff [11:51:04] just some odd transactions or something happening.... [11:55:49] (03PS5) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [12:00:51] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186#2226832 (10Kaarel_Vaidla) Just pinging. be-tarask seems to be ok, but were there any significant problems? Would it be possible to have fiu-vro changed to vro... [12:04:46] addshore, we are actually doing that this time [12:04:51] (manually) [12:05:04] parsercache will be only read only for a few seconds [12:05:08] ahh awesome! _joe_ in that case it will be interesting to see if the script has to add anything [12:05:39] the problem is some traffic may still be going to the wrong datacenter at that time [12:05:57] so it will reduce the number of errors, but not cancel them completely [12:06:58] mediawiki deploys, and in general, conections shifting are not instant [12:08:01] the actual fix, if I understood it rightly is to make jobs more idempotent/transactional [12:09:14] (03CR) 10Jcrespo: [C: 031] "Let's do this/apply in a hot way." [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [12:09:21] <_joe_> addshore: we did first put mediawiki in read-only, and then the databases in read-only [12:10:19] I assume the errors come from killing/failing jobs [12:10:34] from mediwiki tables to wikidata tables [12:10:38] ann jynus this one had nothing to do with jobs [12:10:44] oh [12:10:55] so how did that happen? [12:11:13] 12:44 PM it must be something to do with the initial db write for page creation happening before read only but then then next write adding it to wikibase tables is after the readonly and thus doesnt get written. something could probably be fixed there but no doubt it would take a while to track down [12:11:34] mediawiki-user traffic should not write nothing, and if it does, revert transactionally [12:11:50] (in an ideal world, obviously) [12:11:54] yeh, im guessing something is askew with the transactions [12:12:17] maybe doing those in a job may help, async, so if it fails it can be retried, then? [12:12:41] I am talkind without knowing anything, so ignore me [12:13:01] hehe, well its on page creation, so if it were a job you wouldnt be able to use the page until the job had run [12:13:33] but I assume it is a wikidata thing, so not essential for mere page viewing [12:13:38] just for usage [12:13:44] <_joe_> does wikidata shell out during page creation? [12:13:56] <_joe_> as in executing a php script from outside the web context? [12:14:05] we do things like red links in jobs, and that seem almost immediately [12:14:06] well, the issue was with page rendering, as mediawiki said the page existed as it was in the page table but then wikibase went, oh no it isnt! ;) [12:14:09] <_joe_> I can't see any reason for this happening unless that is the case [12:14:12] _joe_: nop [12:14:30] im just taking another look now to see if there is anything evidently wrong [12:15:54] does mere rendering depend on wikidata? I always assumed mediawiki-core was used as a mere interface to the actual wikidata storage [12:16:25] only templates would depend on wikidata tables [12:16:37] (external references) [12:16:47] well, for example, wikibase doesn't know what entityID the page has if it is not in the wikibase tables [12:17:00] ah, so rendering depends on it [12:17:05] yes [12:17:13] difficult to fix then [12:17:24] but maybe something can be done [12:17:29] yeh, although it could at least give a better error in the future not just BadMethodCall ;) [12:17:37] even if it is the "fail faster" philosophy [12:17:46] well, if everything is in a single transaction it shouldn't happen... [12:17:46] exactly [12:18:25] brandon expresend that recently, we need to be able to fail more reliably too [12:18:52] (03PS2) 10Muehlenhoff: Configure an rsync server which is used to synchronise the AEAD key files between the auth servers [puppet] - 10https://gerrit.wikimedia.org/r/283663 [12:28:27] (03PS3) 10Volans: Set codfw databases to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284402 (owner: 10Giuseppe Lavagetto) [12:29:01] (03CR) 10Volans: [C: 032] MariaDB: separate external storage my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/284666 (https://phabricator.wikimedia.org/T133265) (owner: 10Volans) [12:48:45] (03PS3) 10Muehlenhoff: Configure an rsync server which is used to synchronise the AEAD key files between the auth servers [puppet] - 10https://gerrit.wikimedia.org/r/283663 [12:50:24] (03PS1) 10BBlack: Revert "codfw switch: eqiad text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284687 [12:50:26] (03PS1) 10BBlack: Revert "codfw switch: esams text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284688 [12:50:28] (03PS1) 10BBlack: Revert "codfw switch: codfw text caches -> direct" [puppet] - 10https://gerrit.wikimedia.org/r/284689 [12:52:18] I am going to deplot one laste mediawiki change [12:53:18] (03PS1) 10BBlack: Revert "codfw switch: geodns depool text services from eqiad" [dns] - 10https://gerrit.wikimedia.org/r/284692 [12:55:13] (03CR) 10Muehlenhoff: [C: 032 V: 032] Configure an rsync server which is used to synchronise the AEAD key files between the auth servers [puppet] - 10https://gerrit.wikimedia.org/r/283663 (owner: 10Muehlenhoff) [12:59:57] (03PS1) 10Filippo Giunchedi: Revert "depool upload/eqiad for codfw switchover" [dns] - 10https://gerrit.wikimedia.org/r/284694 [13:00:29] akosiaris: more testing done, we should be ready tomorrow to deploy. [13:02:16] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2227051 (10Slaporte) >>! In T132104#2215041, @Dzahn wrote: > @Slaporte I am able to help with setting up a simple static site on our servers. Thanks! I can share a g... [13:07:01] (03PS1) 10Jcrespo: Temporarely increase es1* master weight to add connection capacity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284695 [13:10:05] (03CR) 10Volans: [C: 031] Temporarely increase es1* master weight to add connection capacity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284695 (owner: 10Jcrespo) [13:10:24] (03PS2) 10BBlack: Revert "codfw switch: esams text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284688 [13:10:26] (03PS2) 10BBlack: Revert "codfw switch: codfw text caches -> direct" [puppet] - 10https://gerrit.wikimedia.org/r/284689 [13:10:28] (03PS2) 10BBlack: Revert "codfw switch: eqiad text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284687 [13:10:54] (03CR) 10Jcrespo: [C: 032] Temporarely increase es1* master weight to add connection capacity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284695 (owner: 10Jcrespo) [13:12:40] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2227101 (10BBlack) @MoritzMuehlenhoff - if you think it's not much overhead and want to take on packaging jessie's apache-2.4 built against our openssl-1.0.2, that would be awesome :) [13:12:44] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Temporarely increase es1* master weight to add connection capacity (duration: 00m 37s) [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:49] note: in ~7 minutes, we'll be switching traffic-layer stuff back to it's normal eqiad-enabled state. this should be non-user-impacting other than latency effects and such. [13:14:13] (this is ahead of actual services, which still start at ~14:00) [13:15:55] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2227146 (10MoritzMuehlenhoff) Sure, I can do that next week. [13:18:24] (03CR) 10Filippo Giunchedi: "puppet compiler change: https://puppet-compiler.wmflabs.org/2528/" [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [13:20:07] (03CR) 10BBlack: [C: 032] Revert "codfw switch: eqiad text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284687 (owner: 10BBlack) [13:20:12] [traffic codfw switch revert #1] - merge -> start salted puppet [13:21:41] !log ori@tin Synchronized php-1.27.0-wmf.21/includes: Ie9799f5ea: Make MessageCache handle lock timeouts better (duration: 01m 18s) [13:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:54] [traffic codfw switch revert #1] - done & confirmed [13:23:14] !log [traffic codfw switch revert #1] - merge -> start salted puppet (@13:20, late log) [13:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:22] up-arrowing broken messages ftw [13:23:27] !log [traffic codfw switch revert #1] - done & confirmed [13:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:51] (03CR) 10BBlack: [C: 032] Revert "codfw switch: esams text caches -> codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284688 (owner: 10BBlack) [13:24:06] !log [traffic codfw switch revert #2] - merge -> start salted puppet [13:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:44] (03PS1) 10Giuseppe Lavagetto: apache-fast-test: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/284697 [13:24:57] PROBLEM - HHVM jobrunner on mw2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:36] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/284667 (owner: 10Jcrespo) [13:25:38] (03CR) 10BBlack: [C: 032] Revert "codfw switch: geodns depool text services from eqiad" [dns] - 10https://gerrit.wikimedia.org/r/284692 (owner: 10BBlack) [13:25:51] !log [traffic codfw switch revert #3] - merge -> authdns-update [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:14] !log [traffic codfw switch revert #2] - done & confirmed [13:27:17] <_joe_> I see traffic flowing through eqiad [13:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:13] <_joe_> and decreasing on codfw [13:28:24] (03CR) 10BBlack: [C: 032] Revert "codfw switch: codfw text caches -> direct" [puppet] - 10https://gerrit.wikimedia.org/r/284689 (owner: 10BBlack) [13:28:34] !log [traffic codfw switch revert #4] - merge -> start salted puppet [13:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:43] _joe_: it is quite impressive isn't it ? [13:29:45] will take until around :36 or so to see bulk of traffic move from #2 (10-minute DNS TTL) [13:30:17] sorry that's s/#2/#3/ above [13:30:22] <_joe_> bblack: esams traffic is flowing through eqiad already, right [13:31:06] !log [traffic codfw switch revert #4] - done & confirmed [13:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:18] yeah esams + codfw both backend to eqiad now [13:31:24] <_joe_> ok [13:31:30] eqiad does all the hits to mw (.codfw) right now [13:31:49] users are still moving on the front edge from codfw -> eqiad on the DNS TTL (not all, just the normal split, which is most) [13:32:53] <_joe_> yup I confirmed with httpry [13:32:57] <_joe_> on one appserver [13:32:58] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [13:33:02] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [13:33:10] ^ traffic shifting between text caches codfw/eqiad [13:33:17] (03PS1) 10Ori.livneh: Enable MessageCacheError log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284698 [13:35:53] (03CR) 10Ori.livneh: [C: 032] Enable MessageCacheError log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284698 (owner: 10Ori.livneh) [13:37:13] !log [traffic codfw switch revert #3] - DNS TTL done, bulk of end-user traffic rebalanced, graphs starting to level off at new normals, as done as it gets from our end [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:21] that's it for the traffic pre-steps [13:37:26] <_joe_> cool [13:38:34] (03Merged) 10jenkins-bot: Enable MessageCacheError log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284698 (owner: 10Ori.livneh) [13:39:39] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I2171f6b1: Enable MessageCacheError log channel (duration: 00m 25s) [13:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:23] I am getting the following network error when trying to move tasks on the wikidata board on phabricator: [13:40:24] Maximum execution time of 10 seconds exceeded [13:40:24] /srv/deployment/phabricator/deployment-cache/revs/7dd45143c333b8fb854b8f40bd96c46ea56a0970/libphutil/src/xsprintf/xsprintf.php:45 [13:40:36] (03PS3) 10Giuseppe Lavagetto: switchover: make jobrunners in codfw stop [puppet] - 10https://gerrit.wikimedia.org/r/284403 [13:40:43] T-20 minutes [13:40:55] if you haven't looked at it today, please do familiarize yourself with https://wikitech.wikimedia.org/wiki/Switch_Datacenter [13:41:09] Lydia_WMDE: Hi! Could you file a task? We're all distracted with the scheduled switchover at the moment, so it's unlikely anyone would be available to take a look right now. [13:41:13] it's split into phases now [13:41:29] ori: ok :) [13:41:53] the phases will be done serially, but anything within a phase should be done in parallel [13:42:00] I'll be pinging people accordingly [13:42:21] thanks, paravoid [13:42:40] <_joe_> should we start the preparation at least for databases? [13:42:44] as always, please use !log whenever you do something [13:43:10] phases and steps within them are all numbered with arabic numerals at the moment, which may cause extra confusion [13:43:36] <_joe_> I'll use #phase.step as a notation when logging [13:43:38] let's use the #1/#1 notation to disambiguate things [13:43:39] <_joe_> so eg [13:43:43] <_joe_> oh ok [13:43:51] heh [13:43:54] ok [13:45:35] i've got a brief announcement of "we're starting" written [13:46:31] "We're starting. --mark" [13:46:41] haha [13:46:43] <_joe_> paravoid: I see preparation for databases is a lot of steps [13:46:48] (03PS3) 10Volans: switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) [13:47:02] we'll start with that at :50 [13:47:05] if volans is ok with it [13:47:06] <_joe_> ok [13:47:13] 10 minutes should be plenty of time :) [13:47:23] paravoid: I'm ready whenever we are [13:47:29] <_joe_> and stop jobqueues / maintenance at :55 maybe? [13:47:56] yep [13:48:44] i'll do the job queues like last time [13:48:50] <_joe_> yes [13:49:29] jynus/volans: anything needed for 1/1, warm up databases? [13:49:41] no [13:49:44] awesome [13:49:50] actually, that phase is now Krinkle's [13:49:54] so to be deleted [13:50:07] not now :) [13:50:12] !log commencing codfw->eqiad datacenter switchover [13:50:13] agree [13:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:18] what what ? [13:50:31] Krinkle: your warmup script [13:50:35] volans: go [13:50:39] paravoid: memcached cleared? [13:50:44] <_joe_> Krinkle: not now [13:50:44] Krinkle: not yet [13:50:47] <_joe_> later :P [13:50:50] ack [13:50:54] ok, then now my warmup script ? [13:50:56] !log [switchover #1/#4] Disable puppet on all eqiad and codfw databases masters [13:50:56] not* [13:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:38] _joe_, ori: prepare for your phase 1 parts [13:52:06] <_joe_> I'll merge puppet [13:52:24] !log [switchover #1/#5] Set final $master status for databases in advance [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] switchover: make jobrunners in codfw stop [puppet] - 10https://gerrit.wikimedia.org/r/284403 (owner: 10Giuseppe Lavagetto) [13:52:49] (03PS3) 10Giuseppe Lavagetto: switchover: disable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284404 [13:52:50] i'll send out my mail [13:52:52] mark: we've started, readonly mode expected at ~:00 [13:52:52] _joe_: you beated me... [13:52:54] heh [13:53:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] switchover: disable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284404 (owner: 10Giuseppe Lavagetto) [13:53:13] (03PS4) 10Volans: switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) [13:53:17] <_joe_> volans: my changes will take effect [13:53:19] <_joe_> your not [13:53:22] [Lag: 17.25] [13:53:24] for the love of god [13:53:25] <_joe_> so let me puppet-merge everything [13:53:29] <_joe_> paravoid: right on time [13:53:31] <_joe_> ori: ready? [13:53:35] yes [13:53:36] <_joe_> I'll merge puppet now [13:53:37] joe add mine to [13:53:43] everyone make sure to be signed in in hangouts [13:53:56] what mark said [13:54:07] <_joe_> ori: merged [13:54:23] "Unable to sign in. Please try again." (hangouts) [13:54:30] (03CR) 10Volans: [C: 032] switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) (owner: 10Volans) [13:54:35] <_joe_> !log [switchover #1/3] stopping crons on wasat [13:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:52] !log [switchover #1/2] stopping jobrunners in codfw [13:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:11] puppet merge done [13:55:13] volans, everything ok? [13:55:16] so far yes [13:55:18] great [13:55:23] !log [switchover #1/#6] Switch pt-heartbeat from active site (codfw) to new site (eqiad) masters [13:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:28] kinda jumped the gun there [13:56:00] (monitoring icinga to avoid spam) [13:56:14] I have icinga, grafana and fluorine open, but the more eyes the better [13:56:50] everything in phase #1 done, correct? [13:56:50] * ori was abled to sign on to hangouts in safari [13:57:08] wait for volans confirmation (that may take some time) [13:57:10] yes [13:57:17] ori: will you do phase 2? [13:57:30] starting pt-heartbeat is not returning from salt checking [13:57:41] yep [13:57:46] (not yet, obviously) [13:57:50] nod [13:58:00] I said not to trust salt :-) [13:58:07] very slow to start, someone is starting [13:58:40] should I downtime all lag alerts? [13:59:04] volans: awaiting for your ack [13:59:11] <_joe_> I am a go with crons [13:59:11] * Krinkle monitors logstash [13:59:13] T-1 [13:59:14] so slow... [13:59:18] I am disabling alerts [13:59:18] jynus: better if you can [13:59:25] half started [13:59:40] * ori waits for go-ahead from paravoid [13:59:55] (03PS4) 10Ori.livneh: Set codfw databases to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284402 (owner: 10Giuseppe Lavagetto) [14:00:09] volans, jynus: can we proceed with phase 2? [14:00:23] !log disabled all db lag alerts [14:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:30] you can continue, paravoid [14:00:33] ori: go [14:00:35] jynus: Yeah, I see the logstash warnigns now too "SqlBagOStuff::setMulti" fails for parser cache setting on all mainpage tests [14:00:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Set codfw databases to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284402 (owner: 10Giuseppe Lavagetto) [14:00:58] I'm doing them manually [14:00:59] volans I can take over if you want to continue with that [14:01:07] ok? [14:01:25] if you want ok, thanks [14:01:45] get the commands from cat /home/volans/eqiad-start-pt-heartbeat.sh jynus [14:01:49] volans, _joe_, Krinkle: prepare for phase 3 (pending my go-ahead) [14:01:55] paravoid: ok [14:01:59] !log ori@tin Synchronized wmf-config/db-codfw.php: [switchover #2/#1] Id8b2e7a05: Set codfw databases to read-only mode (duration: 00m 24s) [14:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:11] !log wikis now in planned read-only mode, cf. http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch/ [14:02:13] confirmed anonymous edit on enwiki -> shows readonly warning box [14:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:24] jynus: don't use salt, is not returning, dunno why, ssh and do it manually, it's quicker [14:02:28] On English Wikipedia, I just tried to thank someone and got a readonly warning. [14:02:43] Because the wikis are readonly. :) [14:02:44] recent-changes has stopped [14:02:45] I see that yall are looking at this now. [14:02:47] :) [14:02:48] let's proceed with phase 3 [14:02:55] volans: go [14:02:57] jynus: I verified the parsercache exception does not bubble to the user. [14:03:09] <_joe_> !log [swichover 3/2 wipe memcacheds] [14:03:10] _joe_: proceed with memcache wipe [14:03:10] no error page. [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:15] !log [switchover #3/#1] Set active site's databases (masters) in read-only mode except parsercache ones. [14:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:31] Krinkle: proceed with warming up eqiad's caches [14:03:42] <_joe_> paravoid: we need eqiad parsercaches to be r-w [14:03:54] <_joe_> I just realized [14:04:02] <_joe_> jynus, volans ^^ [14:04:13] they will be enabled after traffic switchover [14:04:16] !log krinkle@tin: bin/apache-fast-test wiki-urls-warmup1000.txt eqiad [14:04:18] <_joe_> Krinkle: hold on [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:21] <_joe_> sigh [14:04:22] ok [14:04:24] not started yet [14:04:34] _joe_: pending [14:04:37] <_joe_> jynus: Krinkle script can only run after parsercaches are rw [14:04:39] RO now DB done [14:04:40] cannot be done yet, they are rw on codfw still [14:04:46] we'll do it later then? [14:04:49] <_joe_> yes [14:04:49] yes [14:04:54] <_joe_> let's do the script later [14:04:55] let's proceed [14:05:04] _joe_, ori: proceed with phase 4 [14:05:11] puppet/mw-config respectively [14:05:16] <_joe_> I do puppet [14:05:19] is 3.2 done ? [14:05:26] yes it is [14:05:28] ok [14:05:30] (03PS2) 10Ori.livneh: Switch wmfMasterDatacenter to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284398 (owner: 10Giuseppe Lavagetto) [14:05:41] (03PS2) 10Giuseppe Lavagetto: switchover: set mediawiki master datacenter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284397 [14:05:51] everyone: prepare for phase 5 [14:05:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Switch wmfMasterDatacenter to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284398 (owner: 10Giuseppe Lavagetto) [14:06:00] never saw a log line for #3/#2 which is why I asked [14:06:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] switchover: set mediawiki master datacenter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284397 (owner: 10Giuseppe Lavagetto) [14:06:03] Krinkle, wait until 5.5 [14:06:12] yup [14:06:18] akosiaris: there was one above, no # :) [14:06:19] paravoid jynus ping me when 5.5 needs to be done if moved [14:06:29] jynus: yeah, assumed as much. right before app server traffic [14:06:46] bblack: hold for Krinkle for 5.6 [14:06:50] <_joe_> !log [switchover #4/1] puppet merged [14:06:51] paravoid: yup, my bad eyes.. I 'll go the the doctor next week [14:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:55] bblack: i.e. wait for him to ack [14:06:59] paravoid: ack [14:07:15] !log ori@tin Synchronized wmf-config/CommonSettings.php: [switchover #4/#2] I0e85c3d20: Switch wmfMasterDatacenter to eqiad (duration: 00m 26s) [14:07:18] (03PS2) 10BBlack: switchover: switch api/appservers/rendering varnish routing from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284400 (owner: 10Giuseppe Lavagetto) [14:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:20] <_joe_> we can go with phase 5 [14:07:21] just rebasing [14:07:21] ok, proceeding with phase 5 [14:07:28] _joe_: redis [14:07:30] <_joe_> !log [switchover #5/1] switching redis replication manually [14:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:37] mobrovac: restbase, akosiaris: sc* [14:07:41] subbu: parsoid [14:07:48] volans: parsercache r/w [14:07:48] ok. going. [14:07:51] kk [14:07:57] !log [switchover #5/#5] Switch parsercache RO/RW [14:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:04] !log syncing parsoid code [14:08:07] Krinkle: wait for volans to ack, and proceed with warmup [14:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:31] _joe_: there is a typo swicth -> switch in the redis instructions, be careful [14:08:32] Krinkle: ack [14:08:32] k, waiting for volans [14:08:38] and done waiting :) [14:08:39] parsercache on eqiad RW [14:08:42] !log krinkle@tin: bin/apache-fast-test wiki-urls-warmup1000.txt eqiad [14:08:48] paravoid: I'll do phase 7 as well, when the time comes [14:08:49] <_joe_> paravoid: yeah taking into account [14:08:52] godog: proceed with swift [14:09:03] ack [14:09:10] bblack: prepare for 5/6 [14:09:15] (except now pc errors on codfw) [14:09:15] ack [14:09:17] (03PS4) 10Filippo Giunchedi: switchover: switch swift to eqiad imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/284401 (owner: 10Giuseppe Lavagetto) [14:09:23] until 5.6 is finished [14:09:23] volans: if you're all done, prepare for phase 6 [14:09:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] switchover: switch swift to eqiad imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/284401 (owner: 10Giuseppe Lavagetto) [14:09:27] !log [switchover #5/2] restbase puppet agent -tv && systemctl restart restbase [14:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:31] *expect [14:09:31] paravoid: already ready [14:09:33] General question: What do you guess, read only this time till 14:46 too? [14:09:34] !log [switchover #5/#3] Misc services cluster (for the action API endpoint) [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:40] (03PS3) 10BBlack: switchover: switch api/appservers/rendering varnish routing from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284400 (owner: 10Giuseppe Lavagetto) [14:09:44] !log restarting parsoid on all nodes [14:09:45] Luke081515: probably not, unknown at this point, we're busy :) [14:09:48] Luke081515: hopefully not [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:00] Krinkle: waiting for you know [14:10:00] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1682 bytes in 0.200 second response time [14:10:11] !log [switchover #5/#7] roll-restart swift-proxy in eqiad and codfw [14:10:14] Yep, script is running in 100 threads. [14:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:35] <_joe_> I am still working on redis FYI [14:10:38] ETA 4-5 minutes [14:10:45] no issues on ES for now [14:10:49] _joe_: ack, thanks [14:10:56] but traffic is growing [14:11:02] <_joe_> eta 4-5 minutes for me too [14:11:35] * volans ready for RW on database (#6/#1), looking at pt-heartbeat in the meanwhile [14:11:40] !log parsoid deploy and restarts done [14:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:49] ^ wikidata lag expected. nothing to worry... [14:11:53] nod [14:12:23] 500s are still low [14:12:30] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [14:12:41] paravoid: _joe_: still seeing parsercache ro errors in logstash [14:12:48] ah in codfw [14:12:49] belay that [14:12:53] yes [14:12:56] that is normal [14:12:57] <_joe_> redis is a go, I still have things to do but it's a go [14:13:02] yeah, all fine [14:13:06] Krinkle: are you done? [14:13:06] <_joe_> paravoid: ^^ [14:13:15] still running.. [14:13:15] is it worth waiting for the warmup to finish? [14:13:31] if it ends up preventing the need for sidebar ban, yes [14:13:38] I cannot say in advance, sadly [14:13:41] well we had another fix for the sidebar, though [14:13:53] what is the ETA? [14:13:53] <_joe_> akosiaris: is restbase/sc* done? [14:13:57] so I think it'd be ok to proceed -- but let's wait another minute [14:14:01] _joe_: yup, a long time now [14:14:02] agreed [14:14:10] agreed what? [14:14:12] <_joe_> heh, we made it too easy there [14:14:22] !log [switchover #5/2] restbase {{done}} [14:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:43] confirming that everyone but Krinkle done with phase 5 [14:14:45] agreed w/paravoid: so I think it'd be ok to proceed -- but let's wait another minute [14:15:13] minute's over [14:15:13] still waiting for swift roll restart, ETA 2min [14:15:14] (and me for #5/#6, pending Krinkle) [14:15:20] bblack: proceed [14:15:21] ACKNOWLEDGEMENT - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1682 bytes in 0.200 second response time Gehel No updates during read only (due to switchover), should recover after r/w is enabled again. [14:15:23] yes, sorry :) [14:15:29] (03CR) 10BBlack: [C: 032] switchover: switch api/appservers/rendering varnish routing from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284400 (owner: 10Giuseppe Lavagetto) [14:15:34] !log [switchover #5/#6] Switch Varnish MW backends to eqiad - starting [14:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:51] Im not convinced apache-fast-test was able to handle this load. The progress bar doesn't seem to continue. Though CPU usage does still fluctuate suggesting life. [14:16:13] bblack: let us know when done [14:16:14] <_joe_> Krinkle: or the responses are very slow [14:16:14] (03PS3) 10Ori.livneh: Put eqiad in read-write mode for datacenter switchover to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284396 (owner: 10Giuseppe Lavagetto) [14:16:17] ack [14:16:22] _joe_: yeah [14:16:27] Good point [14:16:42] oh hey, progressbar is running again [14:16:44] :) [14:17:03] (being slwo would be consistent with ES issue last time) [14:17:05] ok. it's been 4 minutes. It should be almost done now [14:17:21] Krinkle: we've proceeded with moving real traffic regardless, see above [14:17:27] Yeah [14:17:42] no better warmup than users :) [14:18:13] salt/puppet is rather slow today, even compared to yesterday :/ [14:18:26] !log [switchover #5/#7] finished swift-proxy roll restart [14:18:30] * subbu sees parsoid requests on wtp1001.eqiad [14:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:40] thanks, subbu [14:18:47] <_joe_> cool [14:18:49] warmup finished [14:18:55] good to know [14:18:56] !log apache-fast-test warmup finished [14:18:56] connections keep stable on ES servers [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:02] and I had 1x "already running", now confirming that one... [14:19:03] probably the warmup script was helped by the warmup of real users [14:19:06] ;) [14:19:20] bblack: yup, palladium at 100% CPU constantly. strontium looks like it could handle some more load, but when we were starting, it did not look like that [14:19:32] so I am guessing palladium tries to terminate all that HTTPS [14:19:47] !log [switchover #5/#6] Switch Varnish MW backends to eqiad - DONE (confirmed) [14:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:54] 4k RSA ftw? :) [14:19:57] <_joe_> godog: status? [14:20:01] _joe_: he's done [14:20:01] paravoid: exactly [14:20:01] godog is done [14:20:06] * _joe_ stabs paravoid [14:20:14] I'm done :P [14:20:14] <_joe_> so we can move on? [14:20:14] volans: go [14:20:17] !log [switchover #6/#1] Set database masters RW in eqiad for s1-7, es2-3 and x1 [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:27] done [14:20:30] paravoid: [14:20:34] ori: go with phase 7 [14:20:36] DB are RW [14:20:37] going [14:20:40] no replication issues, not many db errors, connections still stable [14:20:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Put eqiad in read-write mode for datacenter switchover to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284396 (owner: 10Giuseppe Lavagetto) [14:20:53] (03PS2) 10Giuseppe Lavagetto: "switchover: start jobrunners in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/284394 [14:21:34] * volans checking pt-heartbeat finished to run them manually now (jynus FYI) [14:21:39] !log ori@tin Synchronized wmf-config/db-eqiad.php: [switchover #7/#1] Iac92c8bc6b: Put eqiad in read-write mode for datacenter switchover to eqiad (duration: 00m 39s) [14:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:48] !log wikis are read-write again [14:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:54] <_joe_> someone checking edits? [14:21:55] let's confirm Special:RC works [14:22:11] I see an edit just now [14:22:12] \o/ [14:22:14] RC works at dewp [14:22:18] confirmed [14:22:20] does [14:22:24] great! [14:22:25] alright, awesome [14:22:30] <_joe_> does [14:22:31] <_joe_> :) [14:22:31] great [14:22:41] so in general the switch back is easier? :D [14:22:42] :> [14:22:42] let's proceed with p8 then [14:22:58] Luke081515: we did some post-mortem on the first switch and made some improvements [14:22:59] _joe_, ori: go for jobqueue/maint [14:23:07] Luke081515, I suspect switching either way will be easier now [14:23:08] <_joe_> doing puppet merges [14:23:08] _joe_: will you puppet-merge? [14:23:11] cool [14:23:15] ok [14:23:17] compared to the first time it was done [14:23:17] volans: go for db puppet & dns [14:23:24] (03CR) 10Giuseppe Lavagetto: [C: 032] "switchover: start jobrunners in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/284394 (owner: 10Giuseppe Lavagetto) [14:23:27] 14:01->14:21 [14:23:30] (03PS3) 10Giuseppe Lavagetto: switchover: maintenance scripts back to running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284395 [14:23:32] jynus: DNS is yours as agreed ;) [14:23:34] yeah not bad [14:23:36] great [14:23:41] do we still need the wikidata script? [14:23:42] acoording to enwiki RCs [14:23:44] !log [switchover #8/#3] Re-enable puppet on all eqiad and codfw databases masters [14:23:46] mark: apparently so [14:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] switchover: maintenance scripts back to running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284395 (owner: 10Giuseppe Lavagetto) [14:23:58] mark: addshore said so a couple of hours ago [14:24:02] ok [14:24:09] verified VE test. works [14:24:09] yep, as the issue wasnt caused by job stuff [14:24:13] addshore: will you do it? [14:24:13] <_joe_> ori: puppet merged [14:24:26] DB puppet re-enable done [14:24:28] <_joe_> I can do it [14:24:30] I will apply the dns changes (wait for the log) [14:24:36] ok thx [14:24:36] <_joe_> paravoid: run the script [14:24:38] paravoid: i dont have access ;) _joe_ will though! [14:24:40] _joe_: ok, do it [14:24:41] !log [switchover #8/#1] Starting the jobqueue in eqiad [14:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:52] (03PS2) 10Jcrespo: Update dns records for new eqiad masters [dns] - 10https://gerrit.wikimedia.org/r/284667 [14:25:04] (03CR) 10Jcrespo: [C: 032 V: 032] Update dns records for new eqiad masters [dns] - 10https://gerrit.wikimedia.org/r/284667 (owner: 10Jcrespo) [14:25:09] <_joe_> !log [switchover #8/#2] Enabling crons in eqiad [14:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:43] VE works, hence RB works [14:25:49] gehel: will you do ElasticSearch after we're done with this? [14:25:54] godog: I'm still reviewing the swift switchback [14:25:57] <_joe_> !log [switchover #8/#3] Running rebuildEntityPerPage.php [14:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:07] !log deployed dns updated records for new eqiad masters [14:26:10] _joe_: lets see how many it has to re add [14:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:17] bblack: rb still seems to receive external traffic in codfw [14:26:24] bblack: ok thanks, it can wait if need be [14:26:26] <_joe_> mobrovac: that's expected [14:26:32] paravoid: I was planning to let things cool down first, but I can go as soon as you want... [14:26:34] <_joe_> mobrovac: we need to switch it back [14:26:42] <_joe_> mobrovac: once the dust is settled here I guess [14:26:55] our plan says ElasticSearch, Swift, Services, in that order -- but they are discreet steps and we can reorder [14:27:05] godog: the main thing confusing me is we've switched all the steps to revert commits, but they're still listed in the initial order. So we need to go 6 5 4 3 2 1 [14:27:10] RECOVERY - MySQL Replication Heartbeat on db1029 is OK: OK replication delay 0 seconds [14:27:12] and yes, let's wait 10-20 minutes or so to make sure we don't have any fires anywhere [14:27:26] <_joe_> exactly :) [14:27:30] I'm just giving advance warning [14:27:31] parsercache read only errors where reduced from 14:08-14:17 [14:27:31] RECOVERY - MySQL Replication Heartbeat on db1052 is OK: OK replication delay 0 seconds [14:27:40] so, all done with p8? [14:27:45] it takes at least that much for traffic to shift [14:27:55] enwiki RC stream for VE edits looks good to me .. only a 20-min window without edits in that VE-edit stream. [14:27:58] fyi: edit on cs wikipedia has been notified on cs wiktionary channel [14:27:58] (including warmup) [14:28:09] subbu: link? [14:28:16] https://en.wikipedia.org/w/index.php?namespace=&tagfilter=visualeditor&title=Special%3ARecentChanges [14:28:23] godog: other than step-reversal, it all looks good, and the correct salt command is there with the correct commit inside each step [14:28:29] Danny_B: thanks! [14:28:43] !log traffic/mediawiki codfw->eqiad switchover is done [14:28:45] <_joe_> ori: how are the jobqueues doing? [14:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:47] high errors on labswiki [14:29:02] _joe_: nearly done [14:29:09] resourceloader minification cache only dropped to 98% from 99.9% (when switching the other way it dropped to 88%) [14:29:21] <_joe_> Krinkle: nice [14:29:21] Krinkle: nice! [14:29:24] :-) [14:29:26] it would seem side bars didnt mess up this time either? [14:29:26] :-D [14:29:33] addshore: hopefully not! [14:29:40] bblack: yeah I left the original ordering reflecting 'ahead of the switchover' but will add a note later about the reversal :| [14:29:40] PROBLEM - Redis status tcp_6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6381 [14:29:41] <_joe_> addshore: the whole "warmup" was intended to avoid that :P [14:29:49] <_joe_> uhm checking [14:29:50] PROBLEM - Redis status tcp_6380 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6380 [14:30:02] that 1% is probably people browsing wikis in non-default language or non-default skins. (RL has infinite variations almost, like our parser cache) [14:30:03] are these normal? [14:30:08] Apr 21 14:29:42 mw1248: #012Notice: Undefined index: recentChangesFlagsRaw in /srv/mediawiki/php-1.27.0-wmf.21/includes/changes/EnhancedChangesList.php on line 268 [14:30:08] Apr 21 14:29:42 mw1248: #012Warning: Invalid argument supplied for foreach() in /srv/mediawiki/php-1.27.0-wmf.21/includes/changes/EnhancedChangesList.php on line 285 [14:30:21] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [14:30:23] Krenair: afaik those are pre-existing, but worrying nontheless [14:30:27] ok [14:30:29] parsoid: eqiad cluster is picking up (ganglia) + wt2html rate is beginning to crawl back up up (grafana). [14:30:45] Krenair: Can you check if log-error tasks exist? [14:30:55] looking [14:30:58] the difference in ES connections was brutal [14:31:09] (03PS1) 10Gehel: Revert "Switching CirrusSearch to codfw Elasticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284701 [14:31:11] spike of 160 concurrent connections per server [14:31:12] another one. let me know if you want me to notify you again [14:31:15] jobs are getting processed [14:31:20] when it was 60K before [14:31:22] ori: awesome [14:31:26] Krinkle, doesn't look like it [14:31:51] RECOVERY - MySQL Replication Heartbeat on db1058 is OK: OK replication delay 0 seconds [14:32:01] crons still not up yet? [14:32:10] <_joe_> addshore: they are up now AFAICS [14:32:25] RECOVERY - Redis status tcp_6381 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6381 has 1 databases (db0) with 9940759 keys - replication_delay is 0 [14:32:25] RECOVERY - Redis status tcp_6380 on rdb2004 is OK: OK: REDIS on 10.192.16.123:6380 has 1 databases (db0) with 9942102 keys - replication_delay is 0 [14:32:26] spot checking for sidebar errors as anon and not seeing any yet [14:32:36] <_joe_> /bin/bash /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --max-time 540 --batch-size 275 --dispatch-interval 25 --lock-grace-interval 200 [14:32:42] ahh yes _joe_ looks like the wikidata dispatch lag just dived back down [14:32:52] <_joe_> addshore: :) [14:33:00] <_joe_> the rebuild script is still running though [14:33:04] (parsoid) there are still some (very few) straggler requests coming into wtp2001.codfw. [14:33:11] bd808: yeah, also no log messages in the new MessageCacheError log channel [14:33:13] <_joe_> subbu: from where? [14:33:19] <_joe_> subbu: I guess restbase in codfw [14:33:22] <_joe_> that's expected [14:33:28] {"name":"parsoid","hostname":"wtp2001","pid":4988,"level":30,"logType":"info","wiki":"itwiki","title":"Main_Page","oldId":null,"reqId":"fb97d7d1-07cd-11e6-b34b-b753ae4e829e","userAgent":"RESTBase/WMF","msg":"completed parsing in 197 ms","longMsg":"completed parsing in 197 ms","time":"2016-04-21T14:33:15.410Z","v":0} [14:33:29] <_joe_> it's external traffic [14:33:34] <_joe_> we haven't switched it back [14:33:42] ok [14:33:45] _joe_: it took 5 mins last time, lets see how long this time :) [14:33:54] <_joe_> addshore: it seems way longer [14:34:04] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1680 bytes in 0.159 second response time [14:34:10] or not ... [14:34:22] <_joe_> akosiaris: or not what? [14:34:24] edit.success almost fully recovered and increasing [14:34:28] volans, I am enabling lag alerts again [14:34:34] jynus: wait a second [14:34:36] number of enqueued jobs dropped from 600k -> 73k, indicating job queue is healthy [14:34:38] I'm checking old masters [14:34:40] 5.5 [14:34:45] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [14:34:51] _joe_: I was commenting on wikidata recovering exactly the moment you said it is taking longer than 5 mins [14:35:10] _joe_: mc* puppet disabled? [14:35:14] and rdb [14:35:15] <_joe_> akosiaris: it's unrelated [14:35:18] ocg requests were always the first set of requests to switch on the parsoid cluster (both eqiad -> codfw; and codfw -> eqiad) .. i because that service was never disrupted / switched. [14:35:32] there is peak db loads now, but nothing too worrying [14:35:39] <_joe_> paravoid: yes I am going to reenable those progressively [14:35:47] <_joe_> the switch is done, so now I can [14:35:52] jynus: go ahead [14:36:08] ok to reenable lag alerts, you mean, volans ? [14:36:21] all lag checks are ok, the heartbeat one on old master is not but is already acked [14:36:25] yes jynus re-enable [14:36:39] ori: as before, non-zero edits in https://grafana.wikimedia.org/dashboard/db/edit-count while read-only (note: unlike edit.failures, edit.success is based on saveBackedTiming.rate) - maybe that's being triggered for non-edits somehow. [14:36:44] e.g. null edits or something [14:36:54] RECOVERY - MySQL Replication Heartbeat on db1023 is OK: OK replication delay 0 seconds [14:37:18] Krinkle: could be very lost UDP packets? ;) [14:37:24] <_joe_> addshore: zero article fixed [14:37:38] <_joe_> addshore: do you see any that is damaged? [14:37:42] _joe_: awesome! [14:37:47] _joe_: no reports of anything [14:37:57] (_joe_: puppet finished on all the jobrunners now, but jobqueue was already healthy some time ago) [14:37:59] <_joe_> \o/ [14:38:05] <_joe_> awesome [14:38:06] whatever was done differently this time was a good idea :D [14:38:17] Just got to see if https://phabricator.wikimedia.org/T133144 has occoured again this time... [14:39:00] let's proceed with elasticsearch at :45? [14:39:06] <_joe_> addshore: it shouldn't [14:39:21] <_joe_> mobrovac: are you around for services switchback later? [14:39:26] !log enabling replication lag alerts for all dbs [14:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:38] yeh, that one was most deifntly caused by jobs not being able to write [14:39:44] sorry i missed it -- did anyone do: Make sure email works (exim4 -bp on mx1001/mx2001, test an email) [14:39:46] _joe_: define "later" [14:39:59] <_joe_> mobrovac: after ES and Swift are done [14:40:15] ori: I checked the queues yes [14:40:20] cool [14:40:40] _joe_: yes [14:40:41] mails seem to be flowing [14:41:10] _joe_: rdb1001/2 crit? [14:41:48] <_joe_> paravoid: uh? [14:41:58] <_joe_> paravoid: it's transient I guess [14:42:01] CRITICAL ERROR - Redis Library - can not ping '10.64.32.77' on port 6380 [14:42:03] <_joe_> it's puppet restarting redises [14:42:51] <_joe_> it's a 2-second window and that checks manages to detect it [14:43:17] talking about being at the wrong time in the wrong place ... [14:43:39] mobrovac: cassandra test cluster, I suppose? :) [14:43:57] did a restart there already [14:44:01] awesome [14:45:12] !log reverting elasticsearch traffic back to eqiad [14:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:25] thanks, gehel [14:45:26] (03CR) 10Gehel: [C: 032] Revert "Switching CirrusSearch to codfw Elasticsearch cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284701 (owner: 10Gehel) [14:45:29] godog: you're up next [14:45:37] <_joe_> mobrovac: I'm preparing the rollback of services [14:45:41] k [14:45:44] thnx [14:45:54] paravoid: ack, let me know when good to go [14:46:16] note: testing on mw1017 first ... [14:47:53] looks good applying elasticsearch back on all servers... [14:47:54] https://phabricator.wikimedia.org/T121741 (search cluster in codfw) can be closed, right? [14:48:12] and https://phabricator.wikimedia.org/T124671 (app servers) [14:48:13] <1,sleep(3),0) oR ('' is not registered on this wiki.>> [14:48:21] odd exceptions... [14:48:48] hah. nice sql injection attempt. [14:48:50] !log gehel@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 29s) [14:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:16] gehel: let us know when done [14:50:23] (03PS1) 10Giuseppe Lavagetto: switchover: services back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284704 [14:50:28] done, checking right now... [14:50:33] search seems slow, but working [14:50:50] _joe_: you'll go after godog, yes? [14:51:20] jynus: whenever you have time there is tendril to update for a better tree [14:51:22] <_joe_> paravoid: yes [14:52:27] true, volans, the last piece of the cake [14:53:12] I see traffic on eqiad elasticsearch servers, looks good... [14:53:50] gehel: so, done? [14:54:17] paravoid: yes, done. [14:54:57] (03CR) 10Mobrovac: [C: 031] switchover: services back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284704 (owner: 10Giuseppe Lavagetto) [14:54:59] godog: go :) [14:55:03] ack [14:55:20] I think there is some real issue [14:55:24] (03PS2) 10Filippo Giunchedi: Revert "varnish: switch upload eqiad from 'direct' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284648 [14:55:39] I am getting errors due to parsercache ro on dallas [14:55:44] jynus: enough to hold? havent puppet merged yet [14:55:50] why is traffic hittin there? [14:55:58] jynus: that's the expected status though [14:55:59] unrelated to other deployments [14:56:09] ok proceeding [14:56:11] volans, not "ongoing" [14:56:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: switch upload eqiad from 'direct' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284648 (owner: 10Filippo Giunchedi) [14:56:26] <_joe_> jynus: what kind of traffic, have any idea? [14:56:41] <_joe_> and from what hosts? [14:56:46] <_joe_> I mean mw hosts [14:56:49] api.php [14:56:54] (I am checking) [14:57:00] !log [switchover swift #6] upload eqiad to 'direct' [14:57:02] <_joe_> so someone didn't restart all hist clusters [14:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:19] <_joe_> jynus: I just need an hostname [14:57:21] from restbase [14:57:22] <_joe_> anmd I can check [14:57:25] i still see requests to parsoid on wtp2001.codfw [14:57:28] <_joe_> mobrovac: ^^ [14:57:36] <_joe_> subbu: that's expected [14:57:36] is restbase reading from codfw right now? [14:57:45] <_joe_> jynus: which restbase host? [14:57:52] k [14:57:56] restbase1014.eqiad.wmnet [14:58:08] <_joe_> jynus: wtf? checking [14:58:11] only that [14:58:14] no others [14:58:49] sorry it took me some time to notice, but it growed gradually [14:58:51] _joe_: subbu: rb is calling parsoid in codfw for live traffic, but parsoid *should* be calling the MW API in eqiad [14:58:58] <_joe_> mobrovac: yes [14:59:02] since the end of read only [14:59:18] <_joe_> jynus: still ongoing? [14:59:22] mobrovac, _joe_ oh, i assumed external traffic had already switched over, but i guess not. [14:59:26] and I thought at first it was residual traffic [14:59:31] _joe_, cheking again [14:59:40] <_joe_> because I don't see any connection to codfw there [15:00:10] gr, my bad my bad [15:00:18] jynus: _joe_: fixinf rb1014 [15:00:29] <_joe_> mobrovac: yeah seeing now [15:00:31] (03PS2) 10Filippo Giunchedi: Revert "varnish: switch esams from 'eqiad' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284649 [15:00:47] https://logstash.wikimedia.org/#dashboard/temp/AVQ5Vh5HO3D718AOpIHV [15:00:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: switch esams from 'eqiad' to 'codfw'" [puppet] - 10https://gerrit.wikimedia.org/r/284649 (owner: 10Filippo Giunchedi) [15:00:55] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186#2227401 (10Krenair) Yes: {T111822} {T111853} {T111876} {T111895} {T112285} [15:00:57] ^check yourself [15:01:10] still ongoing [15:01:12] <_joe_> jynus: yeah I saw connections right when I said [15:01:15] <_joe_> "seeing now" [15:01:16] sorry [15:01:27] jynus: _joe_: done [15:01:28] !log [switchover swift #5] upload esams to eqiad [15:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:33] jynus: thnx for noticing [15:01:37] what was it? [15:01:44] pure mistake? [15:01:55] If you don't see any reason not to, I'm going to start the reindex of lost updates in elasticsearch... [15:01:56] rb1014 wasn't pooled 2 days ago, but it is today ... [15:01:56] or something failed [15:01:59] and i used the same script [15:02:04] <_joe_> mobrovac: ha! [15:02:10] * mobrovac hides [15:02:20] <_joe_> mobrovac: no problem, really [15:02:21] mobrovac, no problem [15:02:26] data corruption? [15:02:29] <_joe_> it's just jobqueue traffic atm [15:02:36] <_joe_> so not user-facing [15:02:39] will it affect cassandra? [15:02:40] yup [15:02:46] <_joe_> jynus: nope [15:02:51] great [15:03:03] <_joe_> jynus: I expect exactly zero impact from this [15:03:07] can confirm 0 errors now [15:03:22] no pb for cassandra because of this [15:03:24] all good there [15:03:56] in theory, pc should be rw all the time [15:04:02] (03PS2) 10Filippo Giunchedi: Revert "depool upload/eqiad for codfw switchover" [dns] - 10https://gerrit.wikimedia.org/r/284694 [15:04:07] but this time it helped us spot an issue [15:04:17] <_joe_> yes :P [15:04:36] !log starting reindex of lost elasticsearch updates during activation of SSL (T132762) [15:04:37] T132762: Reindex all pages edited since Apr 7 2016 - 14h00 UTC - https://phabricator.wikimedia.org/T132762 [15:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:46] pc handling (and other cache) is another thingto rearch/rethink [15:05:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "depool upload/eqiad for codfw switchover" [dns] - 10https://gerrit.wikimedia.org/r/284694 (owner: 10Filippo Giunchedi) [15:05:27] !log [switchover swift #4] repool upload eqiad in dns [15:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:45] godog: May I close https://phabricator.wikimedia.org/T129089? (codfw upload switch) [15:06:25] Krinkle: yeah, thanks! [15:06:31] Krinkle: thanks for cleaning up :) [15:06:43] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089#2094664 (10Krinkle) 05Open>03Resolved [15:06:48] yw [15:07:11] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2227420 (10Krinkle) [15:07:13] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1962078 (10Krinkle) 05Open>03Resolved a:03Krinkle [15:07:41] _joe_: https://phabricator.wikimedia.org/T126242 staying open for now, right? (reduce app servers) [15:07:54] or is there a separate decom task [15:08:39] (03PS2) 10Filippo Giunchedi: Revert "varnish: switch upload codfw from 'eqiad' to 'direct'" [puppet] - 10https://gerrit.wikimedia.org/r/284650 [15:08:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: switch upload codfw from 'eqiad' to 'direct'" [puppet] - 10https://gerrit.wikimedia.org/r/284650 (owner: 10Filippo Giunchedi) [15:09:20] !log [switchover swift #3] upload codfw to eqiad [15:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:25] <_joe_> Krinkle: it will be closed as soon as the dc switchover is done, basically [15:09:42] <_joe_> Krinkle: I waited for complete decom to see if we were sized ok for the switchback [15:13:05] (Cannot access the database: Can't connect to MySQL server on '10.64.16.30' (4) (10.64.16.30)) [15:13:36] (03PS2) 10Filippo Giunchedi: Revert "varnish: route upload backends to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284651 [15:13:52] <_joe_> jynus, volans ^^ [15:13:57] db1041 [15:14:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: route upload backends to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/284651 (owner: 10Filippo Giunchedi) [15:14:37] yes I was looking [15:14:48] few errors but all towards s7 master [15:15:13] proceeding anyways as it doesn't seem to be impacting to what I am doing, let me know otherwise [15:15:16] <_joe_> we had a jobs spike [15:15:29] <_joe_> godog: yeah go on [15:15:38] !log [switchover swift #2] upload backends from codfw to eqiad [15:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:48] they are api errors [15:16:10] on metawiki [15:16:25] I see them on multiple wiki [15:16:31] also with -rpc [15:16:41] not many but all towards db1041 [15:17:03] 62 & 39 a bit overloaded [15:17:33] we may want to repool db1033 [15:17:44] * volans preparing patch [15:17:57] if you do that, I will try to debug more [15:18:10] if it is a spike or there is something worse [15:18:35] 06Operations, 06Performance-Team, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2227446 (10Krinkle) 05Open>03Resolved a:03Krinkle [15:18:37] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2227448 (10Krinkle) [15:18:42] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2227449 (10mark) Let's move forward with repurposing the existing (ex mobile) Varnish servers for maps. :) [15:19:19] I can see it, there is long running queries on centralauth [15:19:20] weight 150 ok? [15:19:22] Hm.. whatever happened to https://phabricator.wikimedia.org/T126632 (change scap to restart job queues) [15:19:24] bottlenecking a bit [15:19:28] volans, ok [15:19:29] paravoid _joe_ ETA is ~5-10m, left is the last puppet run and sync-file [15:19:30] do we manually restart them when we changed the config? [15:19:40] godog: cool, thanks [15:19:41] at least less connections would help a bit [15:19:45] <_joe_> godog: cool, thx [15:19:56] lol _joe_ [15:20:25] (03PS1) 10Volans: Repool db1033 to reduce load on the others [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284706 (https://phabricator.wikimedia.org/T133205) [15:20:29] _joe_, paravoid: i think we can now move ahead with switching services' traffic to eqiad as it's not dependent on swift [15:20:35] <_joe_> paravoid: I was late because I'm sipping coffee [15:20:52] <_joe_> mobrovac: let's wait the 5 minutes godog need [15:20:54] <_joe_> *s [15:20:54] none of the three (elastic, swift, services) are dependent on one another [15:20:59] so [15:21:03] job queue execution is partially responsible [15:21:06] we're just doing them serially so that if something goes bad, our attention isn't diverted [15:21:06] (03PS2) 10Filippo Giunchedi: Revert "Set synchronous swift writes for eqiad/codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284652 [15:21:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Set synchronous swift writes for eqiad/codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284652 (owner: 10Filippo Giunchedi) [15:21:25] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2227451 (10Krinkle) >>! In T114398, @aaron wrote:> > See also T114271. > > We need scripts and processes to do a planned s... [15:21:30] it is high on central auth, which affects performance, which makes more likely errors [15:21:32] no reason to parallelize if they're not on any critical (user-impacting) path, I think [15:21:36] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#2227456 (10Krinkle) [15:21:42] 06Operations, 15User-mobrovac, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2227459 (10Krinkle) [15:21:44] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out and document the datacenter switchover process - https://phabricator.wikimedia.org/T124670#1962060 (10Krinkle) [15:22:47] (03PS6) 10KartikMistry: WIP: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) [15:23:14] godog paravoid can I merge a mediawiki change to adjust weights? [15:23:15] !log filippo@tin Synchronized wmf-config/filebackend-production.php: [switchover swift #1] async writes to codfw (duration: 00m 28s) [15:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:27] volans: yup, just finished sync-file [15:23:29] (03PS2) 10Volans: Repool db1033 to reduce load on the others [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284706 (https://phabricator.wikimedia.org/T133205) [15:23:31] ok thanks [15:23:36] !log ytterbium: puppet re-enabled for gerrit host [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:54] ostriches: :) [15:24:16] waiting another 5m for swift/varnish to settle [15:24:31] (03CR) 10Volans: [C: 032] Repool db1033 to reduce load on the others [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284706 (https://phabricator.wikimedia.org/T133205) (owner: 10Volans) [15:24:40] wait [15:24:47] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2227467 (10Krinkle) [15:24:53] jynus: ok, but already meged [15:24:55] *merged [15:24:58] (03Merged) 10jenkins-bot: Repool db1033 to reduce load on the others [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284706 (https://phabricator.wikimedia.org/T133205) (owner: 10Volans) [15:24:58] no problem [15:25:02] 06Operations, 10ops-eqiad: eqiad: Failed DIMM db1065 - https://phabricator.wikimedia.org/T133250#2227471 (10Cmjohnson) Received DIMM scheduled with @jcrespo on Tuesday at 1500UTC [15:25:08] I wanted to do extra tweaks [15:25:13] can you hold the deploy? [15:25:13] not yet fetch/pull/sync [15:25:15] so tell me [15:25:19] one sec [15:25:19] we can do it at once [15:25:23] yes [15:25:35] I think load on some other servers on s7 is too low [15:25:42] we can optimize that a bit [15:26:14] godog: once you're done I'll send an email out ;) [15:26:17] <_joe_> /win 17 [15:26:29] <_joe_> mark: wait on services? [15:26:34] ah yeah [15:26:35] <_joe_> it's going to be 2 minutes more [15:27:27] (03PS1) 10Jcrespo: Increase weight on db1034, db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284707 [15:27:33] ^volans check [15:27:46] (03PS2) 10Giuseppe Lavagetto: switchover: services back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284704 [15:27:55] 62 is not ready to handle 23K QPS [15:27:57] <_joe_> grrrit-wm: still waiting on your go [15:27:58] yup swift looks good to me paravoid _joe_ [15:28:01] awesome! [15:28:01] <_joe_> ok [15:28:03] thanks :) [15:28:03] <_joe_> let's go [15:28:13] <_joe_> mobrovac: I'm merging and running puppet [15:28:19] (03CR) 10Mobrovac: [C: 04-1] "I have my doubts about the path set in the config file. Can I have the pertinent changes from the CXServer source and deploy repo?" [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [15:28:21] (03CR) 10Volans: [C: 031] Increase weight on db1034, db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284707 (owner: 10Jcrespo) [15:28:23] yw! [15:28:27] kk _joe_ [15:28:37] (03CR) 10Jcrespo: [C: 032] Increase weight on db1034, db1041 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284707 (owner: 10Jcrespo) [15:28:38] <_joe_> !log [switchback services] moving traffic for restbase/citoid/cxserver back to eqiad [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:46] volans, it is the lack of an api role [15:28:53] that would not solve the issue [15:28:59] but it would isolate it [15:28:59] (03CR) 10Giuseppe Lavagetto: [C: 032] switchover: services back to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284704 (owner: 10Giuseppe Lavagetto) [15:29:07] I'll take care of the pull and sync jynus [15:29:13] ok, was going to ask [15:29:21] no issue, as it locks [15:30:04] next shard to get new hw: s7 :-) [15:30:24] syncing now [15:30:30] meta and centralauth are very important wikis [15:30:31] <_joe_> puppet is indeed slow [15:30:48] !log volans@tin Synchronized wmf-config/db-eqiad.php: Adust weights for shard s7 - T133205 (duration: 00m 32s) [15:30:49] T133205: Switchback to eqiad - https://phabricator.wikimedia.org/T133205 [15:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:10] the title was broken, the patch was ok BTW [15:33:18] RECOVERY - HHVM jobrunner on mw2001 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.089 second response time [15:33:41] yeah, my bad [15:33:55] no, my fault [15:34:12] <_joe_> fight over who's wrong! [15:34:36] <_joe_> !log services switchover done [15:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:49] <_joe_> mark: switchback is officially over I guess [15:34:49] yeah were 28 and 34 not 41 :D, I wrote Adust instead of adjust [15:34:59] \o/ [15:34:59] yep, writing email [15:35:01] 1h30 [15:35:11] volans, no change, errors still high on db1041 [15:35:11] <_joe_> paravoid: we could've gone much faster [15:35:49] why you were thinking that the load on the slaves could create errors on the master? [15:36:10] there were other unrelated errors [15:37:45] jynus, volans: TL;DR? :) [15:38:14] few errors connect to DB only towards db1041, s7 master, investigating [15:38:32] https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [15:38:37] paravoid: ^^ [15:38:38] errors are ongoing, paravoid [15:39:01] grant differences? [15:39:05] not for api [15:39:41] do we have some additional log I can look at on a mw host? [15:39:51] fluorine [15:40:01] it's the same thing though [15:40:07] 2016-04-21 15:39:26 [Vxj0KwpAIDMAAElfXrwAAAAN] mw1181 frwiki 1.27.0-wmf.21 wfLogDBError ERROR: Error connecting to 10.64.16.30: Can't connect to MySQL server on '10.64.16.30' (4) {"db_server":"10.64.16.30","db_name":"centralauth","db_user":"wikiuser","method":"DatabaseMysqlBase::open","error":"Can't connect to MySQL server on '10.64.16.30' (4)"} [15:40:12] 2016-04-21 15:39:26 [Vxj0KwpAIDMAAElfXrwAAAAN] mw1181 frwiki 1.27.0-wmf.21 wfLogDBError ERROR: Connection error: Unknown error (10.64.16.30) {"method":"LoadBalancer::reportConnectionError","last_error":"Unknown error","db_server":"10.64.16.30"} [15:41:32] jynus: could be the firewall rules? but I would expect more all or nothing from them [15:42:38] that or drift on grants [15:42:43] (03PS1) 10BBlack: apt|mirrors|ubuntu: split config so HSTS is only over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/284711 [15:43:50] I am executing a grant on db1041 [15:44:01] ok [15:44:06] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 05codfw-rollout: Wrong sidebar cached on sites - https://phabricator.wikimedia.org/T133069#2227504 (10aaron) 05Open>03Resolved a:03aaron [15:44:07] _joe_, not critical and I had flagged this couple week back as well ... but i think something is broken with ganglia monitoring of network traffic to the parsoid codfw cluster ... look at https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Parsoid+codfw&m=cpu_report&s=by+name&mc=2&g=network_report ... I don't see how that much network traffic can exist on codfw without any requests. This doesn't change if you to a monthly vie [15:44:07] w. [15:44:17] https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Parsoid+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [15:44:29] <_joe_> subbu: probably yes [15:44:47] <_joe_> something wrong there i already noticed [15:45:10] ok. should i file a phab task? or easy / simple to fix? [15:45:11] jynus, volans: nothing dropped on 1041 [15:45:16] (in iptables) [15:45:25] thanks moritzm was my next check :) [15:45:33] <_joe_> conntrack? [15:45:39] hmm [15:45:51] <_joe_> MatmaRex: what's up? [15:45:51] if i wanted to deploy a one-line bugfix today, would that be impossible? [15:45:58] entirely unrelated to switchover [15:46:04] no, conntrack is fine as well [15:46:09] https://gerrit.wikimedia.org/r/#/c/284710/ [15:46:25] (03CR) 10Chad: [C: 032] "Moritz: Any chance you could get this built and in apt.wm.o for me?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [15:46:28] I think is that [15:46:36] MatmaRex: not impossible, no [15:46:54] jynus: that what? [15:46:56] (03CR) 10Muehlenhoff: "Sure, I can do that tomorrow." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [15:47:04] MatmaRex: you can go ahead from my PoV; I suppose ostriches has the releng stamp [15:47:25] * ostriches looks [15:48:00] * ostriches puts a big stamp right on MatmaRex's forehead [15:48:04] :) [15:48:07] volans, what was the old s7-master? [15:48:09] i don't have deployment ccess myself, but i could probably make MarkTraceur do it ;) [15:48:19] 74577 packets transmitted, 74576 received, 0% packet loss, time 12941ms [15:48:20] jynus: db1033 [15:48:22] MarkTraceur: feel like cherry-picking and deploying https://gerrit.wikimedia.org/r/#/c/284710/ ? [15:48:26] * MarkTraceur glares at MatmaRex [15:48:28] Sure [15:48:29] (03PS1) 10Urbanecm: Add *.asc-test.nl to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284712 (https://phabricator.wikimedia.org/T133286) [15:48:31] for mw1181->db1041 [15:48:38] so it's not packet loss or anything like that [15:48:51] any way we can figure out what "unknown error" means? [15:49:27] * MarkTraceur dusts off his prod SSH key [15:50:23] <_joe_> MatmaRex: wait until we've cleared off the ongoing (small) fires? [15:50:35] sorry, no, it is not grants [15:50:41] I was comparing to the wrong host [15:50:52] _joe_: cf. paravoid's go-ahead? [15:51:31] 445 10.64.16.30 [15:51:31] 27 10.64.48.25 [15:51:31] 13 10.64.16.24 [15:51:34] 9 10.64.48.26 [15:51:41] then <= 5 [15:51:44] for the last 1000, that is [15:51:51] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2227541 (10hashar) [15:52:05] 06Operations, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium): install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#2227542 (10hashar) [15:52:13] MatmaRex: Do you know off the top of your head which branches have the broken code? [15:52:16] all different shards [15:52:23] 06Operations, 10hardware-requests, 10Continuous-Integration-Infrastructure (phase-out-gallium), 05Continuous-Integration-Scaling: eqiad: (1) allocate server to migrate Zuul server to - https://phabricator.wikimedia.org/T95760#2227546 (10hashar) [15:52:29] MarkTraceur: wmf.21 [15:52:30] 10.64.48.25 and 10.64.48.26 were having errors before too [15:52:36] due to overheating [15:52:39] Easy enough, thanks MatmaRex [15:52:42] maybe we didn't solve the issue [15:53:14] cmjohnson1: did restbase1015 eventually got setup yesterday btw? saw no update on task [15:53:42] we didn't [15:53:44] MatmaRex: I'm waiting for a thumbs up from _joe_ but otherwise I'm set [15:53:56] right. thanks guys [15:54:09] it is some king of central access for other wikis- centralauth or meta [15:54:14] *kind [15:54:37] <_joe_> jynus: maybe some sessions were invalidated during the switchback? [15:54:59] _joe_: and they give unable to connect to DB? [15:55:08] sounds strange to me [15:55:11] <_joe_> volans: they would yield more connections [15:55:20] 06Operations, 06Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2227556 (10BBlack) With post-switchover work, a weekend coming, and other misc constraints, @Gehel and I planning to actually do the wor... [15:55:26] the masteris not overloaded [15:55:39] (03PS1) 10Andrew Bogott: Use codfw.labtest domain for labtest instances. [puppet] - 10https://gerrit.wikimedia.org/r/284716 [15:56:01] (03CR) 10BBlack: [C: 032] apt|mirrors|ubuntu: split config so HSTS is only over HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/284711 (owner: 10BBlack) [15:56:17] caught on on tcpdump [15:56:20] can you try log out and log in [15:56:50] <_joe_> MarkTraceur: what is the change? [15:56:51] Server greeting immediately followed by RST [15:57:00] hm, that's an RST comming from mw* [15:57:05] _joe_: it's fine [15:57:13] <_joe_> MarkTraceur: ok go on then :) [15:57:17] MarkTraceur: it's fine, go ahead [15:57:18] bd [15:57:20] Thanks guys [15:57:45] <_joe_> paravoid: the rst is coming from the mw hosts? wtf? [15:57:54] just after the 3-way? [15:57:56] could be just a terminated request [15:58:21] (03PS2) 10Andrew Bogott: Use codfw.labtest domain for labtest instances. [puppet] - 10https://gerrit.wikimedia.org/r/284716 [15:58:35] subbu: that traffic is purges ... [15:58:43] so what did change there since the switchover? [15:59:00] since before the switchover, that is [15:59:15] the master changed... was failovered from db1033 to db1041 [15:59:17] godog: sorry ..yes it's setup [15:59:37] same mariadb version though? [15:59:38] also ferm was enabled on all servers [15:59:42] no [15:59:46] from 5.5 to 10 [15:59:55] uhm what? [16:00:04] ah right [16:00:16] the greeting says 5.5.5-10.0.22-MariaDB-log [16:00:19] but I guess that is 10 :) [16:00:23] 10.0.22 [16:00:33] yup, the 5.5.5 at the beginning fooled me [16:00:34] the 5.5.5 could be the client? [16:00:40] (03CR) 10Andrew Bogott: [C: 032] Use codfw.labtest domain for labtest instances. [puppet] - 10https://gerrit.wikimedia.org/r/284716 (owner: 10Andrew Bogott) [16:00:44] no, it's probably mariadb advertising itself as mysql 5.5 [16:00:52] reminds me of user-agent strings [16:00:56] <_joe_> yes [16:00:57] eheheh [16:00:57] yes, forget about that [16:01:07] grants are the same before and after [16:01:41] akosiaris, purges of ... ? it is at the 6M/sec and higher since ganglia monitoring has been turned on for parsoid codfw. [16:01:48] https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=Parsoid+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [16:02:07] jynus: let's check connection timeout values [16:02:11] subbu: I mean HTCP purges [16:02:20] those boxes should not have been receiving them [16:02:38] something networky .. looking [16:02:42] <_joe_> volans: we can raise those, yes [16:02:57] yes, I was about to check all variables [16:03:00] jynus: connect_timeout is 3 [16:03:02] was 5 [16:03:05] 06Operations, 10Traffic, 10netops: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#2227600 (10BBlack) [16:03:12] look here [16:03:15] https://github.com/wikimedia/mediawiki/blob/fe5d88563b10eb5ce8fec367f1405d658663ee2a/includes/db/DatabaseMysqlBase.php#L85 [16:03:16] * subbu isn't familiar with htcp .. googles [16:03:16] 06Operations, 10Traffic, 10netops: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006#2227601 (10BBlack) [16:03:22] could be that the connection is already closed by mysql [16:03:50] <_joe_> oblivian@mw1181:~$ sudo cat /etc/hhvm/server.ini | grep mysql [16:03:50] <_joe_> hhvm.mysql.connect_timeout = 3000 [16:04:00] and wait_timeout is 3600 [16:04:03] <_joe_> just fyi [16:04:05] was 28800 [16:04:21] <_joe_> volans: when did that change, though? [16:04:29] I don't think we send HTCP to parsoid nodes [16:04:32] It might have been too long since I deployed code, but does jenkins not merge things into deployment branches anymore? [16:04:38] _joe_: we changed master [16:04:45] could be that runtime values were different [16:04:46] bblack: we sent HTCP to anyone that subscribes to the IGMP group :) [16:04:53] volans, no unexpected changes [16:04:58] on global variables [16:05:01] well ok fair enough, but there's no reason for them to be subscribed [16:05:04] right? [16:05:12] also they comes from different puppet classes [16:05:13] right, I think that's what akosiaris was investigating [16:05:24] not sure if uses different my.cnf , let me check [16:05:39] network-wise, the failure mode of broken multicast is often broadcast [16:05:48] someday we should document our global assignment of multicast addresses, btw [16:05:50] so it could be that [16:05:57] God, finally [16:05:59] someday we'll get rid of multicast! :) [16:05:59] I think there's some in use ops doesn't even really track [16:06:06] for all things, not just HTCP [16:06:12] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2227615 (10hashar) We have created a sub project in Phabricator https://phabricator.wikimedia.org/project/view/1966/ First step is for #releng t... [16:06:13] paravoid: asw-b-codfw (where wtp2001 is) does not have igmp-snooping btw [16:06:19] 06Operations, 10netops, 10Continuous-Integration-Infrastructure (phase-out-gallium): install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#2227618 (10hashar) We have created a sub project in Phabricator https://phabricator.wikimedia.org/project/view/1966/ First st... [16:06:24] awesome [16:06:27] then that's the culprit [16:06:29] MarkTraceur: tests are just slow. :) [16:06:36] I am enabling it for wtp2001 to test [16:06:43] nice [16:06:43] thx [16:07:07] there's something I never realized before, that I realized several months ago but never noted in any documentation yet I don't think, about multicast [16:07:35] MatmaRex: Do you want to try this on testwiki, or... [16:07:38] jynus: WTF on db1062 there is a third value for connect_timeout (10) and 3600 for wait_timeout [16:07:45] I'm pretty confident it will be OK [16:07:52] which is that, at the level of subscriptions on ethernet networks, there's a couple of bits that don't matter. so multicast addresses "collide" at layer two (you get traffic to your port for things you're not actually subscribed to) if you use 2x distinct multicast IPs that only different in those two bits [16:08:01] <_joe_> volans: you're talking server-side timeouts then? [16:08:11] yes, from global variables in mysql [16:08:18] login times seem to be high, but unconclusive [16:08:29] which is why we should be documenting all multicast assignments and ensuring not only no direct collisions, but no L2 collisions in the face of those two wildcard bits [16:08:43] MarkTraceur: meh [16:08:44] Never mind, testwiki isn't needed [16:08:47] Yeah [16:08:52] it's simple enough. verified locally [16:09:52] I don't see anything weird wrt db1041 [16:10:00] !log disabling event scheduler on db1041 [16:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:05] network is fine, both iptables-wise and in general [16:10:09] and bandwidth tec. [16:10:18] the number of TCP connections is pretty low too [16:10:33] MatmaRex: Syncing [16:10:35] paravoid: agree from what I've checked too [16:10:44] good news is that user-facing issues are minimal [16:10:51] it is mostly affecting rpc [16:10:55] !log marktraceur@tin Synchronized php-1.27.0-wmf.21/resources/src/mediawiki/api/upload.js: Unbreak finishing stash uploads in upload dialog (duration: 00m 27s) [16:10:58] (job queue) [16:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:14] also weird this doesn't happen in other mariadb10 db boxes, right? [16:11:18] MatmaRex: Test? [16:11:20] oh wait, I described that wrong: there's only one L2-wildcard bit on multicast IPv4 [16:11:20] could it be related to traffic/network on mediawiki side? [16:11:33] I don't think so [16:11:44] I've tested connectivity from a number of mw* boxes to db1041, nothing suspicious [16:11:45] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2227625 (10RobH) The ganeti hosts are Dell, will the restbase HP adapters work? [16:11:54] yeah, beacaus its the only host affected [16:12:01] also db1041 is an outlier on wfDBErrorLog [16:12:06] (well, within 239) [16:12:07] MarkTraceur: eh, i don't want to upload files in prod. [16:12:23] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2227628 (10RobH) We'll need up to 4 per system, 4 systems total, for 12 of the adapters. [16:12:26] MatmaRex: Testwiki? *shrug* [16:12:27] queries are happening, at least some [16:12:35] MarkTraceur: that also goes to Commons, i think [16:12:42] jynus: show status like 'Aborted_%'; [16:12:48] Oh, right. [16:12:51] Bloody cross-wiki [16:13:00] yeah. [16:13:00] the counters are increasing any time I refresh it, not many, just a few [16:13:06] we should have it set up in some smarter way [16:13:10] how do I mysql again? :) [16:13:12] like uploads from test2 go to test [16:13:12] So I need to go and take a picture of something [16:13:12] mysql --disable-ssl? [16:13:20] volans, yes, I am looking at the same at https://tendril.wikimedia.org/host/view/db1041.eqiad.wmnet/3306 [16:13:25] (connection problems) [16:13:26] paravoid: mysql --defaults-file=~/.my.cnf [16:13:26] yup, seems to have worked [16:13:36] nah that doesn't work [16:13:44] --skip-ssl [16:13:53] --disable-ssl worked.. [16:13:57] ok [16:14:01] MarkTraceur: easy test is just to wait for someone to successfully upload an image and use it in an article. [16:14:09] performance is very low [16:14:16] compared to normal traffic [16:14:34] performance as in "throughput" [16:15:13] MatmaRex: I thought the upload went through OK? [16:15:20] Oh, but you couldn't use it in the article. Right. [16:16:20] MatmaRex: OK, I'm going to go take a picture of the broadcasting company across the street [16:16:28] Hopefully it doesn't start to rain in the next ten minutes [16:18:05] MarkTraceur: :o [16:20:27] there is a couple of disks with some media errors, but nothing fatal [16:22:04] there is also a transaction from 437 seconds but I cannot see how could affect connections [16:22:06] errors are going down [16:23:02] I am going to bet on some hardware/capacity issue + cold buffers, but the solution would be worse than not doing nothing, so I would wait now [16:23:22] and observe the tendency going to 0 [16:23:25] <_joe_> jynus: seems sensible [16:23:48] it it was a hard failure, I wouls failover [16:23:54] but that would create way more issues [16:24:21] current rate of errors are 1/s [16:24:57] 1-10 every 10 seconds [16:25:07] <_joe_> according to logstash it's around 100/min until a few minutes ago [16:25:18] I do not account jobs [16:25:23] <_joe_> oh ok [16:25:27] because they can be retried, etc [16:25:28] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2227703 (10csteipp) I don't really like the direction, and would rather see us move towards a mo... [16:25:48] <_joe_> yeah most errors are from jobs indeed [16:25:51] <_joe_> uhm [16:26:01] <_joe_> let me check those machines [16:26:13] I am looking at https://logstash.wikimedia.org/#dashboard/temp/AVQ5pHapjK4nptUtP8li [16:26:19] for non-rpc traffic [16:29:05] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 5 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2227723 (10Jdlrobson) Can we safely call this closed from the community perspective @Atsirlin and @Wrh2 ? Any new reports? [16:30:45] (03CR) 10Chad: "Thanks so much!" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [16:32:21] MatmaRex: That was one of the silliest tests ever, but it worked. [16:32:55] MatmaRex: https://en.wikipedia.org/w/index.php?title=Hubbard_Broadcasting&diff=716411941&oldid=710864663 [16:33:16] Oh, wait, enwiki's totally not on wmf21 is it [16:33:29] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2227746 (10csteipp) I should add, I think we can prevent having the agent's memory dumped (so I... [16:33:31] Oh, whew, it is. [16:34:33] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 5 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2227751 (10Wrh2) While I haven't noticed the issue in the past week, if there's no harm in doing so I would leave this open for... [16:34:45] MarkTraceur: haha :D [16:34:59] thanks [16:35:23] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 5 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2227780 (10Jdlrobson) No harm, just wanted to check in whether things are looking good. Sounds like they are :) [16:35:32] Of course! [16:35:49] MatmaRex: Now if I want to test anything else I need to drive somewhere. Shoot. [16:36:40] 06Operations, 10Wikimedia-General-or-Unknown: api.php gives a 302 redirect to URL with '&*' appended, breaking CORS requests - https://phabricator.wikimedia.org/T132612#2227786 (10matmarex) 05Open>03Resolved a:03ori @ori fixed the Wikimedia configuration long ago. My patch above should prevent reoccurren... [16:38:14] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2227794 (10Cmjohnson) yes, the will work and I have exactly 12 [16:55:19] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2227811 (10Gehel) [[ https://wikitech.wikimedia.org/wiki/Multicast_IP_Addresses | Multicast documentation ]] sh... [16:56:42] (03CR) 10Paladox: "@Chad and @QChris even though this repo has jenkins tests they are not use for gate and submit meaning you have to either update the tests" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [17:21:15] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2227898 (10RobH) [17:23:05] (03CR) 10Dzahn: [C: 031] Add additional Gujarati fonts (Rekha) (fonts-gujr-extra) [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [17:24:50] (03CR) 10Dzahn: [C: 031] "YES, i specifically remember the problem with trying to add fonts-gujr-extra etc. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/284653 (owner: 10Muehlenhoff) [17:25:33] (03CR) 10Dzahn: "i remember trying to add this thinking it was trivial, then running into the problem you are fixing with https://gerrit.wikimedia.org/r/#/" [puppet] - 10https://gerrit.wikimedia.org/r/284655 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [17:26:37] (03PS1) 10Alex Monk: servermon: require_package python-ldap instead [puppet] - 10https://gerrit.wikimedia.org/r/284731 [17:29:20] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2227933 (10Dzahn) I tried to do this ^ and ran into issues. but you are fixing those with https://gerrit.wikimedia.org/r/#/c/284653/ first , then... [17:34:04] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2206587 (10RobH) I've checked with @cmjohnson via the now resolved sub-task, and he has 12 lff to sff drive adapters on site he can use for... [17:34:27] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2227953 (10RobH) [17:35:08] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2206587 (10RobH) 05Open>03stalled I'm setting this to stalled, as the pricing details will need to be worked out on blocking task T1333... [17:38:02] (03CR) 1020after4: "@mobrovac: thanks for all the suggestions, as you can see my ruby-fu is not strong. :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [17:38:57] (03PS5) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [17:40:32] (03CR) 10jenkins-bot: [V: 04-1] Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [17:44:44] !log planet1001 - apt-get dist-upgrade (libc6, apache, ..) [17:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:54] 06Operations, 10hardware-requests, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2227995 (10RobH) 05Open>03Resolved T128796 is the setup task and is assigned to @ori Since the hardware request is now fulfilled, I'... [17:47:44] PROBLEM - DPKG on planet1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:47:49] (03PS6) 1020after4: Automate the generation deployment keys (keyholder-managed ssh keys) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) [17:49:15] ACKNOWLEDGEMENT - DPKG on planet1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn upgrade ongoing [17:51:17] (03CR) 1020after4: [C: 04-1] "still needs testing and discussion before going to production but the code should be pretty solid at this point." [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T133211) (owner: 1020after4) [17:54:04] RECOVERY - DPKG on planet1001 is OK: All packages OK [17:55:22] 06Operations, 10Monitoring: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318#2228045 (10MaxSem) [17:57:00] moritzm: re: T131928. is installing "linux-image-4.4" correct? matches multiple packages then. why do i not get 4.4 from dist-upgrade? [17:57:01] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [17:58:08] moritzm: or just linux-image-4.4.0-0.bpo.1-amd64 [17:59:32] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2228059 (10Krenair) this just happened twice more :/ [17:59:51] (03CR) 10Chad: "I know. I don't want this repo to auto-merge anyway." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [18:00:53] PROBLEM - MariaDB disk space on labsdb1001 is CRITICAL: DISK CRITICAL - free space: /srv 179614 MB (5% inode=99%) [18:02:22] jynus ^ volans: labsdb1001 space something for you guys or labs? [18:02:36] (or you still dealing with post changeover and it can wait?) [18:03:21] actually it is for labs [18:03:49] people, running out of space, they should tell me what to delete (there is a ticket for that) [18:04:04] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2228070 (10hashar) Le 21/04/2016 19:59, Krenair a écrit : > this just happened twice more :/... [18:04:17] jynus: ok, cool, i just didnt want to ignore a page! [18:04:29] lemme find task and append in that its paging alerts [18:04:56] https://phabricator.wikimedia.org/T132431 [18:05:02] yep found it, thx =] [18:05:14] (03CR) 10Bmansurov: [C: 031] "Looks similar to lazy loading images." [puppet] - 10https://gerrit.wikimedia.org/r/284576 (owner: 10Jdlrobson) [18:05:54] jynus: 100GB used in ~10 minutes [18:06:30] prince or user-created [18:06:44] PROBLEM - DPKG on planet1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:07:19] (03PS3) 10Bmansurov: Split mobile text cache for lazy loaded references testing [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [18:07:23] PROBLEM - RAID on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:42] ^ me. hrmm [18:08:28] jynus: it will finish space soon, I'm looking for a culprit [18:08:44] RECOVERY - DPKG on planet1001 is OK: All packages OK [18:08:52] (03CR) 10BBlack: [C: 04-1] "This is some kind of duplicate commit from rebase woes or something. It's identical to the parent commit (already merged)." [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [18:09:15] RECOVERY - RAID on planet1001 is OK: OK: no RAID installed [18:09:32] i'm glad the RAID that doesnt exist is OK [18:09:38] (03CR) 10BBlack: "oh wait, I missed the subtle change, was that new in PS3?" [puppet] - 10https://gerrit.wikimedia.org/r/284576 (https://phabricator.wikimedia.org/T129693) (owner: 10Jdlrobson) [18:10:13] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2228112 (10mmodell) @csteipp: Thanks for checking my assumptions. >>! In T133211#2227703, @cste... [18:10:27] !log planet1001 - reboot for upgrade [18:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:35] there are 2 ongoing imports, s51362__erwin85 [18:11:03] and s51127__dewiki_lists [18:11:16] yes [18:12:47] !log planet1001 - on 4.4.0-1-amd64 now (T131928) [18:12:47] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [18:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:11] I was looking at the db sizes on disk to see which one is growing more [18:14:42] jynus: did you kill any query? 50GB just freed up :) [18:15:03] I tried to extend the volume [18:15:08] no more free space, though [18:15:33] !log bromine - apt-get dist-upgrade [18:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:41] <_joe_> so we need either more disk or a good cleanup or both [18:18:15] made some space, not much though, maybe enough for the alarm to get back to warning [18:19:44] PROBLEM - DPKG on bromine is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:22:01] if xtools is still using 300G for a temp table, I think we should probably ask them if they can reduce that? [18:23:05] valhallasw`cloud: are you referring to 282G s51187__xtools_tmp? [18:23:10] yes [18:23:16] that would help, yes! [18:23:38] is a whole DB, not a table [18:23:46] *nod* [18:24:00] with ~10k tables [18:24:10] I'll create tasks for the large tables listed, and then either we have a cleanup, or we have documentation why the size is necessary [18:24:25] we can move them [18:24:49] there is space in other places, but it would be unaccesible [18:24:53] no table is particularly big, there are just 10k of them [18:25:27] the other is p50380g50816__pop_stats [18:25:56] !log bromine (annualreport,bz-static,transparency,releases) reboot for kernel upgrade [18:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:13] u3532__ also, 64G [18:26:35] yes [18:26:49] some of these seem read only, which could be compressed [18:29:14] !log bromine - on 4.4.0-1-amd64 now (T131928) [18:29:15] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [18:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:59] who is using the "VE performance testing rig"? [18:38:08] jynus: is there an easy way to see what accounts have grants on p50380g50816__pop_stats? [18:38:08] osmium [18:38:12] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2228226 (10Krenair) cache-text04 again, I'll look into it [18:39:56] valhallasw`cloud, I think not in mysql- the mysql user name will be that, not sure how it links to LDAP [18:40:19] jynus: it's an old format user name that doesn't link to ldap :( [18:40:43] jynus: so I'm mostly wondering whether maybe some (new format) user also has access to the table [18:41:28] jynus: I clarified (I hope) in https://phabricator.wikimedia.org/T133326#2228235 [18:41:59] s51401 [18:42:02] I think [18:42:12] not 100% sure [18:42:57] at least that user has access to that db [18:43:06] aah, wait, I'm confusing two p....g.... style tables. Sorry, I meant p50380g50943__cache [18:43:12] !log rutherfordium (people.wm.org) - installing package upgrades [18:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:40] s51401 is tools.popularpages, which fits with the p50380g50816__pop_stats database [18:43:59] for p50380g50943, only p50380g50943 has access [18:44:17] ok, thanks. [18:47:25] uuh. [18:48:01] ok, I'm making a chaos. It is popularpages, the 50943 is a red herring [18:48:55] PROBLEM - DPKG on rutherfordium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:50:35] lx [18:51:16] ACKNOWLEDGEMENT - DPKG on rutherfordium is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn upgrade [18:52:08] do we still run anything in pmtpa? [18:52:12] Danny_B: nope [18:52:35] and we don't use pmtpa in any generic domain anymore as well? [18:52:36] the name of the bot on our read-only IRC has "pmtpa" but that's a lie [18:52:55] yeah, not only bot actually [18:52:57] (03CR) 10Paladox: "Ok." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [18:52:57] Danny_B: no, doesnt appear in DNS [18:53:05] RECOVERY - DPKG on rutherfordium is OK: All packages OK [18:56:38] !log people.wm.org / rutherfordium , very short downtime, reboot [18:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:53] !log rutherfordium - on 4.4.0-1-amd64 now (T131928) [18:57:54] T131928: Upgrade jessie systems from Linux 3.19 to 4.4 - https://phabricator.wikimedia.org/T131928 [18:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:57] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2228294 (10Danny_B) [18:59:10] mutante: ^^ [18:59:19] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2201664 (10Dzahn) @catrope if we want to reinstall this, who should we give a heads up, and do we need to save data? [18:59:42] Danny_B: yea, but renaming the bot breaks it for some reason [18:59:51] thats why that is the last thing left or so [19:00:00] it's not only bot [19:00:06] MOTD ie. [19:02:43] Danny_B: https://phabricator.wikimedia.org/rOPUP29586c7772ad9eb20e1cf9b249353e14cd6ca38e [19:02:50] mutante, it's considered a breaking change because other people might be relying on that name to parse things [19:03:52] yep, i have no intention to change it [19:04:20] i hope one day stream.wm can actually replace mw-irc [19:04:30] as it was expected for some time [19:06:51] mutante: i don't think irc will die. i heard that somebody is creating stream->irc transponder [19:08:17] hahaa [19:08:28] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2228327 (10RobH) 05Open>03Resolved This hardware request has been granted with the purchase of new druid nodes via procurement task T132068. [19:08:28] well, on wikitech it says that stream is replacing it [19:08:44] one time i was almost about to decom it [19:08:56] then people said ''nooooo.. wait a moment, not replaced at all' [19:09:00] maybe on their own network [19:09:03] and that's been a while [19:09:06] but the wikimedia one will be replaced [19:09:24] not soon as i heard [19:09:51] but you know what,. i actually added systemd unit files for the ircd.. and i provided a jessie VM [19:09:54] btw: iirc rc-pmtpa is also not the original name of the bot, it used to be simply "rc" [19:10:00] 06Operations, 06Discovery, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2228333 (10RobH) 05Open>03stalled This is presently pending mgmt purchase/allocations approvals on procurement task T131871. I'm setting this to sta... [19:10:07] next we need the python irclib package on jessie. i think Moritz is taking it [19:10:42] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2228336 (10RobH) 05stalled>03Resolved resolving, as all new graphite hosts have setup tasks and have been allocated. [19:10:44] 07Blocked-on-Operations, 06Operations, 06Services, 06WMDE-Analytics-Engineering, and 2 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#2228338 (10RobH) [19:12:05] RECOVERY - MariaDB disk space on labsdb1001 is OK: DISK OK [19:27:58] (03PS1) 10Ori.livneh: Promote 'experimental' sanity check to be the default [puppet] - 10https://gerrit.wikimedia.org/r/284743 (https://phabricator.wikimedia.org/T126217) [19:30:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 669 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5172978 keys - replication_delay is 669 [19:30:54] (03CR) 10Ori.livneh: [C: 032] Promote 'experimental' sanity check to be the default [puppet] - 10https://gerrit.wikimedia.org/r/284743 (https://phabricator.wikimedia.org/T126217) (owner: 10Ori.livneh) [19:31:25] (03PS1) 10Andrew Bogott: labs_bootstrapvz: A few tweaks to improve image behavior in testlabs [puppet] - 10https://gerrit.wikimedia.org/r/284745 [19:33:53] (03CR) 10Andrew Bogott: [C: 032] labs_bootstrapvz: A few tweaks to improve image behavior in testlabs [puppet] - 10https://gerrit.wikimedia.org/r/284745 (owner: 10Andrew Bogott) [19:35:18] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2228385 (10Dzahn) even though we have: ^.*[@.]teliasonera\.com$ ^.*[@.]equinix\.com$ mail from no-reply@equinix.com got moderated and mail from ncm@teliasonera.com passed through without manual interaction... [19:35:42] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2228386 (10Dzahn) 05Open>03Resolved [19:35:44] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2228387 (10Dzahn) [19:38:27] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2225168 (10hashar) I am not a fan of keyholder and multiplying deploy users for each software we... [19:43:17] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5152368 keys - replication_delay is 0 [19:59:44] mutante: linux-meta pulls in the new 4.4 kernel package on jessie systems [20:00:16] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 644 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5153174 keys - replication_delay is 644 [20:00:36] moritzm: ah. so "linux-meta was held back" but i did apt-get install linux-image-4.4.0-1-amd64 and it went fine [20:01:01] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2228436 (10ori) >>! In T129963#2226570, @elukey wrote: > The new version seems to be in Sid: https://packages.debian.org/sid/memcached > > What about: > > 1) test the package quick... [20:04:29] (03PS1) 10BBlack: rcstream: use correct "chained" cert file [puppet] - 10https://gerrit.wikimedia.org/r/284747 [20:05:14] (03CR) 10BBlack: [C: 032 V: 032] rcstream: use correct "chained" cert file [puppet] - 10https://gerrit.wikimedia.org/r/284747 (owner: 10BBlack) [20:07:46] !log reindex elasticsearch updates for the duration of the switch back from codfw, just in case [20:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:06] (03PS1) 10Dzahn: add install_server::tftp role on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/284752 (https://phabricator.wikimedia.org/T132757) [20:18:34] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2228498 (10BBlack) I've missed some meta-tracking (putting bug refs on patches, etc), but status... [20:23:01] (03PS2) 10Dzahn: add install_server::tftp role on install1001 [puppet] - 10https://gerrit.wikimedia.org/r/284752 (https://phabricator.wikimedia.org/T132757) [20:23:41] (03CR) 10Dzahn: [C: 032] "also pulls in base::firewall and ferm rule (but not all of the other things that are on carbon)" [puppet] - 10https://gerrit.wikimedia.org/r/284752 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [20:32:41] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5154046 keys - replication_delay is 0 [20:34:21] !log lowering elasticsearch high watermark to rebalance disk space across cluster [20:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:38] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2228590 (10BBlack) One more thing I didn't note above: stream.wm.o lacks HSTS on fetch of `/`,... [20:36:01] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2228592 (10Volans) The DIMM is handled on T133250. Leaving this open to monitor all 3 servers on dmesg and on wfLogDBError in the next days to see if the problem disappeared. [20:47:55] (03PS1) 10BBlack: stream.wm.o: rewrite / => /rcstream_status [puppet] - 10https://gerrit.wikimedia.org/r/284760 [20:49:08] (03PS2) 10BBlack: stream.wm.o: rewrite / => /rcstream_status [puppet] - 10https://gerrit.wikimedia.org/r/284760 (https://phabricator.wikimedia.org/T132521) [20:50:37] 06Operations, 10MediaWiki-General-or-Unknown, 10Monitoring: edit.success in graphite never reached zero during codfw switchover - https://phabricator.wikimedia.org/T133177#2228686 (10Krinkle) Also note that, contrary to the `edit.failures` metric, `edit.success` (`MediaWiki.timing.editResponseTime`) is actua... [20:54:18] (03PS1) 10Dzahn: install_server: split out reprepro role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/284763 (https://phabricator.wikimedia.org/T132757) [20:54:34] (03PS1) 10Ori.livneh: memcached: on mc2010, enable 'maxconns_fast' & 'hash_algorithm=murmur3' [puppet] - 10https://gerrit.wikimedia.org/r/284765 (https://phabricator.wikimedia.org/T129963) [20:55:05] bblack: hi! gentle ping :) https://phabricator.wikimedia.org/T132374 [20:55:41] bblack: I can take care of that one and take it off your plate [20:55:46] re: AndyRussG [20:56:09] ori: coolbeans, thx! [20:56:35] (03CR) 10jenkins-bot: [V: 04-1] memcached: on mc2010, enable 'maxconns_fast' & 'hash_algorithm=murmur3' [puppet] - 10https://gerrit.wikimedia.org/r/284765 (https://phabricator.wikimedia.org/T129963) (owner: 10Ori.livneh) [20:56:53] valhallasw`cloud: still around? [20:57:20] (coolbeans, like those in the chilled hummus that just squished between the keys of my keyboard....rrrgrgrr) [20:58:25] (03PS1) 10Dzahn: install_server: move tftp role to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/284793 (https://phabricator.wikimedia.org/T132757) [20:59:16] AndyRussG: what string pattern matches all centralnotice cookies? centralnotice_*? what about bannercount_* cookies? [20:59:39] ori: that's be awesome, thanks :) [20:59:48] ori: hmmm lemme see [20:59:56] volans: on mobile and going to sleep soon but quick q is ok [21:00:07] (03PS1) 10BBlack: ganglia web: HTTP->HTTPS redir [puppet] - 10https://gerrit.wikimedia.org/r/284803 (https://phabricator.wikimedia.org/T132521) [21:01:33] valhallasw`cloud: just that for users of the type p50380g50816 Yuvi.Panda and bd.808 knows the secret magic how to find them (no dots in the nicks) ;) [21:01:39] (03CR) 10EBernhardson: "With the new plan to scp .debs to the appropriate machines and upgrade manually, only updating puppet once all done, it seems this patch c" [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [21:02:25] Volans: ah, will keep that in mind for next time. Thanks! [21:02:29] and thanks a lot for taking care of that task creating all the subtasks and getting attention to them [21:02:30] (03Abandoned) 10EBernhardson: Send the api request log to kafka [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240617 (https://phabricator.wikimedia.org/T108618) (owner: 10EBernhardson) [21:02:34] really appreciated! [21:02:36] have a good night [21:02:41] You're welcome! [21:04:18] ori: sadly there is no one string to find them all [21:04:30] (03PS1) 10Dzahn: install_server: move mirrors stuff to own role [puppet] - 10https://gerrit.wikimedia.org/r/284809 (https://phabricator.wikimedia.org/T132757) [21:04:35] AndyRussG: what are the various patterns? [21:05:16] I presume we're not worried about the master 'CN' cookie since there's only one of those [21:05:43] yeah [21:05:54] ori: here are a bunch of ones I queried from the mixin database. Probably many of these were never actually cookies, though, but were just used in LocalStorage (just noticed the comment doesn't reflect that) https://phabricator.wikimedia.org/T131319#2197176 [21:06:21] There is also the likely case of cookies that were set by ad-hoc in-banner JS [21:06:48] i'll just grab a sample of all cookie names [21:06:55] ori: yeah that'd be fantastic [21:07:11] what's up? [21:07:54] (03PS1) 10BBlack: gerrit web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284811 (https://phabricator.wikimedia.org/T132521) [21:08:21] (03PS1) 10BBlack: icinga web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284812 (https://phabricator.wikimedia.org/T132521) [21:08:38] (03PS1) 10BBlack: wikitech: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284813 (https://phabricator.wikimedia.org/T132521) [21:08:58] (03PS1) 10BBlack: tendril: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284814 (https://phabricator.wikimedia.org/T132521) [21:10:26] ori: see further down on that task for more details about which types of cookies we'll be going after to start. The main info I was thinking of trying to get from prod is, yeah, ones that are harder to identify from the code or CN logs... to complement what I may find from searches in banner JS on meta [21:11:15] if u ever see 2 cookies with the same beginning and "-wait" appended to one of them, those are from CN [21:11:19] paravoid: most things, I hope [21:11:25] haha [21:11:34] touche [21:12:05] For impression diet: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/4a01c9712944991d4e2c6ee8d9a4bb5430656bc0/resources/subscribing/ext.centralNotice.impressionDiet.js#L255-L260 [21:12:13] (03PS3) 10Dzahn: Move from python-irclib to python-irc [puppet] - 10https://gerrit.wikimedia.org/r/284672 (https://phabricator.wikimedia.org/T133101) (owner: 10Muehlenhoff) [21:12:56] For large banner limiting: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/4a01c9712944991d4e2c6ee8d9a4bb5430656bc0/resources/subscribing/ext.centralNotice.largeBannerLimit.js#L95-L99 [21:12:59] ori ^ [21:13:45] (03CR) 10Dzahn: [C: 032] "thanks! yea, what Moritz said inline. we can remove this again once argon is down" [puppet] - 10https://gerrit.wikimedia.org/r/284672 (https://phabricator.wikimedia.org/T133101) (owner: 10Muehlenhoff) [21:17:20] 06Operations: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2228940 (10Dzahn) "python-irclib is an older version of the same Python module now packaged as python-irc in jessie and above (Debian #718309)" from https://gerrit.wikimedia.org/r/#/c/284672/ and merged [21:19:09] 06Operations: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2228950 (10Dzahn) python-irc got installed on kraz. thanks, we can close this it looks (on to the next issue but the package is there :) [21:19:27] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2228952 (10Dzahn) [21:19:29] 06Operations: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2228951 (10Dzahn) 05Open>03Resolved [21:21:30] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2228971 (10Dzahn) now: [kraz:~] $ dpkg -l | grep python-irc ii python-irc 8.5.3+dfsg-2 all Internet Relay Ch... [21:22:47] 06Operations, 10Wikimedia-IRC-RC-Server: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2228990 (10Dzahn) [21:23:34] (03CR) 10Dzahn: [C: 031] gerrit web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284811 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:23:49] (03CR) 10Dzahn: [C: 031] icinga web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284812 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:24:06] (03CR) 10Dzahn: [C: 031] wikitech: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284813 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:27:34] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/284814 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:30:25] 0 Matching Host Entries Displayed [21:30:30] 0 Matching Service Entries Displayed [21:31:52] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2229022 (10BBlack) [21:32:20] (03PS2) 10BBlack: gerrit web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284811 (https://phabricator.wikimedia.org/T132521) [21:32:28] (03CR) 10BBlack: [C: 032 V: 032] gerrit web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284811 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:32:44] (03PS2) 10BBlack: icinga web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284812 (https://phabricator.wikimedia.org/T132521) [21:32:51] (03CR) 10BBlack: [C: 032 V: 032] icinga web: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284812 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:32:59] (03PS2) 10BBlack: wikitech: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284813 (https://phabricator.wikimedia.org/T132521) [21:33:05] (03CR) 10BBlack: [C: 032 V: 032] wikitech: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284813 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:33:13] (03PS2) 10BBlack: tendril: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284814 (https://phabricator.wikimedia.org/T132521) [21:33:23] (03CR) 10BBlack: [C: 032 V: 032] tendril: use 301 for https redir [puppet] - 10https://gerrit.wikimedia.org/r/284814 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [21:38:36] Hi im wondering if wikimedia will upgrade to use Ubuntu 16.04 lts instead of debian for the reason it now includes support for snap packages [21:38:45] Which allow quicker creation of packages like windows and it is secure and allows you to update on the fly [21:38:50] Without needing to update os from 16.04 to another one to get new features [21:39:14] probably not, would be my guess [21:39:56] Oh [21:40:22] snap packages are a mixed bag in a well-controlled environment. it "solves" the problem by duplicating dependencies when bundling them in snaps. e.g. when a certain shared library has a sec bug, now that shared library might need upgrading not only on the base system, but inside 10 different snaps, too. and who's checking that within and maintaining all those app-specific snaps? [21:40:50] at least, that's my understanding and take on it, from spending all of like 10 minutes looking at snaps the other day :) [21:41:41] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2229057 (10ori) 05Open>03Resolved a:03ori I captured about 20 minutes' worth of cookie names by running varnishlog on cp1066... [21:42:13] but also, in general we're just more debian-aligned than ubuntu-aligned in general on a number of fronts. as we finish up moving systems off of precise/trusty to debian jessie, I imagine our next target from there will be the next debian stable [21:43:23] Ok [21:43:57] Maybe debian could add snap package support. [21:44:04] :) [21:44:09] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2229100 (10mobrovac) I agree with @csteipp that a general solution is needed, but I also think t... [21:46:44] bblack: yup, they just pack everything up [21:46:50] snappy pkgs, that is [21:47:10] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2229103 (10AndyRussG) Nice!! Thx much!! [21:47:27] i'm just not sure if they pack them all of look at the available debs to deduce what needs packaging [21:48:41] yeah I have no idea either, it just sounds like potential for more CVE pain with local snaps, etc (like every container solution I guess) [21:49:04] (03PS1) 10BBlack: librenms: use chain cert correctly [puppet] - 10https://gerrit.wikimedia.org/r/284817 [21:49:24] (03PS2) 10BBlack: librenms: use chain cert correctly [puppet] - 10https://gerrit.wikimedia.org/r/284817 (https://phabricator.wikimedia.org/T132521) [21:53:06] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2229110 (10BBlack) Just noting here for posterity: since it sounds like we're potentially getting rid of cookies for future CN cam... [21:55:24] (03PS2) 10Ori.livneh: memcached: on mc2010, enable 'maxconns_fast' & 'hash_algorithm=murmur3' [puppet] - 10https://gerrit.wikimedia.org/r/284765 (https://phabricator.wikimedia.org/T129963) [21:59:46] (03PS3) 10Ori.livneh: memcached: on mc2010, set 'maxconns_fast', 'hash_algorithm=murmur3', 'lru_crawler' [puppet] - 10https://gerrit.wikimedia.org/r/284765 (https://phabricator.wikimedia.org/T129963) [22:08:41] (03PS2) 10Andrew Bogott: Mark off a block of public IPs for labtest [dns] - 10https://gerrit.wikimedia.org/r/284491 [22:08:43] (03PS1) 10Andrew Bogott: Allocate LVS service IPs for labs auth and recursor dns [dns] - 10https://gerrit.wikimedia.org/r/284824 [22:13:33] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229282 (10BBlack) [22:14:05] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229301 (10BBlack) [22:14:20] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229282 (10BBlack) [22:22:00] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229282 (10Krenair) Fixed #2. [22:27:30] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229345 (10BBlack) If I had to blindly guess on #1, it's that the config has `SSLCertificateFile` and `SSLCertificateKeyFile`, but lacks `SSLCertificateChainFile`, which should point at a copy of the... [22:37:34] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [22:38:12] 07Blocked-on-Operations, 06Operations, 06Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2229360 (10ori) [22:38:31] 07Blocked-on-Operations, 06Operations, 06Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2229360 (10ori) [22:40:42] (03PS1) 10Andrew Bogott: Add an lvs service ip (labs-ns.wikimedia.org) for labs auth dns [puppet] - 10https://gerrit.wikimedia.org/r/284829 (https://phabricator.wikimedia.org/T119660) [22:49:57] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229407 (10Krenair) a:03Krenair I think I've fixed #1 as well. I did find `SSLCertificateChainFile` in the docs but it's obsolete since apache 2.4.8. Please check. (I also removed the `SSLCACertifi... [22:53:04] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2229410 (10Nuria) >It will have to be new apache setup for prod ja, but since they will be hosted on a single domain, the puppetization doesn't need any knowledge of the subdirectories of... [22:53:55] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10netops, 13Patch-For-Review: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2229427 (10Nuria) 05Open>03Resolved [22:53:58] 06Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2229428 (10Nuria) [22:56:51] (03PS1) 10Mattflaschen: Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 [22:57:24] (03PS2) 10Mattflaschen: Disable Echo survey on French wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284831 (https://phabricator.wikimedia.org/T131893) [23:02:54] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:13:51] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229474 (10BBlack) Fix for #2 works, thanks! The deprecation thing is accurate, but the ChainFile method still works. We've just been configuring all of our in-house apaches the deprecated way becau... [23:14:22] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229475 (10BBlack) (edited above - #1 + #2 are fixed) [23:28:34] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229500 (10Krenair) >>! In T133360#2229474, @BBlack wrote: > The deprecation thing is accurate, but the ChainFile method still works. We've just been configuring all of our in-house apaches the depre... [23:39:20] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229522 (10BBlack) For all I know the STS header may have been previously-set in mediawiki config somehow, too, no idea on that. But it's outputting the right header now, so #3 fixed as well. ssllab... [23:43:41] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2229538 (10Krenair) [23:43:43] 06Operations, 10Traffic, 07HTTPS: Fix wikitech-static TLS config - https://phabricator.wikimedia.org/T133360#2229537 (10Krenair) 05Open>03Resolved [23:45:37] 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2229550 (10Krinkle) a:03Krinkle