[00:05:14] (03PS2) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [00:59:17] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2188533 (10jayvdb) This task mentions T132103, which isnt public. Has it been resolved? [01:01:04] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [01:06:13] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.00 seconds [01:16:31] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2221573 (10Dzahn) @Jayvdb No, it's still open. If @csteipp is ok with it we can add you to it. [01:23:56] (03PS2) 10Dzahn: fix puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283857 (owner: 10Mschon) [01:24:51] (03CR) 10Dzahn: [C: 032] fix puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283857 (owner: 10Mschon) [01:25:07] (03PS3) 10Dzahn: fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 (owner: 10Mschon) [01:26:09] (03CR) 10Dzahn: [C: 032] "thanks Mschon" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 (owner: 10Mschon) [01:27:27] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:28:35] (03PS1) 10Dzahn: kafka: update submodule for lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/284384 [01:30:55] (03PS2) 10Dzahn: kafka: update submodule for lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/284384 [01:31:40] (03CR) 10Dzahn: [C: 032] kafka: update submodule for lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/284384 (owner: 10Dzahn) [01:42:25] (03PS2) 10Dzahn: apache: ignore lint issues in mod.pp [puppet] - 10https://gerrit.wikimedia.org/r/284085 [01:42:55] (03PS3) 10Dzahn: apache: ignore lint issues in mod.pp [puppet] - 10https://gerrit.wikimedia.org/r/284085 [01:45:02] (03CR) 10Dzahn: [C: 032] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/284085 (owner: 10Dzahn) [01:46:56] (03PS3) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 [01:48:20] (03PS2) 10Dzahn: interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 [01:48:34] (03PS3) 10Dzahn: interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 [01:48:48] (03CR) 10jenkins-bot: [V: 04-1] interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 (owner: 10Dzahn) [01:49:36] (03CR) 10Dzahn: "@Akosiaris we still want to use the icinga role on a jessie host, right?" [puppet] - 10https://gerrit.wikimedia.org/r/284277 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [01:50:07] (03PS4) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 [01:50:25] (03PS4) 10Dzahn: interface: move aggregate_member to own file [puppet] - 10https://gerrit.wikimedia.org/r/284084 [01:50:32] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.18 seconds [01:50:57] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 338.52 seconds [01:53:01] (03CR) 10Dzahn: "eh, but why would i cause " Failed to parse template " when this change doesn't even touch a template?" [puppet] - 10https://gerrit.wikimedia.org/r/279682 (owner: 10Dzahn) [01:54:57] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [02:12:08] (03CR) 10Dzahn: [C: 04-2] "after looking at module/role/manifests/labs/dnsrecursor.pp and seeing " require dnsrecursor::metalresolver", "interface::ip { 'role::lab::" [puppet] - 10https://gerrit.wikimedia.org/r/271735 (owner: 10Dzahn) [02:22:35] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 09m 37s) [02:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:20] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr 20 02:31:20 UTC 2016 (duration 8m 46s) [02:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:42] (03PS1) 10Dzahn: install_server: switch install2001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/284387 (https://phabricator.wikimedia.org/T132757) [03:03:34] (03PS1) 10Dzahn: add install1001.wikimedia.org using 208.80.154.6 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [03:07:08] (03PS2) 10Dzahn: add install1001.wikimedia.org using 208.80.154.6 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [03:07:44] (03PS3) 10Dzahn: add install1001.wikimedia.org using 208.80.154.6 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [04:11:47] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [04:13:09] 06Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2221649 (10Dzahn) a:03Dzahn [04:14:02] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2221650 (10Dzahn) a:03Dzahn [04:14:42] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221651 (10Dzahn) a:03Dzahn [04:16:37] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2221654 (10Dzahn) [04:16:39] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2221655 (10Dzahn) [04:16:45] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2221652 (10Dzahn) 05Open>03declined I'm declining this since moving into phabricator seems off the table. T132968 is for the mailing list. Thanks for mentioning the OTRS option but i think we... [04:19:40] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221656 (10Dzahn) We can do whitelisting but on the other hand we did not do that in RT either afair and did not have a (big) spam problem. Just saying, it has advantages but also means we have to maintain the... [04:26:14] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221657 (10Dzahn) We need to fill out the "**List of non-member addresses whose postings should be automatically accepted."** deep link: [[ https://lists.wikimedia.org/mailman/admin/maint-announce/?VARHELP=pr... [04:39:27] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:44:28] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221659 (10Dzahn) I started out with the list from [[ https://gerrit.wikimedia.org/r/#/c/276923/4/modules/role/manifests/phabricator/main.pp | here ]] that i once tried to use for this in phab (was reverted b... [04:50:34] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221663 (10Dzahn) Additionally, added maint-announce@ (without the .lists.) to [[ https://lists.wikimedia.org/mailman/admin/maint-announce/?VARHELP=privacy/recipient/acceptable_aliases | acceptable_aliases ]].... [04:55:15] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2221664 (10Dzahn) Finally, [[ https://lists.wikimedia.org/mailman/admin/maint-announce/?VARHELP=privacy/subscribing/subscribe_policy | subscribe_policy ]] to "requires approval" so that the list admin (noc@) h... [05:18:31] 06Operations, 10OfflineContentGenerator: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221668 (10Dzahn) [05:20:01] 06Operations, 10OfflineContentGenerator: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221680 (10Dzahn) [05:26:23] (03PS1) 10Giuseppe Lavagetto: Add "fake" codfw entries for ocg to hotfix an issue [dns] - 10https://gerrit.wikimedia.org/r/284392 (https://phabricator.wikimedia.org/T133136) [05:29:33] (03CR) 10Dzahn: [C: 031] Add "fake" codfw entries for ocg to hotfix an issue [dns] - 10https://gerrit.wikimedia.org/r/284392 (https://phabricator.wikimedia.org/T133136) (owner: 10Giuseppe Lavagetto) [05:30:50] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221699 (10Aklapper) p:05Triage>03High Brought up numerous times: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Unabl... [05:30:59] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221668 (10Joe) Btw just to let others know my findings: # pdf generation works fine # we have a flawed mechanism with which OCG signals where... [05:32:18] (03CR) 10Giuseppe Lavagetto: [C: 032] Add "fake" codfw entries for ocg to hotfix an issue [dns] - 10https://gerrit.wikimedia.org/r/284392 (https://phabricator.wikimedia.org/T133136) (owner: 10Giuseppe Lavagetto) [05:44:10] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221722 (10Joe) a:03Joe [05:44:46] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221668 (10Joe) So, after wiping the recursor caches for the negative record, I can download PDFs just fine. I am not resolving this issue as w... [05:47:40] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221727 (10Joe) For reference, here is the list of bugs I wrote months ago and that saw no activity since: T120077 OCG should not be contacted... [05:48:24] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221733 (10Joe) 05Open>03Resolved [05:53:38] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2221740 (10Joe) Adding to the pyle of embarassment than our handling of ocg issues is, OCG did not work properly across datacenters because of this, s... [05:53:49] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2221742 (10Joe) p:05High>03Unbreak! [06:20:53] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1426665 (10Nemo_bis) I doubt this would be helped by T116235. This is typically a performance iss... [06:26:26] 06Operations, 06Labs, 10Monitoring, 10wikitech.wikimedia.org: Bacula recovery of sql files from silver/wikitech fails - https://phabricator.wikimedia.org/T131195#2221790 (10jcrespo) alex- I think I was able to recover to a different host, but not from that file. But I may be wrong. In any case, the root pr... [06:30:56] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:07] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:57] (03Abandoned) 10Giuseppe Lavagetto: switchover: stop jobrunners in eqiad [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282880 (owner: 10Giuseppe Lavagetto) [06:38:12] (03Abandoned) 10Giuseppe Lavagetto: switchover: make jobrunners in codfw start up [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282881 (owner: 10Giuseppe Lavagetto) [06:38:28] (03Abandoned) 10Giuseppe Lavagetto: switchover: set mediawiki master datacenter to codfw [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282898 (owner: 10Giuseppe Lavagetto) [06:38:44] (03Abandoned) 10Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/283952 (owner: 10Giuseppe Lavagetto) [06:39:03] (03Abandoned) 10Giuseppe Lavagetto: switchover: enable maintenance scripts in codfw [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/283954 (owner: 10Giuseppe Lavagetto) [06:40:11] (03PS1) 10Giuseppe Lavagetto: "switchover: start jobrunners in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/284394 [06:41:10] (03PS1) 10Giuseppe Lavagetto: Revert "switchover: maintenance scripts back to running in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/284395 [06:41:25] (03PS2) 10Giuseppe Lavagetto: switchover: maintenance scripts back to running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284395 [06:42:07] (03PS1) 10Giuseppe Lavagetto: Put eqiad in read-write mode for datacenter switchover to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284396 [06:42:47] PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:43:12] (03PS1) 10Giuseppe Lavagetto: switchover: set mediawiki master datacenter to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284397 [06:43:47] (03PS1) 10Giuseppe Lavagetto: Switch wmfMasterDatacenter to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284398 [06:45:21] (03PS1) 10Giuseppe Lavagetto: switchover: switch api/appservers/rendering varnish routing from codfw to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284400 [06:46:19] (03PS1) 10Giuseppe Lavagetto: Revert "swift: switch to eqiad imagescalers" [puppet] - 10https://gerrit.wikimedia.org/r/284401 [06:46:43] (03PS2) 10Giuseppe Lavagetto: switchover: switch swift to eqiad imagescalers" [puppet] - 10https://gerrit.wikimedia.org/r/284401 [06:47:41] (03PS1) 10Giuseppe Lavagetto: Set codfw databases to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284402 [06:48:36] (03PS1) 10Giuseppe Lavagetto: switchover: make jobrunners in codfw stop [puppet] - 10https://gerrit.wikimedia.org/r/284403 [06:49:29] (03PS1) 10Giuseppe Lavagetto: switchover: disable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284404 [06:55:56] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:08] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:14] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [07:07:44] RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:17:44] PROBLEM - Disk space on restbase1014 is CRITICAL: DISK CRITICAL - free space: /srv 186215 MB (3% inode=99%) [07:20:54] 06Operations: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2221821 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [07:28:44] (03PS2) 10Jcrespo: Enable base::firewall for mariadb es1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283622 (owner: 10Muehlenhoff) [07:29:12] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb es1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283622 (owner: 10Muehlenhoff) [07:29:40] moritzm, let's break eqiad [07:30:52] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb s1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283611 (owner: 10Muehlenhoff) [07:31:14] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:31:28] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb s4 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283614 (owner: 10Muehlenhoff) [07:32:12] 06Operations, 10RESTBase, 06Services, 10Traffic: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10mobrovac) `citoid.wm.org` and `cxserver.wm.org` are still being actively and officially used, as we haven't put them behind R... [07:32:26] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb s5 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283615 (owner: 10Muehlenhoff) [07:32:35] 06Operations, 10Citoid, 10ContentTranslation-cxserver, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2221836 (10mobrovac) [07:32:56] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb s6 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283616 (owner: 10Muehlenhoff) [07:33:29] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for mariadb s7 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283617 (owner: 10Muehlenhoff) [07:34:31] I think, however, that gerrit:283771 may do all of these in one step, so we may have to abandon those [07:36:03] let's apply them, we will merge manually if needed [07:36:22] also fine with merging 283771, up to you [07:36:35] mmm [07:36:56] can you give a quick look to the ferm parts of 283771 ? [07:37:06] I already checked the rest, seems ok [07:37:14] sure, give me a few minutes [07:37:34] basically, it is a lot of legacy finally cleaned up [07:38:29] jynus: I didn't touched anything ferm related in gerrit:283771 [07:38:59] volans, yes, but you grouped them on already-have-ferm classes [07:39:10] 06Operations, 10Ops-Access-Requests: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2221843 (10Addshore) >>! In T133066#2218998, @Krenair wrote: > So restricted access, basically? If that's what your telling me I need then yes! :) [07:39:49] not really, every host should have keep the state have firewall / not having firewall [07:39:54] true [07:39:58] if there are it's a bug, not my intention [07:40:12] let's apply the original patches, moritz [07:40:18] and we will rebase that one [07:40:25] how the diff is shown is very confusing [07:40:48] yeah, confirmed with the exception of 1057 the firewall state remains the same, I'll start merging them now, then [07:40:49] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb es1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283622 (owner: 10Muehlenhoff) [07:41:03] or you do it :-) [07:41:16] moritzm: 1057 too was already having firewall [07:41:16] we do one by one? [07:42:01] volans: you're right; I mixed that up with db1067 [07:42:30] jynus: I think so, I'll doublecheck the puppet run on es1012 just in case [07:43:11] let's start with https://gerrit.wikimedia.org/r/#/c/283611/ (s1) then, ok? [07:43:50] I've already merged es1 [07:45:32] puppet run and resulting config on es1012 is fine, also added some iptables logging just in case [07:46:16] (03PS2) 10Jcrespo: Enable base::firewall for mariadb s1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283611 (owner: 10Muehlenhoff) [07:47:43] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb s1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283611 (owner: 10Muehlenhoff) [07:48:10] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2221855 (10mobrovac) >>! In T132771#2210659, @Eevans wrote: > I'm going to play Devil's Advocate here and ask: In the absence of column family metrics for `meta`, if there were (for... [07:48:15] so I will have to re-fix all the conflicts after your merges I guess :-P [07:48:33] why you? [07:48:37] on my change [07:48:40] I can do that too [07:49:09] s1 merged [07:49:14] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: Puppet has 1 failures [07:52:32] I see no changes on db1051 [07:52:47] but I see a firewall there [07:52:50] I do not know how [07:53:02] was already there? [07:53:21] how big are the counters on iptables? [07:53:26] if big should be there from a while [07:54:10] there's an icinga warning for mysqld on dd1055. a mysqld process is present, though [07:54:13] db1055 [07:54:25] 38450 the largest [07:54:32] transient, now gone [07:54:59] and that is why I did not want to apply that in production [07:54:59] ah, caused by the puppet run/enabling ferm [07:55:08] :-) [07:55:20] rate of growth? could it be done in few minutes? I didn't check the rule, so I cannot guess how often is matcheed [07:55:25] 15 seconds is like 200.000 failed requests [07:55:57] per host, maybe more [07:56:14] and I do not trust mediawiki's retrial [07:58:52] (03Abandoned) 10Elukey: This is a test for the puppet compiler, not meant to be committed. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/282670 (owner: 10Elukey) [08:02:36] (03CR) 10Volans: "Puppet compiler results checked, all looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:05:35] (03PS2) 10Elukey: Add the possibility to set an external database for Hue. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) [08:08:45] (03PS2) 10Jcrespo: Enable base::firewall for mariadb s4 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283614 (owner: 10Muehlenhoff) [08:09:03] puppet ran on all s1 masters, all look fine [08:10:19] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb s4 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283614 (owner: 10Muehlenhoff) [08:14:05] (03PS2) 10Jcrespo: Enable base::firewall for mariadb s5 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283615 (owner: 10Muehlenhoff) [08:14:07] (03PS3) 10Alexandros Kosiaris: ores: Add logformat [puppet] - 10https://gerrit.wikimedia.org/r/284373 (https://phabricator.wikimedia.org/T113754) (owner: 10Ladsgroup) [08:14:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Add logformat [puppet] - 10https://gerrit.wikimedia.org/r/284373 (https://phabricator.wikimedia.org/T113754) (owner: 10Ladsgroup) [08:14:34] (03PS3) 10Jcrespo: Enable base::firewall for mariadb s5 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283615 (owner: 10Muehlenhoff) [08:14:54] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:28] (03PS1) 1020after4: Add keyholder_key and keyholder_pubkey functions [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T132747) [08:16:32] (03PS3) 10Elukey: Add the possibility to set an external database for Hue. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) [08:17:18] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb s5 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283615 (owner: 10Muehlenhoff) [08:17:35] (03CR) 10jenkins-bot: [V: 04-1] Add keyholder_key and keyholder_pubkey functions [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [08:18:03] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.58 seconds [08:18:46] already recovered, was puppet+firewall? [08:18:58] akosiaris, need your ok for ores deployment [08:19:30] volans, probably, I will check the logs [08:19:51] puppet just ran there, so probably yes. having a look [08:20:03] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [08:20:07] lag is expected [08:21:38] (03CR) 10Elukey: "Addressed ottomata's comments, moved sqlite to sqlite3 and checked with the puppet compiler:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [08:24:26] jynus: merged [08:24:33] jynus: both yours and mine [08:24:34] akosiaris, thanks [08:25:56] puppet ran on all s4 masters, all look fine [08:26:05] (is the channel topic up to date?) [08:26:51] (03PS1) 10Elukey: Sort the hosts with diffs/errors in the Jinja templates. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/284419 [08:28:18] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2221888 (10akosiaris) >>! In T132822#2219923, @Krenair wrote: > https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=mendelevium.eqiad.wmnet&m=cpu_report&s=descen... [08:29:08] sjoerddebruin: are you referring to the date? there'll be another read-only period to switch back to the eqiad data center [08:29:14] db1045 Connection refused by host [08:29:49] within expected [08:29:53] moritzm: oh right [08:30:09] jynus: works now, I think you just hit the second where ferm was applied and blocking new ssh connections [08:30:18] (03CR) 10Volans: [C: 031] "LGTM, thanks Elukey" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/284419 (owner: 10Elukey) [08:31:16] (03PS2) 10Jcrespo: Enable base::firewall for mariadb s6 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283616 (owner: 10Muehlenhoff) [08:33:34] jynus: db1065 on s1 was skipped for a reason? [08:33:46] mmm [08:34:21] we need to merge db1065 with the others [08:34:24] on the class [08:34:26] I'll aggregate them in my CR given that some of them were splitted just because of the firewall, and saw it [08:35:02] I think I tested as one of the ones where I applied p_s [08:35:15] probably because an unscheduled reboot [08:35:42] but there should only be 2 "sections" now per shard and datacenter- the master and the slaves [08:36:09] and only one if we move the maseter flag to hiera [08:36:30] puppet ran on all s5 masters, all look fine [08:36:37] so all should have p_s to ON? [08:36:47] yeah...ish [08:36:50] :) [08:36:59] it has been tested on all servers [08:37:06] there's a separate patch for 1065: https://gerrit.wikimedia.org/r/#/c/283612/ [08:37:17] I have not tested in on the largest ones (72 and 73) [08:37:56] so those would be the largest incognitos [08:38:13] ok, do db1065 too, I'm consolidating them in the following CR, don't worry to consolidate now [08:39:22] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb s6 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283616 (owner: 10Muehlenhoff) [08:40:11] (03PS2) 10Elukey: Sort the hosts with diffs/errors in the Jinja templates. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/284419 [08:42:01] jynus: it's actually on on db1074/76 on s2 [08:43:14] (03PS2) 10Jcrespo: Enable base::firewall for mariadb s7 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283617 (owner: 10Muehlenhoff) [08:44:46] volans, yes, that was the intention [08:45:08] but those had low weight still [08:45:24] so not properly tested with >20 QPS [08:45:36] >20K [08:45:39] (03CR) 10Elukey: [C: 032] Sort the hosts with diffs/errors in the Jinja templates. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/284419 (owner: 10Elukey) [08:45:44] ok, you want me to leave it off for those? [08:45:56] and on for all the others? [08:46:21] let's enable it, we can bring them off if there are issues [08:46:30] in fact, there are no issues [08:46:36] ok [08:46:41] only potential performance impact [08:47:00] and those on with slower speed will be better than depooling them later [08:47:08] with 0 throughput [08:47:16] I take full reponsability [08:47:28] PROBLEM - MariaDB Slave Lag: s6 on db1030 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.54 seconds [08:49:09] puppet ran on all s6 masters, all look fine [08:49:42] it was more than 15 seconds of interruption! [08:50:07] RECOVERY - MariaDB Slave Lag: s6 on db1030 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [08:51:05] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for mariadb s7 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283617 (owner: 10Muehlenhoff) [08:51:28] PROBLEM - MariaDB Slave Lag: s6 on db1037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.64 seconds [08:52:49] BTW, the alert automatic config works-no paging on the secondary datacenter [08:53:09] 06Operations, 07Icinga, 13Patch-For-Review: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2221897 (10akosiaris) >>! In T125023#2219847, @Dzahn wrote: >>>! In T125023#2198454, @akosiaris wrote: >> @dzahn. We 've already got replacement boxes. >> But don't just reuse einsteinium... [08:53:28] RECOVERY - MariaDB Slave Lag: s6 on db1037 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [08:53:28] yeah, I was actually wondering about it this morning [08:53:40] (03PS2) 10Jcrespo: Enable base::firewall for db1065 [puppet] - 10https://gerrit.wikimedia.org/r/283612 (owner: 10Muehlenhoff) [08:53:48] let's see what's missing [08:53:49] (03CR) 10Alexandros Kosiaris: [C: 04-2] "No, not really. See https://phabricator.wikimedia.org/T125023#2221897" [puppet] - 10https://gerrit.wikimedia.org/r/284277 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [08:54:20] !log deployed the new puppet compiler - version 0.1.4 (hosts sorted in the HTML output, minor change) [08:54:22] (03CR) 10Jcrespo: [C: 031] Enable base::firewall for db1065 [puppet] - 10https://gerrit.wikimedia.org/r/283612 (owner: 10Muehlenhoff) [08:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:34] (03PS2) 10Giuseppe Lavagetto: switchover: make jobrunners in codfw stop [puppet] - 10https://gerrit.wikimedia.org/r/284403 [08:55:46] (03CR) 10Jcrespo: [C: 031] "HA proxy will complain" [puppet] - 10https://gerrit.wikimedia.org/r/240056 (owner: 10Muehlenhoff) [08:56:06] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2221900 (10Multichill) Ok. Can someone please update the list of squatted domain list and close this task? We failed. [08:57:03] (03CR) 10Jcrespo: [C: 04-1] "Not without master faileover. We can do it on the slave, however." [puppet] - 10https://gerrit.wikimedia.org/r/240057 (owner: 10Muehlenhoff) [08:57:27] puppet ran on all s7 masters, all look fine [08:57:43] (03CR) 10Jcrespo: [C: 032] Enable base::firewall for db1065 [puppet] - 10https://gerrit.wikimedia.org/r/283612 (owner: 10Muehlenhoff) [08:57:45] 06Operations, 10Analytics-EventLogging, 10RESTBase, 06Services, and 2 others: RESTBase should handle the X-Analytics header - https://phabricator.wikimedia.org/T133139#2221908 (10mobrovac) [08:58:08] so the misc servers are active, we cannot deploy to its masters [08:58:20] we can do it for (some) slaves [08:58:40] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2221923 (10akosiaris) Unsure, why this is #blocked-on-operations. I 'll remove the tag for now, feel free to readd. [08:58:42] jynus: sure, these are really, old obsolete patches (from Sep) when the ferm work started, I'll abandon them [08:58:53] although that may require more testing, they have very custom clients [08:59:06] my main concern with the ones applied [08:59:10] let's focus on the core/eqiad systems for now [08:59:13] would be analytics [08:59:16] (03Abandoned) 10Muehlenhoff: Enable ferm on db1020 [puppet] - 10https://gerrit.wikimedia.org/r/240057 (owner: 10Muehlenhoff) [08:59:24] (03PS2) 10Giuseppe Lavagetto: switchover: disable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284404 [08:59:26] (03Abandoned) 10Muehlenhoff: Enable ferm on db1016 [puppet] - 10https://gerrit.wikimedia.org/r/240056 (owner: 10Muehlenhoff) [08:59:27] I think we chceked already [08:59:30] (03PS1) 10Filippo Giunchedi: restbase: add restbase101[45] to conftool [puppet] - 10https://gerrit.wikimedia.org/r/284425 [08:59:37] if stats* hosts are 10.x [08:59:42] if they are, no issue [09:00:16] yeah, stat* hosts are in the 10.x network as well [09:02:13] (03PS2) 10Giuseppe Lavagetto: Set codfw databases to read-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284402 [09:02:20] the other potential issue is if firewall + performance schema could reduce the throughput on datacenter failback [09:02:31] for large hosts [09:03:03] can I ask you 1 last favour, moritzm ? [09:03:29] (03PS5) 10Volans: MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [09:03:50] -doing some http requests on eqiad mediawikis localy to veryfy everything is still correct [09:04:37] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 07HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2020104 (10akosiaris) I see rOPUPf6c222149 has been merged since which might have fix... [09:05:41] jynus: sure! [09:06:53] (03CR) 10Jcrespo: [C: 04-1] "I think you cherry picked and this deletes essential config." [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:07:27] jynus: yes I was checking the diff now... didn't show locally [09:07:57] if you can work on that, I will focus on API servers fix [09:08:53] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#2221998 (10Nikerabbit) >>! In T104774#2221783, @Nemo_bis wrote: > I doubt this would be helped by... [09:09:53] (03PS6) 10Volans: MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [09:10:01] sure jynus [09:11:22] thank you both [09:11:32] change already fixed [09:12:13] running compiler again also to test elukey change on the compiler :) [09:12:18] !log stop compactions on restbase1014-[ab] [09:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:14] !log backfilling recentchanges on enwiki API servers [09:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:45] I will do those with the binary log on [09:14:17] RECOVERY - Disk space on restbase1014 is OK: DISK OK [09:18:40] (03PS3) 10Giuseppe Lavagetto: switchover: switch swift to eqiad imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/284401 [09:19:53] I think it applied, it was surprisingly fast [09:19:59] due to low load [09:23:01] good, for gerrit::283771 looks ok to me but a second pair of eyes is welcome [09:25:05] jynus: ---^^ puppet compiler is running I'll update with the results asap [09:54:07] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on sites - https://phabricator.wikimedia.org/T133069#2222233 (10Addshore) [09:54:18] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached on sites - https://phabricator.wikimedia.org/T133069#2218817 (10Addshore) [09:54:28] 06Operations, 10Traffic, 10netops: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2222236 (10akosiaris) p:05Triage>03Low [09:55:02] 06Operations, 10MediaWiki-Uploading, 06Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2222240 (10akosiaris) p:05Triage>03Normal [10:00:29] 06Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2097333 (10akosiaris) >>! In T129180#2211445, @Andrew wrote: > What if ssh keys were managed by puppet? Adding new hosts to puppet would be more complicated (you'd have to explicitly create a key in the private... [10:00:40] 06Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2222346 (10akosiaris) p:05Triage>03Low [10:00:52] (03CR) 10Jcrespo: [C: 031] MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:00:55] 06Operations, 10Deployment-Systems, 10Monitoring, 10scap, 10Scap3 (Scap3-Adoption-Phase1): Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#2222351 (10akosiaris) p:05Triage>03Normal [10:01:18] 06Operations, 10Monitoring, 10netops, 10Scap3 (Scap3-Adoption-Phase1): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2222353 (10akosiaris) p:05Triage>03Normal [10:01:37] (03CR) 10Jcrespo: [C: 032] MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:02:22] (03CR) 10Volans: "LAst pupper compiler run for reference https://puppet-compiler.wmflabs.org/2511/" [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:04:05] go ahead and merge jynus :) [10:06:08] 06Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2091362 (10akosiaris) Anything else left to do here ? Or can we close this one ? [10:08:16] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 3 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2222395 (10Amire80) [10:08:28] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, 07WorkType-NewFunctionality: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2222400 (10Amire80) [10:23:26] 06Operations, 10ops-eqiad, 13Patch-For-Review: mw1080 - readonly fs - depooled - please check sda - https://phabricator.wikimedia.org/T132529#2222522 (10akosiaris) p:05Triage>03Normal [10:31:32] !log [switchover-maintenance] Upgrading TLS for shard s7 on eqiad databases [10:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:02] !log [switchover-maintenance] Restarting db1018 [10:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:11] (s2 shard) [10:33:22] check if puppet has already run there [10:34:44] yep, yep [10:35:02] I have like my own rythm after upgrading ~100 servers [10:35:21] stop, stop, upgrade, restart, puppet,start,upgrade,start [10:36:49] the restart is on kernel, just in case, I think jessie has some tasty 4.4 [10:37:02] it does yeah [10:38:02] It actually installed an upgraded 3.19 [10:38:19] and a 3.16 [10:38:30] ah, no, also the 4.4 [10:38:34] so you're doing full upgrade? [10:38:54] I though you want only to upgrade mariadb, from the etherpad comment :) [10:38:58] just to do the same here [10:38:59] PROBLEM - MariaDB Slave IO: s2 on db2017 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1018.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1018.eqiad.wmnet (111 Connection refused) [10:39:45] jynus: I 've noticed fulltext indices on innodb where added in mariadb 10.0.5 and you know which application came to mind ? [10:39:50] ETHERPAD!!!!! [10:39:54] yeath, that is me [10:40:00] ^ [10:40:03] no worries [10:40:04] ok [10:42:11] !log [switchover-maintenance] Restarting db1028 (s7) [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:30] if something was not depooled and keeps pointing to mediawiki on eqiad, we will know, also [10:44:54] jynus: about this, some slaves on s7 have ~70 QPS.. I'll check them before restarting [10:49:30] (03PS2) 10Filippo Giunchedi: restbase: add restbase101[45] to conftool [puppet] - 10https://gerrit.wikimedia.org/r/284425 [10:49:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add restbase101[45] to conftool [puppet] - 10https://gerrit.wikimedia.org/r/284425 (owner: 10Filippo Giunchedi) [10:51:42] !log pool restbase1014 [10:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:27] (03PS1) 10Alexandros Kosiaris: Add atgomez to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/284439 (https://phabricator.wikimedia.org/T133102) [10:55:01] (03PS1) 10Alexandros Kosiaris: Add addshore to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/284441 (https://phabricator.wikimedia.org/T133066) [11:04:57] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2218699 (10akosiaris) I am a bit unclear on if we need some approval on this. addshore already has access to bastiononly, researchers, analytics-privatedata-use... [11:06:27] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2222672 (10Addshore) https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access > 5. A three business day waiting period must... [11:10:19] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2222688 (10akosiaris) >>! In T133066#2222672, @Addshore wrote: > https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access >... [11:10:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for AGomez (WMF) - https://phabricator.wikimedia.org/T133102#2222691 (10akosiaris) p:05Triage>03Normal [11:11:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for AGomez (WMF) - https://phabricator.wikimedia.org/T133102#2220183 (10akosiaris) 3 day waiting period as usual per https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access. [11:11:50] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2222695 (10akosiaris) p:05Triage>03Normal [11:13:30] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2222702 (10akosiaris) p:05Triage>03Normal [11:13:42] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#2222703 (10akosiaris) p:05Triage>03Normal [11:13:56] 06Operations, 10Traffic, 10domains, 13Patch-For-Review: wikiknihy.cz - transfer to Wikimedia Czech Republic? - https://phabricator.wikimedia.org/T127573#2222704 (10akosiaris) p:05Triage>03Normal [11:15:45] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2222707 (10akosiaris) [11:15:46] I think the recovery will page now [11:16:00] Just wondering, is this the current NDA? https://phabricator.wikimedia.org/L4 [11:16:01] RECOVERY - MariaDB Slave IO: s2 on db2017 is OK: OK slave_io_state Slave_IO_Running: Yes [11:16:02] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2110079 (10akosiaris) 05Open>03stalled p:05Triage>03Normal [11:17:00] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2222712 (10akosiaris) p:05Triage>03Normal [11:17:18] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2222713 (10akosiaris) p:05Triage>03Normal [11:17:54] 06Operations, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2219667 (10akosiaris) I think this is can be resolved now ? [11:18:29] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached on sites - https://phabricator.wikimedia.org/T133069#2222718 (10Joe) >>! In T133069#2220066, @BBlack wrote: >>>! In T133069#2219464, @Joe wrote: >> So, during the switchover we first wiped the codfw memca... [11:19:15] 06Operations, 10Analytics-EventLogging, 06Performance-Team, 13Patch-For-Review: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2222719 (10akosiaris) p:05Triage>03Normal [11:19:27] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2222720 (10akosiaris) p:05Triage>03Normal [11:19:44] 06Operations, 10Traffic, 13Patch-For-Review: Host rewrite for /static/ not applied to purges - https://phabricator.wikimedia.org/T130904#2222721 (10akosiaris) p:05Triage>03Normal [11:26:22] 06Operations, 06Labs, 13Patch-For-Review, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2222729 (10akosiaris) p:05Triage>03Normal >>! In T132216#2193397, @Krenair wrote: > While that may be a workar... [11:26:33] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2222731 (10akosiaris) p:05Triage>03Normal [11:30:46] 06Operations, 13Patch-For-Review: Add monitoring metric for connection tracking table usage - https://phabricator.wikimedia.org/T131150#2157431 (10akosiaris) So this now requires enabling, right ? ``` diamond::collector { 'NfConntrackCount': } ``` [11:30:55] 06Operations, 13Patch-For-Review: Add monitoring metric for connection tracking table usage - https://phabricator.wikimedia.org/T131150#2222738 (10akosiaris) p:05Triage>03Normal [11:31:21] (03CR) 10Mobrovac: [C: 04-1] "I like the approach, but I have one big concern: is having untracked files on the puppetmaster acceptable? Would it be better to just plai" (0316 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [11:35:00] 06Operations, 10RESTBase-Cassandra, 06Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2222758 (10mobrovac) >>! In T133091#2222714, @akosiaris wrote: > I think this is can be resolved now ? Nope. The ticket is about trying to find the best way to still have alerts, bu... [11:35:11] 06Operations, 10RESTBase-Cassandra, 06Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2222761 (10mobrovac) p:05Triage>03Normal [11:36:06] 06Operations, 10Traffic: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2222764 (10akosiaris) p:05Triage>03High Triaging as high in case this is more widespread. Adding traffic and subscribers as well [11:41:12] 06Operations, 06Labs, 10Monitoring, 10wikitech.wikimedia.org: Bacula recovery of sql files from silver/wikitech fails - https://phabricator.wikimedia.org/T131195#2222780 (10akosiaris) 05Open>03Resolved a:03akosiaris >>! In T131195#2221790, @jcrespo wrote: > alex- I think I was able to recover to a di... [11:43:33] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [11:57:03] 06Operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#1387187 (10MoritzMuehlenhoff) According to T32288 it was added for SVG conversion: Various SVG diagrams use fonts from the 'Ubuntu' typ... [11:58:33] (03PS1) 10Muehlenhoff: Add reference to Phab tickets as used for other fonts [puppet] - 10https://gerrit.wikimedia.org/r/284446 [12:05:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add reference to Phab tickets as used for other fonts [puppet] - 10https://gerrit.wikimedia.org/r/284446 (owner: 10Muehlenhoff) [12:10:13] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:30] !log [switchover-maintenance] Changing DB slave topology for shard s7 on eqiad T111654 [12:17:31] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:34:54] 06Operations, 13Patch-For-Review: Add monitoring metric for connection tracking table usage - https://phabricator.wikimedia.org/T131150#2222944 (10elukey) 05Open>03Resolved a:03elukey Already added, the only remaining step was to decide if the collector needs to be deployed everywhere or not, but it is n... [12:37:28] !log [switchover-maintenance] Changing DB slave topology for shard s6 on eqiad T111654 [12:37:29] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:05] 06Operations, 06Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2222993 (10Joe) FYI, we build 2.40.5 for jessie, should we move to 2.40.11 there too? [12:46:06] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2223054 (10elukey) started the rsync for /srv with (thanks @Dzahn!): ``` rsync -avp /srv rsync://stat1004.eqiad.wmnet:/srv ``` that is still ongoing. After that I'll also backup... [12:48:43] (03PS1) 10Elukey: Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage. [puppet] - 10https://gerrit.wikimedia.org/r/284451 (https://phabricator.wikimedia.org/T76348) [12:48:47] !log [switchover-maintenance] Changing DB slave topology for shard s3 on eqiad T111654 [12:48:48] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:59] (03CR) 10Elukey: [C: 032] Add Debian Jessie PXE boot option to stat1001 as preparation step for the reimage. [puppet] - 10https://gerrit.wikimedia.org/r/284451 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [12:54:23] !log [switchover-maintenance] Changing DB slave topology for shard s5 on eqiad T111654 [12:54:24] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:57] !log [switchover-maintenance] Changing DB slave topology for shard s4 on eqiad T111654 [12:59:58] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [13:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:05] 06Operations, 10Traffic: Images not showing up at Commons - https://phabricator.wikimedia.org/T128961#2223095 (10BBlack) 05Open>03Resolved a:03BBlack Close-able AFAIK, we did fix this in the general case for all cache clusters (un-proxying the request, normalizing the cast of the request hostname, stripp... [13:07:08] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2223164 (10BBlack) 05Open>03Resolved a:03BBlack [13:07:12] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2223163 (10elukey) Nginx 1.9.15 Changes: ``` *) Bugfix: "recv() failed" errors might occur when using HHVM as a FastCGI server. *) Bugfix: when using HTTP/2 and the "... [13:09:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:10:02] moritzm: could be your last merge? --^ did you get the error from strontium? [13:10:22] actually seems last one was elukey :) [13:10:34] yep I was about to say that, but didn't notice any failure [13:10:36] according to my shell history that sync went fine [13:11:13] is not in red elukey but says connection to strontium closed and some kind of error above [13:11:46] see T128895 :) [13:11:46] T128895: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895 [13:12:45] elukey: to fix it just ssh into strontium [13:12:52] cd /var/lib/git/operations/puppet [13:12:58] sudo git pull origin [13:13:26] yeah I was checking, let me see also in palladium [13:13:55] if it fails on palladium from the start you notice it... doesn't do the pupper merge at all [13:14:46] should be in sync now, I need to pay more attention next time, really sorry about it. [13:15:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:15:44] thanks volans [13:15:45] :) [13:16:42] yw :) [13:17:09] !log [switchover-maintenance] Changing DB slave topology for shard s1 on eqiad T111654 [13:17:10] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [13:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:10] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2223293 (10BBlack) There is no parsoid varnish cache in production anymore, and therefore there shouldn't be one in beta anymore either. Use... [13:19:07] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2223301 (10Volans) @Cmjohnson if you are available ping me on IRC, it will be perfect if we can do it today. Thanks! [13:19:13] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2223302 (10BBlack) p:05Triage>03Low [13:19:20] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2192879 (10BBlack) a:05BBlack>03None [13:21:58] !log rebooting rdb1002,rdb1003,rdb1004,rdb1006,rdb1007,rdb1008 for upgrade to Linux 4.4 [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:42] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1292599 (10Volans) Performance schema is set in the configuration for all coredb (s1-s7), will be active at next restart or before if manually activated. [13:30:54] 07Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 07HHVM: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory - https://phabricator.wikimedia.org/T126658#2223343 (10hashar) 05Open>03Resolved a:03ori Indeed @ori fixed it. On a CI slav... [13:34:30] 06Operations, 10ArchCom-RfC, 10Architecture, 10Incident-20150423-Commons, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#2223365 (10BBlack) 05Open>03Resolved We haven't seen any movement on this ticket in months, and AFAIK varnish... [13:34:33] 06Operations, 10ArchCom-RfC, 10Architecture, 10Incident-20150423-Commons, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#2223367 (10BBlack) [13:37:45] 06Operations, 10Traffic, 13Patch-For-Review: Host rewrite for /static/ not applied to purges - https://phabricator.wikimedia.org/T130904#2223402 (10BBlack) Note, there is a script to work around this on deployment of /static/ stuff, on the deployment hosts. But we should still circle back around and fix thi... [13:40:58] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2223427 (10BBlack) Note: the work in T128813 looks pretty good, and is probably the general pattern we should follow, especially as clusters move to V... [13:42:26] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Solve large-object/stream/pass/chunked in upload cluster better - https://phabricator.wikimedia.org/T131761#2223434 (10BBlack) [13:42:33] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2223436 (10elukey) All data backupped in /srv/stat1001 on stat1004, we should be ready to proceed. [13:42:38] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: Solve large-object/stream/pass/chunked in upload cluster better - https://phabricator.wikimedia.org/T131761#2177610 (10BBlack) [13:42:41] 06Operations, 10Traffic, 07Varnish: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2223437 (10BBlack) [13:43:36] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2223448 (10elukey) [13:43:38] 06Operations, 10ops-eqiad, 10hardware-requests: connect an external harddisk with >2TB space to stat1001 - https://phabricator.wikimedia.org/T132476#2223446 (10elukey) 05Open>03Resolved Closing the task since we decided to use stat1004 and @Dzahn set up the necessary rsync modules. [13:46:00] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2223470 (10BBlack) I backported some of that to our tentative 1.9.14-1+wmf1, but yeah we'll want the rest of the HTTP/2 fixes that have landed since in 1.9.15. Note debian now has the... [13:46:38] 06Operations, 10DBA: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2223479 (10jcrespo) All s* servers now have mariadb10 masters. We keep the old 5.5 masters in case a rollback is needed. There are still some core servers in 5.5, mainly x1. [13:47:28] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: http status 500 [13:47:44] PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 {channel:frontend.error,request:{id:1461160052232-33044},error:{message:Status check failed (redis failure?)}} - 232 bytes in 0.015 second response time [13:47:44] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down! [13:47:59] <_joe_> what the fuck? [13:48:04] <_joe_> oh I know [13:48:07] ocg? [13:48:15] <_joe_> moritzm: did you restart rdb1007, right? [13:48:36] (03CR) 10BBlack: [C: 04-1] "I don't think this belongs in a "role" class. Ultimately there will be a define in a module, such that other roles/classes can invoke (pe" [puppet] - 10https://gerrit.wikimedia.org/r/283763 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [13:48:41] yeah, a few minutes ago [13:48:46] (03PS1) 10Volans: Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) [13:48:46] <_joe_> ok that's it [13:49:00] <_joe_> let's restart all ocg servers [13:49:17] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - ocg_8000 - Could not depool server ocg1003.eqiad.wmnet because of too many down! [13:49:22] did that cause the ocg failure? [13:49:26] <_joe_> yes [13:49:32] <_joe_> not your fault of course [13:49:34] <_joe_> ocg's fault [13:49:38] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 584111 msg: ocg_render_job_queue 0 msg [13:49:41] (03PS2) 10Hashar: hhvm: allow passing service parameters [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) [13:49:53] RECOVERY - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.013 second response time [13:49:54] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [13:50:14] <_joe_> !log rolling restart of ocg servers [13:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:19] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [13:51:25] hmm, I see the Hiera data for ocg has "redis_host: rdb1007". Is there a way to handle a rdb1007 restart gracefully, otherwise I'll add a note to the "Service restarts" wikitech page [13:51:44] (in terms of ocg I mean) [13:52:31] (03CR) 10Hashar: "Rebased. The hhvm class is now using base::service_unit which make the patch quite trivial." [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [13:54:48] (03PS2) 10Hashar: contint: disable HHVM background service [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) [13:54:54] !log puppet disabled on analytics1027 to stop Camus [13:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:09] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/269946 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [13:56:19] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/269947 (https://phabricator.wikimedia.org/T126594) (owner: 10Hashar) [13:59:52] 06Operations, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2223542 (10BBlack) [13:59:58] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10BBlack) [14:05:12] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Backport libvpx2 to jessie - https://phabricator.wikimedia.org/T132033#2223576 (10Joe) 05Open>03Resolved [14:05:14] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2223577 (10Joe) [14:11:08] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [14:11:22] (03CR) 10Ottomata: "looks good, 2 more nits!" (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [14:15:57] 06Operations, 10Continuous-Integration-Infrastructure: Investigate usage of ttf-ubuntu-font-family which is not available on Jessie - https://phabricator.wikimedia.org/T103325#2223600 (10MoritzMuehlenhoff) [14:15:59] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2223599 (10MoritzMuehlenhoff) [14:16:20] cmjohnson1: hey Chris, do you think you'll have time today for T128107 ? [14:16:20] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [14:16:37] godog: yes I will get it today [14:16:56] this afternoon....i have a dentist appt shortly and will do right after [14:17:10] (03PS2) 10Volans: Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) [14:17:35] cmjohnson1: np, thanks! [14:19:43] (03CR) 10Alex Monk: "I added the checks on shinken-01 (the commit is cherry-picked on the beta puppetmaster) and it all appears to be working" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [14:20:08] (03PS4) 10Elukey: Add the possibility to set an external database for Hue. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) [14:20:55] cmjohnson1: seems you have received our elasticsearch servers (T129381). Let me know if you need me to plan their racking... [14:21:50] gehel: I have recieved alot of servers...i am assuming you want these spread out over the 4 racks? [14:22:25] cmjohnson1: I'd like to have at least 1 master per rack. Ideally yes, they should be spread [14:23:19] cmjohnson1: I don't know how much free space you got there, but if you need to be able to shutdown more ES servers during the racking, we can probably move the load around the existing ones... [14:23:26] (03CR) 10Ottomata: [C: 032 V: 032] Add the possibility to set an external database for Hue. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [14:23:47] (03CR) 10Andrew Bogott: "Sorry for the delay with this! I would like to give this a test with the labtest cluster; however, labtestnet2001 is currently dismantled" [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [14:23:58] gehel: i should be okay but will let you know if I will need to make room [14:25:31] cmjohnson1: Thanks! That's the first time I get servers, so I'm not really sure of what you need from me. Just let me know... [14:26:00] (03CR) 10Jcrespo: [C: 031] Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) (owner: 10Volans) [14:26:38] gehel: no worries --- once I actually pull the out of the box I will create a rack and set up task and we'll manage it from there [14:26:50] (03PS1) 10Elukey: Update the cdh module with then new version. [puppet] - 10https://gerrit.wikimedia.org/r/284460 (https://phabricator.wikimedia.org/T127990) [14:27:55] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Backport libvpx2 to jessie - https://phabricator.wikimedia.org/T132033#2223642 (10Joe) I have backported the libvpx 1.5.0 package from stretch [14:28:03] (03PS2) 10Elukey: Update the cdh module with the new version. [puppet] - 10https://gerrit.wikimedia.org/r/284460 (https://phabricator.wikimedia.org/T127990) [14:29:03] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Backport libvpx2 to jessie - https://phabricator.wikimedia.org/T132033#2223643 (10Joe) a:03Joe [14:30:07] (03CR) 10Elukey: [C: 032] Update the cdh module with the new version. [puppet] - 10https://gerrit.wikimedia.org/r/284460 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [14:31:52] (03PS1) 10Jcrespo: Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284462 [14:35:58] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:37:01] !log stopping puppet on analytics1015 and analytics1003 in prep for migration [14:37:01] (03PS1) 10Muehlenhoff: Fix appserver font package name for Indian fonts [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:40] (03CR) 10Volans: [C: 031] Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284462 (owner: 10Jcrespo) [14:45:42] ostriches: ytterbium puppet disabled? [14:46:15] paravoid: Yeah, was testing some stuff with replication & ssh and puppet was gonna get in the way. I'll re-enable today for sure. [14:46:20] k [14:46:23] thanks [14:46:26] np [14:49:19] (03CR) 10Jcrespo: [C: 031] Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284462 (owner: 10Jcrespo) [14:49:26] (03CR) 10Jcrespo: [C: 032] Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284462 (owner: 10Jcrespo) [14:50:58] (03PS3) 10Volans: Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) [14:52:22] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2019 (duration: 00m 38s) [14:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:07] (03PS2) 10Ottomata: analytics1015 -> analytics1003 migration [puppet] - 10https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840) [14:54:27] (03CR) 10Ottomata: [C: 032 V: 032] analytics1015 -> analytics1003 migration [puppet] - 10https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840) (owner: 10Ottomata) [14:54:41] (03CR) 10Volans: [C: 032] Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) (owner: 10Volans) [14:54:55] db2009 seems a bit loaded- I am going to enable the slave [14:54:59] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2223706 (10akosiaris) 05Open>03Resolved a:03akosiaris Bug has been reported upstream https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=821848 and also got a CVE number... [14:55:06] ok [14:55:09] (03Merged) 10jenkins-bot: Change eqiad masters for s1,s3-s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284455 (https://phabricator.wikimedia.org/T105135) (owner: 10Volans) [14:55:20] I am going to do some more small tweaks for codfw weights [14:55:27] !log started puppet on analytics1003 [14:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:36] !log volans@tin Synchronized wmf-config/db-eqiad.php: Change eqiad masters for s1,s3-s7 - T105135 (duration: 00m 28s) [14:56:37] T105135: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135 [14:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:06] _joe_: you might need to rebsae your changes for tomorrow given the above ---^^^ [14:57:57] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:03:24] 06Operations, 10OfflineContentGenerator, 13Patch-For-Review, 05codfw-rollout: Unable to download PDF files of articles - https://phabricator.wikimedia.org/T133136#2221668 (10cscott) @Joe -- it seems like we should open another task for "document migration procedure from codfw to eqiad", right? Isn't that... [15:05:24] (03PS2) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [15:06:08] (03PS1) 10Jcrespo: Tweak weights for better latency, avoiding peaks on QPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284465 [15:06:20] (03CR) 10Alex Monk: "It's dnsrecursor which runs on lab(test)?services, not lab(test)?net." [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [15:06:27] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2223733 (10Nuria) @BBlack: let us know if you think we can proceed with this and whether fab is an acceptable way to deploy [15:06:31] (03CR) 10jenkins-bot: [V: 04-1] graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [15:06:38] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10Dzahn) it's also running Jenkins. added Hashar to answer if it need the public IP. also see T95757 [15:06:52] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2223738 (10cscott) Well, it would be nice if it was really "the org". But in reality it's just me. There's no other staffing for OCG, despite reques... [15:07:07] (03CR) 10Alex Monk: "although maybe to put useful data in it we need labtestnet?" [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [15:07:31] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2223741 (10Dzahn) [15:07:54] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2223744 (10Dzahn) @Andrew Does the horizon host need the public IP ? [15:09:15] (03Abandoned) 10Dzahn: icinga: put role on einsteinium for testing [puppet] - 10https://gerrit.wikimedia.org/r/284277 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [15:10:08] 06Operations, 07Icinga, 13Patch-For-Review: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2223761 (10Dzahn) Oh, ok. but we know what will replace Icinga? [15:10:19] (03PS3) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [15:10:48] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [15:11:14] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2223762 (10Dzahn) Which list of squatted domains are you referring to? [15:11:38] (03CR) 10jenkins-bot: [V: 04-1] graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [15:16:33] there is an abnormaly high number of updates on s7 [15:16:49] <_joe_> jynus: as like 2x normal? [15:16:54] more [15:17:02] high since 9am [15:17:18] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2223769 (10fgiunchedi) 05Resolved>03Open a:05BBlack>03fgiunchedi reopening since we've redirected the homepage which fixed the orig... [15:17:19] (but within normal) [15:17:20] !log re-imaging labtestvirt2001 and labtestneutron2001 [15:17:35] very high since 12:15pm [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:02] 3x more or less than the peak time [15:18:04] <_joe_> jynus: no idea then, we have a high-rate of updates on the jobqueue [15:18:12] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2223774 (10BBlack) I really have no idea about the fab deployment method (whether it's ok, how we automate it and grant access, where it's fetching data from, etc), or how/when we're goin... [15:18:15] so like a baklog? [15:18:37] !log enabling puppet on analytics1015 [15:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:02] yes, I can see it [15:19:08] let me confirm if connected [15:20:37] most traffic seem to come from frwiktionary [15:21:09] <_joe_> uhm [15:21:46] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1844369 (10ssastry) >>! In T120077#2223738, @cscott wrote: > Well, it would be nice if it was really "the org". But in reality it's just me. There's... [15:22:09] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2223811 (10BBlack) I guess you mean `crossdomain.xml` and `robots.txt`? Is there a plan for handling them outside of swift in general? I... [15:22:10] internal traffic (cron or jobs) for "LinksUpdate::updateLinksTimestamp" [15:22:16] <_joe_> sigh [15:22:19] <_joe_> what I feared [15:22:37] <_joe_> so the queue is thin because the codfw cluster is more beefy [15:22:40] purges or bug? [15:22:47] or none [15:22:53] <_joe_> but the rate of updates suggests it's the usual loop bug [15:22:57] mmm [15:23:01] <_joe_> jynus: bug probably [15:23:10] let me file it [15:23:16] with my info [15:23:32] or did the other one not close, let me search? [15:23:47] (03CR) 10Filippo Giunchedi: [C: 031] Fix appserver font package name for Indian fonts [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [15:24:40] searching refreshlinks job + open tasks is like not searching nothing [15:25:06] or more exactly, searching all tickets [15:26:13] 06Operations, 10Analytics-EventLogging, 10RESTBase, 06Services, and 2 others: RESTBase should handle the X-Analytics header - https://phabricator.wikimedia.org/T133139#2221908 (10Nuria) The premise of this ticket is .. ahem.. pretty incorrect, let's catch up on IRC. [15:26:28] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: don't serve upload.wikimedia.org 'root' from wmfrewrite/swift - https://phabricator.wikimedia.org/T130709#2143868 (10faidon) I'm not sure I understand this task — when/how is rewrite.py is going to go away? We haven't really talked about that.... [15:29:01] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2223839 (10jayvdb) http://internal.wikimedia.org/wiki/Squatted_Wikimedia_domains ? (I have no idea if it is still there, but that is where it used to be) [15:30:24] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2223854 (10fgiunchedi) excellent @papaul, let me know when done or if you need anything from me [15:31:28] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2223861 (10Krenair) thanks @bblack. I haven't been able to find anything referencing this host (MW is set up to go to parsoid itself) or any n... [15:32:31] 06Operations, 10DBA, 10MediaWiki-JobQueue: Jobqueue increase activity on refreshlinks for frwiktionary - https://phabricator.wikimedia.org/T133160#2223864 (10jcrespo) [15:32:44] (03PS2) 10Giuseppe Lavagetto: Put eqiad in read-write mode for datacenter switchover to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284396 [15:33:58] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2223885 (10akosiaris) >>! In T132407#2211054, @Nuria wrote: > > @BBlack > This would not be a full-fledged service. What we would be deploying either via puppet of fab is just html/js... [15:34:07] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2223886 (10Papaul) Nothing for now, since I am waiting on the other 2 servers . ETA says 4-22. thanks [15:34:11] 06Operations, 10DBA, 10MediaWiki-JobQueue: Jobqueue increase activity on refreshlinks for frwiktionary - https://phabricator.wikimedia.org/T133160#2223887 (10jcrespo) [15:35:04] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2192879 (10hashar) Parsoid settings for beta are in mediawiki/services/parsoid/deploy and @bblack removed the bit referencing the cache-parsoi... [15:35:45] (03CR) 10Jcrespo: [C: 032] Tweak weights for better latency, avoiding peaks on QPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284465 (owner: 10Jcrespo) [15:36:01] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2223892 (10Krenair) a:03Krenair It's shut off now, I will delete it on wednesday 27th [15:36:10] (03PS4) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [15:37:36] !log jynus@tin Synchronized wmf-config/db-codfw.php: Tweak DB weights for better latency, avoiding peaks on QPS (duration: 00m 32s) [15:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:38] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:53] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2223898 (10Papaul) [15:42:24] !log delete apifeatureusage-2016-01-(02,09,10) from eqiad elasticsearch cluster. We only keep 30 days of apifeatureusage logs [15:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:29] (03CR) 10Giuseppe Lavagetto: "A couple of small tweaks but looks generally ok to me" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [15:43:49] !log delete apifeatureusage-2016.01.20 from codfw elasticsearch cluster. Index should never have existed in this cluster (and is beyond retention). [15:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:36] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2223905 (10fgiunchedi) [15:46:37] (03CR) 1020after4: "That location already has other keys and it's outside of the normal puppet repo. I don't know how opsen manage the files, I guess that the" [puppet] - 10https://gerrit.wikimedia.org/r/284418 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [15:48:03] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: authoritative copy of 'root' files for upload.wikimedia.org is only in swift - https://phabricator.wikimedia.org/T130709#2223908 (10fgiunchedi) @bblack @faidon I've retitled/clarified a bit the scope of this task since it wasn't clear, originall... [15:58:34] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2223962 (10RobH) All of that seems right to me! I don't think maintaining a small list of allowed domains will be difficult, as its only our carriers and vendors. Looks good. [15:58:44] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2223963 (10Eevans) >>! In T132771#2221855, @mobrovac wrote: >>>! In T132771#2210659, @Eevans wrote: >> I'm going to play Devil's Advocate here and ask: In the absence of column famil... [16:02:43] (03CR) 10Krinkle: "What is the context on this? Is this addressing a recent regression? (how recent?) - can you mention the cause?" [puppet] - 10https://gerrit.wikimedia.org/r/284161 (owner: 10Muehlenhoff) [16:03:08] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223981 (10hashar) gallium has been setup in 2011 and is still on Precise. It received a public IP to serves the Jenkins web interface. With time, all the http entry... [16:04:01] 06Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2223557 (10hashar) [16:06:08] (03CR) 10Jdlrobson: [C: 031] Split mobile text cache for lazyloadimages testing [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) (owner: 10BBlack) [16:06:54] (03CR) 10Jdlrobson: "Looks good to me but haven't tested (don't know how to test puppet changes)" [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) (owner: 10BBlack) [16:08:13] 06Operations, 10Analytics-EventLogging, 10RESTBase, 06Services, and 2 others: RESTBase should handle the X-Analytics header - https://phabricator.wikimedia.org/T133139#2224008 (10mobrovac) 05Open>03Invalid Nope, that's not what we should do. [16:12:12] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2224036 (10Andrew) @Dzahn as far as I know it does not, moving it to an internal IP would be fine. [16:12:45] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2224038 (10cscott) Created new task: {T133164}. I could use some help from ops -- I don't know where this documentation canonically lives. [16:16:46] 06Operations, 10OCG-General, 06Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#2224044 (10cscott) Refocusing this bug: I described above adding a per-machine flag to tell specific hosts not to check the redis queue: https://phabr... [16:18:44] 06Operations, 10OCG-General, 06Services: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2224047 (10cscott) [16:21:55] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224049 (10Dzahn) Everything Alex already said :) i setup bromine and most of those microsites and yea, it's meant for small static sites. In addition to one of those small puppet roles... [16:22:50] 06Operations, 10OCG-General, 06Services: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2224054 (10cscott) So... as to implementation. Would ops prefer to decommission a host via puppet or redis? The puppet option would be t... [16:24:16] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224056 (10BBlack) Well, the impedance mismatch here on the standard static bromine setup and what analytics is asking for then may be all about the static-ness and deployment process. I... [16:29:10] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224067 (10Dzahn) It would mostly just be about who has +2 on the gerrit repo that holds the actual site content. If the puppet role on our site git clones with "ensure latest" then there... [16:31:02] (03PS1) 10Andrew Bogott: Specify the wikistatus host for labtest [puppet] - 10https://gerrit.wikimedia.org/r/284479 [16:33:22] (03CR) 10Andrew Bogott: [C: 032] Specify the wikistatus host for labtest [puppet] - 10https://gerrit.wikimedia.org/r/284479 (owner: 10Andrew Bogott) [16:34:22] (03CR) 10Andrew Bogott: "@giuseppe: this patch replaces a file that you removed; I'm hoping you'll have a suggestion as to how to allow this without adding auth.co" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [16:39:53] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2224125 (10BBlack) [16:42:59] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached on sites - https://phabricator.wikimedia.org/T133069#2224137 (10Krinkle) Sidebar is built by Skin::buildSidebar: * Skin::buildSidebar() * cached in WANObjectCache. * computed by Skin::addToSidebar() usi... [16:46:12] legoktm: Thoughts on https://phabricator.wikimedia.org/T133069#2224135 ? Seems like I'm missing something. [16:48:02] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224174 (10Nuria) >It would mostly just be about who has +2 on the gerrit repo that holds the actual site content. If the puppet role on our site git clones with "ensure latest" >then th... [16:49:59] (03CR) 10Jforrester: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [16:50:03] (03CR) 10Southparkfan: [C: 04-1] acme-setup script + acme::init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [16:51:10] (03PS1) 10Jforrester: Follow-up I049fa67: Do not enable wgKartographerWikivoyageMode on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 [16:51:15] (03CR) 10Dereckson: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [16:52:50] (03CR) 10Jforrester: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [16:53:14] (03CR) 10Jforrester: "Follow-up: Ida2f43cfb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [16:54:38] !log replaced "#" with ";" manually in uranium's /etc/php5/cli/conf.d/20-xhprof.ini and /etc/php5/apache2/php.ini to avoid cronspam (didn't find puppet/package trails) [16:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:46] (03CR) 10KartikMistry: [C: 031] Fix appserver font package name for Indian fonts [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [16:56:52] elukey: xhprof in uranium ? it probably makes sense to get rid of the extension altogether [16:57:09] (03PS1) 10Giuseppe Lavagetto: bastionhost::general: include dsh configuration [puppet] - 10https://gerrit.wikimedia.org/r/284487 [16:57:14] <_joe_> paravoid: ^^ [16:57:17] although now that I think about it, puppet must be installing it [16:57:19] <_joe_> subbu: you too :) [16:58:00] duuuhh [16:58:25] (03CR) 10Faidon Liambotis: [C: 032] bastionhost::general: include dsh configuration [puppet] - 10https://gerrit.wikimedia.org/r/284487 (owner: 10Giuseppe Lavagetto) [16:58:40] akosiaris: ah yes for sure, but I didn't find anything and after running puppet my changes were still there, so I thought to log it anyway in the SAL [16:58:41] <_joe_> I told you it was simple :P [16:58:44] thx [16:59:40] elukey: oh, logging it was the way to go, but we should also figure out why xhprof ended up being in ganglia web .. it's not like it's useful [17:00:09] akosiaris: I looked but I can't find where puppet would install xhprof on uranium [17:02:19] <_joe_> subbu: /etc/dsh/group/parsoid is now on bast1001 :) [17:02:51] verified. thanks. [17:03:35] SPF|Cloud: indeed. It does not. It probably was cruft leftover (either debugging or some puppet refactoring). I 've purged the php module along with config. [17:03:39] elukey: ^ [17:03:52] !log aptitude purge php5-xhprof on uranium [17:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:02] yeah, that should do the trick I guess [17:06:57] _joe_, paravoid can one of you add the switchback time to the etherpad? was it 5am PT or 7amPT? I didn't catch the time. [17:08:02] Tuesday, 19 April at 0700 PDT/1400 UTC (and same time on Thursday) [17:08:03] <_joe_> 7 am [17:08:30] k [17:09:09] (03CR) 10Dereckson: "Initial situation before I049fa67: Kartographer has a KartographerWikivoyageMode setting, with a default value at true." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [17:10:56] (03CR) 10Dereckson: Set wgKartographerWikivoyageMode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283339 (owner: 10Dereckson) [17:11:36] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224265 (10BBlack) When you say "build our code" do you mean building client-side javascript code that's ultimately static content from the server's perspective, or do you mean building s... [17:12:03] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::instance: partially disable replication checks [puppet] - 10https://gerrit.wikimedia.org/r/284489 [17:13:36] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224292 (10Dzahn) "build code" and "static site" are confusing me a bit. the kind of static site we host on bromine means HTML and CSS and some images. [17:14:59] (03PS3) 10BBlack: Split mobile text cache for lazyloadimages testing [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) [17:15:08] (03CR) 10BBlack: [C: 032 V: 032] Split mobile text cache for lazyloadimages testing [puppet] - 10https://gerrit.wikimedia.org/r/284080 (https://phabricator.wikimedia.org/T127883) (owner: 10BBlack) [17:15:17] 06Operations, 10MediaWiki-General-or-Unknown, 10Monitoring: edit.success in graphite never reached zero during codfw switchover - https://phabricator.wikimedia.org/T133177#2224294 (10fgiunchedi) [17:15:23] !log Upgrading db1071 and fixing overheathing problems T132515 [17:15:24] T132515: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515 [17:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:28] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10ori) >>! In T132407#2224265, @BBlack wrote: > When you say "build our code" do you mean building client-side javascript code that's ultimately static content from the server's... [17:17:14] 06Operations, 10MediaWiki-JobQueue, 10Monitoring: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#2224324 (10Joe) [17:17:23] 06Operations, 10MediaWiki-JobQueue, 10Monitoring: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#2224336 (10Joe) p:05Triage>03Normal a:03Joe [17:18:03] (03PS2) 10Giuseppe Lavagetto: redis::monitoring::instance: partially disable replication checks [puppet] - 10https://gerrit.wikimedia.org/r/284489 (https://phabricator.wikimedia.org/T133179) [17:19:08] (03PS3) 10Giuseppe Lavagetto: redis::monitoring::instance: partially disable replication checks [puppet] - 10https://gerrit.wikimedia.org/r/284489 (https://phabricator.wikimedia.org/T133179) [17:19:24] !log aaron@tin Synchronized php-1.27.0-wmf.21/includes/jobqueue/JobQueueRedis.php: 86d185a4bbf52d (duration: 00m 39s) [17:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:07] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2224341 (10Dzahn) Oh, internal.wikimedia.org , i don't think i have access to that. But i asked on the staff channel if somebody can check it. [17:23:37] (03CR) 10Giuseppe Lavagetto: [C: 032] redis::monitoring::instance: partially disable replication checks [puppet] - 10https://gerrit.wikimedia.org/r/284489 (https://phabricator.wikimedia.org/T133179) (owner: 10Giuseppe Lavagetto) [17:25:18] 06Operations, 10MediaWiki-General-or-Unknown, 10Monitoring: edit.success in graphite never reached zero during codfw switchover - https://phabricator.wikimedia.org/T133177#2224346 (10fgiunchedi) edit dashboard: https://grafana.wikimedia.org/dashboard/db/edit-count?from=1461074592048&to=1461079319274 and val... [17:27:56] (03PS5) 10Filippo Giunchedi: graphite: port to jessie/systemd [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) [17:27:58] (03CR) 10Filippo Giunchedi: graphite: port to jessie/systemd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/211685 (https://phabricator.wikimedia.org/T132717) (owner: 10Filippo Giunchedi) [17:30:11] !log Upgrading db1070 and fixing overheathing problems T132515 [17:30:13] T132515: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515 [17:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:30:34] akosiaris: sorry was in a meeting, reading [17:31:08] ahhh nice thanks! [17:33:26] 06Operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2224351 (10MoritzMuehlenhoff) For the dnsrec service the server should be depooled via confctl. For NTP all our servers are configured to use multiple NTP servers, so as long as only one system is being reimaged at a ti... [17:33:45] (03PS1) 10Andrew Bogott: Mark off a block of public IPs for labtest [dns] - 10https://gerrit.wikimedia.org/r/284491 [17:34:47] chasemp: I think that ^ indicates a currently unused range [17:40:41] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224373 (10Ottomata) Hm, I had assumed we would just host analytics.wikimedia.org on stat1001. I think we'd like analytics.wikimedia.org to eventually supercede stats.wikimedia.org, an... [17:46:35] (03PS3) 10BBlack: Common VCL: remove wikimedia.org subdomain HTTPS redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) [17:47:25] (03CR) 10BBlack: [C: 032 V: 032] Common VCL: remove wikimedia.org subdomain HTTPS redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) (owner: 10BBlack) [17:56:48] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224433 (10Nuria) >In general, analytics.wikimedia.org will host static files (html, js, tsvs, etc.), but not just for one service (dashiki / reportcard). Correct, some dashiki plots will... [17:58:02] ottomata: I saw your an1003 patches -- why aren't you setting it with jessie? [17:58:40] !log Upgrading db1065 and fixing overheathing problems T132515 [17:58:41] T132515: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515 [17:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:15] paravoid: it uses cdh packages [17:59:18] which are trusty [17:59:47] !log changing database topology to set db1031 as the master of x1 on eqiad [18:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:53] ugh :/ [18:02:48] ja the whole hadoop cluster is trusty [18:03:23] well that much I knew :) [18:03:36] I was just wondering about slowly transitioning it, but it seems CDH has no jessie packages :( [18:05:09] ja [18:05:12] indeed [18:05:15] 06Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2224470 (10csteipp) Rather than keep all of our ssh keys in one place, could we have a repo that tracks the fingerprint of all servers, so when the warnings do come up, you just git pull, and (assuming nothing m... [18:05:41] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2224483 (10Dzahn) Correction, i do have access, just had to find it. The page is still there, but the last edit was in May 2013. So it's not really used (by legal ?)... [18:08:00] 06Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2097333 (10faidon) Well, that's /etc/ssh/ssh_known_hosts on all bastions, see my mail as quoted by @akosiaris above :) Longer-term we might want to explore OpenSSH's relatively new certificate support and havin... [18:08:32] (03PS1) 10Jcrespo: Promote db1031 as the new x1 eqiad local master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284497 [18:10:03] (03CR) 10Volans: [C: 031] Promote db1031 as the new x1 eqiad local master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284497 (owner: 10Jcrespo) [18:10:45] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2224497 (10Dzahn) Added nevertheless : https://internal.wikimedia.org/w/index.php?title=Domain_names%2FSquatted_Wikimedia_domains&type=revision&diff=17196&oldid=17106... [18:10:59] 06Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2224501 (10aaron) [18:11:01] 06Operations, 07Availability, 05MW-1.27-release-notes, 13Patch-For-Review, and 2 others: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2224500 (10aaron) [18:11:25] 06Operations, 07Availability, 05MW-1.27-release-notes, 13Patch-For-Review, and 2 others: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1097878 (10aaron) 05Open>03Resolved This has been reasonably stable in prod for some time. [18:15:53] (03CR) 10Jcrespo: [C: 032] Promote db1031 as the new x1 eqiad local master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284497 (owner: 10Jcrespo) [18:16:10] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/jobqueue/JobQueueGroup.php: Ie9799f5ea: Catch errors in pushLazyJobs() and log them (duration: 00m 36s) [18:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:42] !log ori@tin Synchronized php-1.27.0-wmf.21/extensions/Translate/messagegroups/WikiPageMessageGroup.php: I331bd93b: Avoid more master queries on page views (duration: 00m 31s) [18:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:04] Krinkle: what happens if it can't load the revision out of external store? [18:17:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Promote db1031 as the new x1 eqiad local master (duration: 00m 28s) [18:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:48] legoktm: then Revision::getContent -> getContentInternal -> loadText -> db select or getRevisionText() will fail [18:18:57] the latter using external store [18:19:29] getRevisionText will return false in that case [18:19:48] which means loadText() returns false [18:19:54] which means getContent() returns null [18:19:58] see phab ticket for rest [18:21:25] (03PS1) 10Ottomata: Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) [18:21:32] Krinkle: didn't a blank message mean use default or something? I wonder if the null is getting cast into an empty string somehow [18:21:44] 06Operations, 10Traffic, 07HTTPS: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2224563 (10BBlack) I've started refactoring https://wikitech.wikimedia.org/wiki/HTTPS/domains - it now has the merged... [18:21:46] (03Abandoned) 10Dzahn: add role for hosts with LE certs, add on carbon [puppet] - 10https://gerrit.wikimedia.org/r/283763 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [18:23:05] (03CR) 10jenkins-bot: [V: 04-1] Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) (owner: 10Ottomata) [18:24:51] (03PS2) 10Ottomata: Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) [18:24:59] (03PS1) 10BBlack: ssl_ciphersuite: use "always" for nginx STS header [puppet] - 10https://gerrit.wikimedia.org/r/284501 [18:25:01] (03PS1) 10BBlack: Add HSTS=1y to several one-off public services [puppet] - 10https://gerrit.wikimedia.org/r/284502 (https://phabricator.wikimedia.org/T132521) [18:25:06] (03CR) 10Dzahn: [C: 031] "lgtm, needs manual rebase though, hitting the button in gerrit fails" [puppet] - 10https://gerrit.wikimedia.org/r/283663 (owner: 10Muehlenhoff) [18:25:47] (03CR) 10jenkins-bot: [V: 04-1] Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) (owner: 10Ottomata) [18:26:41] (03PS3) 10Ottomata: Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) [18:26:57] (03CR) 10Dzahn: [C: 031] "i was working on a similar thing but never got to finish it, so yea, thanks +1" [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [18:28:57] (03CR) 10Dzahn: "fyi/see also:" [puppet] - 10https://gerrit.wikimedia.org/r/284463 (https://phabricator.wikimedia.org/T131749) (owner: 10Muehlenhoff) [18:29:57] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2224619 (10Dzahn) 05Open>03Invalid Not sure which status is appropriate here. It's certainly not resolved, it's not open anymore, it's not stalled, it wasn't decli... [18:31:35] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2224630 (10RobH) [18:31:57] (03PS4) 10Ottomata: Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) [18:32:11] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2158431 (10RobH) [18:33:24] (03CR) 10BBlack: [C: 04-1] "of course, the version of nginx on precise hosts doesn't have the "always" option :P" [puppet] - 10https://gerrit.wikimedia.org/r/284501 (owner: 10BBlack) [18:35:12] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2224641 (10CRoslof) Thanks for flagging this domain. We have internal tracking documents on the legal team, and I'll make sure nlwikipedia.org is on them. Since it seem... [18:35:32] legoktm: That logic is much further down. Once an actual message has been determined (after all fallback languages and db msg -> disk msg), then a message can be "disabled" indeed. [18:35:38] but I don't think that plays a role in sidebar [18:35:41] sidebar can't be disabled. [18:36:45] (03CR) 10Ottomata: [C: 032] Include process and disk alerts for hive, oozie and mysql on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284500 (https://phabricator.wikimedia.org/T133182) (owner: 10Ottomata) [18:39:21] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10madhuvishy) I want to explain the current setup in labs a little bit and point out that it won't work as is in prod - and needs some re-working. The idea is we have a single in... [18:45:49] (03PS2) 10BBlack: Add HSTS=1y to several one-off public services [puppet] - 10https://gerrit.wikimedia.org/r/284502 (https://phabricator.wikimedia.org/T132521) [18:45:51] (03PS1) 10BBlack: apt|ubuntu|mirrors: template standard HSTS=1y settings [puppet] - 10https://gerrit.wikimedia.org/r/284504 [18:46:59] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2224733 (10Dzahn) Tested. Sent a mail from external non-wmf address, it got moderated, not bounced or accepted (good). The list owner got the "awaiting approval" message. Since the owner is noc@ we got it at... [18:51:51] 06Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2224802 (10Cmjohnson) return shipment information USPS 9202 3946 5301 2431 4675 24 [18:54:04] (03PS2) 10Dzahn: install_server: switch install2001 to jessie [puppet] - 10https://gerrit.wikimedia.org/r/284387 (https://phabricator.wikimedia.org/T132757) [18:54:53] (03Abandoned) 10BBlack: ssl_ciphersuite: use "always" for nginx STS header [puppet] - 10https://gerrit.wikimedia.org/r/284501 (owner: 10BBlack) [18:55:10] (03PS1) 10Ottomata: Make hive-metastore service depend on libmysql-jar in classpath [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284506 (https://phabricator.wikimedia.org/T133198) [18:55:43] (03PS1) 10BBlack: ssl_ciphersuite: detect os_version, use always for nginx+jessie [puppet] - 10https://gerrit.wikimedia.org/r/284507 [18:58:09] (03CR) 10Dzahn: [C: 032] "just preparing this. not actually installing before DCs have been switched back" [puppet] - 10https://gerrit.wikimedia.org/r/284387 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:00:28] (03CR) 10BBlack: [C: 032] "compiler-checked!" [puppet] - 10https://gerrit.wikimedia.org/r/284504 (owner: 10BBlack) [19:00:38] (03PS2) 10BBlack: apt|ubuntu|mirrors: template standard HSTS=1y settings [puppet] - 10https://gerrit.wikimedia.org/r/284504 [19:01:32] (03CR) 10BBlack: [V: 032] apt|ubuntu|mirrors: template standard HSTS=1y settings [puppet] - 10https://gerrit.wikimedia.org/r/284504 (owner: 10BBlack) [19:02:59] 06Operations, 06Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2224900 (10RobH) 05Open>03Resolved a:03RobH Orders for 4 maps backend hosts for codfw and 4 maps backend hosts for eqiad have been ordered. Their restricted view #procurement ta... [19:03:47] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224904 (10Ottomata) > these dashboards would have to be served at analytics.wikimedia.org/browser-reports, analytics.wikimedia.org/edit-reports etc. And the apache setup will have to cha... [19:03:56] !log Cleared out 'enqueue' job queues to see if corruption comes back [19:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:16] (03PS3) 10BBlack: Add HSTS=1y to several one-off public services [puppet] - 10https://gerrit.wikimedia.org/r/284502 (https://phabricator.wikimedia.org/T132521) [19:04:19] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2224906 (10Ottomata) > and use scap3 Actually, I don't think we can use scap3 if they aren't in git, since scap3 deploys via git. [19:04:44] (03CR) 10BBlack: [C: 032 V: 032] Add HSTS=1y to several one-off public services [puppet] - 10https://gerrit.wikimedia.org/r/284502 (https://phabricator.wikimedia.org/T132521) (owner: 10BBlack) [19:27:42] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack/Setup 6 new pool servers - https://phabricator.wikimedia.org/T132684#2224957 (10Cmjohnson) Racked in the following locations a4 U27 wmf4746 4LF9FB2 4/0/26 B4 U15 WMF4747 4LF7FB2 4/0/34 C4 U15 WMF4748 4LF8FB2 4/0/31 C4 U16 WMF4749 4LFDFB2 0/32 D3 U32 WMF... [19:29:07] (03PS4) 10Dzahn: add install1001.wikimedia.org using 208.80.154.83 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [19:30:11] 06Operations, 10ops-eqiad: db1046.eqiad.wmnet: slot=3 failed - https://phabricator.wikimedia.org/T132917#2224961 (10Cmjohnson) Online and spun up...resolving task [19:31:27] (03PS5) 10Dzahn: add install1001.wikimedia.org using 208.80.154.83 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [19:31:51] (03PS6) 10Dzahn: add install1001.wikimedia.org using 208.80.154.83 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) [19:34:59] 06Operations, 05Continuous-Integration-Scaling, 07HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2224967 (10hashar) 05Open>03stalled Whoa thank you @Joe ! I am not sure what ICU involves when it comes to testing but it se... [19:35:04] 06Operations, 05Continuous-Integration-Scaling, 07HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2224970 (10hashar) [19:35:06] 06Operations, 07HHVM, 07user-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2224971 (10hashar) [19:37:42] (03CR) 10RobH: [C: 031] add install1001.wikimedia.org using 208.80.154.83 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:42:06] (03PS2) 10BBlack: ssl_ciphersuite: detect os_version, use always for nginx+jessie [puppet] - 10https://gerrit.wikimedia.org/r/284507 [19:43:10] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2225023 (10Volans) 05Open>03Resolved es2019 has been repoold after all executed checks passed. [19:44:10] (03PS3) 10BBlack: ssl_ciphersuite: detect os_version, use always for nginx+jessie [puppet] - 10https://gerrit.wikimedia.org/r/284507 [19:44:35] (03CR) 10BBlack: [C: 032 V: 032] "surprisingly, this technique actually works (in the compiler, anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/284507 (owner: 10BBlack) [19:46:07] (03PS1) 10Volans: switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) [19:47:15] (03CR) 10Dzahn: [C: 032] add install1001.wikimedia.org using 208.80.154.83 [dns] - 10https://gerrit.wikimedia.org/r/284388 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [19:49:39] (03PS1) 10Mattflaschen: Use $wgNotifyTypeAvailabilityByNotificationType due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [19:51:51] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2225089 (10RobH) a:05RobH>03None I don't think this should be assigned to me, since I already ordered disks... [19:51:51] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail [19:52:48] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2225094 (10Volans) p:05Unbreak!>03Normal [19:57:07] (03PS1) 10Jcrespo: Set db1031 as the local eqiad master and set it to ROW binlog [puppet] - 10https://gerrit.wikimedia.org/r/284516 (https://phabricator.wikimedia.org/T120122) [19:58:40] (03CR) 10Volans: [C: 031] Set db1031 as the local eqiad master and set it to ROW binlog [puppet] - 10https://gerrit.wikimedia.org/r/284516 (https://phabricator.wikimedia.org/T120122) (owner: 10Jcrespo) [20:00:05] (03CR) 10Jcrespo: [C: 032] Set db1031 as the local eqiad master and set it to ROW binlog [puppet] - 10https://gerrit.wikimedia.org/r/284516 (https://phabricator.wikimedia.org/T120122) (owner: 10Jcrespo) [20:00:40] 06Operations: create new RT queue for managers with access to all except procurement - https://phabricator.wikimedia.org/T81855#2225151 (10Dzahn) [20:04:26] (03PS1) 10BBlack: ssl_ciphersuite: autodetect apache, too [puppet] - 10https://gerrit.wikimedia.org/r/284518 [20:05:34] 06Operations: RT - (re-)enable password reset feature - https://phabricator.wikimedia.org/T80320#2225161 (10Dzahn) [20:06:02] (03PS2) 10BBlack: ssl_ciphersuite: autodetect apache, too [puppet] - 10https://gerrit.wikimedia.org/r/284518 [20:07:29] (03PS2) 10Volans: switchover: switch (s1-s7, x1) master role to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284514 (https://phabricator.wikimedia.org/T133205) [20:09:22] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2225168 (10mmodell) [20:11:22] (03PS1) 10Andrew Bogott: Correct the vlan used for nova-network on labtest. [puppet] - 10https://gerrit.wikimedia.org/r/284520 [20:11:34] chasemp: ^ [20:12:09] (03CR) 10Andrew Bogott: [C: 032] Correct the vlan used for nova-network on labtest. [puppet] - 10https://gerrit.wikimedia.org/r/284520 (owner: 10Andrew Bogott) [20:12:23] (03CR) 10Rush: [C: 031] "I think so yes :)" [puppet] - 10https://gerrit.wikimedia.org/r/284520 (owner: 10Andrew Bogott) [20:17:37] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:33:08] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:08] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [20:37:34] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2225253 (10Dzahn) I think this ticket is technically not done. Reason: we can't shutdown RT because procurement tickets have not been imported and Robh needs them. [20:38:51] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2225259 (10RobH) Please do not take RT offline entirely. I need to access the RT interface for the procurement project in RT. The searches are often using advanced query strings in RT to find specific vendo... [20:42:06] 06Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2225278 (10RobH) When I do my searches for historic info, my criteria tend to be the following: creation date, resolution date, file attachment present, file attachment filenames, email addresses involved, su... [20:42:28] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#2225279 (10Dzahn) 05Resolved>03Open Now that Spaces exist, can we import those tickets? [20:45:35] 06Operations, 05WMF-NDA: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#552 (10RobH) So as @dzahn pointed out, I need to regularly search and read the old procurement queue tickets in RT. These tickets also have file attachments that we need to parse and review for historical purchasing rec... [20:46:12] (03PS3) 10BBlack: ssl_ciphersuite: autodetect apache, too [puppet] - 10https://gerrit.wikimedia.org/r/284518 [20:55:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 2 failures [20:58:36] ^^^ I'm aware of the puppetrun issue on payments* -- it's stalling on a package install, looks like the repo is just being super slow to respond [20:59:15] (03PS1) 10Rush: lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) [20:59:54] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2225358 (10BBlack) [21:00:17] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 2 failures [21:00:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [21:00:17] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: Puppet has 2 failures [21:00:18] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 2 failures [21:00:26] (03CR) 10jenkins-bot: [V: 04-1] lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) (owner: 10Rush) [21:01:49] 06Operations, 10Traffic: Fix apache-2.4 + DHE ciphersuites issue - https://phabricator.wikimedia.org/T133217#2225388 (10BBlack) p:05Triage>03Low [21:05:16] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [21:05:16] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:05:16] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:05:17] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:05:24] (03PS2) 10Rush: lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) [21:06:38] (03CR) 10jenkins-bot: [V: 04-1] lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) (owner: 10Rush) [21:10:11] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 05codfw-rollout: Wrong sidebar cached on sites - https://phabricator.wikimedia.org/T133069#2225437 (10BBlack) >>! In T133069#2224135, @Krinkle wrote: > [...]* wfMessage -> Message::fetchMessage -> MessageCache::get -> Me... [21:12:03] (03PS3) 10Rush: lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) [21:13:35] (03PS1) 10Jdlrobson: Split mobile text cache for lazy loaded references testing [puppet] - 10https://gerrit.wikimedia.org/r/284576 [21:13:37] (03PS1) 10Jdlrobson: Remove NetSpeedB cache splitting [puppet] - 10https://gerrit.wikimedia.org/r/284577 [21:17:10] (03CR) 10jenkins-bot: [V: 04-1] lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) (owner: 10Rush) [21:18:15] (03PS2) 10Jdlrobson: Split mobile text cache for lazy loaded references testing [puppet] - 10https://gerrit.wikimedia.org/r/284576 [21:21:46] 06Operations, 13Patch-For-Review: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2225469 (10RobH) [21:26:54] (03PS4) 10Rush: lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) [21:29:05] (03CR) 10jenkins-bot: [V: 04-1] lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) (owner: 10Rush) [21:29:58] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2225492 (10Volans) All three hosts were shutdown and thanks to @Cmjohnson got the thermal paste applied. We'll monitor after the switchback. Regarding the failing DIMM will you create... [21:30:46] (03PS5) 10Rush: lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) [21:32:21] (03PS2) 10Mattflaschen: Use notify-type-availability due to Echo change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284515 (https://phabricator.wikimedia.org/T132820) [21:45:28] 06Operations: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2225527 (10Dzahn) [21:45:38] 06Operations: put lint:ignore documention on wikitech - https://phabricator.wikimedia.org/T133222#2225540 (10Dzahn) p:05Triage>03Low [21:46:38] (03CR) 10Rush: [C: 032] lshell scaffolding for restricting Labs users [puppet] - 10https://gerrit.wikimedia.org/r/284530 (https://phabricator.wikimedia.org/T102395) (owner: 10Rush) [21:57:11] Are there still entrys at the recent changes missing? [21:57:31] https://test2.wikipedia.org/w/index.php?title=Special:RecentChanges&days=30&from= <-- should have more entries than those from 20 april [22:03:58] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures [22:04:00] (03PS1) 10Dzahn: add install1001 to site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/284600 (https://phabricator.wikimedia.org/T132757) [22:12:57] Luke081515: shouldn't be, unless we forgot to run the script on test2wiki [22:13:15] what entries are missing? [22:13:54] MatmaRex: All before the 20 april [22:16:50] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2225623 (10RobH) [22:17:17] 07Blocked-on-Operations, 06Operations, 10ops-eqiad: check ganeti1001-1006 for lff to sff adapters - https://phabricator.wikimedia.org/T133224#2225623 (10RobH) Please note these systems are online, and cannot simply be offlined to check. This is more for Chris to confirm what I suspect about adding SSDs. [22:17:47] huh. interesting. [22:18:00] oldest timestamp is 20160419150324 [22:18:17] !log creating ganeti VM install1001 on eqiad cluster [22:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:47] Luke081515: file a bug with #operations and #codfw-rollout. maybe it's normal and somebody was exeprimenting with the wiki. but it's definitely very weird. [22:19:55] ok [22:20:07] let me check quicky the db [22:23:10] jynus: Create now, or let you first finish your check? :) [22:23:54] create anyway [22:24:58] what tz has test2wiki? [22:25:29] forget it, I am logged in [22:25:36] it has no rcs [22:26:29] same state 24 hours ago [22:27:50] jynus: What do you mean with it has no rcs? Are they missing at the DB? [22:27:57] 06Operations, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2225647 (10Luke081515) [22:28:29] yeah, not more than the ones shown [22:30:27] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:31:56] (03PS2) 10Dzahn: add install1001 to site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/284600 (https://phabricator.wikimedia.org/T132757) [22:33:44] (03CR) 10Dzahn: [C: 032] add install1001 to site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/284600 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:34:25] but there were 1 week ago [22:37:27] hm [22:38:05] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10netops, 13Patch-For-Review: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2225679 (10RobH) [22:38:20] (03PS1) 10Dzahn: netboot: add install1001 to have RAID1 like install2001 [puppet] - 10https://gerrit.wikimedia.org/r/284603 (https://phabricator.wikimedia.org/T132757) [22:39:58] (03PS2) 10Dzahn: netboot: add install1001, use partman for VMs [puppet] - 10https://gerrit.wikimedia.org/r/284603 (https://phabricator.wikimedia.org/T132757) [22:41:00] (03CR) 10Dzahn: [C: 032] netboot: add install1001, use partman for VMs [puppet] - 10https://gerrit.wikimedia.org/r/284603 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [22:51:28] PROBLEM - Disk space on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:51:28] PROBLEM - Labs LDAP on serpens is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:51:36] PROBLEM - puppet last run on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:51:37] PROBLEM - dhclient process on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:06] PROBLEM - Check size of conntrack table on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:26] PROBLEM - DPKG on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:46] PROBLEM - RAID on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:47] PROBLEM - salt-minion processes on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:53:16] PROBLEM - configured eth on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:57:06] (03PS1) 10Luke081515: Typo fix at the warning for the inactive deployment server [puppet] - 10https://gerrit.wikimedia.org/r/284605 [22:57:31] someone can review this patch? It's just a minor fix [22:59:40] (03CR) 10BryanDavis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/284605 (owner: 10Luke081515) [23:15:17] PROBLEM - puppet last run on mw1081 is CRITICAL: CRITICAL: Puppet has 1 failures [23:27:06] PROBLEM - SSH on serpens is CRITICAL: Server answer [23:41:36] RECOVERY - puppet last run on mw1081 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [23:53:26] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 4 failures