[00:00:05] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160302T0000). [00:00:05] tgr James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:36] hey [00:01:02] * James_F waves. [00:01:12] (03CR) 10EBernhardson: [C: 031] "good for swat." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) (owner: 10DCausse) [00:01:18] (03CR) 10Alex Monk: [C: 032] Removing Gather from enwiki and miscellaneous cosmetic changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [00:01:23] o/ [00:01:57] (03Merged) 10jenkins-bot: Removing Gather from enwiki and miscellaneous cosmetic changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [00:02:14] (03CR) 10Tim Landscheidt: "Currently, the proxy logs requests to /var/log/nginx/access.log with the combined format as the nginx default. If you use one access_log " [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [00:04:42] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/271932/ - disable Gather on enwiki (duration: 01m 26s) [00:04:45] tgr, ^ [00:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:38] Krenair: it's gone alright [00:06:35] (03PS3) 10Alex Monk: Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 (owner: 10Jforrester) [00:06:41] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 (owner: 10Jforrester) [00:07:28] (03Merged) 10jenkins-bot: Enable VisualEditor Single Edit Tab on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274129 (owner: 10Jforrester) [00:08:07] Krenair: InitSettings first, then the dblist. [00:08:15] ok [00:08:26] good idea [00:09:37] :-) [00:10:10] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/274129/ - VE SET on mediawikiwiki/testwiki (duration: 01m 21s) [00:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:17] Krenair: I’m here when you look for me o/ [00:10:42] James_F [00:11:37] (03CR) 10Dzahn: [C: 031] "gotcha, thanks. i was under the impression we are not logging and i would question if we should keep logging the remote addresses, but tha" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [00:12:17] Krenair: LGTM. [00:12:19] !log krenair@tin Synchronized dblists/visualeditor-default.dblist: https://gerrit.wikimedia.org/r/#/c/274129/ - +testwiki (duration: 01m 20s) [00:12:22] James_F, ^ [00:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:19] Krenair: LGTM. [00:14:02] (03PS5) 10Dzahn: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) [00:14:58] (03CR) 10Alex Monk: [C: 032] Enable VisualEditor state transitioning for accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [00:15:04] Whee. [00:16:02] (03PS2) 10Alex Monk: Bump portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274316 (https://phabricator.wikimedia.org/T128522) (owner: 10JGirault) [00:17:51] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2078294 (10EBernhardson) Sounds reasonable. I will put the ask for 2x nobelium level hardware in the strategic goals portion of discovery budget with a... [00:20:51] (03Abandoned) 10Dzahn: osm/maps/postgres: move tuning.conf out of /files/ [puppet] - 10https://gerrit.wikimedia.org/r/249056 (owner: 10Dzahn) [00:21:42] What's up with jenkins this time? [00:21:48] everything queued? [00:22:05] Krinkle, any idea? [00:22:36] jenkins-brb [00:22:41] * Krinkle doens't know [00:22:49] There is probably docs on mediawiki.org on wikitech [00:23:03] Unless it's stil broken tomorrow morning, I'm not fixing it this time [00:23:11] it's not just mediawiki-config broken this time [00:23:34] (03PS2) 10Alex Monk: Enable VisualEditor state transitioning for accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [00:23:45] (03CR) 10Alex Monk: [V: 032] Enable VisualEditor state transitioning for accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [00:25:15] (03PS2) 10Dzahn: logstash: fix top-scope var w/o namespace [puppet] - 10https://gerrit.wikimedia.org/r/272675 [00:25:38] "$ /usr/local/bin/zuul-gearman.py status" says Gearman is working fine [00:25:40] (03CR) 10Dzahn: "thank you Gehel and Bryan for your comments, amending!" [puppet] - 10https://gerrit.wikimedia.org/r/272675 (owner: 10Dzahn) [00:25:50] So it's probably Jenkins or Zuul at fault. [00:26:00] Zuul is working fine too [00:26:00] it worked fine in ops/puppet [00:26:10] maybe just on mw-config? [00:26:21] no [00:26:26] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/272926/ - prepare for VE default switch on dewiki (duration: 01m 17s) [00:26:27] https://integration.wikimedia.org/zuul/ shows everything is queued [00:26:29] James_F [00:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:31] It's not specific to any repo [00:26:33] Krenair: Looking. [00:27:37] Krenair: LGTM. [00:27:55] jgirault, hey [00:28:05] Krenair: o/ [00:28:16] (03PS3) 10Alex Monk: Bump portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274316 (https://phabricator.wikimedia.org/T128522) (owner: 10JGirault) [00:28:21] (03CR) 10Alex Monk: [C: 032 V: 032] Bump portals to master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274316 (https://phabricator.wikimedia.org/T128522) (owner: 10JGirault) [00:30:23] !log krenair@tin Synchronized portals/prod/wikipedia.org/assets: https://gerrit.wikimedia.org/r/#/c/274316/ (duration: 01m 18s) [00:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:06] jgirault, ^ [00:31:42] !log krenair@tin Synchronized portals: https://gerrit.wikimedia.org/r/#/c/274316/ (duration: 01m 18s) [00:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:48] jgirault, ^ [00:31:55] (done now) [00:31:56] I do hope we don't deploy without testing (other than submodule updates that are standalone such as portals) [00:33:31] That would depend upon what sort of testing you're talking about. [00:33:59] Krenair: when should I try? see no update so far [00:34:34] what page are you looking at that's not updated? [00:35:05] https://www.wikipedia.org/ [00:35:12] https://www.wiktionary.org/ [00:35:22] and adding ?whatever doesn’t seem to help [00:35:24] wiktionary isn't on the list for some reason [00:35:40] wikipedia is [00:36:10] I see the wikipedia update [00:36:34] with and without extra query string [00:36:53] Krenair: do you? espanol should be #2 [00:37:07] Krenair: this is still the old one [00:37:53] #2? [00:38:10] Yes the 2 links on the top [00:38:15] I see you moved eswiki from #5 to #3 [00:38:26] and it's reflected in the actual page I see on my machine [00:39:15] Krenair: nono the top 2 links (on the first line) should be : English and Espanol [00:39:21] Krenair: I still see English and Japanese [00:39:31] i certainly get english and japanese from the office as well [00:39:46] Krenair: see https://github.com/wikimedia/wikimedia-portals/blob/2e57bfbd83acce979a25bf213dfd9b44310e3ec2/prod/wikipedia.org/index.html#L26-L43 [00:40:06] (with an age: 0 due to using a query string to break caching) [00:40:17] Krenair: and 2e57bfb corresponds to the right revision ( https://github.com/wikimedia/operations-mediawiki-config ) [00:40:34] Krenair: so I’m assuming something is not deployed correctly [00:40:49] what do you see from `curl -H "Host: www.wikipedia.org" https://text-lb.esams.wikimedia.org/ -k | md5sum`? [00:40:56] Krenair: didn’t we have similar issue last time? your process was slightly different since it was a submodule I think [00:41:03] bah [00:41:16] Submodule path 'portals': checked out '2e57bfbd83acce979a25bf213dfd9b44310e3ec2' [00:42:12] but wait, I see "" in the source [00:42:24] (03PS3) 10Dzahn: logstash: fix top-scope var w/o namespace [puppet] - 10https://gerrit.wikimedia.org/r/272675 [00:42:37] (03CR) 10Dzahn: [C: 032] "noop, tested in compiler http://puppet-compiler.wmflabs.org/1902/" [puppet] - 10https://gerrit.wikimedia.org/r/272675 (owner: 10Dzahn) [00:42:39] right, that was the pre-submodule update change [00:42:39] ok [00:42:43] that's why I was confused [00:43:10] Krenair: yes that line #3 should be: [00:44:14] !log krenair@tin Synchronized portals/prod/wikipedia.org/assets: https://gerrit.wikimedia.org/r/#/c/274316/ - try #2, this time with the submodule update (duration: 01m 16s) [00:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:32] !log krenair@tin Synchronized portals: https://gerrit.wikimedia.org/r/#/c/274316/ - try #2, this time with the submodule update (duration: 01m 17s) [00:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:53] jgirault, ^ not seeing a change though [00:45:53] (03CR) 10Dzahn: [V: 032] logstash: fix top-scope var w/o namespace [puppet] - 10https://gerrit.wikimedia.org/r/272675 (owner: 10Dzahn) [00:46:02] Krenair: seeing it now [00:46:17] Krenair: https://www.wikipedia.org/?Espanol!!!! [00:46:37] aha [00:46:48] so, how to get it on https://www.wikipedia.org/... [00:46:59] reminds me, i once had a tool to create that portal HTML automatically [00:46:59] Krenair: and https://www.wiktionary.org/?4520 (look at English number of entries) [00:47:05] for copy/paste to wiki [00:48:12] It seemed to work as soon as I added a / to the purge URL [00:49:04] is zuul stuck? [00:49:08] Krenair: verified all portals, we’re all good =] [00:49:17] so that might be the change we need to make next time jgirault [00:49:35] 1 - make sure Alex doesn't forget to run the submodule update command :) [00:49:37] 2 - purge with a / ? [00:49:56] Krenair: did you run “sync-portals” script? [00:50:00] yes [00:50:04] ok [00:50:08] (03CR) 10Dzahn: [C: 04-1] "you'll also have to use that variable in the actual config template somewhere" [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) (owner: 10RobH) [00:50:09] which does it without the / [00:50:15] Krenair: so I’ll make the update to https://github.com/wikimedia/wikimedia-portals/blob/master/urls-to-purge.txt [00:50:27] I don't have proof that fixes the problem [00:50:29] to add the / [00:50:35] The timing may have been a coincidence [00:51:03] Krenair: we had issues with MaxSem in the past, not seeing that url purged instantly [00:51:28] Krenair: I think it might be the reason [00:51:49] Krenair: I will make the update and we will know next time :) [00:52:24] ok [00:52:56] 6Operations, 13Patch-For-Review: Configure librenms to use LDAP for authentication - https://phabricator.wikimedia.org/T107702#2078383 (10Dzahn) Faidon said once on the patch (https://gerrit.wikimedia.org/r/#/c/229299/) that it didn't fit our schema but it _might_ work after the migration from opendj to openld... [00:53:44] mutante: what config template? it seems that the config options listed there are generated into the config by puppet? [00:54:30] and the template file in the module doesnt have entries for the other items listed there [00:55:21] oh wait, found it [00:55:35] robh: only options that are used in the template. for example $nicelevel is @nicelevel in the template [00:55:49] yea seems local.cf is the template references? [00:56:03] as it then has setting [00:56:04] i thought spamassassin.default.erb [00:56:08] but not sure [00:56:16] nah, that doesnt have them thats what confused me [00:56:31] it has $nicelevel [00:56:40] but the other file local.cf has entries like: use_bayes <%= @use_bayes %> [00:56:53] which lines up to the entires in the init.pp file [00:57:42] yea, for some reason there are 2 config files [00:58:26] looks like one is for default values and the other to "customize" [00:58:46] hrmm [00:58:51] so if its in local.cf i figure whitelist_from <%= @whitelist_from %> [00:59:09] yea [00:59:34] i'm going to append in that info with a comment and update the patch for review, thanks for looking it over! [01:00:42] (03PS6) 10Dzahn: dynamicproxy: custom log schema (http/https) for tools [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) [01:01:25] (03PS2) 10RobH: whitelisting equinix domain for spam assassin [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) [01:01:30] (03CR) 10Dzahn: [C: 032 V: 032] "will check tools-proxy-01/02" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [01:05:05] (03CR) 10Dzahn: "merged and no puppet change on tools-proxy-01 and 02? how, why, sigh" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [01:05:54] yuvipanda: when we edit modules/dynamicproxy/templates/urlproxy.conf and merge and run puppet on tools-proxy-01 .. should we not expect a change ?> [01:06:21] (03CR) 10RobH: "I think it is now listed in the proper file. It is a bit confusing, since the .erb template file doesn't seem to list/reference the init." [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) (owner: 10RobH) [01:17:58] 6Operations, 10Mail, 13Patch-For-Review: our spam assassin service inserts PROBABLE SPAM into known good emails - https://phabricator.wikimedia.org/T128497#2078484 (10RobH) I've tested the output in the puppet compiler and received the following: http://puppet-compiler.wmflabs.org/1903/mx1001.wikimedia.org/... [01:31:52] (03CR) 10Dzahn: "nevermind. works! :)" [puppet] - 10https://gerrit.wikimedia.org/r/274161 (https://phabricator.wikimedia.org/T128409) (owner: 10Dzahn) [01:35:27] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2078533 (10Dzahn) works now:) root@tools-proxy-01:~# tail -f /var/log/nginx/access-scheme.log shows first results [01:40:22] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2078538 (10Dzahn) here's a first list of tools using http, status 200 ``` add-information admin anagrimes anomiebot anomiebot HTTP ~apper apple-touch-ic... [01:44:32] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1197.00 seconds [01:46:22] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:29:30] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 12m 32s) [02:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:25] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 09m 31s) [02:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 2 03:04:14 UTC 2016 (duration 8m 49s) [03:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:06] PROBLEM - RAID on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:27] PROBLEM - configured eth on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:46] PROBLEM - dhclient process on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:56] PROBLEM - puppet last run on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:57] PROBLEM - Check size of conntrack table on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:07] PROBLEM - Disk space on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:07] PROBLEM - salt-minion processes on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:17:07] PROBLEM - Labs LDAP on serpens is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:17:07] PROBLEM - DPKG on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:24:26] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:26:06] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.062 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [06:30:16] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:16] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:37] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:57] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:27] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:36:47] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.098 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [06:39:18] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2078861 (10Dzahn) mediawiki roles moved https://gerrit.wikimedia.org/r/#/c/256574/ ci roles moved https://gerrit.wikimedia.org/r/#/c/2... [06:41:28] <_joe_> I'm looking at serpens [06:42:16] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:43:10] 6Operations, 10Ops-Access-Requests: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2078862 (10Catrope) [06:43:56] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.081 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [06:44:48] _joe_: it's that issue that only happens with ganeti VMs sometimes [06:44:59] everything times out.. then you ssh to it.. and bam.. recovery [06:45:07] basically by looking at it [06:45:11] schroedingers vm [06:45:21] <_joe_> !log rebooting serpens [06:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:45:42] <_joe_> mutante: tried, didn't work [06:45:49] ugh, ok [06:46:36] RECOVERY - configured eth on serpens is OK: OK - interfaces up [06:46:48] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [06:46:58] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [06:47:07] RECOVERY - Check size of conntrack table on serpens is OK: OK: nf_conntrack is 0 % full [06:47:16] RECOVERY - Disk space on serpens is OK: DISK OK [06:47:16] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:47:17] RECOVERY - DPKG on serpens is OK: All packages OK [06:47:17] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.111 seconds response time [06:47:57] RECOVERY - RAID on serpens is OK: OK: no RAID installed [06:48:28] i tried a random one of those appservers, mw1119, ran puppet. the failures were not real as expected [06:49:16] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:55:18] <_joe_> mutante: yeah it's the usual logrotate puppet bug [06:55:26] <_joe_> you're not usually around at this time :P [06:56:47] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:47] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:48] eh, yea, i will change that :) see you later [06:56:57] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:38] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:58:27] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:46] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.029 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [07:00:06] <_joe_> I have no idea what's going on with the labs dns [07:00:29] <_joe_> I am not going to work on it though, I'm out for a few [07:00:37] me neither, but i do know that labs people have been working on it all the time [07:10:06] anyone with access to a wikimedia fishbowl wiki around? [07:10:35] <_joe_> jayvdb: I didn't even know there was one [07:11:13] e.g. a staff wiki ..? [07:11:19] <_joe_> jayvdb: I have [07:33:37] (03PS1) 10Legoktm: Disable $wgReferrerPolicy on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274344 [07:33:47] (03CR) 10Legoktm: [C: 032] Disable $wgReferrerPolicy on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274344 (owner: 10Legoktm) [07:34:20] (03Merged) 10jenkins-bot: Disable $wgReferrerPolicy on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274344 (owner: 10Legoktm) [07:35:58] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Disable $wgReferrerPolicy on private wikis (duration: 01m 01s) [07:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:39] jayvdb: ^ [07:43:25] (03PS1) 10Catrope: Disable useless Echo eventlogging schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274345 [07:48:18] 6Operations: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2078909 (10elukey) Slave summary: rdb1008 slaveof rdb1007 rdb1004 slaveof rdb1003 rdb1006 slaveof rdb1005 rdb1002 slaveof rdb1001 (but 1002 will not be touched i... [07:52:18] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: shop.wikimedia.org should be HTTPS only - https://phabricator.wikimedia.org/T39790#417984 (10Chmarkine) [07:52:20] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2078930 (10Chmarkine) [07:55:32] 6Operations: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2078933 (10elukey) MediaWiki Job queue config: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/jobqueue-eqiad.php All the slaves... [08:06:23] 6Operations, 10Mail: [URGENT] New email address receiving bounceback - https://phabricator.wikimedia.org/T128485#2078963 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [08:07:10] 6Operations: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#1935461 (10Joe) @elukey let's start with the following: # rdb1004 (precise) # rdb1006 (trusty) # rdb1003 # rdb1005 after we've confirmed the replication to 1006... [08:08:29] 6Operations, 10Mail: [URGENT] New email address receiving bounceback - https://phabricator.wikimedia.org/T128485#2078985 (10MoritzMuehlenhoff) 5Open>3Resolved This seems have been a transient thing. I sent test mails to that address with my private mail server and my wikimedia account and both worked fine,... [08:10:13] 6Operations, 10Ops-Access-Requests: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2079002 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [08:14:12] (03PS8) 10Giuseppe Lavagetto: role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [08:14:37] (03CR) 10Giuseppe Lavagetto: "All mc* hosts have been converted to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [08:15:35] 6Operations: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733#2079006 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [08:17:34] <_joe_> !log disabling puppet on all memcached hosts in preparation for enabling ipsec [08:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:34] (03CR) 10Giuseppe Lavagetto: [C: 032] role::memcached: add cross-dc IPsec for the various shards [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [08:25:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Imported Upstream version 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274127 (owner: 10Muehlenhoff) [08:25:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] * Update to 1.0.2g * Drop handle-ssl-shutdown-while-in-init-more-appropriately-v2.patch (part of new upstream release) [debs/openssl] - 10https://gerrit.wikimedia.org/r/274128 (owner: 10Muehlenhoff) [08:31:25] mutante: still around? [08:33:32] !log installing Django security updates [08:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:58] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2079043 (10Joe) p:5Normal>3High [08:39:30] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2079058 (10Joe) IPsec is now enabled and working. I am going to enable replication now. [08:48:30] !log elastic1003.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [08:48:31] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [08:48:32] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [08:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:19] (03PS1) 10Mobrovac: MobileApps: Put in place proper request templates for RB and MW [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) [09:09:34] (03PS1) 10Elukey: Add Debian Jessie PXE support for rdb servers (MW Job Queues). [puppet] - 10https://gerrit.wikimedia.org/r/274350 (https://phabricator.wikimedia.org/T123675) [09:09:52] (03PS1) 10Volans: Depooled codfw external storage for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274351 (https://phabricator.wikimedia.org/T127330) [09:10:43] (03PS2) 10Elukey: Add Debian Jessie PXE support for rdb servers (MW Job Queues). [puppet] - 10https://gerrit.wikimedia.org/r/274350 (https://phabricator.wikimedia.org/T123675) [09:10:49] (03CR) 10Jcrespo: [C: 031] Depooled codfw external storage for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274351 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [09:12:58] (03CR) 10Volans: [C: 032] Depooled codfw external storage for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274351 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [09:13:02] !log Zuul went crazy / caught in a loop of doom. Same has Saturday. It went back magically at 08:32 UTC T128569 [09:13:03] T128569: Zuul get caught in an error loop preventing it from processing changes - https://phabricator.wikimedia.org/T128569 [09:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:35] (03Merged) 10jenkins-bot: Depooled codfw external storage for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274351 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [09:13:59] (03CR) 10Elukey: [C: 032] Add Debian Jessie PXE support for rdb servers (MW Job Queues). [puppet] - 10https://gerrit.wikimedia.org/r/274350 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:15:08] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:15:41] hashar, thanks for logging! [09:16:25] ^ dpkg alert on labmon1001 is due to T127957and the ongoing django update [09:16:50] !log volans@tin Synchronized wmf-config/db-codfw.php: Depooling external storage DBs in codfw for migration: T127330 (duration: 01m 24s) [09:16:51] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [09:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:07] jynus: I have no idea what is happening but it should be running just fine. Apparently solved all by itself ~45 minutes ago [09:17:58] it was not passive-agresive thanks, I was genuinly thanking for logging it! [09:18:36] yup ack :) [09:18:51] if it happens again, create a ticket so I can help [09:19:43] I did https://phabricator.wikimedia.org/T128569 ;) [09:19:48] !log redis multi-instance stopped on rdb1004 (jobqueue slave) as pre-step for Debian re-image [09:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:05] it is a bug in Zuul , somehow it receives a change that does not have any project attached to it ( it is None somehow) [09:20:38] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [09:30:09] !log installing nodejs updates on restbase* [09:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:18] RECOVERY - DPKG on labmon1001 is OK: All packages OK [09:43:38] !log Cloning es2005->es2014, es2007->es2016, es2009->es2018, see T127330 [09:43:39] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:38] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:58:19] 6Operations, 10Traffic: Install XKey vmod - https://phabricator.wikimedia.org/T122881#2079205 (10ema) p:5Triage>3Normal a:3ema [10:16:00] !log elastic1004.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [10:16:01] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [10:16:02] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [10:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:05] corner case bugs, or how to waste a morning ;D [10:36:00] !log stopped Redis multi-instance on rdb1006 (Job Queue slave) as pre-step for Debian re-image [10:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:39] !log restarting graphite-web on graphite1001 (for django security update) [10:42:39] !log Zuul should no more be caught in death loop due to Depends-On on an event-schemas change. Hole filled with https://gerrit.wikimedia.org/r/#/c/274356/ T128569 [10:42:40] T128569: Zuul get caught in an error loop preventing it from processing changes - https://phabricator.wikimedia.org/T128569 [10:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:51] so Zuul is fixed until it dies again [10:43:43] root cause is marking a Depends-On: on a repository not known to Zuul which causes it to death loop. That is really a corner case. [10:51:52] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: Puppet has 1 failures [10:53:47] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 6 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2079398 (10Sjoerddebruin) >>! In T124356#2077122,... [10:58:12] (03PS1) 10Muehlenhoff: Puppetise yubikey key storage module [puppet] - 10https://gerrit.wikimedia.org/r/274358 [10:58:14] (03PS1) 10Muehlenhoff: Include yhsm-yubikey-ksm in yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/274359 [11:00:07] 6Operations, 10Mail, 13Patch-For-Review: our spam assassin service inserts PROBABLE SPAM into known good emails - https://phabricator.wikimedia.org/T128497#2079413 (10faidon) 5Open>3Invalid Our mail setup doesn't modify the Subject. We didn't add this "PROBABLE SPAM" — I'm unsure which system did, possib... [11:00:30] (03CR) 10Faidon Liambotis: [C: 04-2] "See the task." [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) (owner: 10RobH) [11:09:03] !log profiling db1023 and db1061 for 24 hours- 1/20th of the queries slightly slower [11:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:11] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2079445 (10faidon) p:5Triage>3High a:3RobH [11:11:30] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Puppet has 1 failures [11:13:39] --^ both hosts (mw2078 and mw2084) are failing for a 502 in /Stage[main]/Mediawiki::Hhvm::Housekeeping/File[/usr/local/sbin/hhvm_cleanup_cache] [11:15:45] is it sorry mw2084 with /Stage[main]/Mediawiki::Cgroup/File[/usr/local/bin/cgroup-mediawiki-clean, but 502 [11:16:39] mhh temporary or you can reproduce? [11:17:39] I didn't touch them because I wanted to ask first, but since they are job runners in codfw I guess that I can work on them right? [11:18:33] yup [11:18:52] (03PS2) 10Muehlenhoff: Remove DNS entries for berkelium/curium [dns] - 10https://gerrit.wikimedia.org/r/274095 (https://phabricator.wikimedia.org/T125962) [11:19:21] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 64.00% of data above the critical threshold [5000000.0] [11:27:06] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2079468 (10faidon) 64GB would be ideal; terbium doesn't use much memory most of the time, but at certain circumstances we so... [11:30:38] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2079474 (10mobrovac) [11:30:40] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#2079473 (10mobrovac) [11:30:50] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2075945 (10mobrovac) [11:30:52] 6Operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#2079475 (10mobrovac) [11:30:59] 6Operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2079477 (10mobrovac) [11:31:02] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2075945 (10mobrovac) [11:31:55] (03PS4) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [11:31:55] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [11:33:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove DNS entries for berkelium/curium [dns] - 10https://gerrit.wikimedia.org/r/274095 (https://phabricator.wikimedia.org/T125962) (owner: 10Muehlenhoff) [11:34:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "redis_get_instances doesn't return the correct values." [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [11:36:22] !log mobileapps deploying d384f1ba [11:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:23] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:39:14] godog: didn't do anything, self recovered (I was busy with another task) [11:39:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please see the comments; at least the api url should be changed." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [11:39:46] (03CR) 10Filippo Giunchedi: [C: 04-1] Parameterize the git_server variable in global scap.cfg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [11:39:59] (03PS1) 10Muehlenhoff: Remove ferm exceptions for iron being a DBA maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/274366 [11:40:17] elukey: yeah, if it was permanent all other hosts would have failed by now to [11:40:21] s/to/too/ [11:45:17] (03PS1) 10Muehlenhoff: Add Roan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/274367 (https://phabricator.wikimedia.org/T128557) [11:45:50] (03PS2) 10Mobrovac: MobileApps: Put in place proper request templates for RB and MW [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) [11:47:04] <_joe_> w/in 34 [11:47:10] (03CR) 10Mobrovac: MobileApps: Put in place proper request templates for RB and MW (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [11:47:58] (03CR) 10Mobrovac: "The puppet compiler is looking good: https://puppet-compiler.wmflabs.org/1909/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [11:49:50] (03PS3) 10Giuseppe Lavagetto: MobileApps: Put in place proper request templates for RB and MW [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [11:50:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/274348 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [11:52:02] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2079514 (10mark) [11:54:27] (03CR) 10Jcrespo: [C: 031] "I am ok with this, but will this apply in a hot way to all servers where ferm is deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/274366 (owner: 10Muehlenhoff) [11:55:29] (03PS1) 10BBlack: logrotate: s/syslog/root/ for create [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/274368 [11:56:17] (03CR) 10BBlack: [C: 032 V: 032] logrotate: s/syslog/root/ for create [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/274368 (owner: 10BBlack) [11:56:23] <_joe_> !log stopped puppet on scb1002, depooled scb1001 from mobileapps [11:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:57:25] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog page returned the unexpec [11:58:32] (03PS1) 10BBlack: Bump varnishkafka submodule to 0b3d78ea for logrotate fixup [puppet] - 10https://gerrit.wikimedia.org/r/274370 [11:58:32] <_joe_> known ^^ [11:58:49] (03CR) 10BBlack: [C: 032 V: 032] Bump varnishkafka submodule to 0b3d78ea for logrotate fixup [puppet] - 10https://gerrit.wikimedia.org/r/274370 (owner: 10BBlack) [12:00:25] !log mobileapps stopping the service on scb1001 for debug purposes, T113542 [12:00:26] T113542: Getting 404s for BetaCluster domains - https://phabricator.wikimedia.org/T113542 [12:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:07:41] 6Operations, 10DBA, 7Upstream: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#2079531 (10jcrespo) [12:08:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp San Francisco page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp San Francisco page via mobile-sections-lead returned the unexpected status 504 (expecting: 200) [12:10:26] (03PS1) 10Ema: Use std.fileread to generate WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/274371 [12:13:51] <_joe_> uh wtf is going on with mobileapps ^^ [12:13:54] <_joe_> mobrovac: ? [12:14:29] euh [12:14:47] lemme submit the patch and we'll apply it, should bring stuff back to normal [12:14:50] <_joe_> yeah [12:15:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:15:32] (03PS1) 10Mobrovac: MobileApps: Specify the content type in the MW API call [puppet] - 10https://gerrit.wikimedia.org/r/274373 (https://phabricator.wikimedia.org/T113542) [12:15:42] _joe_: ^ [12:16:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] MobileApps: Specify the content type in the MW API call [puppet] - 10https://gerrit.wikimedia.org/r/274373 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [12:16:55] <_joe_> mobrovac: running on scb1002 [12:17:49] (03PS1) 10Muehlenhoff: Empty the patch series for rt flavour [debs/linux44] - 10https://gerrit.wikimedia.org/r/274374 [12:18:05] wth??? [12:18:44] <_joe_> mobrovac: 1 sec and we should be ok [12:19:03] kk [12:19:51] <_joe_> uhm actually no [12:19:58] <_joe_> that eeror is still there it seems [12:20:10] <_joe_> and had nothing to do with the code deploy [12:20:17] <_joe_> (the one on scb1002) [12:20:17] not the same [12:20:24] _joe_: from the checker: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body: 'NoneType' object has no attribute '__getitem__' [12:20:37] * mobrovac looking [12:20:54] <_joe_> mobrovac: means it received no body [12:21:46] <_joe_> mobrovac: what about we revert? [12:21:50] _joe_: curl receives an empty json blob [12:21:51] <_joe_> mobileapps is down atm [12:22:05] <_joe_> do you get why? [12:22:19] <_joe_> I seriously think we should rollback [12:22:24] no, will need to investigate [12:22:43] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body: NoneType object has no attribute __getitem__: /{domain}/v1/page/mobile-summary/{title} (retrieve page preview of Dog page) is CRITICAL: Test retrieve page preview of Dog pa [12:22:46] _joe_: i suggest disabling puppet on scb1001 so i can investigate and we roll back scb1002? [12:22:55] <_joe_> so let's roll back the software on scb1002, I'll stop puppet and fix the config [12:23:01] <_joe_> but please roll back the code [12:23:11] k, doing it [12:23:20] <_joe_> is that !logged? [12:23:29] what? [12:23:38] <_joe_> the deployments [12:23:46] yes [12:23:54] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [12:24:24] fun ^ [12:24:34] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [12:25:02] <_joe_> mobrovac: have you rolled back? [12:25:21] <_joe_> I have fixed the config [12:25:24] !log mobileapps rolling back to 68e38ec7, problems found in the latest deploy for T113542 [12:25:25] T113542: Getting 404s for BetaCluster domains - https://phabricator.wikimedia.org/T113542 [12:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:25:30] _joe_: there ^ [12:25:50] <_joe_> ok [12:25:53] _joe_: {{done}} [12:26:14] looking at logstash... [12:26:18] ok _joe_, i'll play with it on scb1001 [12:26:23] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:26:27] just leave it depooled for the time being [12:26:28] <_joe_> mobrovac: actually, if we _added_ the new configs along with the old ones [12:26:38] <_joe_> it would've been better, probably [12:26:43] <_joe_> bblack: thanks [12:26:43] indeed [12:27:39] <_joe_> !log puppet disabled on both scb1001/2, depooled scb1001 for moborovac to test and config manually patched on scb1002 so that it runs with the old code correctly [12:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:46] <_joe_> I will amend this after lunch btw [12:28:23] moritzm: you're looking too I guess, but not causing? [12:28:31] I think the oom message in logstash.err is a red herring, it's also logged on logstash1003, which still has the process running [12:28:39] bblack: no just checking at the moment [12:29:09] !log restarted logstash on logstash1001 [12:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:24] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [12:29:53] bd808: restarted a crashed logstash on 2016-02-25 already [12:30:37] and filed https://phabricator.wikimedia.org/T127677, I'll give that a higher priority and finish the patch today [12:30:46] Error: Your application used more memory than the safety cap of 500M. [12:30:49] Specify -J-Xmx####m to increase it (#### = cap size in MB). [12:30:51] Specify -w for full OutOfMemoryError stack trace [12:31:09] bblack: that one is also logged on logstash1003, which is still running fine [12:31:43] could be related, but might also be leading into a wrong direction :-) [12:31:46] are you sure the error there isn't older than the current proc? [12:32:33] (03PS1) 10Jcrespo: [WIP]Test the new heartbeat functionality on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) [12:32:42] !log mobileapps stopping (again) the service on scb1001 for debugging, T113542 [12:32:43] T113542: Getting 404s for BetaCluster domains - https://phabricator.wikimedia.org/T113542 [12:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:52] actually I don't see that error on logstash1003 [12:33:36] !log restarted logstash on logstash1002 [12:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:54] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [12:34:45] 6Operations, 10Wikimedia-Logstash: Auto generated Logstash unit file has "Restart=no" - https://phabricator.wikimedia.org/T127677#2050308 (10BBlack) 2x logstashes (1001 + 1002) pretty much simultaneously crashed today with errors like: ``` Error: Your application used more memory than the safety cap of 500M. S... [12:35:09] it's /var/log/logstash/logstash.err on 1003, the same error message as on 1001, it seems to have been logged at daemom startup, the logstash process on 1003 is from the 24th (same as the date of the logfile) [12:35:42] root@logstash1003:/var/log/logstash# cat /var/log/logstash/logstash.err [12:35:45] WARNING: Default JAVA_OPTS will be overridden by the JAVA_OPTS defined in the environment. Environment JAVA_OPTS are -Xms256m -Xmx256m -Djava.io.tmpdir=/var/lib/logstash [12:35:49] '[DEPRECATED] use `require 'concurrent'` instead of `require 'concurrent_ruby'` [12:35:51] 6Operations, 10media-storage: Error deleting files on Arabic Wikipedia: inconsistent state within the internal storage background - https://phabricator.wikimedia.org/T128570#2079605 (10Aklapper) p:5Triage>3High [12:35:52] [2016-02-24 20:43:11.625] WARN -- Concurrent: [DEPRECATED] Java 7 is deprecated, please use Java 8. [12:35:55] Java 7 support is only best effort, it may not work. It will be removed in next release (1.0). [12:35:56] sorry, we're talking about different log messages [12:35:58] [12:36:01] you're right [12:36:23] 6Operations, 10media-storage: Error deleting files on Arabic Wikipedia: inconsistent state within the internal storage background - https://phabricator.wikimedia.org/T128570#2079154 (10Aklapper) [12:36:25] 6Operations, 10media-storage: Unable to delete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2079609 (10Aklapper) [12:36:33] 6Operations, 10media-storage: Error deleting files on Arabic Wikipedia: inconsistent state within the internal storage background - https://phabricator.wikimedia.org/T128570#2079154 (10Aklapper) Hi @Ibrahim.ID, thanks for taking the time to report this! This particular problem has already been reported into ou... [12:38:41] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [12:39:39] (03CR) 10BBlack: [C: 031] Use std.fileread to generate WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/274371 (owner: 10Ema) [12:41:14] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [12:42:12] 6Operations, 10Wikimedia-Logstash: Auto generated Logstash unit file has "Restart=no" - https://phabricator.wikimedia.org/T127677#2079640 (10MoritzMuehlenhoff) The limit is configurable via LS_HEAP_SIZE: https://discuss.elastic.co/t/is-it-possible-to-give-ls-heap-size-xms-and-xmx-while-starting-logstash-1-4-2-... [12:43:14] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2016-06-30 17:56:02 +0000 (expires in 120 days) [12:54:31] (03CR) 10Muehlenhoff: "The change would be applied to all systems with base::firewall enabled with the usual half-hourly puppet runs. If you prefer we can split " [puppet] - 10https://gerrit.wikimedia.org/r/274366 (owner: 10Muehlenhoff) [12:59:00] (03PS1) 10Mobrovac: MobileApps: Fix the MW API URI, revert c848ca4fd and resurect restbase_uri [puppet] - 10https://gerrit.wikimedia.org/r/274381 (https://phabricator.wikimedia.org/T113542) [12:59:50] _joe_: ^ [13:00:30] (03PS1) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [13:04:22] (03PS4) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/273254 (https://phabricator.wikimedia.org/T124444) [13:06:18] (03CR) 10Giuseppe Lavagetto: [C: 032] MobileApps: Fix the MW API URI, revert c848ca4fd and resurect restbase_uri [puppet] - 10https://gerrit.wikimedia.org/r/274381 (https://phabricator.wikimedia.org/T113542) (owner: 10Mobrovac) [13:06:28] <_joe_> mobrovac: I'll run puppet on scb1001 now [13:06:31] cool [13:08:04] _joe_: aaand we're back! All endpoints are healthy [13:08:30] _joe_: euh actually, i need to re-deploy the new version of the code [13:08:53] let me do that before you repool it [13:09:01] <_joe_> mobrovac: ok [13:09:19] <_joe_> at least we know a puppet-active state that works :P [13:09:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:10:01] (03PS1) 10ArielGlenn: default to jessie for dataset1001 installs [puppet] - 10https://gerrit.wikimedia.org/r/274383 [13:10:02] !log mobileapps re-deploying d384f1ba for T113542 [13:10:03] T113542: Getting 404s for BetaCluster domains - https://phabricator.wikimedia.org/T113542 [13:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:34] <_joe_> mobrovac: should I restart the service on scb1001? [13:10:51] _joe_: yes, do it now [13:10:55] deployment done [13:11:23] <_joe_> mobrovac: it works :) [13:11:30] \o/ [13:11:31] (03CR) 10ArielGlenn: [C: 032] default to jessie for dataset1001 installs [puppet] - 10https://gerrit.wikimedia.org/r/274383 (owner: 10ArielGlenn) [13:11:50] k _joe_, you can run puppet on scb1002 too and repool scb1001 [13:12:57] 6Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2079743 (10MoritzMuehlenhoff) [13:13:32] <_joe_> !log re-enabled puppet on scb1002, repooled scb1001 for mobileapps [13:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:26] (03PS5) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [13:21:33] 6Operations, 10Analytics, 10Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2079753 (10elukey) [13:23:14] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [13:23:33] !log elastic1005.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [13:23:35] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [13:23:35] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [13:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:26:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:28:15] RECOVERY - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is OK: TCP OK - 0.002 second response time on port 9042 [13:29:41] 6Operations: 4.4 Linux kernel - https://phabricator.wikimedia.org/T126320#2079784 (10MoritzMuehlenhoff) A kernel based on Debian's 4.4.2-3 kernel has been imported to git (along with the usual backports for wheezy, config tweaks). I have also integrated the 4.4.3 update from from kernel.org (4.4.4 is also in rev... [13:31:02] (03PS6) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [13:32:01] !log nfs service for dataset1001 disabled (impacts users of stat100{2,3} in prep for jessie upgrade [13:32:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:53] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:37:02] (03CR) 10Faidon Liambotis: [C: 04-1] "I haven't looked at this in depth yet, but quick comment: this doesn't really belong in the sslcert module (which is a fairly generic modu" [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:37:44] (03PS1) 10Muehlenhoff: Refresh block_diginotar.patch for 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274385 [13:42:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Refresh block_diginotar.patch for 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274385 (owner: 10Muehlenhoff) [13:43:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] Empty the patch series for rt flavour [debs/linux44] - 10https://gerrit.wikimedia.org/r/274374 (owner: 10Muehlenhoff) [13:44:51] (03PS2) 10Ema: Use std.fileread to generate WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/274371 [13:45:08] (03CR) 10Ema: [C: 032 V: 032] Use std.fileread to generate WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/274371 (owner: 10Ema) [13:45:16] 6Operations, 6Services, 10procurement: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2079800 (10mobrovac) [13:45:19] 6Operations, 6Language-Engineering, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#2079799 (10mobrovac) [13:47:45] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#2079801 (10mobrovac) > 2. Prepare the configuration so that we'll be able to switch between the two programmatically. This... [13:49:14] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:49:48] (03PS2) 10Jcrespo: [WIP]Test the new heartbeat functionality on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) [13:53:18] (03CR) 10Gehel: "Yeah, I was not really sure where to put that. I will move it to base..." [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:53:31] (03PS7) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [13:54:49] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [13:55:36] (03PS8) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [13:56:11] (03PS1) 10Filippo Giunchedi: site.pp: add restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/274387 [13:56:56] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [13:57:21] <_joe_> grr rubocop [13:57:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] site.pp: add restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/274387 (owner: 10Filippo Giunchedi) [13:59:41] (03PS2) 10Filippo Giunchedi: cassandra: add restbase101[0-5] instances [puppet] - 10https://gerrit.wikimedia.org/r/274133 (https://phabricator.wikimedia.org/T128107) [13:59:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase101[0-5] instances [puppet] - 10https://gerrit.wikimedia.org/r/274133 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [14:03:04] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: puppet fail [14:03:28] !log web service for dumps.wikimedia.org and download.wikimedia.org is now unavailable (upgrade of server to jessie) [14:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:54] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:05:58] (03CR) 10Jcrespo: "Mortiz, I do not care if it is atomic. The only thing that worries me is the "connections failing for a few seconds" when applying ferm fo" [puppet] - 10https://gerrit.wikimedia.org/r/274366 (owner: 10Muehlenhoff) [14:06:27] !log bootstrap restbase1010-a T128107 [14:06:29] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:03] (03PS1) 10Muehlenhoff: Define eventbus ferm service in the role [puppet] - 10https://gerrit.wikimedia.org/r/274389 [14:09:42] 6Operations, 6Language-Engineering, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#1973574 (10KartikMistry) Any action from Language Engineering needed? [14:13:04] (03PS9) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [14:15:18] akosiaris: hola. [14:16:17] (03PS1) 10Filippo Giunchedi: cassandra: add restbase101[0-5] main host IPs [puppet] - 10https://gerrit.wikimedia.org/r/274390 [14:16:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase101[0-5] main host IPs [puppet] - 10https://gerrit.wikimedia.org/r/274390 (owner: 10Filippo Giunchedi) [14:18:12] (03PS1) 10Muehlenhoff: Include rsyncd ferm service in the statistics role [puppet] - 10https://gerrit.wikimedia.org/r/274391 [14:19:52] 6Operations, 6Language-Engineering, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#1973574 (10mobrovac) >>! In T125065#2079833, @KartikMistry wrote: > Any action from Language Engineering needed? Not at... [14:23:38] 6Operations, 10RESTBase-Cassandra: cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2079854 (10fgiunchedi) [14:24:34] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [14:24:45] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [14:25:08] (03PS10) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [14:26:07] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [14:27:29] gehel: re: https://phabricator.wikimedia.org/T126472 "*medium term:* having redundancy on both our IRC and RCStream change logs seems to make sense." [14:27:35] gehel: do we have another task for that? [14:27:52] we can tag it with #codfw-rollout but not #codfw-rollout-jan-mar-2016 [14:27:54] paravoid: not that I know. Should I create it? [14:27:59] yes please :) [14:28:03] wilco [14:28:17] merci! [14:28:31] <_joe_> oh ffs that function works well locally and doesn't when fed by the compiler [14:28:31] paravoid: mais c'est tout naturel mon brave... [14:28:43] <_joe_> what am I doing wrong [14:29:32] kart_: hello [14:30:24] akosiaris: bunch of reviews when you're back fully :) [14:31:07] I have been pushing updated packages to Debian heavily with apertium upstream. We will be good in coming years :) [14:31:15] _joe_: using puppet [14:32:24] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:32:25] !log elastic1006.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [14:32:27] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [14:32:27] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [14:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:35:06] kart_: that sounds great! thanks! [14:35:37] akosiaris: problem is I will ended up maintaining 100+ packages :) [14:36:01] issues again with restbase? [14:36:18] oh, ignore me [14:36:22] it is downtimed [14:36:59] jynus: yeah that's me [14:37:29] kart_: ;-) [14:38:49] I only asked because there was issues with a deploy before, trying to help [14:43:32] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2079914 (10Ppena) Hi @Chmarkine. We don't have anyone with a tech/ops background working for the store at the moment < waiting volunteers> :)!! Can you please... [14:45:25] 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2079916 (10fgiunchedi) tentative schedule to enable async replication on the remaining wikis (modulo resolution/writeup of htt... [14:45:37] 6Operations, 10media-storage: Unable to delete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2079918 (10fgiunchedi) [14:45:39] 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Implement a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#2079917 (10fgiunchedi) [14:48:35] 6Operations, 10Analytics, 10Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2079921 (10Dzahn) Has once been declared "won't fix" on https://wikitech.wikimedia.org/wiki/Httpsless_domains in the past. Adding @ArielGlenn. Remember that discussion? [14:50:14] (03PS3) 10Filippo Giunchedi: swift: switch to codfw imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) [14:50:36] (03CR) 10Ottomata: [C: 031] Include rsyncd ferm service in the statistics role [puppet] - 10https://gerrit.wikimedia.org/r/274391 (owner: 10Muehlenhoff) [14:50:36] kart_: i still think it would have been much easier to have one extra pkg containing lang pairs [14:52:35] <_joe_> bblack: and indeed, you're right [14:54:08] <_joe_> it has to do with "6" and 6 being two different things in ruby/erb [14:54:18] <_joe_> and being the same for puppet [14:54:27] * _joe_ headdesks repeatedly [14:54:47] ftr, that's a puppet fail, not a ruby fail [14:54:48] :P [14:58:54] !log restbase deploy start of 5def2f8 on restbase1001 [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:17] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2079927 (10Dzahn) a:3Dzahn [15:00:05] (03PS1) 10BBlack: stdlib: add function hash_select_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [15:00:13] _joe_: I didn't implement map, but I did implement one new tiny map-like bit of sub-functionality ^ [15:02:21] !log restbase reverting to fa1207e95, problems spotted in logstash [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:33] mobrovac: I was exploring idea with akosiaris to keep data in git repo. [15:02:38] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:39] (03PS2) 10Andrew Bogott: Move production Horizon to Liberty [puppet] - 10https://gerrit.wikimedia.org/r/274311 (https://phabricator.wikimedia.org/T105690) [15:02:49] kart_: that'd be even better [15:02:52] (03PS11) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [15:03:58] mobrovac: I have to test it locally though. In theory it should work :) [15:04:10] (03CR) 10Andrew Bogott: [C: 032] Move production Horizon to Liberty [puppet] - 10https://gerrit.wikimedia.org/r/274311 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [15:04:29] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 6 days ago [15:04:47] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [15:05:11] kart_: famous last words :)))) [15:05:31] <_joe_> die puppetlint-strict; die in a fire [15:06:10] _joe_: i'm happy to see somebody else in the same mood as mine, it really helps :D [15:06:13] (03CR) 10Ottomata: [C: 031] Define eventbus ferm service in the role [puppet] - 10https://gerrit.wikimedia.org/r/274389 (owner: 10Muehlenhoff) [15:06:18] !log running apt-get dist upgrade to upgrade californium packages to openstack Liberty [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:41] \O/ [15:06:55] (03PS2) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [15:07:38] andrewbogott: hello. So you are doing Juno -> Kilo -> Liberty over a short time period? It is good to see us catching up [15:07:54] hashar: I’m only moving Horizon to liberty for now... [15:08:03] the other still probably won’t get upgraded for a month or two [15:08:09] (03PS3) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [15:08:13] * hashar looks for new features ( https://wiki.openstack.org/wiki/ReleaseNotes/Liberty#OpenStack_Dashboard_.28Horizon.29 ) [15:08:36] but, I asked around and a fair number of people are running horizon with git Head against kilo — it’s meant to be backwards-compatible [15:08:45] (03PS1) 10Dzahn: openstack: do not have duplicate wikitech Apache config [puppet] - 10https://gerrit.wikimedia.org/r/274402 [15:08:51] And if I upgrade it now that’s one fewer time that I have to implement 2fa [15:09:29] ACKNOWLEDGEMENT - Restbase root url on restbase1010 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:09:29] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:09:29] ACKNOWLEDGEMENT - restbase endpoints health on restbase1010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.112, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [15:10:53] 6Operations, 10Phabricator, 6Project-Admins, 6Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2079937 (10Qgil) a:5Qgil>3None [15:11:03] hashar: I’m also going to reboot californium shortly just to make sure it can come up. Will that bother you? [15:11:22] andrewbogott: no clue what californium is used for [15:11:28] if that is horizon, nodepool does not depend on it [15:11:51] yes, sorry, californium == horizon [15:11:51] I myself using Horizon solely to look oat instances because the layout is a bit easier to read. Though I abuse the openstack CLI command as well } [15:11:56] so yeah all good [15:12:03] pretty sure you’re my only Horizon user atm [15:12:09] can test it out a bit once you have upgraded if it can help [15:12:10] hehe [15:12:23] I really like it [15:12:43] In that case I bet you’ll love this one: https://gerrit.wikimedia.org/r/#/c/274309/ [15:12:55] (03PS3) 10Andrew Bogott: Increase Horizon session length by a lot [puppet] - 10https://gerrit.wikimedia.org/r/274309 [15:13:12] is 2FA on Horizon a blocker to bring back privileged commands such as deleting/rebooting instances? [15:13:16] 6Operations, 10Huggle, 10Traffic, 7HTTPS: Huggle 2 fails on HTTP used when HTTPS expected - https://phabricator.wikimedia.org/T126357#2079942 (10DVdm) I have left a new version 2.1.27.6 in our shared dropbox, ready to be picked up, tested, honed and published by Petrb. Caught bug reading resources file War... [15:13:46] hashar: it is, but I might have 2fa running later today — it’s all working in labtest [15:14:06] I can test that out as well since I got 2FA on wikitech [15:14:22] cool, I’ll let you know when things are settled [15:14:22] (03PS12) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [15:15:34] !log elastic1007.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [15:15:36] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [15:15:36] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:36] !log rebooting californium just to make sure dist-upgrade didn’t mess up grub [15:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:53] (03PS1) 10Muehlenhoff: Refresh version-script.patch for 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274404 [15:17:03] (03CR) 10Andrew Bogott: [C: 032] Increase Horizon session length by a lot [puppet] - 10https://gerrit.wikimedia.org/r/274309 (owner: 10Andrew Bogott) [15:17:32] (03PS1) 10Dzahn: wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) [15:18:20] (03PS3) 10Jcrespo: [WIP]Test the new heartbeat functionality on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) [15:18:31] (03PS4) 10Jcrespo: Test the new heartbeat functionality on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) [15:19:01] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2079954 (10Dzahn) p:5Lowest>3Low [15:19:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Refresh version-script.patch for 1.0.2g [debs/openssl] - 10https://gerrit.wikimedia.org/r/274404 (owner: 10Muehlenhoff) [15:21:49] (03CR) 10Jcrespo: [C: 032] Test the new heartbeat functionality on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/274377 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [15:23:28] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:28] hashar: horizon is back up and running Liberty. I’m going to have some breakfast and then will set up 2fa. When that’s turned on it’ll be the exact same login creds as wikitech (and Horizon will /only/ work for people with 2fa enabled.) [15:23:58] yay [15:24:12] 2fa for the win! [15:27:14] "what do you want for breakfast?" "I will have an OpenStack component upgrade please." [15:28:08] serves a 'kilo' of 'liberty' fries [15:30:19] (03CR) 10Dzahn: [C: 031] (WIP) Kill misc::limn & limn [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [15:32:31] thcipriani|afk: sigh, just saw the ping re: scap3, lmk when you are online [15:33:59] (03PS1) 10Muehlenhoff: Revert "Refresh version-script.patch for 1.0.2g" [debs/openssl] - 10https://gerrit.wikimedia.org/r/274407 [15:34:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Revert "Refresh version-script.patch for 1.0.2g" [debs/openssl] - 10https://gerrit.wikimedia.org/r/274407 (owner: 10Muehlenhoff) [15:38:23] 6Operations, 13Patch-For-Review: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2080000 (10elukey) Done: rdb1004 (precise) rdb1006 (trusty) Remaining: rdb1003 rdb1005 after we've confirmed the replication to 1006 work... [15:41:49] (03PS11) 10Andrew Bogott: Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) [15:42:19] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2080015 (10ArielGlenn) https://phabricator.wikimedia.org/P2698 Result of trying to PXE boot. Have tried: 1) disable disk boot (now it just loops through failed PXE boot... [15:44:40] !log may extend the maintenance window for dataset1001 upgrade if headway can be made on PXE boot issues... 15 minutes left to decide [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:53] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2080020 (10MoritzMuehlenhoff) Is that the same ixgbe hardware as in https://phabricator.wikimedia.org/T128068 ? [15:47:47] (03CR) 10Andrew Bogott: [C: 032] Support totp auth in keystone [puppet] - 10https://gerrit.wikimedia.org/r/274167 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [15:48:04] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2080026 (10ArielGlenn) we never get to the debian installer. see the paste above. [15:49:09] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 6 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2080027 (10Jdlrobson) Another example https://nl... [15:49:30] (03PS1) 10Elukey: Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) [15:51:55] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2080063 (10Nuria) @catrope: Eventloggimg data in hive is not so easily quary-able as it is on mysql , just an FYI. https://wikitech.wikimedia.org/wik... [15:52:09] ---^ ori, _joe_: the idea would be to start tomorrow morning EU time and wait for the complete drain of the job queues [15:52:25] (one code review up) [15:53:11] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936489 (10Dzahn) tried to get it to PXE boot as well.. enabled/disabled additional NICs, tried UEFI mode instead of BIOS: ``` PXE boot - Embedded NIC 1: Broadcom NetXtr... [15:53:27] !log restbase deploy start of fb66dbf on restbase1001 [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:42] !log extending maintenance window for dataset1001 by one hour to 5 pm UTC [15:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:49] godog: I'm online now [15:55:21] thcipriani: hey, I've pushed a couple of trivial changes to scap and retagged, going to upload 3.0.3 now [15:55:37] godog: awesome! Thank you! [15:56:07] (03PS15) 10Andrew Bogott: Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) [15:56:09] (03PS1) 10Andrew Bogott: Fixed typo in config for the totp keystone plugin [puppet] - 10https://gerrit.wikimedia.org/r/274412 [15:56:59] thcipriani: also not sure about wikimediawiki.org in your email address in debian/changelog ?! :D [15:57:47] haha, ugh. I'll get that fixed. [15:57:59] * thcipriani shakes fist at dch (and self) [15:58:13] (03CR) 10Andrew Bogott: [C: 032] Fixed typo in config for the totp keystone plugin [puppet] - 10https://gerrit.wikimedia.org/r/274412 (owner: 10Andrew Bogott) [15:59:35] (03PS1) 10Muehlenhoff: Backport a change from 1.0.2g-1 to add the new exported symbols SRP_VBASE_get1_by_user and SRP_user_pwd_free [debs/openssl] - 10https://gerrit.wikimedia.org/r/274413 [15:59:37] (03PS1) 10CSteipp: Don't send Referer from private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274414 [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160302T1600). Please do the needful. [16:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:16] (03PS1) 10Jcrespo: Fixes for pt-heartbeat daemon init script (fails automatic runs) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274415 (https://phabricator.wikimedia.org/T114752) [16:01:15] (03CR) 10Jcrespo: [C: 032 V: 032] Fixes for pt-heartbeat daemon init script (fails automatic runs) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274415 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:01:23] here [16:01:45] kart_: I can SWAT [16:01:49] cool [16:03:00] (03PS1) 10Jcrespo: Refresh mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274417 [16:03:22] (03CR) 10Jforrester: [C: 031] "Do this in the SWAT now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274414 (owner: 10CSteipp) [16:03:30] (03CR) 10Tim Landscheidt: "IIRC @Andrew uses separate directories for different versions (kilo, liberty, etc.) so that he can easily switch servers from one to the o" [puppet] - 10https://gerrit.wikimedia.org/r/274402 (owner: 10Dzahn) [16:03:58] thcipriani: Might have https://gerrit.wikimedia.org/r/274414 too. [16:04:00] James_F: ^ I'm on a train, but if someone wants to swat that, I'd appreciate it. [16:04:14] Or I can do it this afternoon [16:04:16] csteipp: I'll take care. [16:04:24] Thanks! [16:04:25] * greg-g +1's [16:04:29] James_F: will do [16:04:47] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274414 (owner: 10CSteipp) [16:04:53] (03CR) 10Andrew Bogott: [C: 032] Support totp auth for horizon [puppet] - 10https://gerrit.wikimedia.org/r/274173 (https://phabricator.wikimedia.org/T105690) (owner: 10Andrew Bogott) [16:05:05] Now in the deployment calendar. :-) [16:05:12] (03Merged) 10jenkins-bot: Don't send Referer from private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274414 (owner: 10CSteipp) [16:06:44] (03PS2) 10Jcrespo: Refresh mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274417 [16:07:15] James_F: <3 [16:07:18] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2080114 (10Eevans) On bootstrap timings: The numbers presented in [[https://phabricator.wikimedia.org/... [16:07:46] (03PS2) 10Filippo Giunchedi: statsdlb: add three statsite instances [puppet] - 10https://gerrit.wikimedia.org/r/273927 (https://phabricator.wikimedia.org/T105679) [16:07:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Do not send Referer from private wikis [[gerrit:274414]] (duration: 01m 18s) [16:07:54] ^ James_F sync'd [16:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsdlb: add three statsite instances [puppet] - 10https://gerrit.wikimedia.org/r/273927 (https://phabricator.wikimedia.org/T105679) (owner: 10Filippo Giunchedi) [16:08:07] * James_F tests. [16:08:41] Yup, working as expected, no referer [sic] request header on private wiki outbound traffic. [16:08:56] (03PS3) 10Jcrespo: Refresh mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274417 [16:09:14] James_F: awesome. Thanks for checking. [16:10:11] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2078914 (10Krenair) >>! In T128559#2079914, @Ppena wrote: > We don't have anyone with a tech/ops background working for the store at the moment < waiting volunt... [16:12:11] (03CR) 10Jcrespo: [C: 032] Refresh mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274417 (owner: 10Jcrespo) [16:15:48] thcipriani: how are we going? [16:16:07] kart_: still waiting on jenkins for now https://integration.wikimedia.org/zuul/ [16:17:35] oh. It was showing 2 minutes at 5 minutes back. [16:17:58] yeah, seems like php55 is running slower than is typical [16:18:10] for testextension [16:18:30] that might be the scribunto change which made it faster for hhvm but slower for php55 [16:19:50] well it seems like the 2nd patch I +2 for the same extension is going to finish before the first. Might just be the box on which it's running? [16:24:04] thcipriani: never underestimate Zuul :) [16:24:40] :D [16:25:34] !log thcipriani@tin Synchronized php-1.27.0-wmf.15/extensions/ContentTranslation/modules/widgets/translator/ext.cx.translator.js: SWAT: Translator widget: Fix js error if translator does not have recent contributions [[gerrit:274340]] (duration: 01m 05s) [16:25:36] ^ kart_ check please [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:26] okay [16:27:44] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2080224 (10fgiunchedi) `ms-be1001` to `ms-be1003` upgraded, `ms-fe` pending merge of https://gerrit.wikimedia.org/r/#/c/273431/ tomorrow [16:28:14] !log starting post-bootstrap (1009-b) cleanup on restbase100{5,6,9-a}.eqiad.wmnet : T95253 [16:28:15] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:53] thcipriani: look OK [16:28:58] thanks! [16:28:58] (03PS1) 10Jcrespo: Previous heartbeat fix was not enough (introduced extra errors) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274423 (https://phabricator.wikimedia.org/T114752) [16:29:01] kart_: cool, thanks! [16:29:15] next one is going out now. [16:29:55] (03PS1) 10Andrew Bogott: Replace single-quotes around a variable ref [puppet] - 10https://gerrit.wikimedia.org/r/274425 [16:29:57] (03CR) 10Jcrespo: [C: 032 V: 032] Previous heartbeat fix was not enough (introduced extra errors) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274423 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:29:59] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1962078 (10Elitre) (Possibly silly question alert: in case things went wrong, would this be reflected on http://status.wikimedia.org/ , or... [16:29:59] thcipriani: I wonder if https://gerrit.wikimedia.org/r/#/c/273936/ could be put for SWAT too. I forgot to log it in deployments, but it's pretty simple. [16:30:06] !log thcipriani@tin Synchronized php-1.27.0-wmf.15/extensions/ContentTranslation/includes/TranslationStorageManager.php: SWAT: Use correct timestamp for updates [[gerrit:274363]] (duration: 00m 59s) [16:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:12] ^ kart_ check please [16:30:25] needs creating the shorturl tables [16:30:51] (03PS1) 10Jcrespo: Update mariadb subrepo [puppet] - 10https://gerrit.wikimedia.org/r/274426 [16:31:01] !log restbase deploy continue of fb66dbf for the rest of the nodes [16:31:02] (03PS2) 10Jcrespo: Update mariadb subrepo [puppet] - 10https://gerrit.wikimedia.org/r/274426 [16:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:12] (03CR) 10jenkins-bot: [V: 04-1] Update mariadb subrepo [puppet] - 10https://gerrit.wikimedia.org/r/274426 (owner: 10Jcrespo) [16:31:14] (03CR) 10Jcrespo: [C: 032 V: 032] Update mariadb subrepo [puppet] - 10https://gerrit.wikimedia.org/r/274426 (owner: 10Jcrespo) [16:31:21] (03CR) 10jenkins-bot: [V: 04-1] Replace single-quotes around a variable ref [puppet] - 10https://gerrit.wikimedia.org/r/274425 (owner: 10Andrew Bogott) [16:31:46] mafk: sure, put it on the deployments page. [16:31:52] will do [16:32:11] (03PS2) 10Andrew Bogott: Replace single-quotes around a variable ref [puppet] - 10https://gerrit.wikimedia.org/r/274425 [16:33:17] thcipriani: checking. [16:33:40] (03PS3) 10Andrew Bogott: Replace single-quotes around a variable ref [puppet] - 10https://gerrit.wikimedia.org/r/274425 [16:33:58] done [16:34:08] !log elastic1008.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [16:34:10] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [16:34:10] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [16:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:03] (03CR) 10Andrew Bogott: [C: 032] Replace single-quotes around a variable ref [puppet] - 10https://gerrit.wikimedia.org/r/274425 (owner: 10Andrew Bogott) [16:35:26] (03PS2) 10MarcoAurelio: Enabling ShortURL for bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273936 (https://phabricator.wikimedia.org/T127968) [16:35:45] mafk: this needs the populateShortUrlTable.php script is that right? How long does that normally take? [16:36:00] thcipriani: for such a small wiki, only few minutes [16:36:19] but needs to be done from tin, not mira IIRC [16:36:39] (03PS1) 10Jcrespo: Revert "Previous heartbeat fix was not enough (introduced extra errors)" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274427 [16:36:50] (03CR) 10Jcrespo: [C: 032 V: 032] Revert "Previous heartbeat fix was not enough (introduced extra errors)" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274427 (owner: 10Jcrespo) [16:38:56] (03PS1) 10Jcrespo: Fix parameter s/m5/$1/ [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274428 [16:39:27] (03CR) 10Jcrespo: [C: 032 V: 032] Fix parameter s/m5/$1/ [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274428 (owner: 10Jcrespo) [16:39:36] (03PS1) 10Andrew Bogott: Define keystone_host before referring to it. [puppet] - 10https://gerrit.wikimedia.org/r/274429 [16:39:43] !log restbase deploy end of fb66dbf [16:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:40:18] (03PS1) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274430 [16:40:34] thcipriani: was bit difficult to test on testwiki, but seems OK. Thanks! [16:40:39] Sorry for taking long time. [16:40:41] kart_: cool, thanks! [16:41:01] (03CR) 10Jcrespo: [C: 032 V: 032] Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274430 (owner: 10Jcrespo) [16:41:02] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2080283 (10GWicke) @Joe, I think we can move ahead with DNS, especially... [16:42:06] (03CR) 10Andrew Bogott: [C: 032] Define keystone_host before referring to it. [puppet] - 10https://gerrit.wikimedia.org/r/274429 (owner: 10Andrew Bogott) [16:42:22] (03PS2) 10Andrew Bogott: Define keystone_host before referring to it. [puppet] - 10https://gerrit.wikimedia.org/r/274429 [16:44:40] thcipriani: I'm about to enable https://gerrit.wikimedia.org/r/273927 which will make statsd bounce, can you let me know when swat is finished and I'll go ahead? [16:44:50] godog: will do [16:46:49] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, and 2 others: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2080308 (10Krinkle) [16:47:17] bblack, moritzm: thanks for getting the logstash processes back up and running. Those OOM crashes only started a few weeks ago, but are already highly annoying. [16:50:00] (03PS1) 10Alexandros Kosiaris: Bring templates up to date with 5.0.7 [software/otrs] - 10https://gerrit.wikimedia.org/r/274436 (https://phabricator.wikimedia.org/T108834) [16:50:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273936 (https://phabricator.wikimedia.org/T127968) (owner: 10MarcoAurelio) [16:50:49] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2080321 (10faidon) I'd really prefer it if we would avoid a DNS-based so... [16:50:57] hashar: will you give horizon a try now? [16:51:15] andrewbogott: finish up a task and going to have a meeting in 8 minutes sorry ;-. [16:51:16] :( [16:51:24] hashar: ok, well, when you have time :) [16:52:05] (03Merged) 10jenkins-bot: Enabling ShortURL for bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273936 (https://phabricator.wikimedia.org/T127968) (owner: 10MarcoAurelio) [16:53:39] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2080343 (10jcrespo) [16:54:18] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2080347 (10jcrespo) [16:54:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enabling ShortURL for bnwikisource [[gerrit:273936]] (duration: 01m 04s) [16:54:48] ^ mafk check please [16:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:58] yes sir [16:55:47] thcipriani: works [16:56:00] mafk: awesome. Thanks for checking! [16:56:04] (03PS2) 10Alexandros Kosiaris: Bring templates up to date with 5.0.7 [software/otrs] - 10https://gerrit.wikimedia.org/r/274436 (https://phabricator.wikimedia.org/T108834) [16:56:08] godog: SWAT's complete [16:57:03] (03PS1) 10ArielGlenn: use 10gb nic mac addy for dataset1001 in dhcp [puppet] - 10https://gerrit.wikimedia.org/r/274439 [16:57:20] 6Operations, 6Performance-Team, 13Patch-For-Review, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2080356 (10Joe) While tidy has been fixed with the latest patchset I created, wikidiff2 needs some amending (still not sure how much). Will report any progres... [16:57:39] thcipriani: sweet, thanks! [16:57:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Bring templates up to date with 5.0.7 [software/otrs] - 10https://gerrit.wikimedia.org/r/274436 (https://phabricator.wikimedia.org/T108834) (owner: 10Alexandros Kosiaris) [16:58:00] (03PS2) 10ArielGlenn: use 10gb nic mac addy for dataset1001 in dhcp [puppet] - 10https://gerrit.wikimedia.org/r/274439 [16:59:40] (03CR) 10ArielGlenn: [C: 032] use 10gb nic mac addy for dataset1001 in dhcp [puppet] - 10https://gerrit.wikimedia.org/r/274439 (owner: 10ArielGlenn) [16:59:50] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: puppet fail [17:06:25] (03PS1) 10Andrew Bogott: Re-enable instance manipulation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274440 [17:07:14] (03PS4) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [17:08:22] (03PS5) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [17:10:26] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2080421 (10GWicke) Local reconfiguration requires service restarts, so i... [17:12:34] http://dumps.wikimedia.org/ down? [17:13:03] apergos: ^^ [17:13:11] yes, still down [17:13:23] I sent an email to the lsit (a second email) that the maintenance window is still ongoing [17:13:30] and we're over it a bit but still not done [17:13:50] I hope to get done in the next 45 minutes [17:16:05] (03PS1) 10Alexandros Kosiaris: Update sopm description [software/otrs] - 10https://gerrit.wikimedia.org/r/274442 [17:16:38] ah, ok, thx [17:17:04] sorry for the inconvenience [17:18:06] RoanKattouw, gwicke: what can you tell me about the labs instance ‘parsoid-spof’? I’m wondering if it can be deleted or, failing that, slimmed down and migrated to another virt host. [17:19:07] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2080459 (10Catrope) >>! In T128557#2080063, @Nuria wrote: > @catrope: Eventloggimg data in hive is not so easily quary-able as it is on mysql , just an... [17:20:58] !log elastic1009.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [17:21:00] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [17:21:00] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [17:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:45] andrewbogott: I think it had to do with their test infra? But I had no hand in that one [17:25:30] RoanKattouw: who then? [17:25:34] any guesses? [17:26:12] subbu may know [17:26:44] :-) [17:26:53] will respond after. [17:28:38] RoanKattouw: does your team have a task for updating Flow to talk to Parsoid via RESTBase (instead of directly)? [17:28:49] (I looked but couldn't find one) [17:28:49] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:29:04] Yeah there is one somewhere, lemme find it [17:29:10] 6Operations, 6Labs, 10Tool-Labs: Get rid of Tool Labs home page check from shinken - https://phabricator.wikimedia.org/T128615#2080500 (10yuvipanda) [17:29:34] Spoiler: it's blocked on fixing data-parsoid loss which currently causes Parsoid to 500 when round-tripping wikitext like {{{foo}}} [17:30:14] yikes [17:30:20] Oh the title was changed, that's why: changed the title from "Migrate Flow to talk with RESTBase instead of Parsoid" to "Update Flow for Parsoid changes re data-mw". [17:30:26] https://phabricator.wikimedia.org/T124837 and subtasks [17:30:51] RoanKattouw: excellent; thanks [17:31:02] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2080513 (10EBernhardson) For the php end of things, whatever SSL certs are put together need to be provided to php, we use curl to talk to... [17:31:16] That whole task tree is different from how I remember it because our team had a meeting with the Parsing and Services teams while I was in Australia [17:31:44] andrewbogott: I am off sorry can't test Horizon today ! [17:31:54] andrewbogott: but maybe some releng / labs folks would be happy to help [17:31:56] ori: Aha, https://phabricator.wikimedia.org/T94574 is what I was thinking of [17:33:24] ori: The 500 bug (which isn't linked apparently) is https://phabricator.wikimedia.org/T113044 [17:33:31] (03PS1) 10Volans: Repool es2005, es2007, es2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274447 (https://phabricator.wikimedia.org/T127330) [17:33:36] RoanKattouw: that's very helpful -- thanks [17:35:01] !log disabling puppet on db1009 (m5-master) to test heartbeat changes [17:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:17] ori: BTW, while I have you -- do you know if there are any multi-DC-related things my team needs to address before the Dallaspocalypse at the end of this month? [17:35:34] Like, perhaps little things like https://gerrit.wikimedia.org/r/250761 still not being merged, or large ones like Flow's memc usage [17:36:08] OTOH maybe I misunderstand the plan and all we're doing is cutting over completely to codfw without going to a dual DC setup? [17:36:18] cutting over completely [17:36:21] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#1936489 (10RobH) >>! In T123724#2080020, @MoritzMuehlenhoff wrote: > Is that the same ixgbe hardware as in https://phabricator.wikimedia.org/T128068 ? Nope, those use the... [17:36:22] RoanKattouw: can Flow handle being pointed to an empty memcached clusteR? [17:36:33] RoanKattouw, reg. the flow / parsoid / restbase discussion .. i think this is orthogonal to eqiad -> codfw switching. [17:36:47] subbu: It is, I just went there because I had ori's attention [17:36:52] ok. [17:36:57] ori: I think we already don't use memc currently, but I'll check [17:37:07] you definitely do, i saw it yesterday [17:37:22] (03CR) 10Krinkle: Add public-wiki-rewrites to wikitech (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [17:37:33] There's something about how we don't update data in memc any more but I forget [17:37:43] mlitn: You around? Do you know the answers to the above? ---^^ [17:40:29] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2080533 (10EBernhardson) Also, mostly for reference, switching user search traffic from eqiad to codfw is changing this line and deploying... [17:44:06] RoanKattouw: sorry for the sluggish response. We were in a meeting about the codfw failover, and there was an item for Flow, but we determined it was not relevant [17:44:19] !log bounce statsdlb on graphite1001 to add 3x statsite instances T105679 [17:44:20] T105679: add more statsite processes to get more balanced load - https://phabricator.wikimedia.org/T105679 [17:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:26] Do you remember what the item was? [17:45:16] RoanKattouw: it only came up as subbu mentioned that all other accesses are through RB these days [17:46:18] we'll just need to remember to update the Parsoid URL used by Flow when we switch over [17:46:19] Ah OK [17:46:19] Yeah we need to be able to access Parsoid directly, but that shouldn't be a problem hopefully [17:46:19] We'd have to not forget the relevant MW config but I think that's it [17:46:36] right. I had added it in case there was anything to be done for Flow. [17:46:41] I updated https://phabricator.wikimedia.org/T127974#2080565 [17:46:46] which should cover that now. [17:47:43] on wikitech, there is a /a/backup/public with dumps, but even though it's called public, it actually is "Forbidden" (https://wikitech.wikimedia.org/dumps) because the Apache config says "Require host wikitech-static.wikimedia.org". Is there any reason to do that? Can we simply make it public? [17:47:48] RoanKattouw gwicke ori ^^ [17:47:59] thanks subbu [17:48:14] bblack also fyi there in the context of your remark about potentially other smaller parsoid services. [17:51:56] (03CR) 10Volans: [C: 032] Repool es2005, es2007, es2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274447 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:52:10] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2080589 (10Dzahn) >>! In T54170#1410096, @ArielGlenn wrote: > should we host the lastest dump on dumps.wm.org? We can open up the dumps... [17:52:24] (03Merged) 10jenkins-bot: Repool es2005, es2007, es2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274447 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [17:55:06] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2080612 (10demon) >>! In T124444#2080513, @EBernhardson wrote: > For the php end of things, whatever SSL certs are put together need to be... [17:55:11] !log volans@tin Synchronized wmf-config/db-codfw.php: Repooling external storage DBs in codfw after data was copied: T127330 (duration: 01m 06s) [17:55:12] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [17:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:46] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2080618 (10Dzahn) @Andrew @yuvipanda any thoughts on the restriction on that /a/backup/public ? please see https://gerrit.wikimedia.org... [17:57:28] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2080625 (10Dzahn) Yea, i was gonna say, that would be possible if store was running on our own infrastructure, but it's all external on shopify.com. [18:00:19] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2080634 (10fgiunchedi) >>! In T128107#2072930, @fgiunchedi wrote: > thanks @Cmjohnson ! I could successfully install restbase1010 after fixing partman in https://gerrit.wikimedia.org/r/2... [18:00:19] (03PS1) 10Jcrespo: Adding full paths to perl for pt-heartbeat execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274454 [18:01:29] (03PS2) 10Jcrespo: Adding full paths to perl for pt-heartbeat execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274454 [18:05:12] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2080643 (10Dzahn) >>! In T128559#2079914, @Ppena wrote: > Can you please explain a little what setting the HSTS header means and what is the urgency on this? Th... [18:05:45] (03PS2) 10Elukey: Preiliminary port to new VSL API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) (owner: 10Ema) [18:06:01] gwicke: sorry if I missed this in the backscroll… what can you tell me about parsoid-spof? [18:06:06] (03PS3) 10Elukey: Preiliminary port to new VSL API [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) (owner: 10Ema) [18:06:27] !log still slugging away at pxe book with these broadcom netxtreme II nics (dataset1001) [18:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:51] andrewbogott: just commented in -labs [18:07:14] gwicke: got it, thanks [18:07:50] (03CR) 10Elukey: [C: 04-1] "Still not compiling, early stage of dev." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/274135 (https://phabricator.wikimedia.org/T124278) (owner: 10Ema) [18:08:54] Hi ori! [18:09:28] just wanted to make sure that https://gerrit.wikimedia.org/r/#/c/274411/1 is good [18:10:01] I'll start tomorrow, let me know if I need to do more to de-pool rdb1003 or if it is ok in your opinion.. I am going offline but I'll read later on :) [18:13:02] 6Operations, 10Traffic, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2080687 (10Florian) [18:13:27] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2080693 (10Dzahn) >>! In T128491#2076729, @Danny_B wrote: > Please assure you removed the inappropriate project tags from the subtask next time. Thank you. fwiw: I think "Patch-For-Rev... [18:15:27] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2080695 (10EBernhardson) It looks like elastica library has a 'curl' config option on the Connection object that holds the array of k=>v pa... [18:15:36] 6Operations, 10Traffic, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2080697 (10Dzahn) 5Open>3stalled stalled - needs DoD TechOps person on this ticket :p [18:16:30] mutante: DoD TechOps? [18:17:10] SPF|Cloud: the ops of the US Army and stuff [18:17:29] DoD = Department of Defense [18:17:35] okay :) [18:19:13] so they can read Wikipedia again on the base [18:19:41] there are OTRS tickets [18:20:58] (somebody on their side still needs to fix it) [18:30:01] (03PS2) 10Muehlenhoff: Backport a change from 1.0.2g-1 to add the new exported symbols SRP_VBASE_get1_by_user and SRP_user_pwd_free [debs/openssl] - 10https://gerrit.wikimedia.org/r/274413 [18:30:26] (03PS2) 10Ottomata: eventlogging: Allow processor format strings to be configurable via hiera [puppet] - 10https://gerrit.wikimedia.org/r/274286 (owner: 10Madhuvishy) [18:30:53] (03CR) 10Dzahn: Puppetise yubikey key storage module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274358 (owner: 10Muehlenhoff) [18:32:18] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:33:34] (03CR) 10Muehlenhoff: Puppetise yubikey key storage module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274358 (owner: 10Muehlenhoff) [18:34:00] mutante: you, for any chance, don't know any tech-ops person in the DoD? (https://phabricator.wikimedia.org/T128182) :P [18:35:03] FlorianSW: no, i don't , and i'm not a US citizen [18:35:48] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 2 failures [18:36:11] maybe if the blog would write about it ?:) [18:36:48] (03PS1) 10GWicke: Temporarily reduce cache timeout until purging works [puppet] - 10https://gerrit.wikimedia.org/r/274456 [18:37:22] FlorianSW: can we ask communications to find one ? [18:37:42] (03PS2) 10GWicke: Temporarily reduce cache timeout until purging works [puppet] - 10https://gerrit.wikimedia.org/r/274456 (https://phabricator.wikimedia.org/T127387) [18:38:27] mutante: I'm not sure, maybe, I already asked the people reporting this to us to ask their tech support, but I think they mostly doesn't have the time to go this way :( (I'm not an US citizien, too, so I probably don't see any possible way for us). [18:38:40] Do you have a contact in communications we could ask (or a phab tag)? [18:39:05] (03CR) 1020after4: "Fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [18:41:23] FlorianSW: hmm. best i have is https://wikimediafoundation.org/wiki/Staff_and_contractors#Communications [18:41:53] FlorianSW: we'd have to check which of these people are on phab [18:42:02] mutante: ok, I simply will write an e-mail to one of them asking if they can help :) [18:42:04] or that [18:42:10] ok, cool [18:42:28] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2080763 (10ArielGlenn) We are well over our original slotted maintenance time so I have rolled back everything and all services are again operational. The nfs mount shoul... [18:42:38] FlorianSW: oh, right, i know Heather is on phab [18:43:22] mutante: ah, great, I'll add her and ask if she can help :) [18:43:31] thanks [18:44:05] !log rolled back all changes for dataset1001, running with same old precise OS, grrrrr [18:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:22] what a waste of several hours [18:45:57] (03PS3) 10Madhuvishy: eventlogging: Change client side processor format string to ignore ClientIP [puppet] - 10https://gerrit.wikimedia.org/r/274286 [18:46:53] 6Operations, 10Traffic, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2080772 (10Florian) @Heather: As far as I can see (and with the big help @Dzahn, who pointed me to this approach :)) it seems you're in the Commun... [18:47:18] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2080775 (10Florian) [18:47:28] mutante: I found https://phabricator.wikimedia.org/tag/wmf-communications/ :P :) [18:48:02] FlorianSW: ah :) good to know. nice [18:48:51] (03PS1) 10Andrew Bogott: Change the wikitech favicon and logo to the actual wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274457 [18:50:31] bd808: ^ [18:50:52] I’m not loving how that .png rendered, alternatives welcome [18:51:25] (03PS3) 10Jcrespo: Adding full paths to perl for pt-heartbeat execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274454 [18:51:55] (03CR) 10Jcrespo: [C: 032 V: 032] Adding full paths to perl for pt-heartbeat execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274454 (owner: 10Jcrespo) [18:52:26] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2080779 (10EBernhardson) [18:52:37] lol, andrewbogott perfectionist [18:52:37] (03PS2) 10Dzahn: Puppetise yubikey key storage module [puppet] - 10https://gerrit.wikimedia.org/r/274358 (owner: 10Muehlenhoff) [18:52:44] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2074588 (10EBernhardson) [18:52:45] (03CR) 10Dzahn: [C: 032] Puppetise yubikey key storage module [puppet] - 10https://gerrit.wikimedia.org/r/274358 (owner: 10Muehlenhoff) [18:53:11] jynus: In this case, I’m only a perfectionist if someone else is doing it :) [18:53:37] for me it is the other way round [18:54:10] andrewbogott: hmm.. those fine line do seem to make strange artifacts [18:54:26] mw's thumbnailer doens't do any better [18:54:47] maybe I can anti-alias [18:54:48] * andrewbogott tries it [18:54:59] (03Abandoned) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268713 (owner: 10JGirault) [18:56:17] (03CR) 10Mobrovac: [C: 031] "Given the situation, seems like the most pragmatic solution at this point" [puppet] - 10https://gerrit.wikimedia.org/r/274456 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [18:56:59] (03PS1) 10GWicke: WIP: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) [18:57:29] (03PS1) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274459 [18:58:00] (03PS2) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274459 [18:58:13] (03CR) 10Jcrespo: [C: 032 V: 032] Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274459 (owner: 10Jcrespo) [18:58:19] hm, nope, looks like that’s what we get without calling in someone from design. [18:58:46] !log elastic1010.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [18:58:47] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [18:58:47] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [18:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:59:39] andrewbogott: the ugliness will bug Krenair and he will fix it ;) [18:59:49] perfect [19:00:43] what, graphics design? [19:01:32] Krenair: we’re talking about https://gerrit.wikimedia.org/r/#/c/274457/ [19:02:12] (03CR) 10GWicke: "I think this is ready for review. I have not actually tested this patch." [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:05:30] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:05:52] * Krenair shrugs [19:06:03] doesn't seem terrible to me at the moment [19:07:09] !log Data transfer completed, started MySQL and replica on es2014,es2016,es2018 [ T127330 ] [19:07:10] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [19:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:30] (03PS2) 10GWicke: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) [19:11:57] (03CR) 10BBlack: "In a quick check, it looks good. There's definitely one tiny thing missing though, which is adding 'normalize_path' to the set of extra_v" [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:12:18] (03PS3) 10BBlack: Temporarily reduce cache timeout until purging works [puppet] - 10https://gerrit.wikimedia.org/r/274456 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:13:41] (03CR) 10BBlack: [C: 032] Temporarily reduce cache timeout until purging works [puppet] - 10https://gerrit.wikimedia.org/r/274456 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:16:41] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2080900 (10Ppena) @Dzahn got it, thanks for explaining in plain english, Dan ;) I will email Shopify and ask them about it, but unfortunately I don't think we... [19:18:07] (03PS6) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [19:19:39] in #-tech: I get 80% packet lost to wikipedia starting at ae2.cr1-esams.wikimedia.org -- known? [19:19:46] (just now) [19:21:32] !log restbase rolling restart for https://gerrit.wikimedia.org/r/274456 T127387 [19:21:32] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [19:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:55] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2080940 (10Nuria) >OK, so what do you recommend I use for querying EL data when MySQL is too slow? Patience..... (kidiing but not really) or a ping to... [19:29:59] andrewbogott: let's have just one wikitech apache config instead of one per openstack release? https://gerrit.wikimedia.org/r/#/c/274402/ [19:30:43] (03PS1) 10Jcrespo: Changing the service to a command execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274467 [19:31:11] (03CR) 10Jcrespo: [C: 032 V: 032] Changing the service to a command execution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274467 (owner: 10Jcrespo) [19:31:21] mutante: Although there has been (so far) no change between kilo and liberty, I’m not positive that there won’t be changes between future versions... [19:31:23] * andrewbogott reads the config [19:31:45] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#1962078 (10tomasz) >>! In T124671#2080234, @Elitre wrote: > (Possibly silly question alert: in case things went wrong, would this be refle... [19:32:03] hm, ok, nevermind, I’m convinced [19:32:14] andrewbogott: thanks. the thing i actually care about is https://gerrit.wikimedia.org/r/#/c/274405/ and just when i wanted to add that i noticed there are 2 configs, and that made me create this other patch [19:32:30] (03PS1) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274469 [19:32:38] (03PS2) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274469 [19:32:45] (03CR) 10Andrew Bogott: [C: 031] "I think merging these is fine... I don't really anticipate differences in apache that will correspond with openstack version changes." [puppet] - 10https://gerrit.wikimedia.org/r/274402 (owner: 10Dzahn) [19:32:52] :) [19:33:05] (03CR) 10Jcrespo: [C: 032 V: 032] Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274469 (owner: 10Jcrespo) [19:33:59] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2080990 (10Catrope) 5Open>3Invalid >>! In T128557#2080940, @Nuria wrote: >>OK, so what do you recommend I use for querying EL data when MySQL is to... [19:34:49] (03CR) 10Andrew Bogott: "This is fine with me in principle but I'd want to check the code to make sure there isn't any user info landing in those files." [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [19:34:54] (03PS1) 10Jdlrobson: Enable reference storage on Japanese Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274470 (https://phabricator.wikimedia.org/T126802) [19:36:24] greg-g, my understanding of our talk that we can go to testwiki immediately and technews is needed when we go to content wikis [19:36:39] (03CR) 10BryanDavis: [C: 031] Change the wikitech favicon and logo to the actual wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274457 (owner: 10Andrew Bogott) [19:40:40] (03PS1) 10Jcrespo: Add --defaults-file to force mysql credentials [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274471 [19:41:28] (03CR) 10Jcrespo: [C: 032 V: 032] Add --defaults-file to force mysql credentials [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274471 (owner: 10Jcrespo) [19:43:48] (03PS1) 10Jcrespo: Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274474 [19:44:02] (03CR) 10Jcrespo: [C: 032 V: 032] Update mariadb repo [puppet] - 10https://gerrit.wikimedia.org/r/274474 (owner: 10Jcrespo) [19:46:07] (03PS1) 10Yurik: Allow maps for test and test2 [puppet] - 10https://gerrit.wikimedia.org/r/274475 [19:46:09] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [19:47:59] works now [19:48:16] again 27x the level of writes, I think this is worth a ticket now [19:48:26] Yeah [19:48:46] If only so we can send that link to people who complain about it :) [19:48:46] Saves me from editing the /topic every day around lunchtime [19:49:31] Single freaking repo. [19:49:38] I just can't with how phab does this. [19:49:48] question, I love phab [19:49:59] but are we 100% with the diffusion part? [19:50:22] it seems the only one causing issues [19:50:30] Yes. This is a one time migration cost as we're copying all the ~1000 or so repos from Gerrit. [19:50:44] I won't ask which one it is this time [19:50:58] ok, then it is time to send an email "there is inestability, but bear with us" [19:51:08] Will do that. [19:51:17] otherwise we will have reports once and again [19:51:47] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.00 seconds [19:51:58] (saw the message too fwiw) [19:52:03] Lag will go away now, queue is done. [19:52:11] I will ack that now, knowing that it is preview [19:52:19] apergos: Hint: 9,056 Commits [19:52:48] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting analytics-privatedata-users access for catrope - https://phabricator.wikimedia.org/T128557#2081125 (10Nuria) >Yeah, that table is pretty large, 21.3 million rows. https://gerrit.wikimedia.org/r/274342 > fixes a bug that was causing the number... [19:53:08] It is ok, knowing that, for example, I can disable the replication lag check for a few days (whatever you tell me) [19:54:18] 6Operations, 6Labs, 10Labs-Infrastructure, 10Monitoring, 10Tool-Labs: Ensure mysql credential creation for tools users is running - https://phabricator.wikimedia.org/T125874#2081132 (10chasemp) 5Open>3Resolved It's been ok for awhile now, we can reopen if it starts breaking again :) [19:54:47] jynus: I plan to be done by the end of tomorrow with this mass importing. [19:54:53] e-mail sent to ops/engineering. [19:55:37] thanks, ostriches! [19:57:22] ACKNOWLEDGEMENT - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 427.00 seconds Jcrespo ongoing phabricator maintenance [19:57:46] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:58:03] jynus: It's just a metric crapton of data for Phab to ingest when you've got thousands of commits that all have to be parsed and indexed, even from a single repo. Luckily in the final stretch, only about 40 repositories to go. [19:58:20] not wikitech-l? [19:58:31] as I said, it is ok, if this was know nobody will complain [19:58:47] legoktm: Feel free to forward, I just typed the first two lists that came to mind [19:59:48] ostriches: can it be rate limited out of the box? [20:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160302T2000). [20:00:12] fwd'd [20:00:59] godog: I dunno if we can tune phd to ratelimit like that. [20:01:15] It'd be a good idea if we could, just slow them during batch work so they don't overwhelm m3. [20:01:23] choo choo, train time! [20:02:28] (03PS1) 10Chad: Updating group1 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274477 [20:02:59] !log demon@tin Started scap: group1 to wmf.15 [20:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:27] <_joe_> ostriches: https://upload.wikimedia.org/wikipedia/commons/1/19/Train_wreck_at_Montparnasse_1895.jpg seems appropriate? [20:07:45] _joe_: Nahhh, https://commons.wikimedia.org/wiki/File:Orchard_TX_Freight_Train.JPG [20:07:59] Beautiful day, on time [20:08:34] https://upload.wikimedia.org/wikipedia/commons/e/e0/Nagasakibomb.jpg [20:09:01] https://upload.wikimedia.org/wikipedia/commons/2/2a/AtomicEffects-p4.jpg [20:09:12] Oh ye of little faith! [20:09:33] (03CR) 10Chad: [C: 032 V: 032] Updating group1 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274477 (owner: 10Chad) [20:09:46] I updated https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys ;) [20:10:00] lol [20:10:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:11:20] so heartbeat is prepared, now I have to see how do I syncronize puppet runs with schema changes [20:11:41] !log demon@tin Finished scap: group1 to wmf.15 (duration: 08m 41s) [20:11:47] I think the schema change is optional, which would help a lot [20:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:16:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:17:12] (03PS2) 10Dzahn: openstack: do not have duplicate wikitech Apache config [puppet] - 10https://gerrit.wikimedia.org/r/274402 [20:18:02] (03CR) 10Dzahn: [C: 032] openstack: do not have duplicate wikitech Apache config [puppet] - 10https://gerrit.wikimedia.org/r/274402 (owner: 10Dzahn) [20:18:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:19:46] (03PS3) 10GWicke: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) [20:20:36] (03CR) 10GWicke: "@bblack: Added 'normalize_path' in the manifest, per your comment." [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [20:22:32] (03PS2) 10Dzahn: wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) [20:23:42] (03CR) 10ArielGlenn: "What's used to create the dumps? And what format are they in?" [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [20:27:49] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [20:27:58] (03CR) 10Dzahn: "@Andrew @Ariel these are good questions, so far i know that Ryan has made those public in the past, from his comments on the linked ticket" [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [20:28:52] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2081283 (10hashar) 5Open>3declined I filled this task since @mobrovac asked to get access on an instance for debugging purposes. I hav... [20:28:54] (03CR) 10Dzahn: "see the "There's daily dumps" part on T54170 (it sucks that you cant directly link to the specific comment from gerrit)" [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [20:29:11] (03Abandoned) 10Hashar: admin: mobrovac as ci-admins [puppet] - 10https://gerrit.wikimedia.org/r/273434 (https://phabricator.wikimedia.org/T128175) (owner: 10Hashar) [20:31:57] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2081298 (10Dzahn) >>! In T54170#1362915, @Krenair wrote: > and also, wikitech's /dumps directory is protected by `Require host wikitech-... [20:32:02] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2081299 (10Jalexander) I'm looking into some options for contacts. [20:32:52] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2081302 (10Krenair) Am guessing this is something for ops? Adding operations [20:33:05] (03PS1) 1020after4: Move /srv/phab/repos to /srv/repos refs T125853 [puppet] - 10https://gerrit.wikimedia.org/r/274484 (https://phabricator.wikimedia.org/T125853) [20:35:35] (03Abandoned) 1020after4: Put the phabricator repo lock file on persistent storage [puppet] - 10https://gerrit.wikimedia.org/r/268340 (owner: 1020after4) [20:35:46] !log elastic1011.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [20:35:48] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [20:35:48] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [20:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:58] (03PS2) 1020after4: Move /srv/phab/repos to /srv/repos refs T125853 [puppet] - 10https://gerrit.wikimedia.org/r/274484 (https://phabricator.wikimedia.org/T125853) [20:36:19] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:36:38] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:37:29] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:39] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:40] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:40] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:42] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2081323 (10Krenair) Was added in https://gerrit.wikimedia.org/r/#/c/189889/ Should've been caught in review [20:37:48] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:48] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:49] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:49] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:37:49] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:09] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:09] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:10] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:19] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:19] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:19] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:19] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:19] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:28] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:30] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:30] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:39] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:39] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:39] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:48] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:49] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:38:49] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:00] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:00] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:00] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:28] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:39] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 32 connecting: cp1048_v4, cp1048_v6 [20:39:39] (03CR) 1020after4: "bblack: I removed the unrelated changes. Trying to push the new patch but gerrit is currently rejecting my changes" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [20:39:51] huh? [20:40:14] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2081269 (10Dzahn) Hi, fr-all@ is a combination of fr-development and fr-online. The config looks like this: ``` fr-all: fr-development, fr-online ``` Then you have people in these sub-groups: ``` fr-development:... [20:40:52] was that huh? for me? [20:41:03] no [20:41:06] ok [20:41:07] well kinda, but not really [20:41:09] * twentyafterfour didn't really mean to ping you on irc [20:41:32] * twentyafterfour didn't touch cp* :) [20:41:42] I jus thappened to flip here in time to see cp10xx alert spam + a patch line mentioning my name saying "trying to push a new patch". my brain went into "oh, someone just merged something awful and broke caches" :) [20:43:47] heh no I was trying to push to gerrit and it doesn't like me. (gerrit must have heard how much I badmouth it) [20:44:24] (03PS3) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [20:44:44] (03CR) 10jenkins-bot: [V: 04-1] Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [20:45:11] !log cp1048: unresponsive console, powercycled [20:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:59] !log cp1048: depooled in confd, too [20:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:28] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 34 ESP OK [20:47:28] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 34 ESP OK [20:47:29] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 34 ESP OK [20:47:29] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 34 ESP OK [20:47:30] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 34 ESP OK [20:47:30] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 34 ESP OK [20:47:30] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 34 ESP OK [20:47:37] (03PS4) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [20:47:39] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 34 ESP OK [20:47:39] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 34 ESP OK [20:47:40] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [20:47:41] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 34 ESP OK [20:47:49] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 34 ESP OK [20:47:49] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 34 ESP OK [20:47:59] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 34 ESP OK [20:47:59] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 34 ESP OK [20:47:59] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 34 ESP OK [20:47:59] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 34 ESP OK [20:48:09] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 34 ESP OK [20:48:09] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 34 ESP OK [20:48:19] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 34 ESP OK [20:48:20] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 34 ESP OK [20:48:20] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 34 ESP OK [20:48:39] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 34 ESP OK [20:48:39] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 34 ESP OK [20:48:49] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 34 ESP OK [20:48:50] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 34 ESP OK [20:48:50] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 34 ESP OK [20:48:58] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 34 ESP OK [20:48:59] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 34 ESP OK [20:48:59] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 34 ESP OK [20:48:59] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 34 ESP OK [20:49:00] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 34 ESP OK [20:49:00] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 34 ESP OK [20:52:10] 6Operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#2081439 (10BBlack) cp1048 (another upload cache) crashed today with: ``` Mar 2 20:25:29 cp1048 kernel: [1915351.432154] ------------[ cut here ]------------ Mar 2 20:25:29 cp1048 kernel: [19153... [20:53:07] !log repooling cp1048, seems unlikely to recrash (rare kernel bug) [20:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:08] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2081464 (10Dzahn) @JGulingan looks like somebody else has already done this a little while ago [20:56:35] 6Operations, 10Mail: move fundraising group aliases to OIT - https://phabricator.wikimedia.org/T128647#2081487 (10Dzahn) [20:58:17] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2081522 (10Dzahn) 5Open>3Resolved @JGulingan This specific ticket is done and i'm closing it as resolved. Separately i created T128647 . Maybe we can move this over to your team like we did with many other aliases? [20:58:19] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2081526 (10Jalexander) Who would be best for me to connect on an email about this, @Dzahn ? Still looking but good chance t... [20:59:04] (03Abandoned) 10Muehlenhoff: Add Roan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/274367 (https://phabricator.wikimedia.org/T128557) (owner: 10Muehlenhoff) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160302T2100). Please do the needful. [21:01:56] no mobileapps deployment today [21:02:55] no parsoid deploy today. [21:03:35] (03CR) 10Dzahn: "example from today, this is XML and images, _not_ SQL" [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [21:04:29] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2081543 (10BBlack) Probably me [21:05:59] (03CR) 10ArielGlenn: [C: 031] "Xml: we don't dump anything into xml that's not public. So those are good to go. Images should be fine too, so open up the hatches, boys " [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [21:06:12] apergos: thank you [21:06:21] sure [21:08:51] 6Operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#2081582 (10MoritzMuehlenhoff) > @MoritzMuehlenhoff - It was already running `Linux cp1048 3.19.0-2-amd64 #1 SMP Debian 3.19.3-9 (2016-01-04) x86_64`, does that include the paulmck rcu fix already?... [21:10:54] (03PS11) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [21:11:39] (03CR) 1020after4: [C: 031] "ok I finally got this to rebase cleanly" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:12:53] !log forcing a major compaction on {local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ,local_group_wikipedia_T_parsoid_html}.data, xenon.eqiad.wmnet : T125906 [21:12:54] T125906: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906 [21:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:10] (03PS1) 1020after4: Phabricator: support systemd as well as upstart. [puppet] - 10https://gerrit.wikimedia.org/r/274488 [21:20:05] (03CR) 1020after4: "creation of "phab-ssh.service systemd service unit file" has been moved to a separate changeset ( https://gerrit.wikimedia.org/r/#/c/27448" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:23:23] (03CR) 10Dzahn: "thanks, i'm willing to merge this so it can happen during the maintenance period today. just one nitpick, rename the deployment keys to so" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:25:35] (03PS12) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [21:26:24] (03CR) 1020after4: [C: 031] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:26:51] 6Operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2081667 (10chasemp) 5Open>3Resolved I spoke with andrew and this was always rotated, the problem came from an increased debug level of logging for designate to troubleshoot the rec... [21:27:46] (03CR) 1020after4: "someone with root will need to place the private key in ops/private to enable scap3 deployments" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:31:18] (03CR) 10Dzahn: [C: 031] Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:31:54] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2081680 (10JGulingan) I wouldn't moving these over to being google groups, since most of the time people request this through to I.T. [21:39:01] (03CR) 10Madhuvishy: [C: 04-1] "Don't merge this yet. We will merge this after the QuickSurveys work when we decide to drop ClientIPs from EL." [puppet] - 10https://gerrit.wikimedia.org/r/274286 (owner: 10Madhuvishy) [21:43:10] (03CR) 10CSteipp: "Luke081515, the last time any of these groups were renamed was in Aug 2014, as part of an effort to standardize the group names. If a grou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [21:48:45] twentyafterfour: is there a way for the redirect script to redirect to the master brach if the commit does not exist please. [21:49:27] (03CR) 10Luke081515: [C: 031] "Yeah, I think we chance is low too, but in theory there is way... but in every case a lot of users would see that, so ok from my side." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [21:49:46] paladox: the redirect script doesn't know if the commit exists until it's too late [21:49:58] but that would be nice [21:55:56] (03CR) 10Dzahn: "i added the private key in the private repo on the puppetmaster, as modules/secret/secrets/phabricator/phab_deploy_private_key" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [21:57:14] (03PS1) 1020after4: Add a deployment source for phabricator deployment from tin/mira [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) [22:04:40] (03CR) 10Dzahn: "in production the new method is the "secrets" module." [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [22:07:37] (03PS2) 1020after4: Add a deployment source for phabricator deployment from tin/mira [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) [22:09:06] (03PS1) 10Hoo man: Set $wgWikimediaBadgesCommonsCategoryProperty to null on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) [22:12:58] (03PS8) 10Ottomata: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [22:14:57] (03CR) 10jenkins-bot: [V: 04-1] Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [22:15:26] (03CR) 10Daniel Kinzler: [C: 031] "This code *looks* like it should do what we want. I have no idea what it actually does..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [22:21:46] twentyafterfour: Ok. [22:22:00] (03CR) 1020after4: [C: 031] "The backwards-compat change only uses the new secret method of phabricator::deployment::source but I will look into getting that working o" [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [22:23:50] (03PS9) 10Ottomata: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [22:25:20] (03CR) 10Dzahn: [C: 031] Add a deployment source for phabricator deployment from tin/mira [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [22:28:01] !log enabling brotli compression on local_group_wikipedia_T_parsoid_html.data in staging, and forcing rewrite of corresponding tables on xenon : T125906 [22:28:02] T125906: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906 [22:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:06] 6Operations, 10Traffic: 3x cache_upload crashed in a short time window - https://phabricator.wikimedia.org/T125401#2082003 (10BBlack) IMHO, 4.4.x is getting close anyways, we may as well see if this problem just goes away after the switch to it. [22:40:21] (03CR) 10Bene: "Should work, didn't test. One nitpick" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [22:42:14] (03CR) 10Hoo man: "re" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [22:45:04] (03CR) 10Bene: [C: 031] "Should work" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [22:49:22] we need a compilation of reassuring code review comments :) [22:49:42] !log krinkle@tin Synchronized php-1.27.0-wmf.15/includes/api/ApiMain.php: Fix PHP Notice (duration: 01m 17s) [22:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:17] 6Operations, 10Traffic: 3 Varnish cache_upload servers crashed in a short time window - https://phabricator.wikimedia.org/T125401#2082092 (10Krinkle) [22:52:38] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2082098 (10Danny_B) >>! In T128491#2080693, @Dzahn wrote: >>>! In T128491#2076729, @Danny_B wrote: >> Please assure you removed the inappropriate project tags from the subtask next time.... [22:52:48] (03PS1) 10Dduvall: labs: Expand paths for nuyaml hiera lookup under common [puppet] - 10https://gerrit.wikimedia.org/r/274566 [22:56:15] (03PS1) 10Andrew Bogott: Provide Horizon with a keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/274567 [22:58:31] (03PS2) 10Andrew Bogott: Provide Horizon with a keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/274567 [23:01:09] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2082155 (10Dzahn) Fair, i think what i actually want is to keep using tracking bugs and subtasks but the technical issue is that phabricator always copies the tags to a subtask and while... [23:04:09] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures [23:05:15] (03PS1) 10Dduvall: Fix programdashboard hieradata [puppet] - 10https://gerrit.wikimedia.org/r/274572 [23:06:55] 6Operations, 6Discovery, 6Labs, 10Labs-Infrastructure, and 3 others: labstore monitoring - "Last run result for unit .. was exit-code" - https://phabricator.wikimedia.org/T128526#2082170 (10chasemp) [23:06:57] 6Operations, 6Labs: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082171 (10chasemp) [23:10:13] (03PS2) 10Dduvall: Fix programdashboard hieradata [puppet] - 10https://gerrit.wikimedia.org/r/274572 [23:10:43] (03PS3) 10Andrew Bogott: Provide Horizon with a keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/274567 [23:13:08] (03CR) 10Andrew Bogott: [C: 032] Provide Horizon with a keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/274567 (owner: 10Andrew Bogott) [23:13:38] 6Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2082189 (10Volans) To have MySQL recognize certificates valid from both CAs (current one and new one) we can use on those options: - Create a certificate file with both CAs pem's `cat ca1.p... [23:14:02] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2082191 (10RobH) [23:14:12] mutante, is this you? [23:14:13] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find template 'openstack/common/labtestwikitech.wikimedia.org.erb' at /etc/puppet/modules/openstack/manifests/openstack_manager.pp:57 on node labtestweb2001.wikimedia.org [23:14:18] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2082195 (10madhuvishy) [23:15:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:16:04] 6Operations, 6Labs: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082209 (10chasemp) Backups have been failing again and I had a few moments to look into things (I am also merging in a task daniel made -- thanks daniel -- as however we address this needs to be systemic). This is fa... [23:17:23] bblack: ping? [23:18:10] SMalyshev: what's up? [23:18:58] 6Operations, 6Labs: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082213 (10chasemp) [23:19:09] bblack: hi! have a varnish question. If we have some url in cache with various query strings - i.e. a?x=1, a?x=2 etc. - is there an operation that allows to purge all query string variants for the same base url? [23:20:49] SMalyshev: nothing that would be usable in production [23:21:14] gwicke: you are hinting at varnish 4, right? :) [23:21:41] 6Operations, 6Labs: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082234 (10chasemp) [23:21:49] that too, yeah - but there are also bans [23:22:06] https://phabricator.wikimedia.org/T122881 [23:22:06] gwicke: but that's not how we do purges normally? [23:22:17] 6Operations, 6Labs, 10Labs-Infrastructure, 10Tool-Labs: labstore - replication to codfw broken or not working yet - https://phabricator.wikimedia.org/T125749#2082235 (10chasemp) [23:22:37] SMalyshev: no, that mechanism does not scale [23:22:53] https://phabricator.wikimedia.org/T122867 [23:22:57] 6Operations, 6Labs: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2046998 (10chasemp) [23:23:17] gwicke: ok, so for normal function right now if I need to purge several URL variants I'd need to name each one of them explicitly? [23:23:33] SMalyshev: correct [23:23:40] ok, thanks [23:23:51] 6Operations, 6Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082241 (10chasemp) [23:31:09] gwicke: thanks! :) [23:31:39] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:32:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [23:32:47] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082282 (10chasemp) [23:35:21] bblack: can I convert this little bit of political capital into a review of https://gerrit.wikimedia.org/r/#/c/274458/ ? ;) [23:35:25] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2082296 (10RobH) So #procurement is for actual pricing. For the public discussion of the specifications (and to link other things to a public blocker), it belongs in #hardware-r... [23:35:37] (03PS1) 10Dduvall: labs: Deployer access for programdashboard [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) [23:35:48] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082302 (10chasemp) [23:37:30] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2082312 (10JGulingan) whoops, I meant I wouldn't mind moving these into being into google groups. [23:37:40] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [23:37:42] 6Operations, 6Labs: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#2082313 (10chasemp) 5Open>3Resolved I believe with the rollout of https://gerrit.wikimedia.org/r/#/c/272900/ this has improved greatly. It is not necessarily difficu... [23:38:27] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082320 (10chasemp) [23:39:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:39:55] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [23:40:12] 6Operations, 10Traffic: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2082348 (10bbogaert) [23:40:19] PROBLEM - ElasticSearch health check for shards on elastic1012 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.144:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.32.144, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [23:41:43] ^ ebernhardson elasic is not running here? [23:42:00] (declarative as in: I see it's not but that seems odd) [23:42:26] (03PS1) 10ArielGlenn: re-enable dump cron job on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/274581 [23:42:42] well I was going to sleep but one last thing [23:43:42] (03CR) 10ArielGlenn: [C: 032] re-enable dump cron job on snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/274581 (owner: 10ArielGlenn) [23:46:23] 6Operations, 6Labs, 10Labs-Infrastructure: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#2082414 (10chasemp) [23:46:25] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082413 (10chasemp) [23:48:26] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2082438 (10RobH) I should have also noted that we have no spare machines that meet this requirement (our high performance misc systems have overkill on memory and disk). I've cr... [23:48:39] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2082443 (10RobH) [23:49:28] RECOVERY - ElasticSearch health check for shards on elastic1012 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 308, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 8865, initializing_shards: 0, number_of_data_nodes: 31, delayed_unassigned [23:50:30] 6Operations, 10Incident-Labs-NFS-20151216, 6Labs: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#2082449 (10chasemp) [23:50:32] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082448 (10chasemp) [23:51:25] !log ran puppet on elastic1012 manually which started a mystery stopped (crashed?) elastic search [23:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:30] PROBLEM - Auth DNS for labs pdns on labs-ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:57:45] 6Operations, 10media-storage: Unable to delete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2082539 (10Krenair) [23:57:54] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.231 second response time [23:57:55] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2063784 (10Krenair) [23:58:59] ok well I guess dns is for sure still messed up then [23:59:36] (03PS2) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [23:59:38] (03PS1) 10BBlack: 2layer: remove outdated pybal weight junk [puppet] - 10https://gerrit.wikimedia.org/r/274584 (https://phabricator.wikimedia.org/T127481) [23:59:40] (03PS1) 10BBlack: 2layer: move $mma out of storage block [puppet] - 10https://gerrit.wikimedia.org/r/274585 (https://phabricator.wikimedia.org/T127481) [23:59:42] (03PS1) 10BBlack: geoip.inc.vcl.erb: move to text extra_vcl [puppet] - 10https://gerrit.wikimedia.org/r/274586 (https://phabricator.wikimedia.org/T127481) [23:59:44] (03PS1) 10BBlack: v::c::directors: remove dead code/comments [puppet] - 10https://gerrit.wikimedia.org/r/274587 (https://phabricator.wikimedia.org/T127481) [23:59:46] (03PS1) 10BBlack: misc-backend: clean up elsif whitespace [puppet] - 10https://gerrit.wikimedia.org/r/274588 (https://phabricator.wikimedia.org/T127481) [23:59:48] (03PS1) 10BBlack: text-backend: clean up applayer backend logic [puppet] - 10https://gerrit.wikimedia.org/r/274589 (https://phabricator.wikimedia.org/T127481) [23:59:50] (03PS1) 10BBlack: VCL: explicit applayer backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274590 (https://phabricator.wikimedia.org/T127481) [23:59:52] (03PS1) 10BBlack: VCL: rename remaining "backend" cache backends [puppet] - 10https://gerrit.wikimedia.org/r/274591 (https://phabricator.wikimedia.org/T127481) [23:59:54] (03PS1) 10BBlack: v::c::directors: remove defaulting of service/dc [puppet] - 10https://gerrit.wikimedia.org/r/274592 (https://phabricator.wikimedia.org/T127481) [23:59:56] (03PS1) 10BBlack: role::cache: undo fe_t[12]_opts complexity [puppet] - 10https://gerrit.wikimedia.org/r/274593 (https://phabricator.wikimedia.org/T127481) [23:59:58] (03PS1) 10BBlack: VCL: move layer from vcl_config to instance param [puppet] - 10https://gerrit.wikimedia.org/r/274594 (https://phabricator.wikimedia.org/T127481)