[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T0000). Please do the needful. [00:01:28] [Thu Jun 30 00:01:24 2016] [hphp] [12154:7f7694804100:0:000002] [] [00:01:31] Notice: Undefined variable: wgDisableUnmergedEdits [00:02:38] I checked the CentralAuth code, this setting isn't anymore on the source code [00:03:26] https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/89bb28a50167e9f25111e6d90bbddbd87d5ac7a2/CentralAuth.php#L204 [00:03:34] wgDisableUnmergedEditing [00:03:46] the wmg and wg didn't match :/ [00:05:17] actually... why don't we have that set to true? [00:05:22] oh, we set $wgCentralAuthStrict = true; [00:05:53] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2416632 (10aaron) The error message should be less terse now. How often does this... [00:07:52] !log dereckson@tin Started scap: wmf-config/CommonSettings.php Clean-up for IS/CS ([[Gerrit:292615]] to [[Gerrit:292618]], no op, 1/2) [00:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:12] !log dereckson@tin scap aborted: wmf-config/CommonSettings.php Clean-up for IS/CS ([[Gerrit:292615]] to [[Gerrit:292618]], no op, 1/2) (duration: 00m 20s) [00:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:27] (03PS1) 10Aaron Schulz: Increase redisLockManager read timeout from 1 to 2 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296676 (https://phabricator.wikimedia.org/T132921) [00:08:51] !log maxsem@tin Synchronized php-1.28.0-wmf.8/extensions/TemplateSandbox: https://gerrit.wikimedia.org/r/#/c/296675/ (duration: 00m 30s) [00:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:30] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:09:51] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Clean-up for IS/CS ([[Gerrit:292615]] to [[Gerrit:292618]], no op, 1/2) (duration: 00m 28s) [00:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:41] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Clean-up for IS/CS ([[Gerrit:292615]] to [[Gerrit:292618]], no op, 2/2) (duration: 00m 29s) [00:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:40] Okay, done for those. As we're out of time, I'm deploying the odder ady. logo change, and we're done for SWAT. [00:11:47] Dereckson, Undefined variable: wmgUseXFFBlocks in /srv/mediawiki/wmf-config/CommonSettings.php on line 3232 [00:12:38] MaxSem: introcued by b4d232a8362283b58b09d08e3d767242aa5002d7 [00:13:02] I'm preparing a fix. [00:13:05] so why it wasn't reverted? [00:13:13] (03PS1) 10MaxSem: Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296677 [00:13:25] (03PS2) 10MaxSem: Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296677 [00:13:31] (03CR) 10MaxSem: [C: 032] Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296677 (owner: 10MaxSem) [00:14:19] (03Merged) 10jenkins-bot: Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296677 (owner: 10MaxSem) [00:15:21] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" (duration: 00m 25s) [00:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:58] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Revert "Cleanup: Move never-altered GlobalBlockingBlockXFF into CommonSettings" (no-op) (duration: 00m 26s) [00:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:43] (03CR) 10Dereckson: "It appears $wmgUseXFFBlocks is also used at another place in CS:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292615 (owner: 10Jforrester) [00:23:13] we're still getting this spam in logs message repeated 5760 times: [ #012Notice: Undefined variable: wmgUseXFFBlocks in /srv/mediawiki/wmf-config/CommonSettings.php on line 3232] [00:24:09] (03PS1) 10Dereckson: Don't use always true wmgUseXFFBlocks anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296679 [00:24:54] MaxSem: I suggest we merge this to remove the setting and use true instead ^ [00:25:06] !log maxsem@tin Synchronized wmf-config/: Try again? (duration: 00m 29s) [00:25:11] I suggest we don't try to break prod anymore [00:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:20] preparing to update phabricator [00:26:04] bd808: https://gerrit.wikimedia.org/r/#/c/296679 looks good to you? [00:27:23] (03CR) 10BryanDavis: [C: 031] Don't use always true wmgUseXFFBlocks anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296679 (owner: 10Dereckson) [00:27:40] !log Taking phabricator offline momentarily for scheduled update. Expect less than 5 minutes of downtime. [00:27:41] (03CR) 10Dereckson: [C: 032] Don't use always true wmgUseXFFBlocks anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296679 (owner: 10Dereckson) [00:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:29] (03Merged) 10jenkins-bot: Don't use always true wmgUseXFFBlocks anymore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296679 (owner: 10Dereckson) [00:29:01] mw1017 : [00:29:01] echo $wgApplyIpBlocksToXff [00:29:01] 1 [00:29:23] phabricator throing 503 503, Backend fetch failed [00:29:28] Expected I assume [00:29:29] I get 500 errors from phab too often [00:29:30] Krinkle: known issue [00:29:35] Not very descriptive though :) [00:29:53] Krinkle: planned maintenance by twentyafterfour, ETA back in < 5 minutes [00:30:24] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Don't use always true wmgUseXFFBlocks anymore (1/2) (duration: 00m 27s) [00:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:31] it would be nice to have a maintenance shingle for that [00:32:09] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Don't use always true wmgUseXFFBlocks anymore (1/2) (duration: 00m 25s) [00:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:29] Krinkle: still seeing 500s? [00:38:19] nope [00:38:35] !log Phabricator upgrade complete, service appears to be stable. [00:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:32] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Don't use always true wmgUseXFFBlocks anymore (2/2) (duration: 00m 25s) [00:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:40:37] so we can now twentyafterfour edit parent tasks and not only children tasks :) [00:45:35] Dereckson: yep [00:46:14] and more things... [00:46:45] (my phabricator is mostly at master of phacility, I already knew new things ;)) [00:48:29] (03CR) 10Dereckson: [C: 04-2] "Blocked by T122771, per last comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296520 (https://phabricator.wikimedia.org/T67064) (owner: 10Mdann52) [00:48:31] Quite a few bugs fixed in the past two weeks. A few of them were sponsored by wmf thanks to Qgil [00:48:48] nice [00:54:01] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 459.23 seconds [00:57:46] (03CR) 10Dereckson: [C: 031] "Technically correct and PNG optimized." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) (owner: 10Urbanecm) [01:01:24] 07Puppet, 06Labs, 10Labs-Infrastructure, 10Phabricator: puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011#2416717 (10mmodell) [01:08:21] Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [01:08:31] twentyafterfour: ^ [01:09:31] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [01:10:50] twentyafterfour: i am here for the reboot of the server [01:12:02] mutante: ok [01:12:12] Dereckson: I saw that too but it resolved itsself [01:12:21] mutante: I'm ready when you are [01:13:25] twentyafterfour: i am ready, i have the mgmt console open and see output [01:13:42] want me to type the command? [01:14:52] twentyafterfour, we actually got sponsoring working? [01:14:54] nice [01:15:15] mutante: go for it [01:15:20] (I am also here because Phab went down :)) [01:15:32] !log rebooting iridium [01:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:16:25] Krenair: yeah quim sponsored a few bugfixes and the first batch of them just got pushed to our phabricator [01:16:43] do we have a list somewhere? [01:16:58] Krenair: once phabricator is back online I'll look it up ;) [01:16:58] So I am waiting for https://phabricator.wikimedia.org/T138460 [01:17:01] haha okay [01:17:12] jynus, which task is that? [01:17:26] the m3-slave depooling [01:18:42] it's back [01:18:44] !log iridium back up, on 3.13.0-91 [01:18:45] Krenair: https://phabricator.wikimedia.org/T136213 [01:18:49] twentyafterfour: done [01:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:18] Gerrit gives HTTP 500 ISE while Phab is down? [01:19:22] jynus: so all that needs to happen for depooling is to stop the cronjob that hits the slave? [01:19:23] when uploading a change [01:19:36] Krenair: weird [01:19:38] To ssh://gerrit.wikimedia.org/mediawiki/extensions/VisualEditor [01:19:38] ! [remote rejected] HEAD -> refs/publish/master/T137424 (internal server error) [01:19:45] twentyafterfour, anthing that hits the slave must be either stopped or prepared to fail [01:19:54] but added my commit anyway [01:19:56] I will bring it down [01:20:10] jynus: well as far as I am concerned, go for it. let the cron jobs fail for a while it won't hurt too much [01:20:28] upgraade it and (next week?) we will failover to be the main db [01:20:31] I don't think we want those expensive queries hitting the masters during the interim, it would cause too much instability [01:20:47] jynus: I'll give my official blessing on the task [01:21:06] so maybe we can temporarilly disable the crons? [01:21:19] can you point me to where they are? [01:22:47] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2416754 (10mmodell) @jcrespo: I say go ahead with the m3 slave depooling. AFAIK the slaves are only used (currently) for some expensive analytics-type queries (the public task dump) This will... [01:23:04] jynus: they are root crons on iridium [01:23:15] in root's crontab on iridium, i see /srv/phab/tools/public_task_dump.py [01:23:29] is that puppetized? [01:23:44] defined in role::phabricator::main [01:23:50] thank you [01:24:02] yeah that should be the only thing that hits slaves, afaik [01:24:05] I will create a change, ask for your +1 [01:24:19] and a general check in case I miss something [01:24:33] and I will do it during the week [01:24:43] jynus: cool, +1 :) [01:24:48] once everthing seems ok [01:25:19] we can schedule a failover for next week, unless this gets delayed/problems are detected [01:26:05] do you know which teams are affected by those crons, analytics? [01:26:31] modules/phabricator/manifests/tools.pp line 41 [01:26:33] it is to send an email indicating the maintenance [01:27:14] jynus: getting a lot of db errors on phabricator [01:27:33] I saw that, I assumed you were doing maintenance [01:27:36] jynus: I pinged JAufrecht, that's the one person I know will be specifically affected [01:27:47] there is a huge amount of inserts going on [01:27:50] jynus: no maintenance on my end, I did that [01:28:00] (already finished maint.) [01:28:01] (I'm not doing any maint either, fwiw) [01:28:06] hmm inserts? .... [01:28:17] see: https://tendril.wikimedia.org/host/view/db1043.eqiad.wmnet/3306 [01:28:47] 300K inserts in 5 minutes [01:29:06] when normally we get 2K [01:29:10] hmm phabricator daemons look very busy. I wonder what they are up to [01:29:12] geez [01:29:42] also 100 simultanous connections [01:29:45] Somebody import a repo? [01:29:48] Tons of indexing shit [01:30:52] !log stopped phd on iridium to investigate large spike in sql insert volume [01:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:56] https://phabricator.wikimedia.org/daemon/ - specifically [01:32:23] "You do not have privileges to access the requested page." [01:32:42] Isn't Phabricator supposed to show actual policies whenever there's a 403? [01:33:04] !log starting phd with only 4 taskmasters to help lighten the load [01:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:33:12] Krenair: yes [01:33:14] 06Operations, 10DBA, 10Phabricator: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2416755 (10mmodell) @JAufrecht: adding you just to give you a bit of fore-warning that the public task dump is going to break sometime this week, not sure how long before we can restore it but... [01:33:44] Krenair: Interesting, "Can Use Application: Public (No Login Required)" [01:34:01] Can configure, administrators [01:34:08] don't let me distract you with that for now [01:34:13] 🤔 [01:34:21] (03CR) 10Dzahn: [C: 031] Allow wdqs admins to control wdqs-updater service [puppet] - 10https://gerrit.wikimedia.org/r/295968 (https://phabricator.wikimedia.org/T138627) (owner: 10Smalyshev) [01:34:40] XChat can't render that character [01:35:20] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 400.82 seconds [01:35:33] Krenair: Find a newer client? 😂 [01:35:40] I can see that one [01:35:54] The other one was the https://www.google.com/?ion=1&espv=2#q=thinking%20emoji [01:35:56] :) [01:36:28] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:36:35] yeah I could still google it [01:36:42] Sigh... Every time I look into the Phabricator source. Every single time. [01:36:55] How do they use so many classes for these things? [01:37:07] Krenair: :-/ [01:37:12] (03PS1) 10Jcrespo: Disable cron script on the phab slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296681 (https://phabricator.wikimedia.org/T138460) [01:38:59] src/applications/daemon/controller/PhabricatorDaemonController.php: public function shouldRequireAdmin() { return true; [01:39:31] XChat can't render that character PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.15 seconds [01:40:25] well, for that matter it doesn't render in Chrome either [01:40:43] is there anything you need from me for the phab issues? [01:41:01] Krenair: You can make the argument that "each class does one thing and one thing only".... but they've gone so far that "each class does one thing that by itself isn't very useful so you need 17 other classes to do something to it" lolol [01:42:23] most connections seem to come from "phabricator_repository" [01:42:51] sorry, that is the db, not the user [01:43:14] Yeah, sounds like a git repo import. Which I'm curious what "new" repo got so big. [01:43:19] Krenair: cf https://tools.wmflabs.org/bash/quip/AU7VT86J6snAnmqnK_qj [01:44:24] I am saying it because if it is something that there is not much to do about, I would wait offline for your review and start with the maintenance tomorrow [01:44:53] while we are in maintenance of phab and all here.. might as well merge a change to gerrit module [01:45:10] mutante: My no-op cleanup? [01:45:11] :D [01:45:16] yea [01:46:18] (03CR) 10Dzahn: [C: 032] "also checked with compiler" [puppet] - 10https://gerrit.wikimedia.org/r/296622 (owner: 10Chad) [01:47:29] Lemme know when it's on puppetmaster and I'll kick off a puppet run on ytterbium [01:47:30] (03CR) 1020after4: [C: 031] Disable cron script on the phab slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296681 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [01:47:41] ostriches: it's running right now [01:47:49] k! [01:47:54] ostriches: already over [01:47:59] Yay no changes? [01:48:05] ostriches: it's probably due to changes in phabricator, it's importing more refs than before [01:48:14] Ugh yah [01:48:16] Prolly [01:48:34] ostriches: only this https://phabricator.wikimedia.org/P3321 [01:48:41] I slowed it down as much as I can, should chug through 11k tasks without too much disruption [01:49:12] thank you, I will merge just before I start the maintenance so I have the least amount of time dow [01:49:14] n [01:49:30] next weeks will be a bit more tricky [01:49:39] (03CR) 10Dzahn: "this was the only change https://phabricator.wikimedia.org/P3321" [puppet] - 10https://gerrit.wikimedia.org/r/296622 (owner: 10Chad) [01:49:49] because it may need a service restart or killing all connections [01:49:56] good night! [01:49:58] jynus: thanks! [01:50:59] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [01:58:43] mutante: All expected :) [01:59:18] :) okay, i thought so. i am also logging off then [02:00:33] 07Puppet, 06Labs, 10Labs-Infrastructure, 10Phabricator: puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011#2416717 (10AlexMonk-WMF) @chasemp, can you check other hosts that we know to work like `bastion-01.bastion.eqiad.wmflabs`? [02:04:06] (03PS1) 10Chad: Gerrit: don't pass SMTP server info around either, it's in hiera [puppet] - 10https://gerrit.wikimedia.org/r/296682 [02:08:25] (03CR) 10Chad: [C: 031] "https://puppet-compiler.wmflabs.org/3234/ - no non-manifest changes, just some newlines trimmed from gerrit.config" [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [02:10:56] (03PS1) 10Chad: Gerrit: Stop customizing ssh port. It's not like we're changing it [puppet] - 10https://gerrit.wikimedia.org/r/296683 [02:18:11] (03PS1) 10Chad: Gerrit: don't bother ensuring hooks directory, we don't use it anymore [puppet] - 10https://gerrit.wikimedia.org/r/296684 [02:24:10] (03PS1) 10Chad: Gerrit: remove replicationdest, unused [puppet] - 10https://gerrit.wikimedia.org/r/296685 [02:39:04] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.7) (duration: 17m 23s) [02:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:53:18] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.8) (duration: 07m 08s) [02:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 30 03:00:28 UTC 2016 (duration 7m 10s) [03:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:23] (03PS1) 10Ladsgroup: Rename ores deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) [03:22:58] (03CR) 10Ladsgroup: "We need to test this, I'm not sure the new repo would work fine." [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [03:47:16] PROBLEM - Hadoop DataNode on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:00:46] RECOVERY - Hadoop DataNode on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [04:10:59] ori: Are you around? [04:17:09] I attempted to create an account with the username mintudazel. [04:17:09] First attempt: An account creation for this user name is already in progress. Please wait. [04:17:09] Second attempt: Username entered already in use. Please choose a different name. [04:17:09] Special:CentralAuth gives Fatal exception of type "Exception" since the first attempt. [04:17:24] I take it that it is https://phabricator.wikimedia.org/T119736? Can someone provide a stack trace? [04:21:13] 2016-06-30 04:09:15 [V3SbawpAIDkAAGJ6vqAAAABP] mw1187 metawiki 1.28.0-wmf.8 exception ERROR: [V3SbawpAIDkAAGJ6vqAAAABP] /wiki/Special:CentralAuth/mintudazel Exception from line 2340 of /srv/mediawiki/php-1.28.0-wmf.8/extensions/CentralAuth/includes/CentralAuthUser.php: Could not find local user data for Mintudazel@enwiki {"exception_id":"V3SbawpAIDkAAGJ6vqAAAABP"} [04:21:33] Thanks [04:44:54] PROBLEM - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [04:45:47] db1054 is an s2 slave [05:05:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3714 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 410039 MB (28% inode=99%) [05:10:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3690 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:15:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3689 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:20:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3689 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:25:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3689 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:30:14] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3689 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:35:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3688 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 408535 MB (28% inode=99%) [05:40:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3688 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409883 MB (28% inode=99%) [05:45:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3687 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409437 MB (28% inode=99%) [05:50:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3687 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 407599 MB (28% inode=99%) [05:55:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3686 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 406146 MB (28% inode=99%) [06:00:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3686 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409735 MB (28% inode=99%) [06:05:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3686 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 407751 MB (28% inode=99%) [06:10:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3685 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 405779 MB (27% inode=99%) [06:15:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3685 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 403903 MB (27% inode=99%) [06:18:28] !log rolling restart of mw1001-mw1016 for kernel secuity update [06:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:20:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3685 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409855 MB (28% inode=99%) [06:25:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3685 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409414 MB (28% inode=99%) [06:26:31] !log resuming rolling restarts of elasticsearch cluster in eqiad and codfw [06:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3719 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409854 MB (28% inode=99%) [06:32:13] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 4 failures [06:32:24] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:43] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:32:54] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:54] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:34:34] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:34:44] PROBLEM - Check size of conntrack table on mw1161 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [06:34:54] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3718 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 408698 MB (28% inode=99%) [06:35:14] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:33] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:24] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:24] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:56] !log powercycling elastic1014, stuck after reboot [06:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:40:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3718 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 407114 MB (28% inode=99%) [06:40:35] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:40:55] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 79 % full [06:42:34] PROBLEM - Check size of conntrack table on mw1162 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [06:44:45] PROBLEM - Check size of conntrack table on mw1165 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:45:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3718 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 405702 MB (27% inode=99%) [06:45:15] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:04] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:46:23] PROBLEM - Check size of conntrack table on mw1161 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [06:47:04] RECOVERY - Check size of conntrack table on mw1165 is OK: OK: nf_conntrack is 76 % full [06:47:13] RECOVERY - Check size of conntrack table on mw1162 is OK: OK: nf_conntrack is 72 % full [06:48:24] RECOVERY - Check size of conntrack table on mw1166 is OK: OK: nf_conntrack is 41 % full [06:48:34] RECOVERY - Check size of conntrack table on mw1161 is OK: OK: nf_conntrack is 46 % full [06:48:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:48:54] RECOVERY - Check size of conntrack table on mw1167 is OK: OK: nf_conntrack is 49 % full [06:49:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:50:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3717 MB (10% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 404354 MB (27% inode=99%) [06:50:53] mmhh, the surge in jobrunner activity is unrelated to any reboots of mw* systems, while I had already logged those, I got sidetracked into further broken elastic mgmt reboots, so no reboots of eqiad jobrunners yet [06:55:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3202 MB (8% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 403094 MB (27% inode=99%) [06:55:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:43] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:56:54] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:40] !log powercycling elastic1015, stuck after reboot [06:57:43] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:57:45] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:23] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:13] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 1936 MB (5% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 401930 MB (27% inode=99%) [07:04:08] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:05:17] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 17685 MB (49% inode=89%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 409709 MB (28% inode=99%) [07:06:37] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [07:07:58] PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [07:07:59] PROBLEM - Check size of conntrack table on mw1169 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [07:11:08] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:27] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [07:12:18] PROBLEM - Check size of conntrack table on mw1162 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [07:12:31] https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=5&fullscreen shows a bit of variations in the jobs for the past hour [07:13:18] RECOVERY - Check size of conntrack table on mw1166 is OK: OK: nf_conntrack is 79 % full [07:13:28] I am seeing tons of TIME_WAIT on mw1167 (conntrack -L) [07:13:45] I found the problem, that's fallout of https://phabricator.wikimedia.org/T136094 [07:14:12] fixing it on mw1161-mw1169 as we speak [07:14:28] ahhh that one again! [07:14:37] RECOVERY - Check size of conntrack table on mw1162 is OK: OK: nf_conntrack is 63 % full [07:14:53] good :) [07:15:14] I'll look into fixing it properly once the current reboots are done [07:15:57] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [07:16:07] PROBLEM - Check size of conntrack table on mw1161 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [07:16:18] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [07:16:35] should all recover now [07:16:52] that makes sense, reboot and conntrack_max race condition [07:16:58] RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 77 % full [07:18:09] RECOVERY - Check size of conntrack table on mw1167 is OK: OK: nf_conntrack is 40 % full [07:18:18] RECOVERY - Check size of conntrack table on mw1161 is OK: OK: nf_conntrack is 36 % full [07:18:37] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 42 % full [07:19:27] RECOVERY - Check size of conntrack table on mw1169 is OK: OK: nf_conntrack is 64 % full [07:19:39] (03PS5) 10Elukey: Allow communications between AQS and Analytics Hadoop on port 7000. [puppet] - 10https://gerrit.wikimedia.org/r/296389 [07:27:59] moritzm: if you are not super busy, is --^ ok to merge for you? (we have the new network ACL in place for port 7000) [07:30:48] akosiaris: Thank you for the port update :) [07:34:51] elukey: I'll have a look in a bit [07:37:45] sure thanks! [07:42:55] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable mask/unmask updater service - https://phabricator.wikimedia.org/T138627#2416979 (10Gehel) [07:43:25] 06Operations, 10Ops-Access-Requests, 06Discovery, 10Wikidata, and 2 others: Enable WDQS admins to enable/disable mask/unmask updater service - https://phabricator.wikimedia.org/T138627#2405780 (10Gehel) Title updated to reflect additional discussion that happened in comments. [08:06:16] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/296389 (owner: 10Elukey) [08:13:53] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/3239/" [puppet] - 10https://gerrit.wikimedia.org/r/296389 (owner: 10Elukey) [08:17:03] disabling puppet on hadoop nodes just to be super safe, will re-enable in a bit [08:39:47] puppet is also disabled on aqs100[123] atm since I hit a syntax error in ferm, atm only on aqs100[456] that are not serving live traffic [08:44:10] (03PS1) 10Elukey: Add extra parentheses to an AQS ferm rule to solve a syntax error. [puppet] - 10https://gerrit.wikimedia.org/r/296693 [08:47:42] (03CR) 10Muehlenhoff: [C: 031] Add extra parentheses to an AQS ferm rule to solve a syntax error. [puppet] - 10https://gerrit.wikimedia.org/r/296693 (owner: 10Elukey) [08:48:34] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/3240/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/296693 (owner: 10Elukey) [08:50:38] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2417062 (10faidon) a:03Dzahn [08:51:37] PROBLEM - Disk space on elastic1045 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [08:51:56] PROBLEM - Disk space on elastic1036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [08:54:52] (03PS1) 10Elukey: Fix typo in Hadoop node list. [puppet] - 10https://gerrit.wikimedia.org/r/296694 [08:56:08] (03CR) 10Elukey: [C: 032] Fix typo in Hadoop node list. [puppet] - 10https://gerrit.wikimedia.org/r/296694 (owner: 10Elukey) [09:01:28] (03PS1) 10Elukey: Remove AQS Thrift ferm rules for port 9160 since not used anymore. [puppet] - 10https://gerrit.wikimedia.org/r/296695 [09:01:53] (03CR) 10Muehlenhoff: [C: 031] Remove AQS Thrift ferm rules for port 9160 since not used anymore. [puppet] - 10https://gerrit.wikimedia.org/r/296695 (owner: 10Elukey) [09:04:33] (03CR) 10Elukey: [C: 032] Remove AQS Thrift ferm rules for port 9160 since not used anymore. [puppet] - 10https://gerrit.wikimedia.org/r/296695 (owner: 10Elukey) [09:10:29] PROBLEM - Host mw1011 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:43] !log deleting old logs on elastic1045 and elastic1036 [09:13:46] !log reboot graphite2001 / graphite1001 to apply trusty kernel update [09:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:53] moritzm: might be safer to wait a bit before continuing the elastic rolling restart on eqiad, something's wrong there [09:15:00] dcausse: ok [09:19:31] !log zotero deployed translators cde2f7531a4 [09:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:39] (03PS3) 10Gehel: Maps: Limit query exec time for kartotherian user [puppet] - 10https://gerrit.wikimedia.org/r/295548 (https://phabricator.wikimedia.org/T138422) (owner: 10Yurik) [09:21:49] !log truncating current logs on elastic1045 and elastic1036 [09:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:19] (03CR) 10Gehel: [C: 032] Maps: Limit query exec time for kartotherian user [puppet] - 10https://gerrit.wikimedia.org/r/295548 (https://phabricator.wikimedia.org/T138422) (owner: 10Yurik) [09:23:59] RECOVERY - Disk space on elastic1045 is OK: DISK OK [09:24:20] RECOVERY - Disk space on elastic1036 is OK: DISK OK [09:25:25] dcausse: just connecting... what's the issue on eqiad? [09:26:07] gehel: something weird, exception loop that I don't understand generating gigs of logs [09:26:42] a nested remote exception between elastic1045 and elastic1036 (master) [09:29:07] looks like a bug, where the exception is going back and forth from elastic1045 and elastic1036 adding more and more root causes [09:30:03] the root cause is related to an itwiki_general_1415230945 shard not being ready for writes, but this shard is ready [09:31:15] dcausse: should we relocate that index of elastic1045? [09:31:34] 3 more shards to initialize then I'll restart elastic1045 hoping that it'll stop this dead loop [09:33:02] gehel: no itwiki_general_1415230945 shards are on elastic1045 :/ [09:33:25] dcausse: even more weird... [09:34:11] I've captured some lines from this gigantic logs then I truncated them, and now we have already 17gig [09:37:30] !log powercycling mw1011, stuck after reboot [09:37:32] dcausse: data/write/bulk does not look like usual indexing traffic, does it? [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:48] gehel: all writes are bulk for cirrus [09:37:56] (03CR) 10Mdann52: "T122771 means this does not work all the time - however, for the purpose this is intended, it's better than nothing. That task covers maki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296520 (https://phabricator.wikimedia.org/T67064) (owner: 10Mdann52) [09:38:30] * gehel is still learning ... [09:38:40] dcausse, gehel: just to doublechek, codfw is fine proceed, right? [09:39:50] RECOVERY - Host mw1011 is UP: PING OK - Packet loss = 0%, RTA = 3.28 ms [09:40:40] PROBLEM - Disk space on elastic1045 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [09:41:01] PROBLEM - Disk space on elastic1036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [09:42:49] moritzm: I don't see any logs growing on codfw the way they grow on eqiad [09:44:16] moritzm: codfw looks god to me, but I just arrived. dcausse did you have any other indication of issue except log size? [09:44:17] !log truncating current logs again on elastic1045 and elastic1036 [09:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:00] gehel: no, gc looks fine, so I prefer to wait until cluster is green before restarting any other nodes [09:45:20] RECOVERY - Disk space on elastic1045 is OK: DISK OK [09:45:42] dcausse: you'd also prefer to wait before restarting nodes on codfw? [09:45:49] RECOVERY - Disk space on elastic1036 is OK: DISK OK [09:46:01] gehel: no that's fine I think [09:48:40] moritzm: so please go on for codfw [09:51:22] ok [10:00:19] 06Operations, 10ops-eqiad: db1054 degraded RAID (failed disk) - https://phabricator.wikimedia.org/T139026#2417215 (10jcrespo) [10:04:02] (03PS11) 10Jcrespo: Remove otrs backups from dbstore1001, create them on es2001 instead [puppet] - 10https://gerrit.wikimedia.org/r/296538 (https://phabricator.wikimedia.org/T131705) [10:06:49] (03PS1) 10Ladsgroup: ores: Add Czech language dictionaries [puppet] - 10https://gerrit.wikimedia.org/r/296705 [10:06:54] moritzm: not sure to understand yet but the problem disappeared (culprit node removed and readded itself to the cluster) [10:07:09] I'll dig into the logs and hopefully we resume rolling restarts this afternoon [10:07:15] sure, thanks [10:11:04] (03CR) 10Jcrespo: [C: 032] Remove otrs backups from dbstore1001, create them on es2001 instead [puppet] - 10https://gerrit.wikimedia.org/r/296538 (https://phabricator.wikimedia.org/T131705) (owner: 10Jcrespo) [10:18:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] apache: logrotate augeas rule needs apache2 package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [10:19:11] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [10:19:30] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add Czech language dictionaries [puppet] - 10https://gerrit.wikimedia.org/r/296705 (owner: 10Ladsgroup) [10:19:35] (03PS2) 10Alexandros Kosiaris: ores: Add Czech language dictionaries [puppet] - 10https://gerrit.wikimedia.org/r/296705 (owner: 10Ladsgroup) [10:19:39] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add Czech language dictionaries [puppet] - 10https://gerrit.wikimedia.org/r/296705 (owner: 10Ladsgroup) [10:20:48] !log powercycling mw1016, stuck after reboot [10:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:23:24] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource - https://phabricator.wikimedia.org/T139027#2417275 (10Danny_B) p:05Triage>03Normal The issue is, that the link is not 100% correct. It's missing `.git` after the repo name. https://git.wikimedia.org/blob/mediawi... [10:25:23] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource - https://phabricator.wikimedia.org/T139027#2417285 (10Danny_B) [10:26:33] !log rebooting stat100[234] and analytics1003 for kernel upgrades [10:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:00] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2417291 (10hashar) 05Resolved>03Open We would need the package to build for Trusty which we used for Zend 5.5 jobs. It... [10:28:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] Gerrit: don't pass SMTP server info around either, it's in hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [10:30:12] (03CR) 10Alexandros Kosiaris: [C: 031] "Indeed. And it's now hardcoded pretty much anywhere anyway. A lot of people have it in more than one repos on their PCs, it will probably " [puppet] - 10https://gerrit.wikimedia.org/r/296683 (owner: 10Chad) [10:30:55] (03CR) 10Alexandros Kosiaris: [C: 031] Gerrit: don't bother ensuring hooks directory, we don't use it anymore [puppet] - 10https://gerrit.wikimedia.org/r/296684 (owner: 10Chad) [10:31:06] (03CR) 10Alexandros Kosiaris: [C: 031] Gerrit: remove replicationdest, unused [puppet] - 10https://gerrit.wikimedia.org/r/296685 (owner: 10Chad) [10:31:48] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource - https://phabricator.wikimedia.org/T139027#2417322 (10Paladox) Ive uploaded this pull https://github.com/wikimedia/texvcjs/pull/17 to change the link. [10:32:51] (03PS1) 10Jcrespo: Correct invalid cron definition; add gtid to backups [puppet] - 10https://gerrit.wikimedia.org/r/296706 (https://phabricator.wikimedia.org/T131705) [10:33:10] (03CR) 10Alexandros Kosiaris: [C: 031] admin: add wdqs-admins to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/296658 (https://phabricator.wikimedia.org/T138628) (owner: 10Dzahn) [10:33:39] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2417331 (10hashar) p:05Triage>03Normal [10:35:14] (03CR) 10Jcrespo: [C: 032] "With all the changes and space issues, let's both keep an eye on next week's backups to check they are being done correctly." [puppet] - 10https://gerrit.wikimedia.org/r/296706 (https://phabricator.wikimedia.org/T131705) (owner: 10Jcrespo) [10:35:29] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#1847021 (10hashar) [10:43:27] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:43:52] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource - https://phabricator.wikimedia.org/T139027#2417380 (10Danny_B) Thanks. However, that solves only this particular case in this moment. For future we need some more versatile solution like those I've suggested earlier. [10:46:51] 06Operations, 06Labs, 10Labs-Infrastructure, 10Shinken, 07Graphite: Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540#2417388 (10hashar) [10:48:52] (03PS1) 10Jcrespo: Avoid log spam as 99% of these would be non-errors [puppet] - 10https://gerrit.wikimedia.org/r/296707 (https://phabricator.wikimedia.org/T132324) [10:53:03] 06Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 13Patch-For-Review: varnish text on beta is unreachable / stuck - https://phabricator.wikimedia.org/T134346#2417400 (10hashar) 05Open>03Resolved a:03hashar The frontend varnish on deployment-cache-text04 has ~ 500 threads but at least there are... [10:58:57] (03CR) 10Jcrespo: [C: 032] Avoid log spam as 99% of these would be non-errors [puppet] - 10https://gerrit.wikimedia.org/r/296707 (https://phabricator.wikimedia.org/T132324) (owner: 10Jcrespo) [11:00:30] (03PS2) 10Jcrespo: Disable cron script on the phab slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296681 (https://phabricator.wikimedia.org/T138460) [11:00:57] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource unless .git is part of repo in url - https://phabricator.wikimedia.org/T139027#2417418 (10jayvdb) [11:04:35] (03CR) 10Jcrespo: [C: 032] Disable cron script on the phab slave due to maintenance [puppet] - 10https://gerrit.wikimedia.org/r/296681 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [11:09:28] (03CR) 10Jforrester: "Meh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296677 (owner: 10MaxSem) [11:14:58] (03PS1) 10Jcrespo: Revert "Disable cron script on the phab slave due to maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/296709 [11:15:14] (03CR) 10Jcrespo: [C: 032] Revert "Disable cron script on the phab slave due to maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/296709 (owner: 10Jcrespo) [11:17:49] (03CR) 10Jforrester: [C: 031] Change wmgVisualEditorAvailableNamespaces keys to canonical names instead of indexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296673 (https://phabricator.wikimedia.org/T138999) (owner: 10Alex Monk) [11:28:20] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2417470 (10jcrespo) This affects to more things than just that cron, I had to revert: https://gerrit.wikimedia.org/r/296709 . I will create an alternative proposal to depo... [11:28:24] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2417471 (10Qgil) [11:29:02] (03CR) 10Esanders: [C: 031] Change wmgVisualEditorAvailableNamespaces keys to canonical names instead of indexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296673 (https://phabricator.wikimedia.org/T138999) (owner: 10Alex Monk) [11:32:14] (03PS1) 10Hashar: beta: send MariaDB errors to syslog [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) [11:33:35] (03CR) 10Hashar: "Jaime, I have no idea what I am doing really :( That should change the config of both Beta cluster databases: deployment-db1 and deploy" [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [11:40:30] (03PS1) 10Muehlenhoff: Add three more jessie-based image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296714 [11:51:56] (03PS2) 10Muehlenhoff: Add three more jessie-based image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296714 [11:54:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add three more jessie-based image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296714 (owner: 10Muehlenhoff) [11:59:35] !log pooling three additional jessie-based image scalers [11:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:36] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: puppet fail [12:16:15] jynus: do you have a bit of time to help me with some strange issue I have on analytics1003? [12:16:24] the DB seems in readonly mode [12:16:32] not sure why, I just rebooted the host [12:16:39] and I can see only Incorrect definition of table mysql.event: expected column 'sql_mode' at.. [12:16:45] in mysql.err [12:18:56] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:21:17] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:24:16] moritzm: I think it's ok to resume rolling restarts on elastic@eqiad [12:26:43] dcausse: ok! could you find a cause of error? [12:27:09] I presume it's unrelated to the rolling reboot, but if I should change anything, please tell [12:27:16] moritzm: certainly a bug, I'm creating a ticket on the elastic github repo [12:27:50] it's most likely a race condition that caused the logger to go crazy [12:28:42] fortunately the issue is just causing floods to the logs, I haven't noticed any incidence on query latency nor mediawiki logs [12:29:04] ok, thanjs [12:29:30] (03CR) 10Hashar: "Cherry picked on beta cluster puppetmaster and I have reloaded mysql." [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [12:39:37] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:42:32] (03PS6) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) [12:43:11] (03CR) 10Hashar: "Can not do this for now since labs infra is missing capacity/overloaded" [puppet] - 10https://gerrit.wikimedia.org/r/285957 (https://phabricator.wikimedia.org/T133911) (owner: 10Hashar) [12:45:55] (03PS2) 10Hashar: contint: tidy Nodepool slaves config history [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) [12:48:54] (03CR) 10Hashar: "Puppet compile https://puppet-compiler.wmflabs.org/3241/gallium.wikimedia.org/ not so useful, it just shows up:" [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [12:51:31] (03PS6) 10Hashar: cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) [12:51:58] (03CR) 10Hashar: "Rebased." [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [12:52:25] (03CR) 10Paladox: contint: tidy Nodepool slaves config history (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [12:53:23] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2389149 (10Johan) It has (though referring to [[ https://lists.wikim... [12:53:49] (03PS1) 10Faidon Liambotis: base: remove ioscheduler setting from non-augeas codepath [puppet] - 10https://gerrit.wikimedia.org/r/296727 [12:53:51] (03PS1) 10Faidon Liambotis: base: reenable augeas codepath on trustys [puppet] - 10https://gerrit.wikimedia.org/r/296728 [12:53:53] (03PS1) 10Faidon Liambotis: Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 [12:53:55] (03PS1) 10Faidon Liambotis: cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 [12:53:57] (03PS1) 10Faidon Liambotis: labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 [12:53:59] (03PS1) 10Faidon Liambotis: mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 [12:54:05] _joe_: could you review the above? (topic:grub for your convenience) [12:56:33] (03CR) 10jenkins-bot: [V: 04-1] Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [12:57:12] (03CR) 10jenkins-bot: [V: 04-1] cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 (owner: 10Faidon Liambotis) [12:57:41] (03CR) 10jenkins-bot: [V: 04-1] labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 (owner: 10Faidon Liambotis) [12:58:31] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 (owner: 10Faidon Liambotis) [13:00:03] (03PS1) 10Urbanecm: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 [13:04:50] (03CR) 10Hashar: apache: logrotate augeas rule needs apache2 package (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [13:05:22] (03PS3) 10Hashar: apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) [13:05:56] (03CR) 10Urbanecm: [C: 031] Namespace configuration for sk.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296270 (https://phabricator.wikimedia.org/T138779) (owner: 10Dereckson) [13:07:54] (03PS2) 10Faidon Liambotis: labstore: un-hieraize elevator/ioscheduler boot-setting [puppet] - 10https://gerrit.wikimedia.org/r/296731 [13:07:56] (03PS2) 10Faidon Liambotis: cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 [13:07:58] (03PS2) 10Faidon Liambotis: Create a new grub module [puppet] - 10https://gerrit.wikimedia.org/r/296729 [13:08:00] (03PS2) 10Faidon Liambotis: mediawiki: un-hieraize cgroup_enable boot-settings [puppet] - 10https://gerrit.wikimedia.org/r/296732 [13:10:47] !log upgrading and restarting analytics1003 mysql tables [13:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:23] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [13:14:13] this one is me --^ [13:14:28] (03CR) 10Paladox: [C: 031] contint: tidy Nodepool slaves config history [puppet] - 10https://gerrit.wikimedia.org/r/295641 (https://phabricator.wikimedia.org/T126552) (owner: 10Hashar) [13:18:14] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [13:19:06] 07Blocked-on-Operations, 06Operations, 07Graphite: "unexpected error" on graphite-web - https://phabricator.wikimedia.org/T138541#2417773 (10yuvipanda) 05Open>03Resolved This is actually resolved since this doesn't cause issues anymore. I'll open a separate task for moving to mod_proxy. [13:23:26] (03PS1) 10Faidon Liambotis: network: remove EXTERNAL_NETWORKS from ferm [puppet] - 10https://gerrit.wikimedia.org/r/296736 [13:23:28] (03PS1) 10Faidon Liambotis: network: use $all_networks in exim4 [puppet] - 10https://gerrit.wikimedia.org/r/296737 [13:23:30] (03PS1) 10Faidon Liambotis: librenms: remove nets setting [puppet] - 10https://gerrit.wikimedia.org/r/296738 [13:23:32] (03PS1) 10Faidon Liambotis: network: move external_networks to hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/296739 [13:23:53] akosiaris: ^ :) [13:29:13] faidon on a roll [13:31:58] (03PS2) 10Faidon Liambotis: network: move external_networks to hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/296739 [13:34:04] (03CR) 10Jcrespo: "reload will not work, it will require a full restart, probably." [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [13:34:21] (03CR) 10Muehlenhoff: "already exists as https://gerrit.wikimedia.org/r/#/c/296379/" [puppet] - 10https://gerrit.wikimedia.org/r/296736 (owner: 10Faidon Liambotis) [13:35:01] (03CR) 10Jcrespo: "BTW, we can merge this without issue, worst case scenario, this only updates a file, it does not cause mysql to restart automatically anyw" [puppet] - 10https://gerrit.wikimedia.org/r/296713 (https://phabricator.wikimedia.org/T119370) (owner: 10Hashar) [13:38:38] (03CR) 10Alexandros Kosiaris: [C: 031] "great! I was wondering about this" [puppet] - 10https://gerrit.wikimedia.org/r/296738 (owner: 10Faidon Liambotis) [13:39:20] (03CR) 10Alexandros Kosiaris: [C: 031] network: move external_networks to hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/296739 (owner: 10Faidon Liambotis) [13:40:15] (03CR) 10Alexandros Kosiaris: "are we sure we want that ? don't we want realm dependent networks ? Or if we need both, how about labs_networks + production_networks ?" [puppet] - 10https://gerrit.wikimedia.org/r/296737 (owner: 10Faidon Liambotis) [13:41:31] (03PS1) 10Elukey: Turn off Mariadb read_only mode for the Analytics Meta instance. [puppet] - 10https://gerrit.wikimedia.org/r/296740 [13:42:28] !log swift codfw-prod: ms-be202[234] weight 3000 [13:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:42:36] (03CR) 10jenkins-bot: [V: 04-1] Turn off Mariadb read_only mode for the Analytics Meta instance. [puppet] - 10https://gerrit.wikimedia.org/r/296740 (owner: 10Elukey) [13:42:41] (03CR) 10Faidon Liambotis: "The longer-term fix is separate mail relays for Labs. Until then, we're using the production relays for both production & Labs traffic (an" [puppet] - 10https://gerrit.wikimedia.org/r/296737 (owner: 10Faidon Liambotis) [13:44:03] (03CR) 10Alexandros Kosiaris: "Ok that explains it. Yeah the explicit definitions will definitely make it more futureproof and more readable. I 'd prefer it." [puppet] - 10https://gerrit.wikimedia.org/r/296737 (owner: 10Faidon Liambotis) [13:44:21] (03Abandoned) 10Faidon Liambotis: network: remove EXTERNAL_NETWORKS from ferm [puppet] - 10https://gerrit.wikimedia.org/r/296736 (owner: 10Faidon Liambotis) [13:44:47] (03PS2) 10Faidon Liambotis: network: use $all_networks in exim4 [puppet] - 10https://gerrit.wikimedia.org/r/296737 [13:44:49] (03PS2) 10Faidon Liambotis: librenms: remove nets setting [puppet] - 10https://gerrit.wikimedia.org/r/296738 [13:44:51] (03PS3) 10Faidon Liambotis: network: move external_networks to hiera as well [puppet] - 10https://gerrit.wikimedia.org/r/296739 [13:45:01] (03CR) 10Alexandros Kosiaris: [C: 032] "I think this is probably fine. pcc is also happy in https://puppet-compiler.wmflabs.org/3242/" [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [13:45:13] (03PS4) 10Alexandros Kosiaris: apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [13:45:23] (03CR) 10Alexandros Kosiaris: [V: 032] apache: logrotate augeas rule needs apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/291024 (https://phabricator.wikimedia.org/T136301) (owner: 10Hashar) [13:46:42] hashar, akosiaris: why not add a require on the augeas stanza instead? [13:47:39] paravoid: it would create an out of class dependency [13:47:49] not the best to deal with when debugging [13:48:01] especially since it's easily solveable this way [13:48:13] why? [13:48:19] why is it not best to deal with? [13:49:13] oh, it's a dependency on a package that is being installed from a class that is not directly installing it. So in the first refactor of the including class that for some reason no longer defines that package, you get a broken class [13:49:22] which had nothing to do with the refactoring [13:50:19] and you got a class that is broken, but nothing has been changed in it [13:50:30] ok, I suppose so [13:50:43] the flip side is that you create a class dependency instead of a more granular resource dependency [13:51:00] which might bite you in the long run [13:51:07] that's true, but that is a very small class [13:54:24] yeah, fair enoguh [13:54:30] enough even :) [13:57:53] (03PS1) 10ArielGlenn: do xml stubs dump pieces based on revs per page range [dumps] - 10https://gerrit.wikimedia.org/r/296742 (https://phabricator.wikimedia.org/T137887) [13:59:53] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: puppet fail [14:07:48] paravoid: akosiaris: the dependencies are always a hot topic :/ Kind of hard to figure out the proper chain ;( [14:08:04] (03PS1) 10Aude: Put wikidatawiki back on 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296744 [14:14:41] got gallium spurting out: [14:14:42] Notice: [14:14:42] Notice: /Stage[main]/Network::Constants/Notify[dummy]/message: defined 'message' as '' [14:14:42] :( [14:15:45] akosiaris: ^ [14:15:50] yeah let's kill this thing [14:15:56] I don't care about rspec that much :) [14:16:04] I don't mind it, but not at the expense of this [14:16:18] hashar: yeah, I am more ok with removing it now that the change is merged [14:16:28] oh [14:16:34] it turned out to be immensely useful while developing the function [14:16:40] akosiaris: that is so some resource is realized just to have rspec load hiera? :( [14:16:48] hashar: yes [14:17:02] (03CR) 10Dereckson: [C: 031] [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 (owner: 10Urbanecm) [14:17:07] (03PS1) 10Yuvipanda: [WIP] k8s: Don't provision abac/tokenauth directly [puppet] - 10https://gerrit.wikimedia.org/r/296747 [14:17:16] (03CR) 10Rush: [C: 031] "way more straightforward seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/296731 (owner: 10Faidon Liambotis) [14:17:28] (03PS2) 10Dereckson: Short array syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296419 [14:17:53] (03PS2) 10Rush: base: remove ioscheduler setting from non-augeas codepath [puppet] - 10https://gerrit.wikimedia.org/r/296727 (owner: 10Faidon Liambotis) [14:18:00] (03CR) 10Rush: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/296727 (owner: 10Faidon Liambotis) [14:19:25] (03PS2) 10Yuvipanda: [WIP] k8s: Don't provision abac/tokenauth directly [puppet] - 10https://gerrit.wikimedia.org/r/296747 [14:20:31] (03PS1) 10Jdrewniak: Updating www portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296748 (https://phabricator.wikimedia.org/T128546) [14:20:57] akosiaris: Could not find class network::constants for aeriale.local on node aeriale.local [14:20:57] :D [14:21:11] akosiaris: none pass on my local machine :( [14:21:31] (03PS3) 10Yuvipanda: [WIP] k8s: Don't provision abac/tokenauth directly [puppet] - 10https://gerrit.wikimedia.org/r/296747 [14:22:12] hashar: hmm they pass on mine. [14:22:20] well the jessie one that is [14:22:37] stretch has puppet4 and ... grrr [14:22:45] hashar: how do you call it ? [14:22:59] rake -t spec_standalone ? [14:23:02] bundle exec rake realspec [14:23:06] hey wrong target [14:23:13] I have no idea what that does [14:23:32] 06Operations, 10Ops-Access-Requests, 10Deployment-Systems, 06Discovery, and 5 others: Add wdqs-admins to deploy-services group - https://phabricator.wikimedia.org/T138628#2405800 (10ArielGlenn) I think there's going to be a lot more of these as more services are converted to use scap. I presume members of... [14:23:51] * akosiaris has avoided bundler [14:25:04] basically setup a set of gems at given versions [14:25:10] so you can pin the appropriate versions [14:26:01] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:26:11] no big deal akosiaris :-}  But one day we will have to look t running those tests in Jenkins eventually [14:27:30] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2418094 (10BBlack) New lists from just the past 24H (shortly before this post): New usernames: ``` Poudou99 ``` Previously notified, stil... [14:30:40] mutante: where would be the proper place to put the hiera data? once it's there, can it be removed from wikitech (how can we not have duplicated data)? [14:31:51] (03CR) 10BBlack: [C: 031] "Looks good, haven't actually tested/verified anything :)" [puppet] - 10https://gerrit.wikimedia.org/r/296729 (owner: 10Faidon Liambotis) [14:31:55] (03CR) 10BBlack: [C: 031] cache: un-hieraize tcpmhash_entries boot setting [puppet] - 10https://gerrit.wikimedia.org/r/296730 (owner: 10Faidon Liambotis) [14:32:27] bblack: the new define could use a comment [14:32:53] perf.pp is so well-documented, I felt kind of bad for not putting a comment on top of it [14:33:00] I was hoping you'd have a good suggestion :) [14:33:24] going to get a quick mobileapps deploy in before morning SWAT if there are no objections [14:33:39] (03CR) 10Chad: Gerrit: don't pass SMTP server info around either, it's in hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [14:33:45] schana: production has only ./hieradata/ to store it. labs has 2 different places, also in files but a different repo AND on wiki pages. afaik these 2 places in labs get merged into one. production will not be influence by whatever is on wikitech or not [14:33:50] (re T139029) [14:33:51] T139029: REST API with endpoint /page/mobile-text/{title} always returning 404 - https://phabricator.wikimedia.org/T139029 [14:36:15] schana: so the right place in production would be.. if we have a role::ores then ./hieradata/role/common/ores.yaml [14:36:31] Thanks mutante [14:36:49] schana, if you are around, I'm happy to look a it with you. [14:37:06] sure halfak [14:37:08] if we don't have a role like that, then we can also do it per hostname or regex [14:37:31] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 200, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [14:37:46] mutante, was thinking that we might need to specify the web hosts differently in eqiad as opposed to codfw [14:37:48] ./hieradata/regex.yaml or ./hierdata/hosts/foo.yaml [14:37:48] mutante: there's already hieradata/role/common/ores/redis.yaml [14:38:05] (03PS1) 10Muehlenhoff: Add four more jessie-based image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296749 [14:38:53] Are we using lvs to do load balancing? [14:39:01] If so, then I suppose we want to make lvs.yaml? [14:39:27] since we have this in labs: [14:39:29] $realservers = hiera('role::labs::ores::lb::realservers') [14:39:30] Yeah. Looks like we do the same here: https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/eqiad/ores/redis.yaml [14:39:40] ideally it is the same in production just without "labs" [14:39:41] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2418150 (10BBlack) @Krinkle - the varnish TTL cap is *per layer*, and it's still 7 days in the backend layers (it's only 1 day in the frontend layers). If the test2wiki change is int... [14:40:05] Seems like this would be the right dir for eqiad: https://github.com/wikimedia/operations-puppet/tree/production/hieradata/role/eqiad/ores [14:40:13] mutante, how would you name the yaml file? [14:41:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add four more jessie-based image scalers [puppet] - 10https://gerrit.wikimedia.org/r/296749 (owner: 10Muehlenhoff) [14:41:18] quick q: is all this in order to monitor from production a worker failing in labs ? [14:41:22] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10ArielGlenn) It seems like this ought to be a foregone conclusion, ganeti instance for use by security person should allow root access by security person. But i... [14:41:37] not a worker, akosiaris, a web node [14:41:39] halfak: hmm. maybe realservers.yaml [14:41:45] Thanks mutante [14:41:46] schana: yeah, sorry web node [14:42:03] paravoid: something like "This is the size of the hashtable for saved TCP metrics. The default on large memory machines is 16384, and we expect peak hashtable entries on the order of ~100K+, so a 64K hashtable will have far fewer collisions" [14:42:13] it would be nice if there was a better name than 'realservers' [14:42:20] akosiaris, seems like lvs should read this config too. [14:42:28] Or rather be configured based on it. [14:42:32] what config ? [14:42:36] I am confused [14:42:44] the planned realservers.yaml in hieradata [14:42:57] That would list out the hostnames of ores web nodes. [14:42:57] why would LVS do that ? [14:43:06] Oh. Sorry. What is doing our load balancing? [14:43:13] akosiaris: I responded on the smtp_host param stuff for gerrit :) [14:43:32] ostriches: ok thanks [14:44:05] akosiaris, I was under the impression that lvs was splitting requests to the web nodes. Sorry if that's not right. [14:44:18] But whatever is doing our load balancing should be configured based on hieradata [14:44:20] halfak: LVS is doing our load balancing indeed, but I fail to see the connection yet [14:44:26] no it should not [14:44:30] it is based on etcd data [14:44:32] akosiaris, how does lvs know where to route requests [14:44:40] which is under conftool-data in the puppet repo [14:45:03] (03CR) 10BBlack: "If all 3 header values are the same, just set it explicitly for the first one and set the rest from that one. e.g.:" [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [14:45:10] !log starting mobileapps deployment [14:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:23] akosiaris, can we read that when configuring the health check? [14:45:47] the health check for what ? labs ? [14:45:53] (Also, I'm lost why we have "conftool-data" and "hiera" [14:45:57] akosiaris, prod [14:46:05] We already use hiera with the load balancer in labs [14:46:07] prod is already monitored [14:46:12] akosiaris: in labs there is $realservers = hiera('role::labs::ores::lb::realservers') [14:46:17] in production that doesnt exist [14:46:17] Not the routes we need akosiaris [14:46:20] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2412705 (10faidon) Neither this nor the subtask explain what exactly this VM will host and what Darian needs to run what and when — and why would this require full root.... [14:46:29] halfak: advertise them in the swagger spec ? [14:46:29] !log pooling four additional jessie-based image scalers (mw1295-mw1298) [14:46:31] (03CR) 10Yuvipanda: "Ah, apologies - I missed the nginx config change! This is fairly clever and I like it. My suggested change would be to change the URL form" [puppet] - 10https://gerrit.wikimedia.org/r/296535 (https://phabricator.wikimedia.org/T134782) (owner: 10Nschaaf) [14:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:52] akosiaris, we need something more specific. [14:47:30] more specific than all the endpoints advertised to all the clients ? [14:47:34] We need to have a request hit the "testwiki" model with a unique "rev_id" in order to make sure it actually goes out to a worker and does something. [14:47:37] akosiaris, yes [14:47:37] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2418188 (10faidon) [14:47:39] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2418186 (10faidon) 05Resolved>03Open The role is still stub as you said, not sure why this task was resolved. [14:47:41] something is missing from the swagger spec then [14:47:45] akosiaris, no it isn't [14:47:56] The swagger spec doesn't tell how to get around the cache. [14:48:10] and thank god for that [14:48:10] akosiaris, if you'd like to file a bug for the swagger spec, I welcome you. [14:48:19] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2418189 (10Dzahn) Because it was about creating a VM for it. [14:48:24] In the meantime, we want to get monitoring in place so we don't have a massive downtime event *again* [14:48:29] It would be great to have your help [14:48:55] which is what I am trying to do [14:49:11] I am trying to understand what is it that we are trying to monitor [14:49:13] !log mobileapps finished deploying 43538aa [14:49:16] ok, so here's the confusion from my perspective [14:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:26] ores in prod is configured pretty differently from ores in 'labs', which I'm going to start callign ores-experimental [14:49:35] ores-labs [14:49:41] It's called that everywhere [14:49:45] ores-experimental (which is ores.wmflabs.org) uses nginx for load balancing [14:49:49] Labs_labs_labs :) [14:49:49] (everywhere I've gotten to so far) [14:50:25] I'm going to call it ores-experimental because I thought we agreed on that, and I'm not going to call it ores-labs because of Labs_labs_labs. but that's a tangent let's not get there now [14:50:39] so it uses nginx for load balancing, and there's no way for us to know when any of the web nodes behind it die [14:50:59] because we don't have a way for us to monitor things per node in labs [14:51:14] so this change is 100% about ores-experimental, and has nothing to do with prod at all [14:51:33] I called the hiera setting realservers back when I set it up to match LVS terminology, and I now see why that's confusing [14:51:36] yuvipanda, we had a massive downtime event in prod because of this lack of monitoring [14:52:01] well, if we want this in *prod* that's much easier to do, since in prod you can just setup an nrpe check to hit localhost [14:52:06] Massive as in it was ~6 hours because we realized it. [14:52:23] is there docs on this somewhere? [14:52:27] Yes [14:52:29] err, incident report [14:52:31] Yes [14:52:54] link? so I can read and produce informed opinion. [14:53:27] btw, monitoring in production is not yet where we want it to be. it's very very basic as it is not still using service_checker [14:53:33] I think until we rename ores.wmflabs.org to ores-experimental.wmflabs.org and follow through on https://etherpad.wikimedia.org/p/environments-ores this confusion is going to exist, since I assumed thsi was purely about ores.wmflabs.org [14:53:34] https://wikitech.wikimedia.org/wiki/Incident_documentation/20160610-ORES [14:53:45] yuvipanda, that's a different conversation [14:53:48] Also I disagree [14:53:52] Let's talk abotu that later [14:54:10] so if this is true: < akosiaris> btw, monitoring in production is not yet where we want it to be. [14:54:19] then what is the issue with adding it [14:54:32] aaargh, I thought we already agreed, but sure. [14:54:43] mutante: oh, it's just ores advertising a swagger spec under /?spec [14:54:47] there's a task for that [14:54:56] I am waiting for it in order to enable service_checker [14:54:57] akosiaris, it's been done [14:55:05] But that won't help this [14:55:09] wat ? how did I miss that ? [14:55:14] so is there anything bad about https://gerrit.wikimedia.org/r/#/c/296535/ ? [14:55:40] mutante it doesn't touch prod at all. [14:55:44] ah, closed yesterday https://phabricator.wikimedia.org/T137804 [14:55:58] akosiaris, that spec will not allow for the check we need. [14:56:14] akosiaris, yes. Goes into the done column, gets closed at the end of the week. [14:56:18] ok, now we are getting somewhere [14:56:26] so, what is the check we need ? [14:56:36] mutante wrote that as an icinga check [14:56:36] let me find it [14:56:59] https://gerrit.wikimedia.org/r/#/c/296054/2/modules/nagios_common/files/check_commands/check_ores_workers [14:57:00] akosiaris look at ./modules/nagios_common/files/check_commands/check_ores_workers [14:57:06] that's what we need [14:57:12] that _is_ an icinga check [14:57:13] ah that timestamp thing ? [14:57:17] yes [14:57:18] akosiaris, we looked at this together [14:57:18] i dont know what is going anymore, sorry [14:57:19] Yes [14:57:20] steps back [14:57:25] so I have a big question on this one [14:57:30] mutante we're reticulating splines [14:57:35] is that check effectively polluting the cache ? [14:57:44] akosiaris, yes :) [14:57:44] 06Operations, 10vm-requests, 13Patch-For-Review, 05Security: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#2418199 (10faidon) Well OK, that's fine :) There is no description on the request on what will be included there and no other task to describe this setup. An... [14:57:49] But we don't care [14:57:54] Because it's a drop in the bucket [14:58:29] It's nice to know that things don't blow up when storing in the cache. [14:58:38] It exercises a real scoring job. [14:58:47] It's a nice check and it catches deep problems [14:58:47] we don't have to pollute the cache to actually see that [14:58:57] and it's an end-to-end check [14:59:12] not a very fine grained one [14:59:27] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2418200 (10ArielGlenn) @faindon OK, will ask for more info. @dpatrick what tools and scripts would this be running? What would you be using root for on the instance? [14:59:38] so, when I was saying there is something missing from the swagger spec, I had a point [14:59:43] akosiaris, OK. Fine. File a task for making it more fine-grained and we can prioritize that for later. [15:00:03] * yuvipanda awards point to akosiaris [15:00:05] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T1500). [15:00:05] Urbanecm, kart_, aude, Dereckson, and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:08] other services have healthcheck urls [15:00:19] akosiaris, file a task? [15:00:25] Around [15:00:25] the idea is a url that has the service doing a very basic health reporting [15:00:28] oh I will [15:00:32] Thanks [15:01:01] PROBLEM - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.35.26.194 [15:01:06] So, now are we going to block health monitoring on a production service in order to make it fit these requirements first? [15:01:22] I can SWAT today. Looks like quite a few patches... [15:01:51] thcipriani: ok [15:02:00] * kart_ around [15:02:12] halfak: not sure I understand [15:02:39] (03PS3) 10Thcipriani: Change project logo for enwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) (owner: 10Urbanecm) [15:02:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) (owner: 10Urbanecm) [15:02:51] Can we get the check we have that works in place now or should we implement whatever it is you want in the swagger spec first? [15:03:26] (03Merged) 10jenkins-bot: Change project logo for enwikt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296403 (https://phabricator.wikimedia.org/T138801) (owner: 10Urbanecm) [15:03:35] get that check per node as is ? we probably can't do it in any kind of clean way [15:03:47] get it working on the service as a whole ? I think we got that already [15:03:51] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:04:08] !log rolling reboot of wtp2 for kernel security update [15:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:13] even with this solution, we're missing granularity on the worker nodes [15:04:35] is check_ores_workers running against prod right now? [15:05:02] yes [15:05:29] Oh good. [15:05:34] I misunderstood that. [15:05:49] OK. So now I don't see how a swagger spec will help the work that schana is trying to do. [15:05:51] RECOVERY - Juniper alarms on mr1-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [15:06:03] * akosiaris is starting to think we are arguing too much because of misunderstandings [15:06:05] Or have we concluded that we dont' need to check individual workers in prod [15:06:20] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0 [15:06:23] akosiaris, too many cooks IMO [15:06:25] (03PS1) 10Addshore: Deploy RevisionSlider to test test2 and testikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) [15:06:31] halfak: :-) [15:06:48] (03CR) 10Addshore: [C: 04-1] Deploy RevisionSlider to test test2 and testikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [15:06:51] so, we definitely want to check as much as possible of the functionality of every worker in prod [15:07:14] !log thcipriani@tin Synchronized static/images/project-logos/enwiktionary.png: SWAT: [[gerrit:296403|Change project logo for enwikt (T138801)]] (duration: 00m 25s) [15:07:15] T138801: English Wiktionary—new logo - https://phabricator.wikimedia.org/T138801 [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:30] and the interface we 've created with the services team to get that done easily is as I said already the service_checker which uses the swagger spec [15:07:34] 7~/win 29 [15:08:05] akosiaris, OK. I'm not convinced that this is a good option for us and the needs we have on this project. [15:08:26] No one has helped me understand how we can do an end-to-end test that won't hit the cache with it. [15:08:46] halfak: ah, that's my failure then. [15:09:02] 1. I'd like to see a proposal for how that could work [15:09:08] Urbanecm: wiktionary logo updated and purged. [15:09:10] 2. I'd like to get our basic health check in place in the meantime. [15:09:41] We'll likely need to iterate for a while on how the spec is generated before we can have it specify endpoints in the way you want. [15:09:42] ok for 1 lemme study the swagger spec a bit and I 'll come back with a proposal [15:09:57] Right now, the path is /scores//// [15:10:00] as you 've suggested I 'll file a task [15:10:15] (03CR) 10Aude: Deploy RevisionSlider to test test2 and testikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [15:10:17] Swagger doesn't tell you what contexts, models or revids are available [15:10:47] thcipriani: Ehm... I think it should be bigger. [15:11:02] Urbanecm: yeah, I'm going to revert for the time being. [15:11:34] ahh aude its SWAT now! but it looks over filled already :( [15:12:23] !log thcipriani@tin Synchronized static/images/project-logos/enwiktionary.png: SWAT: Revert [[gerrit:296403|Change project logo for enwikt (T138801)]] (duration: 00m 25s) [15:12:24] T138801: English Wiktionary—new logo - https://phabricator.wikimedia.org/T138801 [15:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:54] (03PS2) 10Thcipriani: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 (owner: 10Urbanecm) [15:13:48] (03PS1) 10Thcipriani: Revert "Change project logo for enwikt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296754 [15:13:58] (03CR) 10Alexandros Kosiaris: Gerrit: don't pass SMTP server info around either, it's in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [15:14:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296754 (owner: 10Thcipriani) [15:14:41] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2418308 (10Dzahn) @Faidon This is the outcome of meeting Darian at Wikimania. He has explained to me that there are some scripts that historically have been running on Chr... [15:14:48] (03Merged) 10jenkins-bot: Revert "Change project logo for enwikt" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296754 (owner: 10Thcipriani) [15:15:13] (03PS3) 10Thcipriani: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 (owner: 10Urbanecm) [15:15:51] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 (owner: 10Urbanecm) [15:16:35] (03Merged) 10jenkins-bot: [cleanup] Delete old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296733 (owner: 10Urbanecm) [15:16:38] swat! [15:16:46] (03PS2) 10Filippo Giunchedi: syslog: limit source range to $PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295368 [15:17:13] (03PS2) 10Tobias Gritschacher: Deploy RevisionSlider to test test2 and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296753 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [15:18:02] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:296733|Delete old throttle rules]] (duration: 00m 25s) [15:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:16] ^ Urbanecm sync'd throttle rule cleanup, thanks for the patch [15:18:47] Thanks for deploying thcipriani [15:19:11] (03PS5) 10Thcipriani: Deploy Compact Language Links as default (Stage 3.5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296501 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:19:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296501 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:19:34] BTW why the logo on enwikt isn't changed back thcipriani? [15:19:52] thcipriani: test deploy as usual first :) [15:20:04] Urbanecm: hmmm, it changed back for me...sync'd and purged. [15:20:10] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2418364 (10faidon) OK. FTR, I would have prefered it if your meeting at Wikimania was transcribed into Phabricator on a separate task and if this work (puppet commits, fix... [15:20:32] It has changed a few seconds ago for me too. [15:20:44] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 3.5) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296501 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:20:45] Urbanecm: phew, ok, good :) [15:21:33] kart_: mw1017 has the change. [15:22:07] ok. testing. [15:24:01] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [15:25:34] thcipriani: looks good. [15:25:38] thcipriani: go ahead. [15:25:43] kart_: kk, doing [15:26:18] (03PS2) 10ArielGlenn: do xml stubs dump pieces based on revs per page range [dumps] - 10https://gerrit.wikimedia.org/r/296742 (https://phabricator.wikimedia.org/T137887) [15:27:13] !log thcipriani@tin Synchronized dblists/clldefault.dblist: SWAT: [[gerrit:296501|Deploy Compact Language Links as default (Stage 3.5) (T136677)]] (duration: 00m 26s) [15:27:14] T136677: Deployment of Compact Language Links - https://phabricator.wikimedia.org/T136677 [15:27:16] ^ kart_ sync'd! [15:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:28] aude: how do you want to roll out the wikidata change? Fine to roll wikidata update to both wmf.7 and wmf.8 then do the config change? Or wmf.8, config change, then wmf.7? [15:28:30] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:28:50] (03PS2) 10Chad: Gerrit: don't pass SMTP server info around either, realm.pp provides it [puppet] - 10https://gerrit.wikimedia.org/r/296682 [15:28:53] (03PS4) 10Yuvipanda: [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [15:28:56] (03PS2) 10Thcipriani: Put wikidatawiki back on 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296744 (owner: 10Aude) [15:29:00] (03PS1) 10BryanDavis: sitelist: update link to sitelist documentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296756 [15:29:26] thcipriani: i can check on test.wikidata [15:29:35] thcipriani: checking. [15:29:36] 06Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#2418403 (10Dzahn) Ok, sorry for not leaving more detail. It was meant to be T138873#2412723 though and planned to explain it at meeting along with the access request. I ju... [15:29:49] akosiaris: Ok, commit summary tweaked. If that string of changes lands that should wrap up my cleanup of gerrit manifests. They're actually mostly useful now! [15:29:50] :D [15:29:51] the change is only essential for wmf.8 but won't do harm on wmf.7 [15:30:08] thcipriani: good so far! :) [15:30:23] (03PS5) 10Yuvipanda: [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [15:30:35] (03PS1) 10Dereckson: Logo update for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296757 (https://phabricator.wikimedia.org/T138801) [15:30:38] Urbanecm: thcipriani: ^ this one seems at the good size [15:30:43] aude: ack, ok, I'll roll to wmf.8 and wmf.7 and let you test on test.wikidata, then roll forward wit hthe config. [15:30:58] thanks Dereckson [15:31:00] (03CR) 10Alexandros Kosiaris: [C: 032] Gerrit: don't pass SMTP server info around either, realm.pp provides it [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [15:31:01] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:31:07] (03PS3) 10Alexandros Kosiaris: Gerrit: don't pass SMTP server info around either, realm.pp provides it [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [15:31:15] (03CR) 10Alexandros Kosiaris: [V: 032] Gerrit: don't pass SMTP server info around either, realm.pp provides it [puppet] - 10https://gerrit.wikimedia.org/r/296682 (owner: 10Chad) [15:31:25] (03CR) 10Alexandros Kosiaris: [C: 032] Gerrit: Stop customizing ssh port. It's not like we're changing it [puppet] - 10https://gerrit.wikimedia.org/r/296683 (owner: 10Chad) [15:31:30] (03PS2) 10Alexandros Kosiaris: Gerrit: Stop customizing ssh port. It's not like we're changing it [puppet] - 10https://gerrit.wikimedia.org/r/296683 (owner: 10Chad) [15:31:34] (03CR) 10Alexandros Kosiaris: [V: 032] Gerrit: Stop customizing ssh port. It's not like we're changing it [puppet] - 10https://gerrit.wikimedia.org/r/296683 (owner: 10Chad) [15:31:34] thcipriani: ok [15:31:42] (03CR) 10Alexandros Kosiaris: [C: 032] Gerrit: don't bother ensuring hooks directory, we don't use it anymore [puppet] - 10https://gerrit.wikimedia.org/r/296684 (owner: 10Chad) [15:31:48] (03PS2) 10Alexandros Kosiaris: Gerrit: don't bother ensuring hooks directory, we don't use it anymore [puppet] - 10https://gerrit.wikimedia.org/r/296684 (owner: 10Chad) [15:31:52] (03CR) 10jenkins-bot: [V: 04-1] [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (owner: 10Yuvipanda) [15:31:54] (03CR) 10Alexandros Kosiaris: [V: 032] Gerrit: don't bother ensuring hooks directory, we don't use it anymore [puppet] - 10https://gerrit.wikimedia.org/r/296684 (owner: 10Chad) [15:32:01] (03CR) 10Alexandros Kosiaris: [C: 032] Gerrit: remove replicationdest, unused [puppet] - 10https://gerrit.wikimedia.org/r/296685 (owner: 10Chad) [15:32:10] (03PS2) 10Alexandros Kosiaris: Gerrit: remove replicationdest, unused [puppet] - 10https://gerrit.wikimedia.org/r/296685 (owner: 10Chad) [15:32:14] (03CR) 10Alexandros Kosiaris: [V: 032] Gerrit: remove replicationdest, unused [puppet] - 10https://gerrit.wikimedia.org/r/296685 (owner: 10Chad) [15:32:30] (03PS1) 10Hashar: network: fix spec not loading libs properly [puppet] - 10https://gerrit.wikimedia.org/r/296758 [15:33:05] (03PS6) 10Yuvipanda: [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [15:34:19] !log thcipriani@tin Synchronized php-1.28.0-wmf.8/extensions/Wikidata/extensions/Wikibase/repo/includes/WikibaseRepo.php: SWAT: [[gerrit:296701|Update Wikidata - Fix broken editing of statements (T138974)]] (duration: 00m 31s) [15:34:20] T138974: SyntaxError's when adding statements - https://phabricator.wikimedia.org/T138974 [15:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:26] ^ aude check please [15:34:27] ostriches: done. 4/4 merged [15:34:31] checking [15:34:32] gerrit being restarted [15:34:50] looks good [15:35:07] !log restarted (actually puppet did) gerrit after merging 4 related changes [15:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:17] aude: kk, rolling to wmf.7 [15:35:21] thanks [15:36:07] !log thcipriani@tin Synchronized php-1.28.0-wmf.7/extensions/Wikidata/extensions/Wikibase/repo/includes/WikibaseRepo.php: SWAT: [[gerrit:296701|Update Wikidata - Fix broken editing of statements (T138974)]] (duration: 00m 25s) [15:36:08] T138974: SyntaxError's when adding statements - https://phabricator.wikimedia.org/T138974 [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:20] ^ aude FYI [15:38:24] k [15:38:43] should be safe now to put wikidata on wmf.8 [15:39:01] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: SWAT: [[gerrit:296744|Put wikidatawiki back on 1.28.0-wmf.8]] [15:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:07] ^ aude check please [15:40:04] checking [15:40:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:41:07] looks good [15:41:14] cool, thanks for checking [15:42:52] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [15:43:30] akosiaris: Thank youuuuu!!!! [15:44:04] hmm gerrit-wm is asleep on the job [15:44:20] hmm [15:44:21] yeah [15:45:00] I just kicked it, thcipriani [15:45:19] thanks yuvipanda [15:45:22] yw [15:45:41] I suspect it's because gerrit got restarted with ostriches' patch, and grrrit-wm needs a restart after such [15:45:59] oh, it's that all right [15:46:06] I forgot to restart it, sorry [15:46:21] Dereckson: phew this is a big patch :) [15:46:35] I suppose grrrit-wm sets up a persistent connection and never tries to reestablish it ? [15:46:49] yeah [15:46:50] ssh [15:47:38] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Wikimedia-SVG-rendering, 07I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#2418497 (10mehtab.ahmed) If I can get any solution to this matter please. [15:48:10] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:296419|Short array syntax]] (duration: 00m 30s) [15:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:22] ^ Dereckson sync'd [15:48:49] logs all seem fine [15:49:23] for some value of "fine" :) [15:49:38] "not worse" [15:49:58] (03PS2) 10Thcipriani: Use extension registration for LabeledSectionTransclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281237 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:50:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281237 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:50:41] exactly [15:50:58] Ah, yeah it probably freaked out by a few newline (otherwise basically no-op) changes to the config file [15:50:59] (03Merged) 10jenkins-bot: Use extension registration for LabeledSectionTransclusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281237 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:51:03] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:08] So puppet was smart and restarted it :) [15:51:20] yuvipanda, thcipriani ^ [15:51:38] thcipriani: yes, config is full of arrays [15:53:33] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:281237|Use extension registration for LabeledSectionTransclusion (T119117)]] (duration: 00m 27s) [15:53:34] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [15:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:46] ^ Dereckson check please [15:54:04] (03CR) 10Alexandros Kosiaris: [C: 032] "thank you! This is great!" [puppet] - 10https://gerrit.wikimedia.org/r/296758 (owner: 10Hashar) [15:54:12] (03PS3) 10Alexandros Kosiaris: network: fix spec not loading libs properly [puppet] - 10https://gerrit.wikimedia.org/r/296758 (owner: 10Hashar) [15:54:14] (03CR) 10Alexandros Kosiaris: [V: 032] network: fix spec not loading libs properly [puppet] - 10https://gerrit.wikimedia.org/r/296758 (owner: 10Hashar) [15:54:32] akosiaris: I presume it worked on your local machine wasn't it ? [15:55:02] (03PS1) 10Chad: WIP: Gerrit: move replication config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/296762 [15:55:04] (03PS2) 10Muehlenhoff: install_server::tftp_server: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296218 [15:55:06] Dereckson: hmm, trying to think of how to sync https://gerrit.wikimedia.org/r/#/c/281240/1 [15:55:06] hashar: e, almost [15:55:22] Puppet::ParseError: [15:55:22] hiera() has been converted to 4x API [15:55:36] but that is actually due to my machine having puppet 4, so it is safely ignored [15:55:50] thcipriani: tricky [15:57:00] normamy such changes should be CS, IS, but here that means we lost wgTitleBlacklistUsernameSources info [15:57:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] install_server::tftp_server: Use PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/296218 (owner: 10Muehlenhoff) [15:58:14] akosiaris: that is where bundler comes to play. The Gemfile has gem 'puppet', '~> 3.4.3' [15:58:24] Dereckson: could you maybe make a transitional patch? One where you just copy the array in IS instead of renaming it? [15:58:32] akosiaris: so if you: bundle install ; bundle exec rake ; that should uses puppet 3.4.3 [15:59:18] then, after that's synced, wmg can be removed [15:59:29] ok [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T1600). Please do the needful. [16:00:38] nothing for puppet swat [16:01:20] (03CR) 10Chad: "Worked on the first try? Go me! https://puppet-compiler.wmflabs.org/3243/ytterbium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/296762 (owner: 10Chad) [16:01:35] (03PS2) 10Chad: Gerrit: move replication config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/296762 [16:01:44] moritzm: ! [16:02:53] (03PS1) 10Krinkle: Set $wgSquidMaxage to 14 days on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296765 (https://phabricator.wikimedia.org/T124954) [16:03:04] (03PS2) 10Krinkle: Set $wgSquidMaxage to 14 days on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296765 (https://phabricator.wikimedia.org/T124954) [16:04:10] If you guys want something, ^^^ will be the last gerrit cleanup patch for me [16:04:18] 296762 [16:04:53] (03PS2) 10Dereckson: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) [16:05:35] (03CR) 10jenkins-bot: [V: 04-1] Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:06:31] (03PS3) 10Dereckson: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) [16:07:58] ostriches: for puppetswat? yeah I can take a look [16:08:47] godog: It's the last in a longgggg string of changes I started last night that have all landed so far :) [16:08:56] Should be a no-op according to puppet compiler test [16:09:11] (03PS9) 10Yuvipanda: [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [16:10:06] (03CR) 10jenkins-bot: [V: 04-1] [WIP] tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (owner: 10Yuvipanda) [16:10:16] (03PS1) 10Dereckson: Get rid of $wmgTitleBlacklistUsernameSources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296767 [16:10:39] thcipriani: so sync order would be 281240 - IS, CS, then 296747 - IS [16:11:00] Dereckson: ack, ok. [16:11:27] (03CR) 10Filippo Giunchedi: "one nit over comments, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296762 (owner: 10Chad) [16:11:30] ostriches: neat [16:11:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:11:45] (03PS10) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [16:13:01] (03Merged) 10jenkins-bot: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:13:12] (03CR) 10Chad: Gerrit: move replication config to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/296762 (owner: 10Chad) [16:13:40] (03PS3) 10Filippo Giunchedi: Gerrit: move replication config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/296762 (owner: 10Chad) [16:13:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Gerrit: move replication config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/296762 (owner: 10Chad) [16:14:11] ostriches: {{done}} {{rubberstamp}} [16:14:53] godog: Yay thanks! [16:15:02] PROBLEM - Disk space on elastic1021 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [16:15:08] Gerrit puppet makes me feel all warm and fuzzy now [16:15:13] Instead of sick to my stomach [16:15:27] damn problem with elastic again [16:15:43] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:281240|Use extension registration for TitleBlacklist (T119117)]] PART I (duration: 00m 39s) [16:15:44] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [16:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:05] I'm happy puppet doesn't make you sick anymore, I'm not quite there yet I think [16:16:12] PROBLEM - Disk space on elastic1036 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [16:16:17] (03CR) 10Dereckson: "Follow-up: Ie973544a696eb449a71c6300359a40ae4ec9b373" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281240 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:16:51] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:281240|Use extension registration for TitleBlacklist (T119117)]] PART II (duration: 00m 36s) [16:16:52] T119117: Get rid of $wg = $wmg hack for extensions that have been converted to using extension.json - https://phabricator.wikimedia.org/T119117 [16:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:34] !log truncating elastic logs on elastic1036 and elastic1021 [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:15] what is happenening: assumed 'TBLSRC_URL' in /srv/mediawiki/wmf-config/CommonSettings.php on line 788 [16:18:32] RECOVERY - Disk space on elastic1036 is OK: DISK OK [16:19:41] RECOVERY - Disk space on elastic1021 is OK: DISK OK [16:19:59] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#1827597 (10elukey) @jcrespo: I believe that we could simply decide when to perform maintenance and then communicate it a couple of days... [16:20:24] This is a define provided by the extension, https://www.mediawiki.org/wiki/Extension:TitleBlacklist [16:20:49] hmm, I'm rolling back. [16:22:39] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT:Revert "Use extension registration for TitleBlacklist" (duration: 00m 27s) [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:32] (03PS1) 10ArielGlenn: disable dump cron jobs on snapshots 1001,2,4 [puppet] - 10https://gerrit.wikimedia.org/r/296768 [16:23:32] !log thcipriani@tin Synchronized wmf-config: SWAT:Revert "Use extension registration for TitleBlacklist" (duration: 00m 32s) [16:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:42] (03PS1) 10Thcipriani: Revert "Use extension registration for TitleBlacklist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296769 [16:25:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296769 (owner: 10Thcipriani) [16:25:42] (03Merged) 10jenkins-bot: Revert "Use extension registration for TitleBlacklist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296769 (owner: 10Thcipriani) [16:25:54] (03CR) 10ArielGlenn: [C: 032] disable dump cron jobs on snapshots 1001,2,4 [puppet] - 10https://gerrit.wikimedia.org/r/296768 (owner: 10ArielGlenn) [16:27:39] Possibly stupid question. Is there a way to dynamically obtain these IPs in the role class? https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/gerrit/production.pp;HEAD$18,23 [16:28:35] (03PS1) 10Dereckson: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296770 [16:29:43] (03PS11) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [16:30:55] There is a lot of AbuseFilter mw exceptions, is that a known issue? [16:31:21] PROBLEM - Disk space on elastic1021 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [16:31:38] (03PS1) 10Mxn: Bumped portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296772 [16:32:31] (03PS12) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 [16:32:34] jynus: Looking [16:32:44] well, not "a lot" [16:32:54] thcipriani: I've opened https://phabricator.wikimedia.org/T139075 to report the issue to extension code, https://phabricator.wikimedia.org/T139075 substs the configuration define by the string. Perhaps wait extensions maintainers (if any) feedback to see if they want to add back the define() in a registration callback method. [16:33:14] The "Duplicate get" ones? [16:33:22] Dereckson: ack, thanks for that. [16:33:41] RECOVERY - Disk space on elastic1021 is OK: DISK OK [16:34:37] but 200 errors/5 minutes [16:35:45] we have these two tasks for AbuseFilter: https://phabricator.wikimedia.org/T138529 https://phabricator.wikimedia.org/T138528 [16:35:51] it is a terrible amount of logspam [16:36:59] There is https://gerrit.wikimedia.org/r/#/c/296491/ to fix that. AaronSchulz Krinkle > you have plans to backport it? [16:38:20] I'm not sure. Should be fine, but also major code path. Do we have abuse filters in labs that trigger this? [16:38:30] Or on test2?wiki? [16:38:42] I'd be happy to backport it if someone can help me test/verify it [16:39:13] thcipriani: oh by the way, as a follow-up for en.wiktionary logo, I prepared https://gerrit.wikimedia.org/r/#/c/296757/ one hour ago, I don't know if you want to deploy it now or if I schedule it for a next SWAT window. [16:39:47] Dereckson: I'm in a meeting now, it'd be better to schedule for next SWAT. Thank you for that. [16:40:02] k [16:42:01] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2418691 (10jcrespo) @elukey Let's schedule one for 12 July, but only if @Cmjohnson can. [16:42:42] PROBLEM - Disk space on elastic1021 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=90%) [16:43:39] ostriches, if it is know and little real impact, it is ok to me, I just worried in case it had not been noticed [16:43:45] *known [16:44:19] I guess it's known if there's tasks for it :) [16:44:31] The error itself is mostly harmless, just noisy [16:44:47] sorry about that, I saw "index" and my first thought was about the db [16:44:50] !log restarting elastic1036 (master in eqiad) [16:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:56] then I saw it was not related [16:45:02] thcipriani: did the portals get deployed during this morning swat? [16:45:02] RECOVERY - Disk space on elastic1021 is OK: DISK OK [16:45:48] jan_drewniak: the patch that was on the deployment page had already been deployed on Monday it looked like [16:45:51] * thcipriani checks again [16:46:06] 06Operations, 10Traffic, 13Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2418705 (10Krinkle) >>! In T124954#2418150, @BBlack wrote: > Re: detecting parser output changes, couldn't we just do a hash over the output to generate an ETag? That's a paradox. If... [16:47:29] jan_drewniak: yeah the patch linked on the deployments page was the one that went out Tuesday. I made the assumption that it was a typo. Also, we ran out of time for morning SWAT :( [16:48:21] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [16:49:06] thcipriani: yeah, busy day [16:59:27] (03PS1) 10Yuvipanda: Ensure that we don't have two backends running together [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296779 [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T1700). [17:05:07] hhmmm, ok [17:07:08] deploying graphoid [17:07:58] !log stopping slave on db1073 to test InnoDB compression T139055 [17:07:59] T139055: Test InnoDB compression - https://phabricator.wikimedia.org/T139055 [17:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:10] !log deployed Graphoid https://gerrit.wikimedia.org/r/#/c/296780/ [17:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:52] (03PS2) 10Yuvipanda: Ensure that we don't have two backends running together [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296779 [17:18:25] (03CR) 10BryanDavis: [C: 032] Ensure that we don't have two backends running together [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296779 (owner: 10Yuvipanda) [17:19:14] (03Merged) 10jenkins-bot: Ensure that we don't have two backends running together [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296779 (owner: 10Yuvipanda) [17:19:16] (03CR) 10Merlijn van Deen: Ensure that we don't have two backends running together (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296779 (owner: 10Yuvipanda) [17:20:44] (03PS1) 10Yuvipanda: Slightly clearer error message for backend conflict [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296782 [17:30:30] @seen ottomata [17:30:30] mutante: Last time I saw ottomata they were quitting the network with reason: Quit: Leaving. N/A at 6/15/2016 3:37:19 PM (15d1h53m11s ago) [17:31:49] can i delete this class ? Class misc::monitoring::view::hadoop [17:32:16] theroretically it creates a view in ganglia. at https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=&tab=v&vn=&hide-hf=false [17:32:23] but it doesnt appear to work [17:33:06] and it would be nice to kill that.. manifests/misc/monitoring.pp [17:34:08] the view for "hadoop" doesnt have data, the one for kafkatee does though [17:35:36] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80523 MB (15% inode=99%) [17:37:21] and/or ... just move that whole thing into ganglia module? [17:37:32] since they are views for ganglia [17:39:43] akosiaris: would there be something bad about misc::monitoring::views::* --> ganglia module? [17:40:24] like all the classes in misc/monitoring.pp are for setting up ganglia views [17:40:35] and the only file left in manifests/misc/ [17:41:48] then we could drop an entire import 'misc/*.pp' [17:44:42] (03CR) 10Mxn: "Sorry, I didn’t see 296748 before pushing this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296772 (owner: 10Mxn) [17:44:55] (03Abandoned) 10Mxn: Bumped portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296772 (owner: 10Mxn) [17:45:16] (03CR) 10Mxn: [C: 031] Updating www portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296748 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [17:46:59] deploying kartotherian [17:48:05] (03PS1) 10Paladox: Update git.wikimedia.org references replace them with diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) [17:48:38] !log deployed Kartotherian https://gerrit.wikimedia.org/r/#/c/296787/ [17:48:39] ostriches, Dereckson could you review https://gerrit.wikimedia.org/r/296788 please. [17:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:43] paladox: These sorts of things should go on swat deploys. [17:51:58] ostriches oh, ok sorry i didnt know. [17:53:14] (03CR) 10Merlijn van Deen: [C: 032] Slightly clearer error message for backend conflict [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296782 (owner: 10Yuvipanda) [17:54:44] (03CR) 10Dzahn: [C: 031] "yes please, old link is not redirected to that now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [17:54:58] mutante ^^ thanks [18:00:15] (03PS1) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [18:00:56] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80392 MB (15% inode=99%) [18:08:20] (03PS2) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [18:08:51] (03CR) 10jenkins-bot: [V: 04-1] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [18:09:20] (03PS3) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [18:10:56] (03Merged) 10jenkins-bot: Slightly clearer error message for backend conflict [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296782 (owner: 10Yuvipanda) [18:14:05] (03CR) 10Dereckson: [C: 031] "Yes, content matches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [18:14:36] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80138 MB (15% inode=99%) [18:15:35] (03CR) 10Paladox: "Hi, yep ive added there. I may not be available so mutante (dzahn) said he will be available if I'm not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [18:18:22] (03PS3) 10ArielGlenn: do xml stubs dump pieces based on revs per page range [dumps] - 10https://gerrit.wikimedia.org/r/296742 (https://phabricator.wikimedia.org/T137887) [18:22:03] 07Puppet, 06Labs, 10Labs-Infrastructure, 10Phabricator: puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011#2419186 (10Paladox) p:05Triage>03High Changing to high per @Dzahn and everybody should up the priority of that ticket [18:24:49] !log restarted coal on graphite1001 and navtiming on hafnium due to inexplicably stopped metrics; nothing useful in logs. [18:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:25:47] mutante: I would welcome moving them somewhere in the modules/ hierarchy [18:26:45] ori: I rebooted graphite1001 this morning, possibly related [18:26:48] akosiaris: alright, i'll make a patch for the details :) [18:28:01] coal accumulates metrics into a 5 minute sliding window; anything outside of that gets discarded. And EventLogging has been lagged by about ~30m for several hours. so that seems to be the cause. [18:28:04] godog: ^ [18:28:39] godog: yeah, not related to your reboot. [18:28:50] thanks for the ping anyway [18:29:22] ori: np! glad you found the cause ! [18:36:33] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80299 MB (15% inode=99%) [18:38:49] dcausse: ^ the cure is truncating logs? [18:39:34] "thursday i don't care about you" [18:39:35] 07Puppet, 06Labs, 10Labs-Infrastructure, 10Phabricator: puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011#2419238 (10chasemp) >>! In T139011#2416792, @AlexMonk-WMF wrote: > @chasemp, can you check other hosts that we know work like `bastio... [18:39:39] godog: /var/lib/elasticsearch is separate [18:40:10] ebernhardson: doh, indeed [18:40:13] godog: the 15% against /var/lib/elasticsearch are still around because gehel and faidon can't agree on the right way to fix th alert. When cisk gets to 15% elasticsearch shuffles shards around the cluster and fixes itself. [18:40:40] will figure it out sooner or later i imagine :) [18:41:38] 07Puppet, 06Labs, 10Labs-Infrastructure, 10Phabricator: puppet function ipresolve unable to look up instance on labs-puppetmaster - https://phabricator.wikimedia.org/T139011#2419246 (10AlexMonk-WMF) >>! In T139011#2419238, @chasemp wrote: >>>! In T139011#2416792, @AlexMonk-WMF wrote: >> @chasemp, can you c... [18:43:14] ebernhardson: hehe I see, fired 18 times last month according to my logs [18:44:05] oh but the new hardware is coming online too so it might get better (?) [18:45:18] godog: new hardware is online now, but elasticsearch, for purposes of cluster actions like balance, treats the cluster as homogenous [18:49:24] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [18:49:46] ebernhardson: I see, so the new hardware can't get more disk utilization until the old is decommissioned [18:50:03] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80792 MB (15% inode=99%) [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T1900). Please do the needful. [19:06:22] (03PS13) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) [19:10:55] jouncebot: doing the needful [19:13:01] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80722 MB (15% inode=99%) [19:13:41] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79624 MB (15% inode=99%) [19:13:59] aude: should I still be holding wikidata back to wmf.7? [19:15:26] (03CR) 10Merlijn van Deen: [C: 032] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:15:34] nevermind looks like wikidata is already on wmf.8 [19:15:38] (03PS4) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [19:16:09] (03CR) 10jenkins-bot: [V: 04-1] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:16:11] (03PS1) 1020after4: all wikis to 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296801 [19:16:40] (03CR) 1020after4: [C: 032] all wikis to 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296801 (owner: 1020after4) [19:17:00] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:17:00] (03PS5) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [19:17:16] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296801 (owner: 1020after4) [19:17:31] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80443 MB (15% inode=99%) [19:18:39] !log aaron@tin Synchronized php-1.28.0-wmf.8/includes: adc4c90202d6c44aa58756e3c6bc35918afc5f75 (duration: 01m 19s) [19:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:35] !log Deploying 1.28.0-wmf.8 to all wikimedia wikis. [19:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:31] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:20:40] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [19:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:40] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80599 MB (15% inode=99%) [19:23:40] PROBLEM - HHVM rendering on mw2123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.136 second response time [19:23:51] PROBLEM - HHVM rendering on mw2134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.141 second response time [19:24:32] PROBLEM - Apache HTTP on mw2134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.140 second response time [19:24:50] PROBLEM - Apache HTTP on mw2098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.141 second response time [19:25:01] PROBLEM - Apache HTTP on mw2123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.149 second response time [19:25:01] PROBLEM - HHVM rendering on mw2098 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.151 second response time [19:26:48] (03PS4) 10ArielGlenn: do xml stubs dump pieces based on revs per page range [dumps] - 10https://gerrit.wikimedia.org/r/296742 (https://phabricator.wikimedia.org/T137887) [19:29:02] is that 500 error related to the train? error rates are up slightly after sync-wikiversions [19:29:41] Hi does anyone know what the web page is for https://github.com/wikimedia/wikimedia-discovery-dashboard/blob/c4b62e88d01591ceb1d8745850ef7be828b2ef5d/shiny-server/index.html [19:30:02] Since some repo's are not in use any more and im trying to find ones that are worth updating. [19:30:04] please [19:31:19] uhm, what the...???? File not found: /srv/mediawiki/php-1.28.0-wmf.8/api.php in /srv/mediawiki/w/api.php on line 3 [19:31:44] (03PS1) 10Aaron Schulz: Set the SaveParse log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296808 [19:32:30] (03PS6) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 [19:32:48] twentyafterfour: All hosts? [19:32:49] Just one? [19:33:17] (03CR) 10Aaron Schulz: [C: 032] Set the SaveParse log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296808 (owner: 10Aaron Schulz) [19:33:37] ostriches: looks like mw2123vmostly mw2134 [19:33:48] Might've failed to sync, ssh in and sync-common? [19:33:57] Seems most likely if it's not everywhere [19:35:19] (03PS1) 10Chad: Beta: remove old staging node defs and do it for beta instead [puppet] - 10https://gerrit.wikimedia.org/r/296809 [19:37:10] ostriches: access denied to log in to mw2134 [19:37:44] (03PS1) 10Yuvipanda: Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 [19:37:44] wfm [19:38:03] !log mw2134: running sync-common, seems out of...sync :) [19:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:14] (03CR) 10jenkins-bot: [V: 04-1] Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 (owner: 10Yuvipanda) [19:38:36] (03PS2) 10Yuvipanda: Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 [19:39:08] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 70749 bytes in 2.942 second response time [19:39:25] hmm logged in, that file is not missing [19:39:39] oh you beat me to it [19:40:09] RECOVERY - Apache HTTP on mw2134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.642 second response time [19:40:26] (03PS1) 10Paladox: Replace git.wikimedia.org with diffusion [software] - 10https://gerrit.wikimedia.org/r/296814 (https://phabricator.wikimedia.org/T139089) [19:40:28] Try 2123 too? [19:40:42] ostriches: a bunch of them are having errors [19:40:46] should I do a full scap? [19:40:54] what about mw2123 ? [19:41:31] is it possible/a good idea to do that without the whole i18n-cache-rebuild thing? [19:41:47] I'm not sure [19:41:51] (03CR) 10Chad: "Inline question for Ariel mostly." (031 comment) [software] - 10https://gerrit.wikimedia.org/r/296814 (https://phabricator.wikimedia.org/T139089) (owner: 10Paladox) [19:42:07] Could run sync-common on all apaches. [19:42:09] like sync-common across all machines [19:42:10] yeah [19:42:15] !log activating statement timeout limitations for kartotherian on maps cluster codfw (T138422) [19:42:16] T138422: Limit postgres query execution time for Kartotherian role - https://phabricator.wikimedia.org/T138422 [19:42:16] see ~krenair/foreachapache [19:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:22] !log ran scap pull on mw2098 [19:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:34] (03PS2) 10Aaron Schulz: Set the SaveParse log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296808 [19:42:44] (03CR) 10Paladox: "Ok, I will let ArielGlenn answer that." [software] - 10https://gerrit.wikimedia.org/r/296814 (https://phabricator.wikimedia.org/T139089) (owner: 10Paladox) [19:43:07] RECOVERY - Apache HTTP on mw2098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.643 second response time [19:43:48] RECOVERY - HHVM rendering on mw2098 is OK: HTTP OK: HTTP/1.1 200 OK - 70749 bytes in 2.399 second response time [19:43:50] !log ran scap pull on mw2123 [19:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:04] ostriches im wondering what this file https://github.com/wikimedia/labs-toollabs/blob/715be778a37df279f92eba025390964c2c96f59e/debian/control does and is it worth updating [19:44:05] please [19:44:19] references git.wikimedia.org [19:45:05] I think that's all of them, at least that's the ones icinga noticed [19:45:18] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 70749 bytes in 3.555 second response time [19:45:29] api.php is a weird file to go missing [19:45:29] (03CR) 10Aaron Schulz: [C: 032] Set the SaveParse log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296808 (owner: 10Aaron Schulz) [19:45:32] silly rsync [19:45:53] (03CR) 10ArielGlenn: Replace git.wikimedia.org with diffusion (031 comment) [software] - 10https://gerrit.wikimedia.org/r/296814 (https://phabricator.wikimedia.org/T139089) (owner: 10Paladox) [19:45:58] RECOVERY - Apache HTTP on mw2123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.227 second response time [19:46:07] (03Merged) 10jenkins-bot: Set the SaveParse log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296808 (owner: 10Aaron Schulz) [19:47:15] paladox: I said it's not worth updating these things for the most part... I'm not going to answer questions about each and every one you find. Perhaps you should ask people who maintain the relevant bit of code instead? [19:47:31] Ok [19:47:42] !log aaron@tin Synchronized wmf-config/InitialiseSettings.php: Set the SaveParse log (duration: 00m 26s) [19:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:12] (03PS7) 10BryanDavis: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:48:46] (03CR) 10BryanDavis: [C: 032] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:49:15] (03CR) 10Yuvipanda: [C: 04-2] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:49:17] (03CR) 10BryanDavis: [C: 04-2] Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:49:22] (03PS8) 10Nschaaf: Check all ores web nodes [puppet] - 10https://gerrit.wikimedia.org/r/296535 (https://phabricator.wikimedia.org/T134782) [19:50:38] (03PS8) 10BryanDavis: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:50:55] (03PS3) 10Yuvipanda: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 [19:50:57] (03PS3) 10Yuvipanda: Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 [19:51:29] (03CR) 10Yuvipanda: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:52:00] (03CR) 10BryanDavis: [C: 032] "Fixed version number that was the cause of the -2s" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:53:21] (03Merged) 10jenkins-bot: Set CPU & Memory limits for kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/296791 (owner: 10Yuvipanda) [19:54:17] (03CR) 10BryanDavis: [C: 032] Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 (owner: 10Yuvipanda) [19:54:45] (03Merged) 10jenkins-bot: Remove xdebug from base php container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/296811 (owner: 10Yuvipanda) [19:55:05] (03PS1) 10Paladox: Allow git.wikimedia.org/git/passport-mediawiki/git to be redirected properly [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) [19:55:20] twentyafterfour ostriches ^^ [19:55:42] (03PS2) 10Paladox: Allow git.wikimedia.org/git/passport-mediawiki/git to be redirected properly [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) [19:57:43] (03PS4) 10BryanDavis: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 (owner: 10Yuvipanda) [19:58:27] (03CR) 10BryanDavis: [C: 032] Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 (owner: 10Yuvipanda) [19:58:55] (03Merged) 10jenkins-bot: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 (owner: 10Yuvipanda) [19:58:59] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79684 MB (15% inode=99%) [19:59:09] (03PS1) 10Dzahn: (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [20:00:30] (03PS3) 10Dzahn: Allow git.wikimedia.org/git/passport-mediawiki/git to be redirected properly [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:00:42] (03CR) 10Paladox: "Wait please." [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:00:59] (03CR) 10Paladox: "I found one more to do." [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:02:44] (03PS4) 10Paladox: Allow git.wikimedia.org/git/passport-mediawiki/git to be redirected properly [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) [20:03:10] (03CR) 10Paladox: [C: 031] "@dzahn thanks for waiting. This can now be merged please." [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:04:07] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [20:06:07] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80683 MB (15% inode=99%) [20:07:16] (03CR) 10Danny B.: [C: 04-1] "Won't work. It's not that simple." [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:07:35] (03CR) 10Paladox: Allow git.wikimedia.org/git/passport-mediawiki/git to be redirected properly [puppet] - 10https://gerrit.wikimedia.org/r/296816 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [20:07:42] dcausse: You were poking the disk space issues on elastic* right? [20:13:05] (03PS5) 10ArielGlenn: do xml stubs dump pieces based on revs per page range [dumps] - 10https://gerrit.wikimedia.org/r/296742 (https://phabricator.wikimedia.org/T137887) [20:14:09] (03PS1) 10Kaldari: Moving PageAssessments from extension-list-labs to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296819 (https://phabricator.wikimedia.org/T137918) [20:16:35] (03PS1) 10ArielGlenn: delay full dumps this month by one day [puppet] - 10https://gerrit.wikimedia.org/r/296820 [20:18:02] (03CR) 10ArielGlenn: [C: 032] delay full dumps this month by one day [puppet] - 10https://gerrit.wikimedia.org/r/296820 (owner: 10ArielGlenn) [20:23:27] (03CR) 10BryanDavis: "some python formatting/spelling nits inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) (owner: 10Yuvipanda) [20:25:36] (03PS1) 10Hashar: network: simplify tests with puppetlabs_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/296821 [20:28:36] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:28:39] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2419679 (10hashar) @akosiaris recently introduced the `network` puppet module and we have something nice going on with rspec-puppet and pu... [20:34:39] (03PS14) 10Yuvipanda: tools: Provision accounts for all tools [puppet] - 10https://gerrit.wikimedia.org/r/296747 (https://phabricator.wikimedia.org/T133999) [20:34:41] (03PS1) 10Yuvipanda: graphite: Add DocumentRoot explicitly [puppet] - 10https://gerrit.wikimedia.org/r/296823 (https://phabricator.wikimedia.org/T137924) [20:34:56] (03CR) 10Hashar: "For proper crediting, that is inspired by the Kitchen Hackathon and Nicko patch https://gerrit.wikimedia.org/r/#/c/282484/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/296821 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [20:35:31] (03CR) 10Yuvipanda: [C: 032 V: 032] graphite: Add DocumentRoot explicitly [puppet] - 10https://gerrit.wikimedia.org/r/296823 (https://phabricator.wikimedia.org/T137924) (owner: 10Yuvipanda) [20:38:44] 06Operations, 10hardware-requests: eqiad: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2376734 (10yuvipanda) [20:49:16] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80728 MB (15% inode=99%) [20:51:07] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80659 MB (15% inode=99%) [20:52:19] !log aaron@tin Synchronized php-1.28.0-wmf.8/includes/filerepo/file/LocalFile.php: 51d7fb48f2af31d69db115a8b3ed790cdaaf0d2e (duration: 00m 35s) [20:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:56:00] (03PS1) 10Hashar: postgresql: simplify tests with puppetlabs_spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/296829 [20:56:06] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80766 MB (15% inode=99%) [20:56:52] (03CR) 10Hashar: "Same as what is proposed for the network module https://gerrit.wikimedia.org/r/#/c/296821/" [puppet] - 10https://gerrit.wikimedia.org/r/296829 (owner: 10Hashar) [20:56:57] q/me sighs [20:57:01] ostriches: yes we have an ongoing issue (https://github.com/elastic/elasticsearch/issues/19187) but this is related to disk space problem on / not /var/lib/elasticsearch [20:57:15] (03PS1) 10Yuvipanda: labspuppetbackend: Use sqlalchemy for connection pooling [puppet] - 10https://gerrit.wikimedia.org/r/296830 (https://phabricator.wikimedia.org/T133412) [20:57:32] dcausse: Yeah, I was noticing that /var/lib/elasticsearch was hitting 85% in places. [20:57:40] I don't know why we have so many /var/lib/elasticsearch space right now :/ [20:57:49] *so many issues [20:58:11] elasticsearch is bad at balancing things :P [20:58:16] :) [20:59:13] I note that we didn't retry to freeze and test fast rolling restart since we migrated to 2.3 [20:59:31] I'll try taht tomorrow [20:59:50] i just changed the disk high watermark to 80% to force it to balance earlier...will at least make things quieter [21:00:02] !log change cluster.routing.allocation.disk.watermark.high on eqiad elasticsearch cluster to 80% [21:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:34] ebernhardson: Yeah, that makes sense. [21:00:37] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80713 MB (15% inode=99%) [21:00:47] Put Elastic's watermark below the icinga alert [21:00:51] So hopefully we avoid the alert :) [21:03:43] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79327 MB (15% inode=99%) [21:05:43] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2419927 (10Whatamidoing-WMF) I've left a message for the new user today. Also, the list finally seems to be getting smaller. @Fae ran i... [21:17:44] RECOVERY - Disk space on elastic1018 is OK: DISK OK [21:21:05] RECOVERY - Disk space on elastic1020 is OK: DISK OK [21:22:55] RECOVERY - Disk space on elastic1019 is OK: DISK OK [21:30:46] (03PS2) 10Dzahn: (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [21:31:56] (03CR) 10jenkins-bot: [V: 04-1] (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 (owner: 10Dzahn) [21:31:58] (03PS3) 10Dzahn: (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [21:33:00] (03CR) 10Dzahn: "haha, nice. so this jenkins-fail is for "Error: Could not parse for environment production: No file(s) found for import of 'misc/*.pp' " t" [puppet] - 10https://gerrit.wikimedia.org/r/296817 (owner: 10Dzahn) [21:33:08] (03CR) 10jenkins-bot: [V: 04-1] (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 (owner: 10Dzahn) [21:33:47] (03PS4) 10Dzahn: (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [21:46:05] (03PS2) 10Brian Wolff: Add Content-Security-Policy to images from test[2]wiki [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) [21:47:32] (03PS3) 10Krinkle: Set $wgSquidMaxage to 14 days on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296765 (https://phabricator.wikimedia.org/T124954) [21:47:39] (03CR) 10Krinkle: [C: 032] Set $wgSquidMaxage to 14 days on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296765 (https://phabricator.wikimedia.org/T124954) (owner: 10Krinkle) [21:48:20] (03Merged) 10jenkins-bot: Set $wgSquidMaxage to 14 days on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296765 (https://phabricator.wikimedia.org/T124954) (owner: 10Krinkle) [22:01:14] RECOVERY - Disk space on elastic1017 is OK: DISK OK [22:04:57] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: test2wiki wgSquidMaxage (duration: 00m 28s) [22:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:24] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbps DF CWDM C49]BR [22:18:36] (03PS5) 10Dzahn: (WIP) move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [22:33:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [22:58:12] (03PS1) 10Alex Monk: admin: Remove my old SSH key [puppet] - 10https://gerrit.wikimedia.org/r/296844 [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160630T2300). Please do the needful. [23:00:04] Dereckson, paladox, mutante, and kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] here [23:00:13] ^ might have a patch to add [23:03:11] here [23:03:13] (03PS2) 10Jdlrobson: Enable lazy loaded references on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 [23:03:41] 06Operations, 06Labs, 06Project-Admins: Archive old Incident-* projects - https://phabricator.wikimedia.org/T134624#2420338 (10Danny_B) Can #operations and #labs folks please check unclosed tasks in projects listed above? Thanks. [23:03:52] (03PS3) 10Jdlrobson: Enable lazy loaded references on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 [23:04:55] i am here for swat [23:05:27] who's deploying? [23:06:42] (03CR) 10Dzahn: [C: 032] admin: Remove my old SSH key [puppet] - 10https://gerrit.wikimedia.org/r/296844 (owner: 10Alex Monk) [23:07:13] thanks mutante [23:07:38] yw, i have to do that too some time :p [23:07:54] OK, guess it's me :P [23:08:55] No irc-nickname in the channel, not deploying ther patch} [23:09:19] \o [23:10:12] (03PS2) 10MaxSem: Logo update for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296757 (https://phabricator.wikimedia.org/T138801) (owner: 10Dereckson) [23:10:20] (03CR) 10MaxSem: [C: 032] Logo update for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296757 (https://phabricator.wikimedia.org/T138801) (owner: 10Dereckson) [23:10:57] (03Merged) 10jenkins-bot: Logo update for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296757 (https://phabricator.wikimedia.org/T138801) (owner: 10Dereckson) [23:11:23] uh-oh 270 fatal error: Argument 2 passed to Monolog\Logger::debug() must be an instance of array, string given in /srv/mediawiki/php-1.28.0-wmf.8/vendo [23:11:23] r/monolog/monolog/src/Monolog/Logger.php on line 532 [23:12:36] (03CR) 10Dzahn: [C: 032] "per "Your fix is just fine." in inline comments" [software] - 10https://gerrit.wikimedia.org/r/296814 (https://phabricator.wikimedia.org/T139089) (owner: 10Paladox) [23:12:52] mutante ^^ thanks [23:12:55] !log maxsem@tin Synchronized static/images/project-logos/enwiktionary.png: https://gerrit.wikimedia.org/r/#/c/296757/ (duration: 00m 30s) [23:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:15] (03CR) 10Dzahn: "has been scheduled in evening swat . present" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [23:16:56] (03PS6) 10Dzahn: move ganglia views out of misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/296817 [23:17:34] ugh, can't get purges to wor [23:17:36] k [23:18:37] (03PS7) 10Dzahn: remove misc/monitoring.pp, remove import 'misc/*.pp' [puppet] - 10https://gerrit.wikimedia.org/r/296817 [23:19:47] has to be banned in varnish [23:20:03] ori, :O [23:20:15] small price to pay for far future-expires [23:20:46] ori, supposedly, purgeList should still work? [23:20:59] what is purgeList? [23:21:01] do you need commands for a one-off varnish purge? [23:21:25] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_.28bans.29 [23:21:31] i can do it if you like [23:21:43] yes please [23:21:47] Hi. [23:21:57] hi Dereckson [23:22:47] woud be bummer to wait till Fri, 30 Jun 2017 15:12:38 GMT [23:23:16] doing [23:24:03] but why don't HTCP purges work? [23:24:40] (03PS4) 10MaxSem: Update logo settings for Adyghe Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296460 (https://phabricator.wikimedia.org/T139005) (owner: 10Odder) [23:24:48] (03CR) 10MaxSem: [C: 032] Update logo settings for Adyghe Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296460 (https://phabricator.wikimedia.org/T139005) (owner: 10Odder) [23:25:28] hmmmm [23:25:32] (03Merged) 10jenkins-bot: Update logo settings for Adyghe Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296460 (https://phabricator.wikimedia.org/T139005) (owner: 10Odder) [23:25:34] they should [23:25:44] HD version would be nice too for en.wikt [23:25:50] but they don't [23:26:02] MaxSem: are you doing the SWAT deploy? [23:26:09] yup [23:26:13] cool [23:26:25] lemme know when you get to mine :) [23:26:34] it's just a config change [23:27:30] (03PS1) 10Dereckson: HD logo for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296851 (https://phabricator.wikimedia.org/T138801) [23:27:53] !log maxsem@tin Synchronized static/images/project-logos/: (no message) (duration: 00m 27s) [23:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:48] Dereckson, I don't even know how to tell the new adywiki logo from the old one [23:29:17] (03CR) 10Dereckson: [C: 04-1] "typo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296851 (https://phabricator.wikimedia.org/T138801) (owner: 10Dereckson) [23:29:37] Let me check. [23:30:24] (03PS1) 10Jdlrobson: Enable lazy loaded references and images on Tagalog Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296852 (https://phabricator.wikimedia.org/T137822) [23:31:03] MaxSem: currently, ady. prints the old logo [23:33:09] age:34947 [23:33:32] purges don't work there either [23:33:53] ori, something must be proken:P [23:33:57] but https://ady.wikipedia.org/w/static/images/project-logos/adywiki.png is the new logo [23:34:18] (without any need of ?debug=true) [23:34:44] these images aren't loaded via RL [23:35:28] (03PS2) 10MaxSem: Update git.wikimedia.org references replace them with diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [23:35:35] (03CR) 10MaxSem: [C: 032] Update git.wikimedia.org references replace them with diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [23:36:21] (03Merged) 10jenkins-bot: Update git.wikimedia.org references replace them with diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296788 (https://phabricator.wikimedia.org/T137353) (owner: 10Paladox) [23:37:05] MaxSem: what URL you tried to purge by the way, http://en.wikipedia.org/static/images/project-logos/adywiki.png ? [23:37:09] !log maxsem@tin Synchronized docroot/mediawiki/xml/sitelist-1.0/index.html: https://gerrit.wikimedia.org/r/#/c/296788/ (duration: 00m 24s) [23:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:00] Dereckson, the one from the network tab, https://ady.wikipedia.org/static/images/project-logos/adywiki.png [23:38:07] ah [23:38:16] mutante, paladox ^^^ [23:38:22] Static resources are now served from en.wikipedia.org. [23:38:24] thanks! ack! [23:38:31] Thanks [23:38:37] Dereckson, except they're not [23:38:38] enwiktionary is updated [23:38:43] i'll do adywiki next [23:39:36] much appreciated, ori :) [23:40:33] For some months, it has been echo 'https://en.wikipedia.org/static/images/project-logos/adywiki.png' | mwscript purgeList.php for /static, as *Varnish* caches them from en. [23:40:35] should be good? [23:40:38] kaldari, your change should go last cuz scap, I guess [23:41:11] (03PS2) 10MaxSem: Updating www portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296748 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:41:11] !log banned /static/images/project-logos/enwiktionary.png and /static/images/project-logos/adywiki.png [23:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:18] (03CR) 10MaxSem: [C: 032] Updating www portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296748 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:41:58] (03Merged) 10jenkins-bot: Updating www portals stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296748 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [23:43:13] did i get dropped from the swat window? [23:43:28] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 28s) [23:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:35] MaxSem: looks like I had a wikitext edit fail [23:43:44] you're not in it, just some irc-nickname :P [23:43:56] !log maxsem@tin Synchronized portals: (no message) (duration: 00m 27s) [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:45:36] (03PS4) 10MaxSem: Enable lazy loaded references on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 (owner: 10Jdlrobson) [23:45:43] (03CR) 10MaxSem: [C: 032] Enable lazy loaded references on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 (owner: 10Jdlrobson) [23:46:23] (03Merged) 10jenkins-bot: Enable lazy loaded references on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296257 (owner: 10Jdlrobson) [23:46:29] (03CR) 10Mholloway: [C: 031] "My patch[1] removing ZeroOpts and all protocol-dependent logic from the ZeroBanner extension is now live on all Wikipedias and everything " [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [23:47:20] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/296257/ (duration: 00m 26s) [23:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:38] jdlrobson, please test [23:48:07] MaxSem: on it [23:49:01] works. Thanks MaxSem ! [23:49:53] (03PS2) 10MaxSem: Moving PageAssessments from extension-list-labs to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296819 (https://phabricator.wikimedia.org/T137918) (owner: 10Kaldari) [23:50:02] kaldari, yt? [23:50:06] yes [23:50:16] (03CR) 10MaxSem: [C: 032] Moving PageAssessments from extension-list-labs to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296819 (https://phabricator.wikimedia.org/T137918) (owner: 10Kaldari) [23:50:53] (03Merged) 10jenkins-bot: Moving PageAssessments from extension-list-labs to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296819 (https://phabricator.wikimedia.org/T137918) (owner: 10Kaldari) [23:51:15] MaxSem: any idea how long scap takes these days? [23:51:30] none [23:52:20] hmm [23:52:31] (03PS2) 10Dereckson: HD logo for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296851 (https://phabricator.wikimedia.org/T138801) [23:53:11] !log maxsem@tin Started scap: https://gerrit.wikimedia.org/r/#/c/296819/ - noop in prod [23:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master