[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T0000). [00:00:04] urandom RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:29] * urandom is available! [00:01:02] Reedy: down with log noise! [00:01:20] greg-g: It's possibly only wikitech... And depends how long we're staying on .10 [00:01:22] and mine patch too! :P [00:02:44] (03PS1) 10Dduvall: Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 [00:03:09] Reedy: https://gerrit.wikimedia.org/r/#/c/265425/ [00:03:29] AaronSchulz: we have far too much shit in our repos [00:04:05] OK I'll do the SWAT [00:04:13] (03CR) 10Dduvall: [C: 032] Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 (owner: 10Dduvall) [00:04:24] Or... not [00:04:32] RoanKattouw: uno momento [00:04:44] but we did roll back [00:04:51] I have a meeting [00:04:59] I can do the SWAT but only at :30 [00:05:02] * greg-g nods [00:05:06] (03Merged) 10jenkins-bot: Revert "Rollback labswiki and labtestwiki to 1.27.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265429 (owner: 10Dduvall) [00:05:10] there probably won't be much to do due to the rollback [00:05:11] Sounds like that might work out [00:05:21] Wait, wmf NIE? [00:05:26] *NINE [00:05:35] Oh, labswiki [00:05:37] for labswiki [00:05:41] RoanKattouw_away: Revert [00:05:52] yes, and that [00:06:03] 21:32 thcipriani: reverted group1 wikis to 1.27.0-wmf.10 due to session errors. [00:06:03] meh, Isee [00:06:06] I guess the eventbus one can go out, since it's only going to test wikis anywho [00:06:09] We still have group0 on wmf11 though right? [00:06:15] right [00:06:23] OK [00:06:25] (I want a freaking dashboard for this) [00:06:27] Good enough for me [00:06:38] greg-g: +5 [00:06:41] https://noc.wikimedia.org/conf/ [00:06:45] Currently active MediaWiki versions: 1.27.0-wmf.10, 1.27.0-wmf.11 [00:06:58] ..... [00:07:02] when was that added? [00:07:05] Ages ago [00:07:10] Literally, AGES ago [00:07:22] well then [00:07:23] greg-g: I got sick of people asking [00:07:23] when we used svn [00:07:28] RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [00:07:48] mutante, wrong - CVS ;) [00:08:03] https://github.com/wikimedia/operations-mediawiki-config/commit/fd94140cf1ad681856d4023380562d14487dc047 [00:08:07] greg-g: Nearly 2 years ago [00:08:14] Oh, no [00:08:24] Yes [00:08:24] https://github.com/wikimedia/operations-mediawiki-config/commit/09ddd03a8a6f27d012f5fc8f8f316014d9903d4f [00:08:33] 2 days away from being 2 years ago [00:08:45] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1950585 (10mmodell) @dzahn: awesome, thanks! [00:08:54] oh, it's actually newer than i though [00:09:13] heh [00:09:18] 2 years is hardly new though :) [00:10:10] it does help, but I'd love a "which wikis are on which version" answer dashboard, but that's for another rant [00:11:19] greg-g: thats what you have Reedy for >.> [00:11:43] reedy@tin:/srv/mediawiki-staging/multiversion$ ./activeMWVersions --withdb [00:11:43] 1.27.0-wmf.10=piwiktionary 1.27.0-wmf.11=mediawikiwiki [00:13:41] greg-g: It'd be relatively easy [00:13:47] Where would it live? [00:16:05] https://www.mediawiki.org/wiki/Special:SiteMatrix ? [00:17:26] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [00:18:02] Is the scap waiting for Roan? [00:18:07] s/scap/swat/ [00:18:41] greg-g, Reedy: we could make a tool in tools labs pretty easily I think. The data file we need is fetchable from either git or noc. [00:19:34] I guess you could do it purely in javascript or PHP [00:19:54] yeah, it's just a json blob these days [00:20:26] Reedy: dunno re: where it'd live, some cool single use webpage? ;) [00:26:15] (03CR) 10Gehel: [C: 031] "Looks good to me (but what do I know)." [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [00:28:45] greg-g, Reedy: quick and dirty POC -- https://tools.wmflabs.org/bd808-test/versions.php [00:29:08] https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.json [00:29:09] Ctrl + F [00:29:10] :P [00:29:35] I wonder which way it's wanted... Like that? [00:29:39] Or for a version, list the dbnames [00:29:40] Or both [00:31:02] Reedy: insert line break, then it's grep-able already [00:32:49] bd808: yeah, I don't know how, but I'd love a simplier version [00:33:13] from a whiteboard of mine way back in the day: https://commons.wikimedia.org/wiki/File:PersonalDashboard_v1.jpg [00:33:13] * Reedy slaps greg-g [00:33:33] mutante: once tin is running HHVM we can put line breaks back in! [00:33:37] where those blue boxes around the the wiki names are expanding on-click to show which wikis [00:33:46] bd808: lolol [00:33:51] Progress is being made again, at least [00:34:08] nearly there I think [00:34:31] Well, soon as tin goes offline for reinstall, we can start changing stuff [00:34:49] If only we'd decided what version of PHP we should bump to... [00:34:54] greg-g: ah. yeah that but ... yesterday's config blows up the idea of 3 nice buckets a bit [00:35:03] yeah.... [00:35:20] MaxSem: yes, your understanding is correct about what I was trying to do :) [00:35:22] solution: get rid of the buckets [00:35:23] what are the trend lines showing? [00:35:34] warning/errors per group [00:35:43] *nod* [00:35:49] bblack, will tke a look today [00:36:10] MaxSem: well, mostly. We really want every purge of $1.wikipedia.org/X to purge $1.m/wikipedia.org/X [00:36:10] eg [00:36:21] https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1 [00:36:23] it's not really specific to the /wiki/Foo URLs [00:36:25] bah https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1 [00:36:36] MaxSem: thanks! [00:36:51] greg-g: or iterate over `git log --since 1w --follow wikiversions.json` and render changes as a timeline? [00:37:00] bleh can't type, but you get the idea I'm sure [00:37:18] marxarelli: interesting [00:37:45] but... what the heck, this doesn't look good still: https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor-group1 [00:42:16] ^ ? plus there is a hhvm warning about dropped sessions [00:42:30] marxarelli: are you sure that's group1? [00:42:41] "Recursion detected in RequestContext::getLanguage in /srv/mediawiki/php-1.27.0-wmf.11/includes/context/RequestContext.php on line 351" [00:42:50] as in, current group1? [00:42:53] it shouldn't be [00:42:53] query : wiki:*wiktionary OR wiki:*wikinews OR wiki:*wikibooks OR wiki:*wikiquote OR wiki:*wikisource OR wiki:*wikinews OR wiki:*wikiversity OR _missing_: wiki [00:43:15] * greg-g scratches his head [00:43:37] greg-g, since RoanKattouw_away is away, I can do the swat? [00:44:15] yessir, the eventbus one looks easy enough [00:44:19] the recursion stuff at least says wmf11 [00:44:22] marxarelli: "Recursion detected" is sadly normal [00:44:44] and "Failed to write session data (user). Please verify that the current setting of session.save_path is correct () in /srv/mediawiki/php-1.27.0-wmf.11/includes/session/SessionManager.php on line 588" [00:44:44] the other is wmf11 as well [00:44:57] see https://phabricator.wikimedia.org/T124126#1950584 about that [00:44:58] right. doesn't make sense [00:45:24] MaxSem: and I guess the echo one, too [00:45:39] hhvm doesn't report which wiki it happened on, so it gets caught by _missing_ [00:45:41] and the geodata one? :P [00:45:59] * greg-g reloads page [00:46:15] ah, i see [00:46:49] MaxSem: sure [00:46:57] tgr: ahhhh [00:47:01] (re _missing_) [00:47:04] that's an unfortunate bit of missing log data [00:47:31] I'm going to restart the other 2 logstash services because I'm still seeing some January 2015 data trickle in [00:47:57] marxarelli: those are hhvm pooping itself. no place to capture the wiki id from [00:48:12] (03CR) 10MaxSem: [C: 032] Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [00:48:33] brace yourself, urandom [00:49:00] (03Merged) 10jenkins-bot: Enable EventBus extension on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265142 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [00:49:05] urandom: about? [00:49:27] greg-g: aye [00:49:34] good :) [00:49:49] greg-g: how come? [00:49:53] the swat? [00:50:14] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/265142/ (duration: 00m 32s) [00:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:22] urandom: yeah, eventbus [00:50:24] urandom, please test:) [00:51:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:51:26] is testwiki and test2wiki among those that have already gotten wmf.11? [00:52:04] urandom, yes: https://noc.wikimedia.org/conf/highlight.php?file=wikiversions.json [00:52:08] yep [00:54:16] ebernhardson, I'm observing 13 155 of /srv/mediawiki/php-1.27.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php: Operation timed out [00:54:30] "13" means not very severe :) [00:55:38] .9? [00:56:51] !log maxsem@tin Synchronized php-1.27.0-wmf.11/extensions/GeoData/: https://gerrit.wikimedia.org/r/#/c/265409/ (duration: 00m 33s) [00:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:32] ebernhardson, GD logging is live on wmf11 ^^^ [00:58:41] urandom, how are we looking? [01:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T0100). [01:01:02] MaxSem: still looking [01:01:17] but nothing horrible yet [01:01:49] (03CR) 10MaxSem: [C: 032] Enable Echo cross-wiki tracking table on all wikis with CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope) [01:02:40] (03Merged) 10jenkins-bot: Enable Echo cross-wiki tracking table on all wikis with CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265395 (https://phabricator.wikimedia.org/T124232) (owner: 10Catrope) [01:02:46] such confidence [01:03:07] * RoanKattouw apologizes for flaking out [01:03:09] My meeting went long [01:03:53] alright, I need to head out [01:04:06] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/265395/ (duration: 00m 32s) [01:04:08] now that Roan's here he can deal with the echo thing, so I think we're all covered [01:04:10] RoanKattouw, ^^^ [01:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:04:29] MaxSem: be sure to test your own swat :P [01:05:15] greg-g, I have ebernhardson for that! :p [01:05:39] oh whew, checks and balances, important [01:05:57] * MaxSem looks outside and sees no nuclear fireballs on east [01:06:03] LGTM! [01:06:57] MaxSem: cool [01:07:14] EventBus seems OK [01:09:08] 6operations, 5Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1950843 (10Dzahn) [01:09:10] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1950844 (10Dzahn) [01:10:07] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1227390 (10Dzahn) duplicate of T124197 ? [01:10:22] 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1948822 (10Dzahn) duplicate of T96842 ? [01:10:25] MaxSem: those timeouts are a bit odd, and all on labtestwiki [01:10:48] 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1950855 (10Dzahn) [01:10:50] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1950854 (10Dzahn) [01:11:19] 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1948822 (10Dzahn) [01:11:21] 6operations, 5Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#1950856 (10Dzahn) [01:11:48] 6operations: Reinstall magnesium with jessie - https://phabricator.wikimedia.org/T123713#1950858 (10Dzahn) a:3Dzahn [01:11:50] doesn't look related to cluster, failures for both eqiad and codfw [01:12:35] ebernhardson, isn't there some special network exception to allow silver to connect to cirrussearch? [01:12:51] maybe andrewbogott forgot to copy that for labtestweb2001 [01:13:02] Krenair: oh that would make sense. [01:13:13] is labtestwiki new as well? never heard of it before :) [01:13:17] yes [01:13:31] it lives on a silver-like machine called labtestweb2001 [01:13:40] modules/role/manifests/elasticsearch/server.pp: srange => '(($INTERNAL @resolve(silver.wikimedia.org)))', [01:13:42] yeah, that'll be it [01:13:51] andrewbogott: ^ [01:14:36] !log Restarted logstash on logstash1002 [01:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:56] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1950862 (10Dzahn) maybe we could go ahead with the same setup we had on gallium, or let's make an actual blocker for the net... [01:15:19] 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1950863 (10Dzahn) [01:15:40] !log Restarted logstash on logstash1003 [01:15:43] ebernhardson: Krenair want me to patch? [01:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:47] 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1204862 (10Dzahn) added @netops please specify which VLAN to use for cobalt [01:16:01] YuviPanda: yes would be appreciated, basically just needs another ferm line for wherever labtestwiki is [01:16:53] kk [01:18:10] 6operations: Reinstall caesium with jessie - https://phabricator.wikimedia.org/T123714#1950882 (10Dzahn) a:3Dzahn [01:18:38] 6operations: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1950883 (10Dzahn) [01:18:44] (03PS1) 10Yuvipanda: elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 [01:19:07] 6operations: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1936383 (10Dzahn) This is releases.wikimedia.org . Totally agree it's a candidate for a ganeti VM. I'm going to request a machine for it. [01:22:28] (03CR) 10EBernhardson: [C: 031] elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 (owner: 10Yuvipanda) [01:23:05] (03CR) 10Yuvipanda: [C: 032] elasticsearch: Add ferm rule for labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/265436 (owner: 10Yuvipanda) [01:23:11] ebernhardson: ok, I’ll add that [01:23:47] 6operations: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn) 3NEW a:3Dzahn [01:24:10] andrewbogott: labtestweb? I already merged patch [01:24:20] for cirrus? great. [01:25:09] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn) [01:27:28] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950922 (10Dzahn) [01:27:58] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950923 (10Dzahn) a:5Dzahn>3None [01:28:42] ebernhardson, apocalypse: 1393 Avro failed to serialize record for CirrusSearchRequestSet : {"payload":{"tookMs":"Expected string, but recieved integer"}} in /srv/mediawiki/php-1.27.0-wmf. [01:28:42] 10/includes/debug/logger/monolog/AvroFormatter.php on line 97 [01:28:56] 6operations, 10vm-requests: request VM for releases.wm.org - https://phabricator.wikimedia.org/T124261#1950905 (10Dzahn) @Mark should we have a second one of this in codfw as well? so we can switch over DCs and still have releases.wm.org up ? [01:31:58] MaxSem: those should be fixed in the latest code, we pushed out a fix yesterday :S [01:32:05] looking.. [01:32:09] pushed? [01:32:14] swatt'd [01:32:21] to wmf10 too? [01:32:31] lemme poke on tin and see.. [01:33:58] MaxSem: yes wmf.10 and wmf.11 have fix [01:34:41] 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1950963 (10Dzahn) this is role::ipv6relay / miredo What kind of requirements does the IPv6 relay have? I installed "nload" to see how much it's used network-wise. [01:35:24] MaxSem: where do you see them? on fluorine `grep AvroForm /a/mw-log/hhvm.log` only reports log lines from several days ago [01:35:37] (which seems odd since they rotate...) [01:37:09] Jan 20 09:25:53 mw1237: #012Notice: Undefined index: 1 in /srv/mediawiki/php-1.27.0-wmf.11/extensions/VisualEditor/VisualEditor.hooks.php on line 83 [01:37:10] Jan 19 15:50:59 mw1217: message repeated 217 times: [ #012Notice: Avro failed to serialize record for CirrusSearchRequestSet : {"payload":{"tookMs":"Expected string, but recieved integer"}} in /srv/mediawiki/php-1.27.0-wmf.10/includes/debug/logger/monolog/AvroFormatter.php on line 97] [01:37:10] Jan 20 09:08:46 mw1064: message repeated 2 times: [ #012Fatal error: Stack overflow in /srv/mediawiki/php-1.27.0-wmf.11/includes/libs/objectcache/MemcachedBagOStuff.php on line 177] [01:37:29] we have some bogus crap coming late, in other words [01:37:36] ok :) [01:41:00] 6operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1950987 (10Peachey88) [01:44:03] 6operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, and 2 others: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1950990 (10ori) [01:44:40] 6operations: Migrate nitrogen to jessie - https://phabricator.wikimedia.org/T123732#1950998 (10Dzahn) incoming and outgoing about 3MBit/s on average over a couple minutes [01:47:34] !log nitrogen - install package upgrades [01:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:50:39] MaxSem: I'd love to know where those syslog events get buffered for days before showing up on fluorine and in logstash [01:50:49] it's an ongoing mystery [01:54:47] bd808, some algorithm that produces "message repeated 217 times"? [01:55:31] that bit is built into rsyslog but there should be a max accumulation time as well which is something like 5 minutes max [01:55:57] * bd808 should look that up again [02:05:24] i wonder how many of the 4 redis boxes could be down at a time [02:05:40] to reinstall them that is [02:06:07] (03PS1) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) [02:08:15] (03PS2) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) [02:10:00] (03CR) 10Aaron Schulz: [C: 031] Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) (owner: 10Ori.livneh) [02:10:56] (03PS3) 10Ori.livneh: Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) [02:11:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Job runners: Add a dedicated htmlCacheUpdate runner [puppet] - 10https://gerrit.wikimedia.org/r/265438 (https://phabricator.wikimedia.org/T123815) (owner: 10Ori.livneh) [02:14:04] bd808: I’m right now messing with broken puppet on stashbot-deploy… I don’t suppose you’d like to fix it so I don't have to? [02:14:33] andrewbogott: actually we can jsut shoot those instances in the head [02:14:47] bd808: delete the project too? [02:14:57] I've migrated to other servers in the tools project and just not shut things down yet [02:15:10] andrewbogott: yeah kill it all with fire [02:15:15] great, will do. Thanks [02:16:14] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures [02:16:45] !log Restarting jobrunner service on job runners to ensure I180856917 gets picked up [02:16:46] done [02:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:17:33] andrewbogott: cool. Hopefully I won't remember tomorrow why I hadn't cleaned it all up yet. ;) [02:24:04] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:22] !log citoid deploying 3a1b6c8648 [02:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:37] wth mobileapps? [02:26:32] bd808: andrewbogott yay, less self hosted puppetmasters! [02:26:34] checked mobileapps, all looks good to me [02:26:53] the checker script concurs [02:27:03] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 33s) [02:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:23] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:30:18] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951071 (10ori) The bulk of the jobs in the queue are htmlCacheUpdate jobs, and @aaron suspects that the runners are slow to clear the backlog because the [[ https://gi... [02:36:51] 6operations, 10Gerrit, 10GitHub-Mirrors, 10ValueView, and 2 others: [Bug] ValueView GitHub mirror not updated any more - https://phabricator.wikimedia.org/T123521#1951094 (10JanZerebecki) I just checked with `git ls-remote` and the github mirror still does not have `refs/meta/config`, which means syncing i... [02:41:53] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:45:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:34] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [02:49:35] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 09m 39s) [02:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:50] (03PS1) 10Aude: Remove unused/no longer existing item-create oauth grant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 [02:56:45] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jan 21 02:56:44 UTC 2016 (duration 7m 9s) [02:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:13] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [03:01:43] (03CR) 10Alex Monk: [C: 031] [cirrus maint] redirect stderr to log and use full mwscript path [puppet] - 10https://gerrit.wikimedia.org/r/265427 (https://phabricator.wikimedia.org/T120843) (owner: 10EBernhardson) [03:20:56] (03PS1) 10Andrew Bogott: Actively remove use of webproxy.eqiad.wmnet on labs [puppet] - 10https://gerrit.wikimedia.org/r/265451 [03:38:33] (03Abandoned) 10Andrew Bogott: Actively remove use of webproxy.eqiad.wmnet on labs [puppet] - 10https://gerrit.wikimedia.org/r/265451 (owner: 10Andrew Bogott) [03:40:30] (03PS9) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [03:58:53] PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:59:04] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:24] PROBLEM - puppet last run on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:01:14] PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:35] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.002 second response time on port 11212 [04:08:23] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:14:24] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:14] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:54] RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [04:15:54] RECOVERY - RAID on mw1162 is OK: OK: no RAID installed [04:16:23] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [04:16:44] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [04:17:13] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:17:15] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures [04:58:19] (03CR) 10Tim Starling: [WIP] Implement /w/static.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [05:15:54] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:18:40] (03CR) 10Tim Starling: [WIP] Implement /w/static.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [05:35:34] (03PS1) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) [05:36:43] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 58 failures [05:41:26] (03PS2) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) [05:43:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [05:46:40] (03PS1) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) [05:55:45] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail [05:56:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:02:43] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [06:04:14] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:23:24] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:28:03] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 14 failures [06:30:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:31:13] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:53] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:53] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:03] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:24] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:14] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 63 failures [06:38:39] (03CR) 10Luke081515: [C: 031] "...but looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265413 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope) [06:41:04] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [06:41:13] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:14] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951238 (10Luke081515) Looks good now [06:41:23] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:44] PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:04] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:42:35] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:43:05] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:35] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:45] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:46:45] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:23] PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:43] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:48:53] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [06:48:54] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [06:49:24] RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [06:49:24] RECOVERY - DPKG on mw1161 is OK: All packages OK [06:53:13] RECOVERY - Disk space on mw1161 is OK: DISK OK [06:54:04] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:54:45] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:54:54] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:55:13] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:13] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:15] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:55:23] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:55:34] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:55:44] PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:55:45] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:56:03] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:04] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:33] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:04] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:44] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [07:01:44] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [07:04:03] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:34] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:08:04] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:05] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:10:14] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [07:10:50] Wikimedia: a tale of salvation and woe [07:19:04] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [07:20:53] RECOVERY - Disk space on mw1161 is OK: DISK OK [07:21:03] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:23:53] PROBLEM - Apache HTTP on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:03] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:04] PROBLEM - HHVM rendering on mw1116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:05] PROBLEM - Check size of conntrack table on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:13] PROBLEM - puppet last run on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:14] PROBLEM - RAID on mw1116 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:13] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:27:54] RECOVERY - Apache HTTP on mw1116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.425 second response time [07:28:04] RECOVERY - Check size of conntrack table on mw1116 is OK: OK: nf_conntrack is 4 % full [07:28:13] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [07:28:14] RECOVERY - RAID on mw1116 is OK: OK: no RAID installed [07:29:14] RECOVERY - HHVM rendering on mw1116 is OK: HTTP OK: HTTP/1.1 200 OK - 69958 bytes in 0.905 second response time [07:31:34] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [07:33:53] RECOVERY - Disk space on mw1161 is OK: DISK OK [07:34:43] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 2 failures [07:37:54] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:40:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:45:04] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:46:25] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:25] RECOVERY - Disk space on mw1161 is OK: DISK OK [07:49:03] RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:51:25] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:24] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:53:24] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [07:53:34] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:04] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [07:54:34] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [07:54:35] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [07:55:13] RECOVERY - DPKG on mw1161 is OK: All packages OK [07:55:15] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [07:58:04] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:59:33] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951270 (10elukey) @BBlack: I didn't see the "Additionally, the following parameters are available as part of our commercial subscription:" before the directive, in the oth... [08:01:34] <_joe_> !log upgrading all codfw appserver layer's kernel to linux-image-3.13.0-76-generic [08:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:04:41] (03CR) 10Giuseppe Lavagetto: [C: 032] instrumentation: fixup for Ib0b3c139a [debs/pybal] - 10https://gerrit.wikimedia.org/r/264088 (owner: 10Giuseppe Lavagetto) [08:05:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [08:06:59] (03CR) 10Hoo man: [C: 031] "Fine to deploy at any time, right is unused in Wikibase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265447 (owner: 10Aude) [08:08:31] (03Merged) 10jenkins-bot: instrumentation: fixup for Ib0b3c139a [debs/pybal] - 10https://gerrit.wikimedia.org/r/264088 (owner: 10Giuseppe Lavagetto) [08:09:13] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [08:11:23] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:03] PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [08:12:20] <_joe_> that's mee rebooting them, but this is not good [08:12:33] PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:43] PROBLEM - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:05] <_joe_> ok, let's stop here [08:13:43] PROBLEM - Host mw2155 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:53] PROBLEM - Host mw2014 is DOWN: PING CRITICAL - Packet loss = 100% [08:13:53] PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% [08:14:13] PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% [08:14:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [08:14:33] PROBLEM - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100% [08:14:44] PROBLEM - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:33] PROBLEM - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:43] PROBLEM - Host mw2177 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:53] PROBLEM - Host mw2159 is DOWN: PING CRITICAL - Packet loss = 100% [08:16:53] PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% [08:17:49] ACKNOWLEDGEMENT - Host mw2014 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2041 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:49] ACKNOWLEDGEMENT - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:50] ACKNOWLEDGEMENT - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:50] ACKNOWLEDGEMENT - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:51] ACKNOWLEDGEMENT - Host mw2141 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:51] ACKNOWLEDGEMENT - Host mw2144 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:17:52] ACKNOWLEDGEMENT - Host mw2155 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Issues with the latest ubuntu kernel [08:19:33] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2159.codfw.wmnet because of too many down! [08:19:46] <_joe_> wow, it works :) [08:19:59] <_joe_> (the pybal check, I mean) [08:20:24] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2159.codfw.wmnet because of too many down! [08:22:03] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms [08:23:34] 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951302 (10akosiaris) So, gallium is in `public1-b-eqiad` (208.80.154.128/26). The story behind a public IP is a... [08:24:35] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [08:25:53] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:26:15] (03CR) 10Alexandros Kosiaris: [C: 032] redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad) [08:26:19] (03PS2) 10Alexandros Kosiaris: redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad) [08:26:25] (03CR) 10Alexandros Kosiaris: [V: 032] redis_monitoring.py: easy pep8 fixes [puppet] - 10https://gerrit.wikimedia.org/r/265374 (owner: 10Chad) [08:27:37] (03PS2) 10Alexandros Kosiaris: toolschecker.py: 1 minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/265373 (owner: 10Chad) [08:27:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] toolschecker.py: 1 minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/265373 (owner: 10Chad) [08:29:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [08:29:53] RECOVERY - Host mw2022 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [08:31:05] (03PS2) 10Alexandros Kosiaris: PacketLossLogtailer.py: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/265315 (owner: 10Chad) [08:31:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] PacketLossLogtailer.py: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/265315 (owner: 10Chad) [08:32:03] RECOVERY - Host mw2049 is UP: PING OK - Packet loss = 0%, RTA = 37.25 ms [08:33:24] RECOVERY - Host mw2015 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [08:35:44] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [08:38:24] RECOVERY - Host mw2155 is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [08:38:49] (03PS1) 10Hoo man: Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 [08:40:14] RECOVERY - Host mw2014 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [08:41:23] RECOVERY - Host mw2114 is UP: PING OK - Packet loss = 0%, RTA = 37.21 ms [08:42:44] RECOVERY - Host mw2100 is UP: PING OK - Packet loss = 0%, RTA = 37.48 ms [08:43:14] RECOVERY - Host mw2141 is UP: PING OK - Packet loss = 0%, RTA = 36.85 ms [08:44:43] RECOVERY - Host mw2041 is UP: PING OK - Packet loss = 0%, RTA = 36.44 ms [08:46:04] RECOVERY - Host mw2144 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [08:46:14] RECOVERY - Host mw2177 is UP: PING OK - Packet loss = 0%, RTA = 37.30 ms [08:47:14] RECOVERY - Host mw2159 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [08:49:03] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: /a 325818 MB (3% inode=99%) [08:49:13] RECOVERY - Host mw2043 is UP: PING OK - Packet loss = 0%, RTA = 36.62 ms [08:50:33] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:50:43] PROBLEM - Host mw2022 is DOWN: PING CRITICAL - Packet loss = 100% [08:50:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:56:33] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:14] PROBLEM - puppet last run on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:34] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:44] PROBLEM - SSH on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:53] PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:43] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:59:43] RECOVERY - DPKG on mw1162 is OK: All packages OK [09:01:24] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:45] RECOVERY - Disk space on stat1002 is OK: DISK OK [09:01:46] 6operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951327 (10elukey) Also I believe I got the wrong meaning of the max_fails directive, that does not mean "retry for" but jus "consider this backend unavailable if x request... [09:01:54] RECOVERY - Host mw2022 is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [09:01:58] 6operations, 10Traffic: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#1951328 (10elukey) [09:03:53] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:13] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:06:53] PROBLEM - puppet last run on mw2022 is CRITICAL: CRITICAL: puppet fail [09:07:43] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:09:24] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:13] RECOVERY - Disk space on mw1162 is OK: DISK OK [09:10:53] PROBLEM - nutcracker port on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:11:04] RECOVERY - puppet last run on mw2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:12:33] RECOVERY - DPKG on mw1162 is OK: All packages OK [09:15:23] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:44] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:17:04] PROBLEM - RAID on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [09:18:14] PROBLEM - salt-minion processes on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:18:54] PROBLEM - DPKG on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:19:33] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [09:20:44] PROBLEM - Disk space on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:03] RECOVERY - DPKG on mw1162 is OK: All packages OK [09:21:23] RECOVERY - nutcracker port on mw1162 is OK: TCP OK - 0.018 second response time on port 11212 [09:22:04] PROBLEM - nutcracker process on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:44] RECOVERY - Disk space on mw1162 is OK: DISK OK [09:24:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [09:25:54] PROBLEM - configured eth on mw1162 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:43] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:27:11] <_joe_> !log powercycled mw1162, memory exhaustion [09:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:23] PROBLEM - DPKG on mw1162 is CRITICAL: Timeout while attempting connection [09:27:31] heh I was logging in [09:28:29] <_joe_> oh sorry [09:28:39] not at all [09:28:40] <_joe_> I tried twice, no luck, so I decided to powercycle [09:29:05] <_joe_> I have bricked 10 codfw appserver this morning, so spent a good part of the morning in the consoles :P [09:29:15] oops [09:29:24] RECOVERY - SSH on mw1162 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [09:29:43] RECOVERY - RAID on mw1162 is OK: OK: no RAID installed [09:29:49] 6operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#1951353 (10elukey) [09:30:04] RECOVERY - configured eth on mw1162 is OK: OK - interfaces up [09:30:34] RECOVERY - nutcracker process on mw1162 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:31:13] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1951356 (10faidon) [09:31:15] 6operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#1951357 (10faidon) [09:31:23] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 52 minutes ago with 0 failures [09:31:35] RECOVERY - DPKG on mw1162 is OK: All packages OK [09:34:30] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951371 (10faidon) @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW, IPv6 penetration is > 10% globally and... [09:35:20] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951375 (10hashar) [09:40:14] PROBLEM - HHVM rendering on mw2054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [09:42:14] RECOVERY - HHVM rendering on mw2054 is OK: HTTP OK: HTTP/1.1 200 OK - 69936 bytes in 3.644 second response time [09:43:48] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1951388 (10faidon) >>! In T50501#527689, @Krinkle wrote: > Would it be an option to flatten our subdomains? > > We'd only need b... [09:50:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [09:53:39] <_joe_> !log rolling reboot of the codfw appserver layer [09:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:01] (03PS2) 10Hoo man: Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 [09:54:23] PROBLEM - Host mw2049 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:43] PROBLEM - Host mw2015 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:33] RECOVERY - Host mw2015 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [09:55:43] RECOVERY - Host mw2049 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [09:57:55] PROBLEM - Apache HTTP on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:00:03] RECOVERY - Apache HTTP on mw2114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.181 second response time [10:02:53] PROBLEM - Host mw2100 is DOWN: PING CRITICAL - Packet loss = 100% [10:03:20] (03CR) 10Jcrespo: [C: 031] Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 (owner: 10Hoo man) [10:03:54] RECOVERY - Host mw2100 is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [10:05:13] PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% [10:05:44] RECOVERY - Host mw2043 is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [10:06:01] (03PS1) 10Faidon Liambotis: Drain codfw for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/265469 [10:06:04] PROBLEM - Apache HTTP on mw2037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:06:30] (03CR) 10Jcrespo: [C: 032] Restore s5 DB configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265466 (owner: 10Hoo man) [10:06:49] (03CR) 10Faidon Liambotis: [C: 032] Drain codfw for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/265469 (owner: 10Faidon Liambotis) [10:07:03] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:07:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:08:03] RECOVERY - Apache HTTP on mw2037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.601 second response time [10:09:54] PROBLEM - HHVM rendering on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:05] PROBLEM - HHVM rendering on mw2195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:25] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Restore s5 DB configuration (duration: 01m 57s) [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:56] !log mw2098.codfw.wmnet failed to sync [10:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:48] <_joe_> jynus: ouch, I'm rebooting servers, should I stop? [10:11:53] no problem [10:11:54] RECOVERY - HHVM rendering on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.014 second response time [10:12:04] RECOVERY - HHVM rendering on mw2195 is OK: HTTP OK: HTTP/1.1 200 OK - 69958 bytes in 0.346 second response time [10:12:09] let me know when you are doen with this so I can sync it manually [10:12:12] <_joe_> that's why you'd have 2-4 servers down at a time in codfw [10:12:23] PROBLEM - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:25] with this specific server, I mean [10:12:42] <_joe_> I guess mw2098 will come back in a couple of minutes [10:12:44] (I suppose I can see it myself :-)) [10:13:16] <_joe_> seems like it's having troubles rebooting [10:14:05] PROBLEM - HHVM rendering on mw2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:14:20] <_joe_> yes, actually I seem unable to reach the console too :/ [10:14:43] PROBLEM - Host mw2073 is DOWN: PING CRITICAL - Packet loss = 100% [10:14:43] PROBLEM - Host mw2048 is DOWN: PING CRITICAL - Packet loss = 100% [10:15:14] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1951433 (10faidon) Nitpick: is it possible to name this perhaps something else than "discovery-analytics-deploy"? What if others outside of the Discovery department want access t... [10:16:03] PROBLEM - Apache HTTP on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:16:04] RECOVERY - Host mw2073 is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [10:16:05] RECOVERY - HHVM rendering on mw2030 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.565 second response time [10:16:14] RECOVERY - Host mw2048 is UP: PING OK - Packet loss = 0%, RTA = 36.67 ms [10:16:58] 6operations, 10Continuous-Integration-Infrastructure, 10netops, 5Continuous-Integration-Scaling: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1951435 (10hashar) `gallium.wikimedia.org` has a bunch of services which are exposed publicly via the misc-web v... [10:17:54] RECOVERY - Apache HTTP on mw2164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.362 second response time [10:19:21] <_joe_> !log mw2098 doesn't reboot, console unreachable [10:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:29] <_joe_> jynus: ^^ [10:19:35] PROBLEM - HHVM rendering on mw2140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:35] I'm reopening T85286 [10:19:41] _joe_, [10:20:21] <_joe_> yeah, makes sense [10:21:34] RECOVERY - HHVM rendering on mw2140 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 2.708 second response time [10:22:00] did you know if it was actually rebooted by you or was simply dead in the first place? [10:22:08] <_joe_> I rebooted it, yes [10:22:12] thanks [10:22:14] <_joe_> well, technically salt did [10:22:18] :-) [10:23:08] 6operations, 5codfw-rollout, 3codfw-rollout-Apr-Jun-2015: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1951450 (10jcrespo) [10:23:10] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1951448 (10jcrespo) 5Resolved>3Open This happened again, I am reopening this because I believe this could be related. Maybe consider, if it is under guarantee to report a faulty DRAC. To be more clear, the c... [10:29:54] PROBLEM - Apache HTTP on mw2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:30:13] PROBLEM - HHVM rendering on mw2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:05] RECOVERY - Apache HTTP on mw2010 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.590 second response time [10:32:23] RECOVERY - HHVM rendering on mw2010 is OK: HTTP OK: HTTP/1.1 200 OK - 69959 bytes in 3.006 second response time [10:33:13] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951457 (10Reedy) >>! In T100519#1951371, @faidon wrote: > @mmodell can you please fix IPv6 instead or explain why it is difficult to do so? FWIW... [10:33:23] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:23] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:24] PROBLEM - configured eth on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:34] PROBLEM - SSH on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:33:34] PROBLEM - dhclient process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:34] PROBLEM - nutcracker port on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:35] PROBLEM - RAID on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:14] PROBLEM - nutcracker process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:14] PROBLEM - DPKG on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:14] PROBLEM - HHVM processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:14] PROBLEM - Check size of conntrack table on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:53] PROBLEM - salt-minion processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:35:04] PROBLEM - Disk space on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:15] RECOVERY - HHVM processes on mw1133 is OK: PROCS OK: 6 processes with command name hhvm [10:36:15] RECOVERY - nutcracker process on mw1133 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:36:23] RECOVERY - DPKG on mw1133 is OK: All packages OK [10:37:03] PROBLEM - Host mw2169 is DOWN: PING CRITICAL - Packet loss = 100% [10:37:43] RECOVERY - configured eth on mw1133 is OK: OK - interfaces up [10:37:43] RECOVERY - SSH on mw1133 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [10:37:44] RECOVERY - Host mw2169 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [10:37:44] RECOVERY - nutcracker port on mw1133 is OK: TCP OK - 0.000 second response time on port 11212 [10:37:44] RECOVERY - dhclient process on mw1133 is OK: PROCS OK: 0 processes with command name dhclient [10:38:23] RECOVERY - Check size of conntrack table on mw1133 is OK: OK: nf_conntrack is 0 % full [10:38:54] RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:39:13] RECOVERY - Disk space on mw1133 is OK: DISK OK [10:39:25] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: puppet fail [10:39:53] RECOVERY - RAID on mw1133 is OK: OK: no RAID installed [10:41:13] PROBLEM - Host mw2127 is DOWN: PING CRITICAL - Packet loss = 100% [10:41:34] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:41:44] RECOVERY - Host mw2127 is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms [10:45:24] PROBLEM - Apache HTTP on mw2059 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.286 second response time [10:46:08] all that are you, _joe_? [10:47:33] RECOVERY - Apache HTTP on mw2059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.803 second response time [10:48:23] (03PS4) 10Giuseppe Lavagetto: lvs: use etcd for pybal config for ulsfo backups [puppet] - 10https://gerrit.wikimedia.org/r/263847 [10:48:44] PROBLEM - Host mw2023 is DOWN: PING CRITICAL - Packet loss = 100% [10:49:10] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1951465 (10faidon) I'm not sure how good would LVS with two VMs would do — all the VM hosts are in the same availability zone (row), after all. We either need... [10:50:03] RECOVERY - Host mw2023 is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms [10:50:05] PROBLEM - HHVM rendering on mw2040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:14] PROBLEM - HHVM rendering on mw2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:36] 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1951466 (10ema) a:3ema [10:52:05] RECOVERY - HHVM rendering on mw2040 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 2.896 second response time [10:52:13] RECOVERY - HHVM rendering on mw2023 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 2.035 second response time [10:55:42] 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1951469 (10ema) 3NEW a:3ema [10:56:37] 6operations, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#1951476 (10ema) 3NEW [10:57:44] PROBLEM - HHVM rendering on mw2072 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:59:45] RECOVERY - HHVM rendering on mw2072 is OK: HTTP OK: HTTP/1.1 200 OK - 69910 bytes in 3.664 second response time [10:59:57] 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951482 (10ema) 3NEW [11:00:13] PROBLEM - Host mw2147 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:24] RECOVERY - Host mw2147 is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [11:01:24] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:02:04] PROBLEM - Apache HTTP on mw2176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:04:03] RECOVERY - Apache HTTP on mw2176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.463 second response time [11:04:43] PROBLEM - Host mw2166 is DOWN: PING CRITICAL - Packet loss = 100% [11:05:44] RECOVERY - Host mw2166 is UP: PING OK - Packet loss = 0%, RTA = 37.02 ms [11:05:45] PROBLEM - HHVM rendering on mw2166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:01] (03PS3) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) [11:07:44] RECOVERY - HHVM rendering on mw2166 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 1.693 second response time [11:08:44] (03CR) 1020after4: [C: 031] "This is awesome." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [11:08:53] PROBLEM - Host mw2097 is DOWN: PING CRITICAL - Packet loss = 100% [11:09:08] !log adding new version of mariadb to carbon for jessie (10.0.23-1) [11:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:34] PROBLEM - HHVM rendering on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:14] RECOVERY - Host mw2097 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [11:11:04] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951490 (10mmodell) @faidon: I don't have any idea how to fix ipv6. I have zero experience with the systems involved and I don't even have ipv6... [11:11:34] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.271 second response time [11:12:08] (03PS1) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 [11:12:29] (03CR) 10jenkins-bot: [V: 04-1] Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse) [11:12:53] PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% [11:14:13] PROBLEM - HHVM rendering on mw2191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:14:15] (03PS2) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 [11:14:23] RECOVERY - Host mw2118 is UP: PING OK - Packet loss = 0%, RTA = 36.70 ms [11:14:43] (03PS4) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) [11:16:13] RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.324 second response time [11:16:48] 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1951498 (10akosiaris) for one of the pool counters, it probably makes sense to move it to a ganeti VM. requirements are minimal for a pool counter anyway, it's a perfect candidate for virtualization. I am saying for o... [11:17:11] <_joe_> akosiaris: no :) [11:17:13] PROBLEM - Host mw2034 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:44] RECOVERY - Host mw2034 is UP: PING OK - Packet loss = 0%, RTA = 36.31 ms [11:19:23] 6operations: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1951500 (10akosiaris) Note the 1.5TB current usage for carbon (and another 6 TB free). carbon has a software RAID5 specifically to have enough disk space for it's role. The plan seems fine, just take into consideration the storage... [11:19:40] 6operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#1951503 (10Joe) @akosiaris I think that's a bad idea - a single poolcounter server dying still causes unavailability (notwithstanding the mitigations we tried to create with T105378. I'd say until T105378 is resolved... [11:20:47] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: use etcd for pybal config for ulsfo backups [puppet] - 10https://gerrit.wikimedia.org/r/263847 (owner: 10Giuseppe Lavagetto) [11:20:52] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951505 (10faidon) >>! In T100519#1951490, @mmodell wrote: > @faidon: I don't have any idea how to fix ipv6. I have zero experience with the sys... [11:20:56] <_joe_> let's go! [11:21:12] _joe_: ? no to what ? [11:21:25] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [24.0] [11:22:03] PROBLEM - HHVM rendering on mw2067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:11] <_joe_> akosiaris: poolcounter on a VM [11:22:18] why not ? [11:22:27] <_joe_> see my comment on the ticket [11:22:46] <_joe_> sorry, trying to activate etcd on a prod pybal now [11:24:03] RECOVERY - HHVM rendering on mw2067 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 3.428 second response time [11:24:53] (03PS2) 10Jcrespo: Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) [11:25:07] <_joe_> !log restarting pybal on lvs4004, switching to etcd [11:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:37] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris) [11:26:43] (03PS5) 10Alexandros Kosiaris: url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 [11:26:46] (03CR) 10Alexandros Kosiaris: [V: 032] url_downloader: Add port as a hierable parameter [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris) [11:27:27] <_joe_> !log restarting pybal on lvs4003, switching to etcd [11:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:45] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [11:28:04] 6operations, 10Traffic: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#1951515 (10ema) 3NEW a:3ema [11:29:12] grep: modules/admin/files/home/akosiaris/.my.cnf: No such file or directory [11:29:15] PROBLEM - Apache HTTP on mw2142 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.247 second response time [11:29:23] every time I grep for anything in the puppet repo :( [11:29:44] PROBLEM - Apache HTTP on mw2046 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.272 second response time [11:30:04] PROBLEM - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:00] Krenair: git grep [11:31:33] with the extra bonus of being faster [11:31:34] RECOVERY - Apache HTTP on mw2142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.242 second response time [11:32:03] RECOVERY - Apache HTTP on mw2046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.253 second response time [11:33:35] PROBLEM - HHVM rendering on mw2060 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.152 second response time [11:34:14] (03PS3) 10Jcrespo: Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) [11:34:41] and I was looking for that :) [11:35:08] 6operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup pc2004-2006 - https://phabricator.wikimedia.org/T121879#1951530 (10faidon) [11:35:14] 6operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#1951532 (10akosiaris) Yes the single row right now comment is obviously true. Still, there is the rack level availability zone issue as well. And we 've had c... [11:35:16] <_joe_> paravoid: into what? [11:35:31] the parsercache task [11:35:39] there are 2 [11:35:46] <_joe_> oh for, not into, I thought you somehow were looking at the codfw appservers [11:35:47] see gerrit [11:35:53] RECOVERY - HHVM rendering on mw2060 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 8.082 second response time [11:36:28] it took me also some time to find those tasks, looking in my "assigned" is no longer useful [11:36:47] (03CR) 10Jcrespo: [C: 032] Adding new parsercache machines (pc[12]00[4-6]) [puppet] - 10https://gerrit.wikimedia.org/r/265473 (https://phabricator.wikimedia.org/T121879) (owner: 10Jcrespo) [11:37:24] PROBLEM - HHVM rendering on mw2135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.155 second response time [11:37:44] PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:39:34] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 4.090 second response time [11:39:44] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.706 second response time [11:40:24] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [11:41:31] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951540 (10mmodell) @faidon: I was only summarizing the discussion we (myself, @reedy, @dzahn and @chasemp) had in IRC. Please don't shoot the me... [11:41:54] PROBLEM - HHVM rendering on mw2193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:42:04] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:42:34] twentyafterfour: not my intention to shoot anyone! :) [11:44:03] RECOVERY - HHVM rendering on mw2193 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.422 second response time [11:44:13] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.534 second response time [11:45:04] PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:44] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:45:44] PROBLEM - HHVM rendering on mw2179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.284 second response time [11:46:03] RECOVERY - Host mw2122 is UP: PING OK - Packet loss = 0%, RTA = 37.04 ms [11:46:20] (03PS5) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) [11:46:24] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Puppet has 2 failures [11:48:04] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 8.716 second response time [11:48:44] PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:44] PROBLEM - Host mw2204 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:24] RECOVERY - Host mw2204 is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [11:49:53] PROBLEM - Host mw2092 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:15] 6operations, 10ops-codfw: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1951543 (10Joe) 3NEW [11:50:24] RECOVERY - Host mw2117 is UP: PING OK - Packet loss = 0%, RTA = 37.21 ms [11:50:34] RECOVERY - Host mw2092 is UP: PING OK - Packet loss = 0%, RTA = 38.03 ms [11:51:19] ACKNOWLEDGEMENT - Host mw2039 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124282 [11:52:23] ACKNOWLEDGEMENT - Host mw2098 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T85286 [11:54:34] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:56:24] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: Puppet has 2 failures [11:56:43] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 69906 bytes in 2.094 second response time [11:56:53] PROBLEM - Host mw2131 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:24] PROBLEM - Host mw2110 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:35] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: Puppet has 2 failures [11:58:13] RECOVERY - Host mw2131 is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [11:58:44] RECOVERY - Host mw2110 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [12:02:00] PROBLEM - Apache HTTP on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:22] PROBLEM - NTP on mw2069 is CRITICAL: NTP CRITICAL: Offset unknown [12:02:51] PROBLEM - NTP on mw2108 is CRITICAL: NTP CRITICAL: Offset unknown [12:03:00] RECOVERY - Apache HTTP on mw2202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.479 second response time [12:03:01] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures [12:03:41] (03PS1) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 [12:05:51] RECOVERY - NTP on mw2069 is OK: NTP OK: Offset -0.001298308372 secs [12:06:40] RECOVERY - NTP on mw2108 is OK: NTP OK: Offset 0.0001020431519 secs [12:06:44] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951552 (10Reedy) >>! In T100519#1951505, @faidon wrote: > In any case, please approach "X is broken and I don't know how to fix it" with "can so... [12:07:20] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:07:40] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951553 (10hashar) The DNS IPv6 entry has been dropped yesterday because there is no ssh service listening there to serve the git repositories.... [12:09:21] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Puppet has 2 failures [12:09:31] PROBLEM - Apache HTTP on mw2061 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.280 second response time [12:09:31] PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:31] PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:41] PROBLEM - HHVM rendering on mw2061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:11] RECOVERY - Host mw2111 is UP: PING OK - Packet loss = 0%, RTA = 37.38 ms [12:10:21] RECOVERY - Host mw2119 is UP: PING OK - Packet loss = 0%, RTA = 37.07 ms [12:10:31] PROBLEM - Apache HTTP on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:45] !log upgrading cr1-codfw to JunOS 13.3R8.7 [12:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:41] RECOVERY - Apache HTTP on mw2061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.860 second response time [12:11:51] RECOVERY - HHVM rendering on mw2061 is OK: HTTP OK: HTTP/1.1 200 OK - 33593 bytes in 3.203 second response time [12:12:31] RECOVERY - Apache HTTP on mw2111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.487 second response time [12:17:51] PROBLEM - Apache HTTP on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:11] PROBLEM - Apache HTTP on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:20] PROBLEM - Apache HTTP on mw2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:00] RECOVERY - Apache HTTP on mw2143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.554 second response time [12:20:20] RECOVERY - Apache HTTP on mw2198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.443 second response time [12:20:21] RECOVERY - Apache HTTP on mw2019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.146 second response time [12:26:20] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:28:21] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 33593 bytes in 2.705 second response time [12:31:38] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:48] PROBLEM - NTP on mw2165 is CRITICAL: NTP CRITICAL: Offset unknown [12:31:48] PROBLEM - NTP on mw2205 is CRITICAL: NTP CRITICAL: Offset unknown [12:32:28] PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% [12:32:48] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 33592 bytes in 0.280 second response time [12:33:18] RECOVERY - NTP on mw2165 is OK: NTP OK: Offset 0.00022149086 secs [12:33:18] RECOVERY - NTP on mw2205 is OK: NTP OK: Offset -0.001058459282 secs [12:33:19] RECOVERY - Host mw2201 is UP: PING OK - Packet loss = 0%, RTA = 38.39 ms [12:33:42] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951586 (10Luke081515) 5Open>3Resolved a:3Luke081515 Seems like it is fixed now, (the queue needs still some time to make the backlog smaller) but categorysation... [12:33:59] 6operations, 10Wikimedia-General-or-Unknown, 5Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1951589 (10Luke081515) a:5Luke081515>3ori [12:34:11] 6operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1948627 (10Luke081515) [12:37:08] PROBLEM - Host mw2196 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:48] PROBLEM - HHVM rendering on mw2208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:48] RECOVERY - Host mw2196 is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [12:38:08] PROBLEM - Apache HTTP on mw2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:49] RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 2.653 second response time [12:40:09] RECOVERY - Apache HTTP on mw2016 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.627 second response time [12:40:28] PROBLEM - Host mw2057 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:28] PROBLEM - Host mw2068 is DOWN: PING CRITICAL - Packet loss = 100% [12:41:48] RECOVERY - Host mw2068 is UP: PING OK - Packet loss = 0%, RTA = 36.28 ms [12:41:49] RECOVERY - Host mw2057 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [12:41:59] PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:04] 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951614 (10mark) 3NEW [12:43:59] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 2.354 second response time [12:44:03] 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951622 (10mark) p:5Triage>3Normal [12:44:48] 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951625 (10faidon) Note that Juniper raises a system (or chassis?) alarm when the RE down, so a check for "show chassis alarms" and "show system alarms" (as described al... [12:44:58] PROBLEM - Host mw2182 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:08] PROBLEM - Host mw2042 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:48] RECOVERY - Host mw2182 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [12:45:59] PROBLEM - HHVM rendering on mw2062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:46:08] RECOVERY - Host mw2042 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [12:47:59] RECOVERY - HHVM rendering on mw2062 is OK: HTTP OK: HTTP/1.1 200 OK - 70157 bytes in 3.746 second response time [12:49:08] 6operations, 10Analytics-Cluster, 10EventBus, 6Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#1951634 (10Joe) I do think that we DEFINITELY want to rely events to active listeners in both datacenters. What we //don't... [12:51:19] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:53:08] PROBLEM - Host mw2106 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:19] PROBLEM - Apache HTTP on mw2163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:39] RECOVERY - Host mw2106 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [12:56:19] RECOVERY - Apache HTTP on mw2163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.437 second response time [12:57:40] PROBLEM - Host mw2045 is DOWN: PING CRITICAL - Packet loss = 100% [12:57:59] RECOVERY - Host mw2045 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [12:57:59] 6operations, 10netops, 7Monitoring: Icinga monitoring for (Juniper MX480) routing engine status - https://phabricator.wikimedia.org/T124285#1951637 (10mark) [12:58:01] 6operations, 10netops, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1951638 (10mark) [12:59:10] PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [12:59:36] 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951640 (10faidon) https://github.com/fgsch/varnish3to4 is pretty good. I spent a small amount of time (less than a half hour) at some point running this + manual changes against the upload VCL and I was successfu... [13:00:58] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.31 ms [13:02:28] PROBLEM - Apache HTTP on mw2183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:28] RECOVERY - Apache HTTP on mw2183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.475 second response time [13:04:50] 6operations, 10Traffic: Forward-port VCL to Varnish 4 - https://phabricator.wikimedia.org/T124279#1951642 (10mark) And instead of inline C, we can consider using vmods too. [13:05:38] PROBLEM - Host mw2047 is DOWN: PING CRITICAL - Packet loss = 100% [13:05:59] RECOVERY - Host mw2047 is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms [13:06:28] PROBLEM - Apache HTTP on mw2044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:19] RECOVERY - Apache HTTP on mw2044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.192 second response time [13:12:38] PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:29] RECOVERY - Host mw2184 is UP: PING OK - Packet loss = 0%, RTA = 36.61 ms [13:16:38] PROBLEM - Host mw2154 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:29] RECOVERY - Host mw2154 is UP: PING OK - Packet loss = 0%, RTA = 36.83 ms [13:17:38] PROBLEM - HHVM rendering on mw2154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:49] PROBLEM - HHVM rendering on mw2171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:39] RECOVERY - HHVM rendering on mw2154 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.371 second response time [13:19:50] RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.371 second response time [13:20:21] <_joe_> !log rolling reboot of imagescalers, jobrunners in codfw [13:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:20:32] _joe_: <3 [13:21:06] <_joe_> paravoid: I'm just doing codfw, eqiad is a bit more challenging, but I guess I'll be done by tomorrow evening [13:21:49] PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:49] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.427 second response time [13:23:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:24:48] PROBLEM - Host mw2107 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:59] PROBLEM - Apache HTTP on mw2153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:26:19] RECOVERY - Host mw2107 is UP: PING OK - Packet loss = 0%, RTA = 36.89 ms [13:27:59] RECOVERY - Apache HTTP on mw2153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.179 second response time [13:30:18] PROBLEM - HHVM rendering on mw2095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:28] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:30:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 118, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - Core: cr1-codfw:xe-5/2/0 {#10695} [10Gbps DF]BRae0: down - Core: cr1-codfw:ae0BR [13:31:08] PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [13:32:30] RECOVERY - HHVM rendering on mw2095 is OK: HTTP OK: HTTP/1.1 200 OK - 70165 bytes in 2.178 second response time [13:32:38] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.12 ms [13:33:08] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [13:36:49] PROBLEM - Apache HTTP on mw2020 is CRITICAL: Connection refused [13:37:59] PROBLEM - HHVM rendering on mw2020 is CRITICAL: Connection refused [13:38:19] PROBLEM - RAID on mw2020 is CRITICAL: Connection refused by host [13:38:29] PROBLEM - configured eth on mw2020 is CRITICAL: Connection refused by host [13:38:49] PROBLEM - Check size of conntrack table on mw2020 is CRITICAL: Connection refused by host [13:38:59] PROBLEM - dhclient process on mw2020 is CRITICAL: Connection refused by host [13:39:19] PROBLEM - nutcracker port on mw2020 is CRITICAL: Connection refused by host [13:39:19] PROBLEM - DPKG on mw2020 is CRITICAL: Connection refused by host [13:39:38] PROBLEM - nutcracker process on mw2020 is CRITICAL: Connection refused by host [13:39:39] PROBLEM - Disk space on mw2020 is CRITICAL: Connection refused by host [13:39:49] PROBLEM - salt-minion processes on mw2020 is CRITICAL: Connection refused by host [13:40:19] PROBLEM - HHVM processes on mw2020 is CRITICAL: Connection refused by host [13:40:20] PROBLEM - puppet last run on mw2020 is CRITICAL: Connection refused by host [13:41:08] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:41:58] PROBLEM - Host mw2005 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:58] RECOVERY - Host mw2005 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [13:45:33] (03PS1) 10Alex Monk: annualreport: Ensure latest checkout of git repo [puppet] - 10https://gerrit.wikimedia.org/r/265485 [13:49:10] (03PS2) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 [13:49:48] PROBLEM - Host mw2083 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:06] (03CR) 10Alexandros Kosiaris: [C: 032] annualreport: Ensure latest checkout of git repo [puppet] - 10https://gerrit.wikimedia.org/r/265485 (owner: 10Alex Monk) [13:51:49] RECOVERY - Host mw2083 is UP: PING OK - Packet loss = 0%, RTA = 36.84 ms [13:51:51] (03PS3) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 [13:52:10] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:18] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 69961 bytes in 5.876 second response time [13:55:39] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures [13:56:19] PROBLEM - Apache HTTP on mw2088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.404 second response time [13:58:28] RECOVERY - Apache HTTP on mw2088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.178 second response time [13:58:49] (03PS1) 10Alexandros Kosiaris: simplify annualreport module [puppet] - 10https://gerrit.wikimedia.org/r/265487 [13:59:55] (03CR) 10Alexandros Kosiaris: [C: 032] simplify annualreport module [puppet] - 10https://gerrit.wikimedia.org/r/265487 (owner: 10Alexandros Kosiaris) [14:00:45] (03PS4) 10Jcrespo: Fixing db path for newer machines by adding a condition on the role [puppet] - 10https://gerrit.wikimedia.org/r/265479 [14:02:10] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:02:49] PROBLEM - Host mw2081 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:16] (03CR) 10Jcrespo: [C: 032] "Verified with puppet compiler. The fix is ugly, but it is only temporal, until pc100[123] are decommissioned." [puppet] - 10https://gerrit.wikimedia.org/r/265479 (owner: 10Jcrespo) [14:03:19] RECOVERY - Host mw2081 is UP: PING OK - Packet loss = 0%, RTA = 35.04 ms [14:05:58] PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:06:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:07:16] hrm [14:07:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:07:22] I wonder if ^^^ is codfw related [14:07:26] oh, never mind then [14:09:38] RECOVERY - Host cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms [14:09:58] could be if for some reason esams traffic is going via-codfw to eqiad :) [14:10:09] that's impossible :) [14:10:18] never say never! [14:11:08] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:13:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:13:48] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:15:28] PROBLEM - Host mw2087 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:08] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:09] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:24:45] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:26:59] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:08] (03PS1) 10Jcrespo: Enabling ssl and ferm on parsercaches, disabling performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/265488 [14:28:58] (03CR) 10Jcrespo: [C: 032] Enabling ssl and ferm on parsercaches, disabling performance_schema [puppet] - 10https://gerrit.wikimedia.org/r/265488 (owner: 10Jcrespo) [14:30:00] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951715 (10Johan) Asking dumb question to make sure the answer is as obvious as I hope it is: OTRS will (probably) be down between 0800 UTC and probably somewhere between 1400 or 1600 U... [14:36:19] PROBLEM - Host cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:37:01] !log upgraded cr2-codfw to JunOS 13.3R8.7 [14:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:58] RECOVERY - Host cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [14:40:01] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951756 (10akosiaris) >>! In T74109#1951715, @Johan wrote: > Asking dumb question to make sure the answer is as obvious as I hope it is: OTRS will (probably) be down between 0800 UTC an... [14:41:57] (03PS1) 10Faidon Liambotis: Revert "Drain codfw for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/265489 [14:42:35] (03CR) 10Faidon Liambotis: [C: 032] Revert "Drain codfw for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/265489 (owner: 10Faidon Liambotis) [14:47:43] 6operations, 10netops: Upgrade JunOS on cr1/cr2-codfw - https://phabricator.wikimedia.org/T113640#1951766 (10faidon) 5Open>3Resolved a:3faidon All done! [14:47:49] (03PS1) 10Jcrespo: Modifying parsercache including latest optimizations and options [puppet] - 10https://gerrit.wikimedia.org/r/265490 [14:50:36] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951773 (10BBlack) I'm putting together 3x commits for review that I think will resolve this, they should show up below... [14:51:01] (03CR) 10Jcrespo: [C: 032] Modifying parsercache including latest optimizations and options [puppet] - 10https://gerrit.wikimedia.org/r/265490 (owner: 10Jcrespo) [14:53:19] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [14:55:40] (03PS1) 10BBlack: Add IPv6 for iridium-vcs.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/265492 (https://phabricator.wikimedia.org/T100519) [14:56:05] (03PS1) 10BBlack: Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) [14:56:07] (03PS1) 10BBlack: Add public IPv6 to git-ssh LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) [14:56:24] 6operations, 10OTRS, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1951980 (10Johan) Thanks. [14:56:53] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1951981 (10BBlack) I think those 3 and then uncommenting the public after it's deployed and tested should do the trick. Needs review! [15:03:27] PROBLEM - NTP on mw2020 is CRITICAL: NTP CRITICAL: No response from NTP server [15:19:27] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:36:30] (03PS3) 10DCausse: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 [15:39:32] 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952109 (10akosiaris) 5Open>3Resolved a:3akosiaris Hello, Users: @KHammerstein @Fjalapeno @JMinor @Bgerstle-WMF @Nirzar have b... [15:42:37] 6operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1952143 (10akosiaris) @cscott, any news on this? [15:42:55] 6operations, 5Patch-For-Review, 7Pybal: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#1952151 (10akosiaris) p:5High>3Normal [15:43:03] (03PS1) 10Andrew Bogott: Bump kernel version in jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 [15:51:13] cmjohnson: we're doing T123546 now ja? [15:51:30] now or in 9 mins...whichever you prefer...i am ready [15:52:02] well 9 mins, but ja [15:52:08] ok, i'm going to prep the el things [15:52:55] jouncebot: next [15:52:55] In 0 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600) [15:55:52] (03PS1) 10Jcrespo: New parsercache servers for codfw datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265501 (https://phabricator.wikimedia.org/T121879) [15:55:59] hashar: around for next one hour? [15:58:25] kart_: nope, leaving soonish [15:58:28] kart_: what is happening? [15:59:17] 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952246 (10BGerstle-WMF) confirmed, now have access! [15:59:41] !log stopping eventlogging mysql consumers for https://phabricator.wikimedia.org/T123546 [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1600). Please do the needful. [16:00:04] Addshore mdholloway kart_ bblack: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:07] *waves* [16:00:09] ottomata power down for me [16:00:21] yo [16:00:28] cmjohnson: 1 min [16:00:42] yep...ping me when ok [16:00:48] I can SWAT today. Going to try to get config changes out first then backports. addshore you're up first. [16:00:56] awesome :) [16:01:25] cmjohnson: good to go [16:01:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 (owner: 10Addshore) [16:01:34] * kart_ here [16:01:34] power down dbproxy1004 go! [16:01:35] cool [16:02:03] thcipriani: we've table creation too :) [16:02:20] <- here [16:02:35] kart_: oh good :) [16:02:49] (03Merged) 10jenkins-bot: wgRCWatchCategoryMembership true on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264732 (owner: 10Addshore) [16:03:24] did you stop eventlogging? I want to restart db1046 too [16:04:25] ottomata: powering on [16:05:03] mw2020.codfw.wmnet REMOTE HOST IDENTIFICATION HAS CHANGED! ← expected? [16:05:18] PROBLEM - Host dbproxy1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:21] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: wgRCWatchCategoryMembership true on dewiki [[gerrit:264732]] (duration: 01m 28s) [16:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:27] thcipriani: depends how long it's been since you connected there [16:05:33] ^ addshore check please [16:05:40] (03PS3) 10KartikMistry: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) [16:05:45] thcipriani: looks like its deployed, going to manually test now! [16:05:49] bblack: probably yesterday morning. [16:06:21] it looks like it's mid-reinstall [16:06:31] or something like that, it's not in a normal state [16:06:34] mw2098.codfw mw2039.codfw mw2087.codfw also timed-out [16:07:02] _joe_: ^ ? [16:07:04] there are several mws that failed to restart after a rolling restart this morning [16:07:08] RECOVERY - Host dbproxy1004 is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms [16:07:11] there are tickets about that [16:07:14] ok [16:07:20] (03PS2) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) [16:07:22] that may slow down scap or something :/ [16:07:41] yeah, they'll timeout every time :\ [16:08:04] jynus you have the ticket ref? phab search sucks [16:08:07] hey, I didn't restart them, don't blame me :-P [16:08:16] didn't create the tickets either [16:08:44] <_joe_> thcipriani: they are dead, yes, I should remove them [16:08:46] only this https://phabricator.wikimedia.org/T85286, after it timed out gor me [16:08:49] <_joe_> still didn't have time [16:08:59] thcipriani: generally all looking fine [16:09:07] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [16:09:15] addshore: cool, thanks for checking. [16:09:29] _joe_: kk, I'll continue with SWAT then. [16:09:59] mdholloway: I'll come back to yours since tests take a little bit to merge. [16:10:16] thcipriani: sounds good [16:10:44] kart_: about this table... [16:10:55] yes. [16:11:06] thcipriani: you can go with Beta config change first. [16:11:13] (tables are created there) [16:11:28] then tables in Production. And, enable config change. [16:12:43] this is other: https://phabricator.wikimedia.org/T124282 [16:13:07] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [16:13:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry) [16:13:50] (03CR) 1020after4: [C: 031] "+1 because I can't +2 on operations/puppet :(" [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [16:13:52] (03Merged) 10jenkins-bot: Beta: Set ContentTranslationCorpora to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265458 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry) [16:13:54] ebernhardson et all: es20xx have been warning about disk space for more than a day now [16:14:06] not a day [16:14:10] months [16:14:14] kart_: kk, while we wait for that change to go out to beta, I'm going to circle back and get mdholloway 's stuff out the door. [16:14:27] thcipriani: OK! [16:14:28] er [16:14:30] dammit [16:14:37] paravoid, I've been complaining about those for almost a year [16:14:39] (03CR) 1020after4: [C: 031] "because I can't +2 in operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [16:14:39] ebernhardson: ignore that! :) [16:14:40] brainfart [16:14:42] 6operations, 10ops-eqiad, 10Analytics: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1952293 (10Cmjohnson) 5Open>3Resolved Replaced the bad DIMM at slot A3 [16:14:56] thcipriani: I will have to update Beta patch once Production in too. [16:15:07] paravoid, vote yes to sex servers now! [16:16:17] (03PS4) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [16:16:17] cmjohnson: all good? [16:16:40] looks good [16:17:49] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [16:18:17] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 37, down: 0, dormant: 0, excluded: 3, unused: 0 [16:18:40] (03CR) 10Hashar: "Fixed root dir ownership so it belongs to jenkins-deploy" [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [16:18:46] cmjohnson: , you done? can i turn el stuff back on? [16:18:52] (03PS1) 10Jcrespo: Depool pc1001 for maintenance (clone to pc1004) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265504 [16:19:14] ottomata: yes I've been finished for awhile...pinged you about it [16:19:28] !log deactivating GTT BGP peering on cr2-eqiad [16:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:55] cmjohnson: cool, saw the message about powering on, wasn't sure that was all done [16:20:24] oh..yeah sorry I should've been more clear [16:20:38] !log started eventlogging mysql consumers [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:48] which servers have to be depooled? [16:21:02] (03CR) 10Mobrovac: [C: 04-1] Add the visualdiff module + instantiate visualdiffing services (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [16:21:53] 6operations, 10ops-codfw: mw2087 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124299#1952308 (10Joe) 3NEW [16:22:08] thanks cmjohnson! [16:22:22] jynus: when do you want to start the el tokudb stuff? [16:22:34] <_joe_> papaul: we do have 3 appservers that failed to reboot; would you be able to take a look today or tomorrow? [16:22:49] ACKNOWLEDGEMENT - Host mw2087 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T124299 [16:22:54] !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265118]] (duration: 01m 28s) [16:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:20] ottomata, let's do it [16:23:49] can I just do it? [16:23:52] ^ mdholloway sync'd out on test wikis [16:23:56] oook, gimme just a few, i'm going to puppetize this consumer stop [16:24:08] waiting [16:24:22] I will prepare some updates for that mysql, too [16:24:41] thcipriani: sweet! thanks thcipriani [16:24:49] doing .10 now [16:24:55] k [16:25:37] (03PS1) 10Giuseppe Lavagetto: lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 [16:26:12] <_joe_> bblack, ema ^^ I'm merging this after my meeting. Or, if you feel bold enough, you can go on ofc :P [16:27:07] ok :) [16:27:32] (03CR) 10DCausse: "Unless my math are wrong we have 28 shards for enwiki_content." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265372 (https://phabricator.wikimedia.org/T124215) (owner: 10EBernhardson) [16:28:12] (03PS1) 10Ottomata: Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) [16:28:27] !log thcipriani@tin Synchronized php-1.27.0-wmf.10/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 100% [[gerrit:265117]] (duration: 01m 27s) [16:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:58] mdholloway: ^ sync'd and purged the url [16:29:10] (03CR) 10jenkins-bot: [V: 04-1] Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) (owner: 10Ottomata) [16:29:27] thcipriani: great! yup, looks good. thanks again. [16:29:32] (03PS2) 10Ottomata: Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) [16:29:37] mdholloway: cool, thanks for checking! [16:30:01] kart_: looks like beta-scap-eqiad finished, beta look ok? [16:30:27] thcipriani: a minute. [16:30:47] kart_: kk, gonna get bblack 's change done then [16:31:04] \o/ [16:31:07] (03CR) 10Ottomata: [C: 032] Temporarily disable eventlogging mysql consumers and burrow monitoring for them [puppet] - 10https://gerrit.wikimedia.org/r/265506 (https://phabricator.wikimedia.org/T120187) (owner: 10Ottomata) [16:31:18] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [16:31:37] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic, 5Patch-For-Review: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1952355 (10greg) Thanks @bblack [16:32:16] thcipriani: OK. Lets go for table creation on Production. [16:33:17] kart_: lemme get this change out the door (in zuul now), then I'll circle back to you change. [16:33:34] *your change [16:33:46] (03PS3) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) [16:33:55] (03CR) 10Luke081515: [C: 031] Add public IPv6 to git-ssh LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/265494 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [16:34:16] thcipriani: sure. [16:34:35] jynus: good to go [16:34:59] !log stopped eventlogging mysql consumers for long downtime: https://phabricator.wikimedia.org/T120187 [16:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:29] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:35:37] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [16:35:48] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [16:36:53] ^ I'll get these in a sec, beta-only changes [16:37:06] yeah. Thought so :) [16:37:07] (03PS4) 10KartikMistry: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) [16:37:23] joe: i will [16:40:54] !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part I [[gerrit:265486]] (duration: 01m 28s) [16:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:36] ^ bblack wanted to ensure the function was there before adding it to hooks, sync-ing part II (MobileFrontend.php) now. [16:41:47] thcipriani: good thinking, thanks :) [16:41:52] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1952407 (10EBernhardson) [16:42:15] <_joe_> papaul: thanks :)) [16:42:18] ok, its running [16:42:31] !log thcipriani@tin Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/MobileFrontend.php: SWAT: Use TitleSquidURLs hook to purge mobile URLs directly Part II [[gerrit:265486]] (duration: 01m 28s) [16:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:42] !log batch-converting m4-master (log) tables from innodb to tokudb [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:08] ^ bblack should be sync'd out to mediawiki.org and testwikis now (should roll to more before eod) [16:43:17] ottomata, I've left it on a screen on db1046 [16:43:25] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1909028 (10EBernhardson) based on @faidon and @ottomata suggestions i've changed the named of the puppet group to analytics-search-users in the request, i also adjusted the usern... [16:43:27] in case you want to check its progress [16:43:58] you can also see it with mysql's SHOW PROCESSLIST and compare it to the list of conversions on /srv/tmp [16:44:03] thcipriani: it should affect group1 now, or not? [16:44:12] 10Ops-Access-Requests, 6operations: Create new puppet group `discovery-analytics-deploy` - https://phabricator.wikimedia.org/T122620#1952426 (10EBernhardson) [16:44:25] bblack: no, we rolled back group1 late yesterday [16:44:29] oh! [16:44:47] ok, I'm not sure if I can really verify on group0, but will see what I can figure out :) [16:44:52] :) [16:45:03] hmm there's an m.wikimedia.org, will see there [16:45:10] I mean m.mediawiki.org [16:46:08] thcipriani: confirmed, works right on group0 :) [16:46:31] 63 RxURL c /wiki/Manual:System_administration [16:46:31] 63 RxHeader c Host: www.mediawiki.org [16:46:31] 63 RxURL c /w/index.php?title=Manual:System_administration&action=history [16:46:32] bblack: cool, thanks for checking. Should be rolled out to group1 sometime today (if logs are looking better) [16:46:34] 63 RxHeader c Host: www.mediawiki.org [16:46:37] 63 RxURL c /wiki/Manual:System_administration [16:46:39] 63 RxHeader c Host: m.mediawiki.org [16:46:42] 63 RxURL c /w/index.php?title=Manual:System_administration&action=history [16:46:45] 63 RxHeader c Host: m.mediawiki.org [16:46:47] ok, thanks [16:47:14] kart_: never made a new table as part of SWAT, is there a .sql file for this? [16:49:05] kart_: the sql/parallel-corpora.sql one, guessing? [16:49:17] thcipriani: yes. ContentTranslation/sql/parallel-corpora.sql [16:49:19] yes. [16:49:23] Thanks :) [16:49:40] thcipriani: note that it will go to Wikishared DB [16:50:22] wikishared. [16:52:52] thcipriani: and there is, https://phabricator.wikimedia.org/T120815#1948202 [16:52:59] Please check it. [16:53:41] 7Blocked-on-Operations, 6operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1952441 (10Fjalapeno) Thanks! [16:53:53] kart_: does this command look right? mwscript sql.php --wiki=aawiki --wikidb=wikishared /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/ContentTranslation/sql/parallel-corpora.sql [16:55:09] --cluster extension1 too. [16:55:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] Move portals into generic sites.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:55:20] (03CR) 10Alexandros Kosiaris: "Change looks good, comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [16:56:15] thcipriani: possible to use sql.php --cluster extension1 and go to wikishared and source sql there? [16:56:16] kart_: is extension1 the name of wikishared for beta? That's the imporession that I get from that ticket you posted... [16:56:42] thcipriani: extension1 is cluster, wikishared is DB. [16:57:18] thcipriani: see: https://phabricator.wikimedia.org/T120815#1948202 [16:58:39] jynus: ^^ [16:59:21] kart_: I'm just going to source the script from: `sql wikishared` seems like the right thing [16:59:42] you are asking the wrong man, I know almost nothing about mediawiki [16:59:55] jynus: hola amigo [17:00:03] hi, nuria [17:00:04] akosiaris mutante: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1700). [17:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:16] jynus: there is bit of SQL involved :) [17:00:22] jynus: could you please let me know where can i run " bash check_eventlogging_lag.sh dbstore1002"? [17:00:44] thcipriani: yes. make sure we're in wikishared, that's it. [17:00:53] kk, doing. [17:01:02] Krenair: there's a -1 from me on https://gerrit.wikimedia.org/r/#/c/264978 for a small issue, otherwise the change looks good [17:01:05] mobrovac: around ? [17:01:14] nuria, it's a horrible, 5-minute made script that I can share if you do not judge me very badly [17:01:16] it's graphoid time [17:01:20] kart_: blerg: The MariaDB server is running with the --read-only option so it cannot execute this statement [17:01:25] jynus: no i would love it, REALLY [17:01:29] akosiaris: yup, in the meeting you're not in :D [17:01:47] ok [17:01:51] I am not sure you have access to the master, but let me at least share the code, we can arrange how to run it later [17:01:54] thcipriani: are you in cluster1 and wikishared DB? [17:02:12] the idea would be to put that in an alert, nuria when properly done [17:02:37] but let me at least commit it to operations/software:dbtools [17:02:44] !log rebooting labvirt1008 [17:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:42] thcipriani: still same? [17:04:27] kart_: yeah, still looking at it. I'll try the command you posted with wikidb=wikishared [17:05:30] jynus: excellent, that way we have a "synchronized" way to check lag [17:06:09] thcipriani: sql.php --cluster extension1 and use wikishared; and try source parallel-corpora.sql [17:06:36] jynus: let me know when it is comitted, i do not receive alerts from taht depot [17:06:39] if this still has permission issue, we need Ops/DB. [17:06:41] jynus: thank you [17:08:31] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto) [17:08:38] (03PS2) 10Giuseppe Lavagetto: lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 [17:08:44] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/1642/lvs4001.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto) [17:09:18] (03PS1) 10Jcrespo: [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 [17:09:28] kart_: still no luck. I'm going to try running the command I posted above with the cluster extension1 option. [17:09:33] (03CR) 10Giuseppe Lavagetto: [V: 032] lvs: switch all of ulsfo to use etcd for pybal config [puppet] - 10https://gerrit.wikimedia.org/r/265505 (owner: 10Giuseppe Lavagetto) [17:10:02] nuria, https://gerrit.wikimedia.org/r/#/c/265509/1 [17:10:20] thcipriani: is it permission issue? :/ [17:11:36] kart_: only when using the sql option and trying to source the file. mwscript hasn't worked with any incantations I've tried. [17:11:38] are you running it on the master? [17:12:06] I remember a recent patch that by default points to a slave, which all of ours are read-only [17:12:47] <_joe_> !log restarting pybal on the main balancers in ulsfo to consume from etcd [17:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:54] what are you guys trying to do with the sql scripts exactly? [17:12:56] jynus: I run: `sql wikishared` on tin. Tried to source a .sql file and it gave me a The MariaDB server is running with the --read-only option so it cannot execute this statement so I'd guess not :) [17:13:03] ottomata: hey, any reason to not install kafkacat on bastions? [17:13:16] thcipriani, then that is a slave, not a master [17:13:18] Krenair: just trying to run /srv/mediawiki-staging/php-1.27.0-wmf.11/extensions/ContentTranslation/sql/parallel-corpora.sql for the wikishareddb [17:13:20] thcipriani, `sql --write wikishared` [17:13:52] (03PS6) 10Alex Monk: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) [17:13:53] Krenair is here. Good! [17:14:21] or `mwscript sql.php testwiki --cluster extension1 --wikidb wikishared extensions/ContentTranslation/sql/parallel-corpora.sql` or something [17:14:45] Krenair: thank you! kart_ got it with sql --write wikishared and sourcing the script. [17:14:56] good that I have them as read only, so you would not try to drift our slaves :-) [17:15:32] jynus: indeed. [17:15:45] jynus: --bow-to-jynus is the hidden option too. [17:15:47] :D [17:16:21] ha [17:16:23] jynus: we can use your script and create alarm if you want, does that sound good? [17:16:30] cc ottomata [17:16:50] nuria, if you can polish that, I would personally make sure to deploy it [17:16:56] :-) [17:17:00] thcipriani: jynus Krenair thanks. [17:17:03] jynus: sounds great, send it our way [17:17:28] kart_: ok, ready to deploy your final patch, then? [17:17:30] jynus: cc ottomata elukey and myself on your commit , that you! [17:17:41] thcipriani: yes. Table looks OK. [17:17:45] kk [17:17:49] (03CR) 10Jcrespo: [C: 032] [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo) [17:17:56] (03CR) 10Jcrespo: [V: 032] [WIP] Quick and dirty script to check lag @ eventlogging schema [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo) [17:18:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry) [17:18:49] (03Merged) 10jenkins-bot: Enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265459 (https://phabricator.wikimedia.org/T119617) (owner: 10KartikMistry) [17:19:18] akosiaris: i'll be ready to go in 10 mins [17:19:49] ^Ive added you but merged, because for the alarm it should be improved and commited to operations/puppet, not there [17:20:18] (03CR) 10Jcrespo: "Where you notified of the commit?" [software] - 10https://gerrit.wikimedia.org/r/265509 (owner: 10Jcrespo) [17:20:25] jynus: ticket created: https://phabricator.wikimedia.org/T124306 [17:21:11] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1952507 (10coren) a:5coren>3None [17:21:57] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [17:22:18] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [17:22:32] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslationCorpora Part I [[gerrit:265459]] (duration: 01m 28s) [17:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:12] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Enable ContentTranslationCorpora Part II [[gerrit:265459]] (duration: 01m 28s) [17:24:15] ^ kart_ check please [17:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:24:22] OK. [17:24:33] Table should populate some data in some time. [17:24:37] 6operations, 10MediaWiki-General-or-Unknown, 10MobileFrontend-Feature-requests, 10Traffic, and 3 others: Fix mobile purging - https://phabricator.wikimedia.org/T124165#1952512 (10BBlack) ^ So the fix is in 1.27.0-wmf.11, which is on group0 so far. When it reaches group1 and group2 as well, we can resolve... [17:25:18] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail [17:29:00] (03PS1) 10Alexandros Kosiaris: graphoid: apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265511 [17:29:02] (03PS1) 10Alexandros Kosiaris: graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 [17:30:26] !log disabled puppet and salt-minion on sca1001, sca1002 for graphoid upgrade [17:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:07] RECOVERY - Last backup of the tools filesystem on labstore1001 is OK: OK - Last run for unit replicate-tools was successful [17:31:41] ^this actually did fail but that's a bogus recovery so hooray icinga :) [17:32:09] 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10jcrespo) 3NEW [17:32:55] (03CR) 10Alexandros Kosiaris: [C: 032] Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [17:32:59] (03CR) 10Alexandros Kosiaris: [C: 032] graphoid: apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265511 (owner: 10Alexandros Kosiaris) [17:33:09] (03PS7) 10Alexandros Kosiaris: Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [17:33:12] (03CR) 10Alexandros Kosiaris: [V: 032] Move portals into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/264978 (https://phabricator.wikimedia.org/T86644) (owner: 10Alex Monk) [17:33:34] 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952534 (10jcrespo) [17:36:48] PROBLEM - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:36:58] PROBLEM - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:38:05] known ^ [17:38:17] ACKNOWLEDGEMENT - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion alexandros kosiaris graphoid migration to scb ongoing [17:38:17] ACKNOWLEDGEMENT - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion alexandros kosiaris graphoid migration to scb ongoing [17:39:39] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures [17:39:57] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [17:39:58] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures [17:40:47] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Puppet has 1 failures [17:40:48] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:17] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:27] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:27] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:27] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:48] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [17:41:57] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [17:43:17] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:43:53] (03PS10) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [17:44:35] thcipriani: still around? [17:44:41] kart_: yup [17:44:46] (03PS1) 10KartikMistry: Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 [17:44:54] thcipriani: I made mistake in config patch :/ Set it to false instead of true. [17:45:00] (03CR) 10jenkins-bot: [V: 04-1] Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [17:45:11] So, can you deploy 265514 please? [17:46:11] kart_: sure, lemme get this in before the train. [17:46:45] Thanks! [17:46:47] (03CR) 10Thcipriani: [C: 032] Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 (owner: 10KartikMistry) [17:46:59] (03PS1) 10Giuseppe Lavagetto: role::deployment: add a warning on the inactive server [puppet] - 10https://gerrit.wikimedia.org/r/265515 [17:47:12] (03Merged) 10jenkins-bot: Really enable ContentTranslationCorpora [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265514 (owner: 10KartikMistry) [17:48:24] (03PS11) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [17:48:34] !log add scb1001, scb1002 in pybal graphoid config [17:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:06] (03PS2) 10Alexandros Kosiaris: graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 [17:49:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] graphoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265512 (owner: 10Alexandros Kosiaris) [17:49:37] (03CR) 10jenkins-bot: [V: 04-1] Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [17:49:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Really enable ContentTranslationCorpora [[gerrit:265514]] (duration: 01m 29s) [17:49:56] ^ kart_ [17:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:10] thcipriani: thanks. checking. [17:51:18] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:52:44] (03PS12) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [17:56:04] akosiaris, was the portal change applied? [17:56:12] Krenair: yes [17:56:30] well, on some hosts I checked [17:56:35] it was a one line change btw [17:56:47] +++ /tmp/puppet-file20160121-9518-v5ylck 2016-01-21 17:41:41.922700070 +0000 [17:56:48] @@ -1,4 +1,5 @@ [17:56:48] [17:56:48] + [17:56:49] ServerName wikipedia.org [17:57:01] not sure where that came from yet, will look into it later [17:57:31] !log depool sca1001,sca1002 for graphoid pybal config [17:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:56] thcipriani: thanks. We've data!! [17:58:17] kart_: nice! thanks for checking. [17:59:50] akosiaris, so prod is fine? there's a problem in beta [18:00:05] Krenair: yes, prod seems fine [18:00:25] mobrovac: mobileapps complaining on scb1001 [18:00:51] looking [18:00:56] Krenair: what kind of problem ? [18:01:01] puppet errors [18:01:19] actually, I got a migration going on, will look into it in a few [18:01:41] Ohh. [18:01:52] I had an earlier version of the patch on the beta puppetmaster [18:02:04] It got into a conflict trying to merge [18:04:13] for the record in here, context is "what should we do with the train after yesterday's rollback?": [18:04:16] 18:01 < greg-g> I think we should move ahead with the train (ie: roll out to group1, wait a bit, then roll out to wikipedis, as scheduled) [18:04:25] Krenair: a ok, good to know [18:04:31] (03PS1) 10Alexandros Kosiaris: citoid: Apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265517 [18:04:33] (03PS1) 10Alexandros Kosiaris: citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 [18:04:59] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:05:09] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:28] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:29] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:29] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:49] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:49] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:16] (03CR) 10Alexandros Kosiaris: [C: 032] citoid: Apply the role on scb [puppet] - 10https://gerrit.wikimedia.org/r/265517 (owner: 10Alexandros Kosiaris) [18:06:29] RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [18:06:38] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:39] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:06:58] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:07:28] !log rebooting labvirt1001 [18:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:29] (03PS1) 10Rush: diamond: nfsd grab /proc/fs/nfsd/pool_stats as well [puppet] - 10https://gerrit.wikimedia.org/r/265519 [18:10:38] (03CR) 10jenkins-bot: [V: 04-1] diamond: nfsd grab /proc/fs/nfsd/pool_stats as well [puppet] - 10https://gerrit.wikimedia.org/r/265519 (owner: 10Rush) [18:11:26] greg-g, thcipriani: is swat still happening or can we start the train early? [18:11:41] marxarelli: SWAT is complete. [18:11:59] PROBLEM - Host ores.wmflabs.org is DOWN: PING CRITICAL - Packet loss = 100% [18:12:03] might be good to have extra time to assess group1 before promoting to al [18:12:04] all [18:12:08] word [18:13:05] I wanted to do some depools for some data migration, but I think I will wait [18:13:16] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [18:13:33] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#1952659 (10RobH) [18:14:15] (03CR) 1020after4: "Regarding "Update MediaWiki core to output hashes in static urls."" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263566 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [18:16:10] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [18:18:00] (03CR) 10Mobrovac: [C: 04-1] "One detail in the systemd service file left to deal with." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [18:18:00] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.91 ms [18:20:30] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [18:20:54] anomie: is "Failed to write session data (user)." for wmf.11 something to do with your backported fix yesterday? [18:21:22] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [18:21:38] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 982305 bytes in 3.210 second response time [18:21:54] anomie: seeing that in fatalmonitor [18:22:33] marxarelli: https://gerrit.wikimedia.org/r/#/c/265480/ [18:22:52] should fix it [18:23:42] anomie: alright. i'll cherrypick that [18:25:23] (03PS1) 10Andrew Bogott: Send broken-puppet nags to admins of all projects! [puppet] - 10https://gerrit.wikimedia.org/r/265522 [18:25:44] 6operations, 10Analytics, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952723 (10Milimetric) p:5Triage>3Normal [18:28:48] (03PS1) 10Alex Monk: Delete config.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/265525 [18:28:50] (03PS1) 10Alex Monk: Fix upload.beta.wmflabs.org docroot path [puppet] - 10https://gerrit.wikimedia.org/r/265526 [18:29:00] (03PS1) 10Alexandros Kosiaris: scb: Update the realserver_ips [puppet] - 10https://gerrit.wikimedia.org/r/265527 [18:29:06] (03CR) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) (owner: 10Subramanya Sastry) [18:29:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] scb: Update the realserver_ips [puppet] - 10https://gerrit.wikimedia.org/r/265527 (owner: 10Alexandros Kosiaris) [18:31:05] (03PS13) 10Subramanya Sastry: Add the visualdiff module + instantiate visualdiffing services [puppet] - 10https://gerrit.wikimedia.org/r/264032 (https://phabricator.wikimedia.org/T118778) [18:31:50] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:32:41] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [18:34:00] !log pool scb1001, scb1002 for citoid [18:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:47] !log depool sca1001, sca1002 for citoid [18:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:26] !log enable puppet and salt-minion on sca100{1,2}.eqiad.wmnet [18:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:29] !log restbase disabling puppet in prod for testing firejail in staging [18:43:29] just got an ssh host verification failure for mw2020.codfw.wmnet [18:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:43] was that host re-imaged or something? [18:43:44] !log dduvall@tin Synchronized php-1.27.0-wmf.11/includes/session/PHPSessionHandler.php: deploy follow-up warning fix for T124126 (duration: 01m 28s) [18:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:45] 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1952810 (10ema) Patches marked as forward-ported are available on Varnish 4 WMF repo on Gerrit: https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/varnish4 - [X] 0010-varnishd-cache_dir... [18:44:52] 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1952813 (10mobrovac) [18:45:23] 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1206310 (10mobrovac) [18:46:38] (03PS2) 10Alexandros Kosiaris: citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 [18:46:53] !log rebooting labvirt1002 [18:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:06] (03PS1) 10Mobrovac: RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 [18:47:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] citoid: update LVS/conftool configuration [puppet] - 10https://gerrit.wikimedia.org/r/265518 (owner: 10Alexandros Kosiaris) [18:47:20] !log 4 apache sync failures during sync-file, appear to be know issues [18:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:48] (03PS1) 10Alex Monk: Move apache includes into generic sites.pp [puppet] - 10https://gerrit.wikimedia.org/r/265532 (https://phabricator.wikimedia.org/T86644) [18:49:08] (03PS2) 10Alexandros Kosiaris: RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 (owner: 10Mobrovac) [18:49:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] RESTBase: Enable firejail [puppet] - 10https://gerrit.wikimedia.org/r/265531 (owner: 10Mobrovac) [18:49:51] !log sync to mw2020 failed due to failed host key verification, mw2087/mw2039/mw2098 due to connection failed [18:51:20] !log starting train promotion of group1 to 1.27.0-wmf.11 [18:52:48] er, morebots? [18:52:53] !log sync to mw2020 failed due to failed host key verification, mw2087/mw2039/mw2098 due to connection failed [18:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:06] there ya are, buddy [18:53:08] !log starting train promotion of group1 to 1.27.0-wmf.11 [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:23] (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 [18:55:16] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 (owner: 10Dduvall) [18:55:48] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265536 (owner: 10Dduvall) [18:56:05] krenair@mw2020.codfw.wmnet's password: [18:56:06] Ummmm. [18:56:33] Krenair: that host just failed for me during sync-file as well [18:56:40] yeah, that's why I went to take a look [18:56:43] host key verification fail [18:56:48] Maybe someone in ops is reinstalling it? [18:57:10] I don't see a note in SAL about it until your one [18:57:11] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11 [18:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:41] Krenair: ran into it during SWAT, bblack mentioned that it was in a bad state somehow. [18:58:26] anomie, tgr: heads-up, group1 has been promoted [18:58:52] so in this train window, we're doing group1+group2 -> wmf.11? [18:59:02] group1 first [18:59:27] then wait/verify, then all [18:59:32] bblack: group1 first, wait/watch.... yeah [18:59:33] is the plan anyway [18:59:49] ok thanks [19:00:04] marxarelli: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160121T1900). [19:00:17] silly robot [19:04:04] * anomie tests logins to wikidata.org via web and API, and succeeds [19:06:27] (03PS1) 10Alexandros Kosiaris: WIP: DONT MERGE. cleanup SCA from *oid services [puppet] - 10https://gerrit.wikimedia.org/r/265541 [19:08:40] seeing some new fatal errors [19:08:45] CentralAuth/session related [19:08:48] "Fatal error: Cannot call abstract method MediaWiki\Session\SessionProvider::provideSessionInfo() in /srv/mediawiki/php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php on line 107" [19:09:04] anomie, tgr ^ [19:09:08] csteipp, tgr ^ [19:09:13] * anomie looks [19:10:18] (03PS2) 10Chad: ganglia diskstat.py: pep8 fixes all over the place [puppet] - 10https://gerrit.wikimedia.org/r/264997 [19:10:51] marxarelli: https://gerrit.wikimedia.org/r/265542 should fix that [19:11:09] anomie: kk [19:13:58] !log rebooting labvirt1003 [19:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:41] PROBLEM - Host www.toolserver.org is DOWN: PING CRITICAL - Packet loss = 100% [19:17:43] (03PS2) 10BBlack: Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) [19:18:14] https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen [19:18:21] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:18:58] tgr: need_token errors again? [19:19:10] yeah [19:19:34] what would be a good place to ask bot operators about it? #commons? [19:19:40] still only on group1, right? [19:19:44] right [19:20:14] (03CR) 10BBlack: [C: 032] Add iridium-vcs.eqiad.wmnet ipv6 to phab puppetization [puppet] - 10https://gerrit.wikimedia.org/r/265493 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [19:20:38] commons would be one good place, yeah [19:21:14] (03CR) 10BBlack: [C: 032] Add IPv6 for iridium-vcs.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/265492 (https://phabricator.wikimedia.org/T100519) (owner: 10BBlack) [19:22:17] !log restbase re-enabling puppet in prod [19:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:56] !log restbase rolling-restart after firejail inclusion [19:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:36] (03PS1) 10Alex Monk: beta: Remove deployment.wmflabs.org VHost that doesn't actually resolve [puppet] - 10https://gerrit.wikimedia.org/r/265548 [19:25:51] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [19:31:54] !log dduvall@tin Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthTokenSessionProvider.php: deploy https://gerrit.wikimedia.org/r/#/c/265545/ for 1.27.0-wmf.11 (duration: 01m 28s) [19:31:56] !log rebooting labvirt1004 [19:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:00] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:28] marxarelli: anomie tgr errors went down https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen [19:37:37] greg-g: yeah, saw that [19:37:54] how's other graphs doing? [19:38:00] trend seems to follow the last sync [19:38:51] PROBLEM - Host mw2020 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:08] fatalmonitor looks good to my n00b eyes [19:39:39] greg-g: that's probabl just grafana being stupid [19:39:50] last few pixels are not really reliable [19:41:08] anybody rolled out some new version of mw to wiktionaries? [19:41:18] in last couple mins [19:41:27] tgr: seems to be consistent over the past 10 min [19:41:33] is it that dumb? [19:41:42] did someone just puppet-merge on palladium and grab my change? [19:41:47] Danny_B: yes [19:41:52] Danny_B: 18:53 UTC, I believe. [19:42:01] RECOVERY - Host mw2020 is UP: PING OK - Packet loss = 0%, RTA = 37.08 ms [19:42:06] Danny_B: wikitionaries have a new version since 10:57 [19:42:21] err, 18:57 UTC. Yeah. [19:42:23] the behaviior changed just few mins ago on cs wikt [19:42:36] and it does not work, please roll back [19:42:49] not possible to edit at all, since edit links don'ť work [19:42:51] marxarelli: I'll trust it when I start seeing the low-level noise on the right side [19:42:56] edit links don't work? [19:43:47] hitting the edit link by the headline links to #/editor/1 [19:43:51] nothing happens [19:43:59] besides edit links now look different [19:44:06] i can edit en wiktionary [19:44:11] mobile? [19:44:19] sounds like a gadget issue [19:44:20] I can edit on en.wikitionary [19:44:21] they are of the size of the header [19:44:47] not as small as they used to be, neither enclosed in brackets [19:44:49] is said cs wiktionary [19:44:56] s/is/i [19:45:29] mobile edit interface does look weird, no idea if that's new though [19:45:38] not mobile, desktop [19:45:44] cs also works; http://imgur.com/msfOPSm [19:45:54] can edit there also [19:46:04] aude: you started the page [19:46:11] you did not try to edit the section [19:46:19] Danny_B: i edited the section also [19:46:20] tgr: you were right. just spiked again [19:46:41] "wgHostname":"mw1103" if that helps [19:46:45] Danny_B: i believe you though [19:47:31] interesting though i was logged in on en.wiktionary [19:47:41] then apparently got logged out / not logged in on cs.wiktionary [19:47:56] marxarelli: we have some kind of statsd buffering proxy I think [19:47:58] marxarelli: can you rollback cs wikitionary to see if that fixes Danny_B's issue? [19:48:09] greg-g: sure thing [19:48:29] (03CR) 10jenkins-bot: [V: 04-1] Bump kernel version in jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 (owner: 10Andrew Bogott) [19:49:03] * anomie is trying to debug why logging in isn't working on cswikt, but rolling back will probably screw that up [19:49:06] weird that i am not logged in there, but am when i go to wikivoyage, commons, etc [19:50:17] also afrikaans wiktionary, danish, ... [19:50:18] i'm creating the sshot for ya guys [19:50:24] but logged in on spanish wiktionary [19:50:29] (03PS3) 10Andrew Bogott: Bump kernel version in Jessie base image [puppet] - 10https://gerrit.wikimedia.org/r/265500 [19:50:45] greg-g: anomie is debugging. rollback, er... ? [19:51:33] aude: any luck debugging? [19:51:35] aude: re logging: if wikitech was also changed the mw version, then i submitted some bug which may be relevant, it's marked as secure though [19:51:42] so idk if you can see it [19:51:44] greg-g: no [19:51:51] it's some wiktionarys not, some i am [19:51:53] Danny_B: which bug? [19:52:03] marxarelli: rollback [19:52:07] kk [19:52:08] don't know if they are on different db servers / groups [19:52:09] just cswiki for now [19:52:37] arabic logged in (and maybe already had account there) [19:52:41] same with spanish [19:53:04] just a guess [19:53:43] marxarelli: Danny_B: cs.wikt edit interface looks the same as en.wiki [19:53:48] japanese, polish works [19:54:06] Did you guys create bot_passwords tables on all wikis? [19:54:06] http://imgur.com/EEnhSoc i clicked the edit by "výslovnost" and see the url in browser location bar [19:54:16] not vietnamese or portuguese [19:54:17] maybe Danny_B unexpectedly ended up on the mobile if [19:54:17] Krenair: no they did not [19:54:29] Danny_B, clearly. but let them answer. [19:54:34] Krenair: i reported the bug [19:54:42] !log rebooting labvirt1005 [19:54:46] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: rollback cswiktionary to 1.27.0-wmf.10 [19:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:06] greg-g: https://phabricator.wikimedia.org/T124335 [19:55:21] Danny_B: test again, please, we just rolled back cswikitionary [19:55:23] the ones i am logged in, already had an account [19:55:27] Krenair: should be using the central DB on meta [19:55:28] https://es.wiktionary.org/wiki/Especial:Informaci%C3%B3n_de_la_cuenta_global/Aude [19:55:32] greg-g: ok, mmt pls [19:56:26] greg-g: works as before now [19:56:42] how can i help you with debugging? what would you like me to examine? [19:57:00] wtf [19:57:15] ar.wikibooks ( no account, no login) [19:57:19] why it's only effecting cswiki I have no idea [19:58:02] greg-g: So it looks like the reason I can't log in on cswikt is because CentralAuth is suddenly creating unattached accounts. [19:58:04] greg-g: we have two outstanding issues now, mystery cswikt UI muck and need_token spike [19:58:15] if UseBotPasswords is not true, should the table be expected to exist? [19:58:17] !log mobileapps deploying 68c09e [19:58:18] No idea if that's related to any of the other problems [19:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:41] Krenair: If the UseBotPasswords is false, it shouldn't need the table to exist. [19:58:55] marxarelli, anomie: https://phabricator.wikimedia.org/T74791#1953065 [19:59:10] marxarelli: tgr the need_token thing isn't resolving? :( [19:59:25] probably the central id provider issue? [19:59:41] maybe the bot password problem is the same? [20:00:15] greg-g: still no clue what that issue even is, just see a few bots trying to log in like crazy [20:00:28] I mean EnableBotPasswords [20:00:36] :/ [20:00:53] ermmm [20:00:57] unattached accounts?? [20:01:16] Krenair: are there DB errors or why do you ask? [20:01:26] tgr, https://phabricator.wikimedia.org/T124335 [20:01:48] Krenair: can you CC me on that? [20:02:01] i can't unfortunatelly follow here nonstop atm, please ping me when you need further assistance or testing or when you think it's fixed, thank you [20:02:29] Danny_B: sadly you're the only one able to repo right now [20:02:29] tgr: added [20:02:43] Danny_B: I'll ping when we need more testing [20:03:21] ah, labswiki [20:03:27] greg-g: no prob, i just need to switch to different window and work, so i'll may have delays in replies a bit, but i'll do my best [20:03:39] kk [20:05:49] why is labswiki not in the nonglobal list? [20:06:07] it is in the nonglobal list [20:06:20] therefore UseBotPasswords is set to false [20:06:26] bah, EnableBotPasswords* [20:06:54] tgr, Krenair: https://gerrit.wikimedia.org/r/265561 [20:07:13] But User::setPasswordInternal -> BotPassword::invalidateAllPasswordsForUser -> BotPassword::invalidateAllPasswordsForCentralId -> boom [20:08:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO) {#11541} [10Gbps DWDM]BR [20:09:33] going to go and have dinner, cherry-pick: https://gerrit.wikimedia.org/r/#/c/265563/ [20:10:01] will deploy it when I'm back if nobody else does [20:13:57] Krenair, anomie, tgr: i'll deploy it once the cherry-pick merges [20:13:57] I'll deploy [20:14:04] that works too [20:16:33] !log rebooting labvirt1006 [20:16:34] (03PS2) 10EBernhardson: Adjust cirrus titlesuggest index shard counts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) [20:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:57] (03CR) 10EBernhardson: [C: 04-1] "These values were chosen by hand, need to update with the values calculated by the script in T124332" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261287 (https://phabricator.wikimedia.org/T124332) (owner: 10EBernhardson) [22:35:37] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 1 failures [22:36:49] https://phabricator.wikimedia.org/T124356 sounds like MF is leaking into vector? [22:41:16] Danny_B: I think your mobile/desktop choice sticks with you between wikis? [22:41:26] so you should try testing with a different user [22:41:46] * greg-g nods [22:43:41] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:44:00] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:44:23] (03PS1) 10Alex Monk: mediawiki: Move www.wikimedia.org portal into wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/265642 [22:44:44] tgr: i never tried mobile [22:45:11] yes but the error is clearly related to that somehow [22:45:16] but ok, i'll test as anonymous [22:45:20] you are getting mobile edit links [22:46:07] it's very time consuming, as it takes a while to hit it, some automated testing would be handy [22:46:17] !log started running migratePass0.php (CentralAuth) on group1 wikis [22:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:16] why do we have remnant.conf and main.conf in apache config? [22:49:58] tgr: i've just hit it as anonymous in browser i have not used for months [22:50:20] thanks for checking [22:51:51] welcome [22:52:42] if it was on me i'd rollback wmf11 and done some investigation in its sources... but i am not an op neither mng... ;-) [22:55:23] (03CR) 10EBernhardson: [C: 032] Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse) [22:56:07] (03Merged) 10jenkins-bot: Recycle completion suggester indices for small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265472 (owner: 10DCausse) [22:56:13] Danny_B: are you using a standard browser? [22:56:45] tgr: #define( "standard browser" ) [22:57:09] something mainstream like current Chrome, Firefox... [22:57:51] RECOVERY - Host mw2087 is UP: PING OK - Packet loss = 0%, RTA = 36.38 ms [22:59:08] firefox, chrome, iron, k-meleon, ie, opera, vivaldi, amaya, lynx and bunch of others [23:01:11] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:03:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:05:31] PROBLEM - very high load average likely xfs on ms-be1002 is CRITICAL: CRITICAL - load average: 199.11, 142.18, 71.37 [23:06:24] (03CR) 10Mobrovac: [C: 04-1] "Minor comments in-lined. Also, the upstart conf file should probably be removed." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:09:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:16:07] Danny_B: erm, can you give me an example url you hit and saw it with? [23:16:31] (03PS3) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 [23:17:41] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [23:18:12] ebernhardson: ^? [23:19:29] legoktm: https://cs.wiktionary.org/wiki/leetspeak https://cs.wiktionary.org/wiki/Polsko https://sk.wiktionary.org/wiki/kostern%C3%AD https://cs.wiktionary.org/wiki/d%C5%AFle%C5%BEit%C3%BD https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi [23:20:14] thank you [23:20:45] i kept these open, but there is several i've closed and don't remember [23:21:27] Danny_B: can you send me the html for one of those pages? [23:21:41] ooh [23:21:43] I got a repro [23:22:03] cool [23:22:11] now i can go sleep ;-) [23:24:27] (03CR) 10Mobrovac: Migrate parsoid::role::testing service from upstart to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:25:32] (03PS2) 10Ottomata: 0.8.2.1-4 release - kafka-mirror package enhancements [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/265288 (https://phabricator.wikimedia.org/T124077) [23:25:34] (03PS2) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) [23:25:47] legoktm: wow, thanks [23:26:03] (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [23:26:25] (03Abandoned) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265634 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [23:26:59] yeh im seeing it on https://cs.wiktionary.org/wiki/Polsko [23:27:00] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=82%) [23:27:55] twentyafterfour: ^ [23:28:02] svn import causing space issues? [23:28:23] YuviPanda: already fixed [23:28:41] I think something is corrupting the parser cache. [23:28:58] this sounds very similar to the wikivoyage issue.. [23:29:09] thanks twentyafterfour [23:29:11] RECOVERY - Disk space on iridium is OK: DISK OK [23:29:26] $skin = $wgOut->getSkin(); [23:29:26] return call_user_func_array( [23:29:26] array( $skin, 'doEditSectionLink' ), [23:29:29] eww [23:29:36] sorry about that, I didn't realize ~ was on a tiny partition [23:30:31] still, it shouldn't be corrupting the parser cache [23:30:50] right, edit section links run afterwards [23:31:35] so somehow $wgOut->getSkin() is returning SkinMinerva?? [23:31:54] (03PS1) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) [23:32:27] (03PS4) 10Subramanya Sastry: Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 [23:32:49] (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [23:32:50] legoktm: it's running the MobileFormatter. Is it possible we do not have separate parser caches for mobile/desktop on those projects? [23:33:00] legoktm: you'll notice it's not just the edit links but it's doing section wrapping [23:33:30] why doesn't it contain the parsercache hash? [23:33:36] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1954018 (10Papaul) Fixes & Enhancements Enhancements: N/A Fixes: - Fix for issues that cause iDRAC7 sluggish responsiveness after a prolonged period of time (approx. 45-100 days, depending on the usage). In some... [23:33:43] (03CR) 10jenkins-bot: [V: 04-1] Migrate parsoid::role::testing service from upstart to systemd [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:34:34] (03PS2) 10Nuria: Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) [23:34:36] (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:35:00] Platonides: by default ParserOptions::$mEnableLimitReport is false... [23:35:06] 6operations, 10ops-codfw: mw2087 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124299#1954026 (10Papaul) Update IDRAC firmware from 1.30 to 2.21 @joe please see note on T85286 [23:35:29] (03CR) 10jenkins-bot: [V: 04-1] Removing code generates pageviews using legacy definition [puppet] - 10https://gerrit.wikimedia.org/r/265656 (https://phabricator.wikimedia.org/T124244) (owner: 10Nuria) [23:35:51] legoktm: where do you see array( $skin, 'doEditSectionLink' ), ? I'm not finding that in MobileFrontend [23:35:55] smells like old code [23:36:07] interestingly ParserOptions::matches() ignores mEnableLimitReport [23:36:11] legoktm: / jdlrobson: assessment on whether we should rollback or not due to this? [23:36:17] it makes sense [23:36:23] since it doesn't affect the output [23:36:29] jdlrobson: ParserOutput::getText() in core [23:37:01] greg-g: normally I'd say rollback, but I think we're going to have more issues rolling back the session stuff, and apparently this is very hard to repro... [23:37:12] * greg-g nods [23:37:23] I suspect that $po->setText( ExtMobileFrontend::DOMParse( $outputPage, $po->getText(), $isBeta ) ); is the culprit but I'm not sure why it's running on the desktop ParserOutput [23:37:51] it should also be disabling TOC [23:37:55] which it doesn't seem to be doing though [23:38:05] oh that runs on OutputPage [23:38:09] ergg so confusing [23:39:01] I'm going to live hack mw1017 a bit [23:39:03] idea: if the last edit of the page was coming through mobile, then the page stays in mobile output ? [23:39:16] yeh only the example https://cs.wiktionary.org/wiki/Polsko has had no recent edits [23:39:20] https://cs.wiktionary.org/w/index.php?title=Polsko&action=history doesn't say mobile? [23:39:40] legoktm: are you able to flush the parser output cache for a particular page? [23:39:50] anyway, i'm going to take a nap now (almost 1am here), feel free to pm me with additional needs [23:40:04] jdlrobson: probably, but I don't want to do that unless we have another repro [23:40:09] if so there's a few experiments we could run e.g. visit via mobile first and see whether that impacts desktop cache [23:40:25] legoktm: did you try those other links i gave you? [23:41:07] so this is sounding like something that's going to need a lot of varnish purging once fixed, right? [23:41:19] Danny_B: oops, never clicked further. [23:41:23] (03CR) 10Ottomata: [C: 032] 0.8.2.1-4 release - kafka-mirror package enhancements [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/265288 (https://phabricator.wikimedia.org/T124077) (owner: 10Ottomata) [23:41:29] maybe we can purge on the obj.http.Date range for when it was generating bad stuff [23:41:30] okay, I can also repro on https://cs.wiktionary.org/wiki/d%C5%AFle%C5%BEit%C3%BD and https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi [23:41:30] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [23:42:52] legoktm: i purged leetspeak befor because of testing [23:42:56] ok [23:43:23] * Danny_B -> nap [23:43:33] null edit fixed https://cs.wikiquote.org/wiki/Jean-Marie_Adiaffi [23:44:36] 6operations, 10ops-codfw: mw2098 non-responsive to mgmt - https://phabricator.wikimedia.org/T85286#1954075 (10Papaul) Note version 1.56 down is for IDRAC 7 and not IDRAC 6. I am stay working on the IDRAC 6 for the PowerEdge R410 series like mw2039 which had the same problem as well. [23:44:59] eh [23:45:15] notices in languages I don't speak keep popping up and disappear before I can copy the text [23:47:07] even doing an actual edit on mobile doesn't cause it to come back: https://cs.wikiquote.org/w/index.php?title=Jean-Marie_Adiaffi&type=revision&diff=76155&oldid=70975 [23:48:56] bblack: maybe? we still don't know whats causing it yet :/ [23:50:43] legoktm: i tried editing via desktop site on a mobile device but that didn't cause any issues,so i'm out of ideas :/ [23:51:07] not having the limit report is bugging me... [23:52:44] (03PS1) 10Alex Monk: beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 [23:53:26] legoktm: i just saw your message from earlier, i dont think mediawiki-config is me (havn't merged anything today) [23:53:49] (03CR) 10jenkins-bot: [V: 04-1] beta: Move login and bits apache configs into wikimedia.conf, like prod [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk) [23:53:51] ebernhardson: [14:55:23] (CR) EBernhardson: [C: 2] Recycle completion suggester indices for small wikis [mediawiki-config] - https://gerrit.wikimedia.org/r/265472 (owner: DCausse) [23:54:00] Krenair: Wouldn't it be easier to just template out the production configs (for servername, aliases etc) and update the beta config to use it? [23:54:33] (03CR) 10Subramanya Sastry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265628 (owner: 10Subramanya Sastry) [23:54:47] Reedy, the plan is to identify the differences and merge them [23:55:03] In theory, it should be minimal [23:55:16] legoktm: oh doh...i mean't to merge a different patch :S [23:55:20] (03CR) 10Alex Monk: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/265659 (owner: 10Alex Monk) [23:55:35] well...might as well put that up to swat now since it's pre-config for a patch going out next week [23:55:41] related, https://github.com/wikimedia/mediawiki is like 8 days behind?? [23:55:54] Reedy, I could upload one change that just deletes half of the mediawiki sites config and adds a load more [23:56:13] lol [23:56:15] Reedy, however, I would like to stand a realistic chance of getting stuff done, and therefore I need to be able to convince ops to approve changes [23:58:24] https://gerrit.wikimedia.org/r/#/c/263606/ looks relatively safe [23:58:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:58:51] Danny_B: do we have any evidence this started with wmf.11 yesterday? [23:59:41] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.