[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160224T0000). [00:00:04] RoanKattouw Krenair AaronSchulz aude: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:40] (03CR) 10Aaron Schulz: [C: 031] Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 (owner: 10Giuseppe Lavagetto) [00:00:42] perf window is running over, wikidata took 2 hours longer [00:00:54] !log krinkle@tin Synchronized php-1.27.0-wmf.13/includes: Backport MW_NO_SESSION changes (duration: 02m 13s) [00:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:08] Krinkle_: OK, ping me when done [00:01:14] Same arrangement as yesterday :) [00:05:39] (03PS4) 10Cmjohnson: Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 [00:25:26] RoanKattouw: patches staged now on mw1017, 5-10 min. max [00:35:30] !log krinkle@tin Synchronized php-1.27.0-wmf.14/includes: SessionManager backports (duration: 02m 17s) [00:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:00] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [5000000.0] [00:37:44] !log krinkle@tin Synchronized php-1.27.0-wmf.14/extensions/WikimediaMessages: SessionManager backports (duration: 01m 37s) [00:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:35] RoanKattouw: Done [00:38:42] (last sync is finishing any second now) [00:40:13] !log krinkle@tin Synchronized php-1.27.0-wmf.14/extensions/CentralAuth: SessionManager backports (duration: 01m 39s) [00:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:46] 6Operations, 10RESTBase, 6Services, 10Traffic, 3Mobile-Content-Service: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2058357 (10GWicke) > It still seems to me like the only generic decoding we can do for all traffic is the unreserved s... [00:51:50] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:52:39] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 1 failures [01:03:51] RoanKattouw? [01:10:05] meh, I'll just do my thing [01:10:27] (03CR) 10Aaron Schulz: [C: 032] Enable async secondary swift writes for non-"big" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272611 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [01:10:38] RoanKattouw will be back in a few [01:11:07] (03Merged) 10jenkins-bot: Enable async secondary swift writes for non-"big" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272611 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [01:13:38] !log aaron@tin Synchronized wmf-config/filebackend-production.php: Enable async secondary swift writes for non-"big" wikis (duration: 01m 31s) [01:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:15:06] AaronSchulz, please let me know when you're happy with that commit so I can do my patches [01:16:10] (03PS2) 10Jforrester: Enable VisualEditor for new accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) [01:16:51] Krenair: I'm done [01:17:04] ty [01:17:54] (03CR) 10Jforrester: [C: 04-2] "Subject to community consent (vote is on-going), and needs the maintenance script run first so that existing registered users do not get t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [01:17:56] (03CR) 10Alex Monk: [C: 032] Remove old WEF dashboard IP from enwiki ratelimit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272837 (https://phabricator.wikimedia.org/T126541) (owner: 10Alex Monk) [01:18:21] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:18:27] (03Draft3) 10Jforrester: Enable VisualEditor for IP users on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) [01:18:43] (03Merged) 10jenkins-bot: Remove old WEF dashboard IP from enwiki ratelimit exception [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272837 (https://phabricator.wikimedia.org/T126541) (owner: 10Alex Monk) [01:18:57] (03CR) 10Jforrester: [C: 04-2] "Not until community has agreed, please. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [01:21:35] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/272837/ (duration: 01m 35s) [01:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:44] Ohai, I'm here now [01:21:46] (03CR) 10Alex Monk: [C: 032] VisualEditor: Enable in two extra namespaces on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272889 (https://phabricator.wikimedia.org/T127819) (owner: 10Jforrester) [01:21:48] But I see Krenair is SWATting already [01:22:01] well [01:22:05] I was doing my list of patches [01:22:10] Sorry, I was here at the set time, but then Timo needed more time and I got distracted [01:22:21] OK, I can do the others [01:22:21] Apologies again for showing up so late [01:22:27] (03Merged) 10jenkins-bot: VisualEditor: Enable in two extra namespaces on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272889 (https://phabricator.wikimedia.org/T127819) (owner: 10Jforrester) [01:23:41] RoanKattouw, basically the remaining ones are yours and aude's [01:23:52] OK [01:24:02] If you're done and ready for me to take over, I'll do the rest now [01:24:20] (03PS1) 10Aaron Schulz: Enable async swift writes for remaining backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 [01:24:31] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/272889/ (duration: 01m 35s) [01:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:25:13] Krenair: aude?!? I don't see any patches by her [01:25:26] James_F, ^ lgtm [01:25:39] aude [01:25:40] [config] 271336 Don't yet (for interactive graphs) allow wikidatasparql graph urls [01:26:06] Krenair: Confirmed. [01:26:12] RoanKattouw, oh, it's been moved [01:26:22] RoanKattouw, https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=325240&oldid=325045 [01:26:28] RoanKattouw, all yours [01:26:40] Oh OK [01:29:25] (03CR) 10Catrope: [C: 032] Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:30:10] (03Merged) 10jenkins-bot: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [01:32:03] (03CR) 10Catrope: [C: 032] Exclude fishbowl and add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [01:32:39] !log catrope@tin Synchronized w/static/images: Add project logos for Echo (duration: 01m 33s) [01:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:32:47] (03Merged) 10jenkins-bot: Exclude fishbowl and add computed dblist for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250460 (owner: 10Mattflaschen) [01:34:57] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Use project logos for welcome notifications (duration: 01m 34s) [01:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:38:53] !log catrope@tin Synchronized dblists/: Add new dblists for Flow (duration: 01m 32s) [01:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:39:29] 6Operations, 10Ops-Access-Requests: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2058556 (10Dzahn) I can confirm the user bmansurov is still in the "wmf" LDAP group but that is all i know about hue access. Re-assigning over to ottomata for advice. [01:39:41] 6Operations, 10Ops-Access-Requests: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2058557 (10Dzahn) a:5coren>3Ottomata [01:42:14] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1900.00 seconds [01:45:13] 6Operations, 10Ops-Access-Requests: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2058565 (10Dzahn) 17:44 < ori> mutante: it requires membership in the analytics-privatedata-users group in admin.yaml 17:44 < ori> or admin/data/data.yaml rather 17:46 < muta... [01:45:54] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 1.00 seconds [01:47:05] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Recognize new Flow dblist (duration: 01m 35s) [01:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:50:57] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Use Flow dblists for deciding which wikis have Flow (duration: 01m 38s) [01:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:51:16] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1578558 (10Jackmcbarn) FYI, I didn't realize that this was taking parsoid-prod.wmflabs.org down until it happened, so a lot of the requests from it we... [01:54:26] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2058577 (10ssastry) >>! In T110474#2058575, @Jackmcbarn wrote: > FYI, I didn't realize that this was taking parsoid-prod.wmflabs.org down until it hap... [01:55:12] 6Operations, 10Parsoid, 10Traffic, 10VisualEditor, and 2 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#2058578 (10Jackmcbarn) No problem, I already have it taken care of. [02:12:50] PROBLEM - nutcracker process on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [02:14:19] PROBLEM - nutcracker port on mw1099 is CRITICAL: Connection refused [02:15:09] RoanKattouw: matt_flaschen : No 'default' key for UseFlow? [02:15:56] Note this seems to be the first computed dblist we use in production. We normall use them on cli only, or subst it in dblist/ with script to generate it if used in prod. Might be a minor perf hit. [02:16:05] Though the compiled config cache might mitigate it [02:16:42] Hmm yeah that should have a 'default' key, good catch [02:17:04] Also I should deploy the Echo fixes [02:18:57] (03PS1) 10Catrope: Follow-up c021272153e: set a default value (false) for wmgUseFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272925 [02:19:12] 18669 Undefined variable: wmgUseFlow in /srv/mediawiki/wmf-config/CommonSettings.php on line 2516 [02:19:18] Thanks for noticing Krinkle :) [02:21:01] (03CR) 10Catrope: [C: 032] Follow-up c021272153e: set a default value (false) for wmgUseFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272925 (owner: 10Catrope) [02:21:59] ...is someone else deploying something? [02:22:07] (03Merged) 10jenkins-bot: Follow-up c021272153e: set a default value (false) for wmgUseFlow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272925 (owner: 10Catrope) [02:22:24] Oh #$@#$#@ l10nupdate is [02:23:34] Krinkle, looking. [02:23:50] Yep, binfo [02:23:52] Notice: Undefined variable: wmgUseFlow in /srv/mediawiki/wmf-config/CommonSettings.php on line 2516 [02:24:10] couple thousands times and counting :) [02:24:28] ah, roan got it already [02:24:45] Yeah you reported its cause :) [02:24:56] Sorry about that. [02:25:00] And the fix is merged, but I can't deploy it because l10nupdate holds the scap lock [02:25:17] Which is mostly because I got distracted again and didn't remember to finish my SWAT until after 6pm [02:25:24] Apparently 6pm is when l10nupdate runs? [02:25:45] (03PS3) 10Jforrester: Enable VisualEditor for new accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) [02:25:47] (03PS4) 10Jforrester: Enable VisualEditor for IP users on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) [02:25:49] (03PS1) 10Jforrester: Enable VisualEditor state transitioning for accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) [02:26:04] Krinkle, isn't it also used by arbitraryaccess? [02:26:24] matt_flaschen: what is that? [02:26:39] what is used by what? [02:26:41] Krinkle, it's a Wikibase feature, seems to use computed db list similarly, IIUC. [02:27:17] matt_flaschen: arbitraryaccess is substituted [02:27:17] (03CR) 10Jforrester: [C: 04-1] "If they're still in favour, we'll do this as soon as dewiki community vote closes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272926 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [02:27:26] possibly auto-generated, but not computed at run time [02:27:44] the all - x - y + z, has never been used in prod. [02:28:00] Which may be fine, it's just something seem to have avoided thusfar [02:28:09] My mistake. [02:28:28] it was introduced as a way to neatly make your own sets in the command line, and it even leaked into wmf-config as a way to "save" that custom preset, but was never used in actual run time [02:30:05] Krinkle, it is mainly for maintenance scripts here too (that's the intended benefit), but I figured it might as well also use it at runtime for consistency. I didn't know about the substing, how is that done? [02:32:54] matt_flaschen: Compile by hand and add a unit test that ensures it is up to date. [02:33:04] Some are also updated via refresh-dblists [02:33:12] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 13m 47s) [02:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:32] Which we could build upon to auto-subst expressions (e.g. moving the expression into a # comment on top) [02:33:47] right now it only builds .dblist variants [02:33:50] anyway [02:34:17] I'd avoid expressions for now as that wasn't previously used in prod and might have negative impact until we know further. [02:34:40] we should probably move expression presets elsewhere [02:35:00] or make it a cli-only feature with everything else substituted by a script [02:35:20] !log catrope@tin Synchronized php-1.27.0-wmf.14/extensions/Echo: SWAT (duration: 01m 42s) [02:35:25] Krinkle, okay, if it seems like a perf problem I have no objection to reverting. I'll file a bug about doig it in refresh-dblists. That makes sense. [02:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:50] matt_flaschen: yeah. again, I think it might be fine, but trying to maintain status quo for the time being as it wasn't intentional. [02:36:11] using a dblist is still fine of course, we can just subst it and leave the rest as-is [02:36:35] Krinkle, well, I intended it to apply at runtime, but I didn't realize it was the only one. [02:37:07] there is one other one, but it's not used at runtime. only as arg to cli. which isn't obvious. [02:37:19] e.g. it's not in the array of files we read in wmf-config [02:37:26] as wfConf tags [02:37:39] !log catrope@tin Synchronized php-1.27.0-wmf.13/extensions/Echo: SWAT (duration: 01m 40s) [02:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:42] Krinkle, I'll put a followup Gerrit to subst it. [02:38:47] thx [02:38:49] And task is https://phabricator.wikimedia.org/T127926 [02:38:51] signing off now :) [02:38:59] 'll check it out tomorrow [02:38:59] o/ [02:40:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Add default to fix notices about wmgUseFlow (duration: 01m 36s) [02:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:54] (03PS1) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 [03:01:25] (03PS2) 10Mattflaschen: Expand computed dblist; leave flow_computed for easy regeneration: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 [03:10:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 18m 05s) [03:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:29] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:50] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 28168 bytes in 0.175 second response time [03:19:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 24 03:19:41 UTC 2016 (duration 8m 46s) [03:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:46:55] 6Operations, 6Performance-Team, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2058670 (10Ricordisamoa) > 3.6 becomes EOL at the end of January 2016. It looks like it has been postponed until March: https://github.com/hhvm/user-documentation/issues/267 [04:48:32] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: shop.wikimedia.org should be HTTPS only - https://phabricator.wikimedia.org/T39790#2058723 (10Dzahn) shop.wikimedia.org and shop.wikipedia.org point to our Apache cluster: ``` templates/wikipedia.org:shop 600 IN DYNA geoip!text-addrs templ... [04:52:13] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2058724 (10Dzahn) the shop does not run on WMF infra but external on shopify.com and isn't operated by the operations team. so i'm afrai... [04:54:46] 6Operations, 10DNS, 10Traffic, 10domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#2058726 (10Dzahn) 5stalled>3Open [04:58:16] 6Operations, 10DNS, 10Traffic, 10domains: Transfer of domain names to WMF servers - https://phabricator.wikimedia.org/T114922#2058727 (10Dzahn) 5Open>3Resolved a:5VBaranetsky>3None We have received no updates on this since November, but it has been fixed at some point. whois wilkipedia.org | grep... [04:58:20] (03PS1) 10Ori.livneh: Never profile PyBal health-checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272932 [04:58:39] (03CR) 10jenkins-bot: [V: 04-1] Never profile PyBal health-checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272932 (owner: 10Ori.livneh) [04:58:50] (03PS2) 10Ori.livneh: Never profile PyBal health-checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272932 [04:59:08] (03CR) 10Ori.livneh: [C: 032] Never profile PyBal health-checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272932 (owner: 10Ori.livneh) [04:59:34] (03Merged) 10jenkins-bot: Never profile PyBal health-checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272932 (owner: 10Ori.livneh) [05:00:37] 6Operations, 10Traffic, 10Wikipedia-Store, 7HTTPS: shop switches HTTPS -> HTTP when showing login prompt (on clicking checkout) - https://phabricator.wikimedia.org/T63528#2058731 (10HuiZSF) Seems shopify only use https for checkout transaction. Similar report earlier in last year: https://phabricator.wiki... [05:01:11] (03PS1) 10BryanDavis: wikitech: Allow imports from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272933 [05:01:53] (03CR) 10Ori.livneh: [C: 031] "Sounds like a good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272933 (owner: 10BryanDavis) [05:02:47] argh, the sync-master mtime-equality fix has not been deployed yet? [05:03:15] there is nothing quite like running a command with a progress indicator that just sits at 0% for 2 minutes straight. [05:04:00] !log ori@mira Synchronized wmf-config/StartProfiler.php: I0e7be0b5: Never profile PyBal health-checks (duration: 03m 12s) [05:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:04:52] ori: thc.ipriani said that they were working on getting a new deb built [05:05:05] ah it's deployed as a debian package these days? [05:05:28] yes. finally fixing the "any dev can edit" loophole [05:05:59] but they haven't gotten the deploy process tuned up yet [05:06:16] I don't think they have had a release since the initial transition [05:07:12] well, it's these opsen, they're like san francisco hipsters [05:08:00] if a package is not lovingly hand-crafted by a skilled artisan who apprenticed with a master, it might as well be garbage [05:08:08] lol [05:08:51] they can't tell you how they're better than automatically-generated packages, but man, they can _feel_ it [05:08:53] I saw a tweet about "small batch artisan data" today [05:11:31] next step towards Terminator, Boston Dynamics again https://www.youtube.com/watch?v=rVlhMGQgDkY [05:11:47] !log Restarting HHVM on codfw app servers to make sure they pick a file-scope change to stop profiling PyBal health-checks [05:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:11:52] can you see it racking servers already ? :p [05:12:22] on the plus side, we have really thorough profiling data for Special:BlankPage [05:12:51] ori: I bet it says that our startup code is slow [05:15:52] let's cut it, who needs startup code [05:15:58] just get to it already [05:16:23] with a router and a DI system ... [05:16:30] if $wmgUseThis, if $wmfUseThat, give me my wiki page damn it [05:17:00] but content pages are always going to be the slow spot because that's where all the cool stuff happens [05:25:22] the XHGui database is at 18G after less than two days, and I configured it to retain 30 days' worth [05:25:53] what sample rate are you using? [05:26:15] 9G/day seems like a lot [05:26:37] 1:10,000 [05:27:55] I wonder if MongoDB is compressing data [05:28:22] * bd808 inserts web scale joke [05:28:37] " MongoDB 3.0 introduces compression with the WiredTiger storage engine. " [05:28:38] !log applied https://secure.phabricator.com/rP03d6e7f1b699d89c829e92ba0da2178b41ad1d6a on iridium to fix visibility on pastes [05:28:40] we're at 2.2 [05:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:29:43] it doesn't have compression. wow. [05:30:56] maybe it's the 1:1 request profiling on mw1017 that makes up the bulk of profiling data [05:31:00] rather than the 1:10,000 [05:31:09] could be I suppose [05:31:37] how big is the mw1017.log file on fluorine? [05:32:32] 65M [05:32:44] the log files in archive are tiny. I doubt it is getting a lot of usage [05:34:01] 1992 requests in today's log [05:34:15] that's not your problem :) [05:34:19] (03CR) 10Chad: [C: 032] wikitech: Allow imports from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272933 (owner: 10BryanDavis) [05:34:47] (03Merged) 10jenkins-bot: wikitech: Allow imports from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272933 (owner: 10BryanDavis) [05:35:04] ostriches: \o/ I can keep cleaning things up [05:35:22] * bd808 needs to learn to use pywikibot to script some stuff [05:35:43] Quiet night, figured I'd look at config backlog :) [05:36:20] I put that one up for swat in the morning [05:37:44] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: import meta to wikitechwiki (duration: 01m 45s) [05:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:40:09] ostriches: works. I just imported Template:Self [05:40:49] whee :) [05:41:54] grr meta doesn't have Template:Cc-by-sa-all [05:41:58] (03CR) 10Chad: [C: 032] Remove skel-1.5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272649 (owner: 10Chad) [05:42:39] (03Merged) 10jenkins-bot: Remove skel-1.5 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272649 (owner: 10Chad) [05:44:07] bd808: Heh, https://phabricator.wikimedia.org/F3410393 is so noisy it crowds anything else out of apache logs :p [05:45:00] those are boring errors [05:45:17] Yeah they are [05:45:30] Should either be filtered or squashed. [05:46:33] !log demon@tin Synchronized docroot/: removing skel-1.5 symlinks (duration: 01m 41s) [05:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:47:32] ostriches: we could add logstash rules to drop them pretty easily [05:48:01] I don't think we are going to fix the underlying apache2 bug unless you can nerd snipe T.im into looking at it [05:48:37] Heheh nah I won't do that to him [05:49:02] we already toss some similar ones -- https://github.com/wikimedia/operations-puppet/blob/production/files/logstash/filter-syslog.conf#L84-L89 [05:51:22] Patch incoming.... [05:52:43] (03PS1) 10Chad: Drop bogus apache2 error "AH01075: Error dispatching request to :" [puppet] - 10https://gerrit.wikimedia.org/r/272936 [06:03:20] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:29:50] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [06:30:20] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:19] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:08] (03CR) 10Giuseppe Lavagetto: [C: 032] "I never realized those logs made it to logstash..." [puppet] - 10https://gerrit.wikimedia.org/r/272936 (owner: 10Chad) [06:34:19] <_joe_> ostriches: thanks ^^ [06:34:19] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:48] _joe_: np, just noticed them myself [06:35:27] <_joe_> no I noticed them 2 years ago, read the source of mod_proxy_fcgi, found the issue and also that it was already fixed upstream [06:35:49] Ah so an eventual upgrade will squash them anyway? [06:36:02] <_joe_> yes [06:36:13] <_joe_> but it means upgrading to a new distro [06:36:25] <_joe_> that might happen in a not too distant future anyways [06:43:39] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: puppet fail [06:56:39] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:58:00] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:40] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:55] (03Abandoned) 10Muehlenhoff: Backport upstream fix 062c189fee20c18fae5ac3716a7379143d64150e which deals with changes in OpenSSL's SSL_shutdown() function during SSL handshakes introduced in 1.0.2f (causing false positive critical errors) Bug: T126616 [software/nginx] - 10https://gerrit.wikimedia.org/r/272685 (https://phabricator.wikimedia.org/T126616) (owner: 10Muehlenhoff) [07:06:08] (03Abandoned) 10Muehlenhoff: Backport upstream fix 062c189fee20c18fae5ac3716a7379143d64150e which deals with changes in OpenSSL's SSL_shutdown() function during SSL handshakes introduced in 1.0.2f (causing false positive critical errors) Bug: T126616 [software/nginx] - 10https://gerrit.wikimedia.org/r/272695 (https://phabricator.wikimedia.org/T126616) (owner: 10Muehlenhoff) [07:11:50] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:52:41] (03Abandoned) 10Giuseppe Lavagetto: Add native ipvs manager [debs/pybal] - 10https://gerrit.wikimedia.org/r/261375 (owner: 10Giuseppe Lavagetto) [08:05:55] hmmm [08:05:58] errors [08:06:25] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This is not how this is supposed to be fixed. I'll create a patch soon." [puppet] - 10https://gerrit.wikimedia.org/r/272899 (https://phabricator.wikimedia.org/T127845) (owner: 10Thcipriani) [08:06:36] Info: https://usercontent.irccloud-cdn.com/file/YLygtcVY/error%20with%20details [08:07:26] looks like back now... [08:09:53] Oh, I see that one almost every hour. [08:10:23] <_joe_> Jamesofur: which url gave that to you? [08:10:32] _joe_: it was a commons contributions link [08:10:39] I can get you the exact one if needed [08:10:43] <_joe_> Jamesofur: no need [08:11:02] sjoerddebruin: consistently ?...... [08:11:05] or just recently? [08:11:11] * Jamesofur very much does not get it often [08:11:30] <_joe_> I don't think I ever got a 503 unless an outage was going on [08:11:48] Jamesofur: in the last few days. [08:11:54] huh [08:12:02] <_joe_> also, given less than 0.001% of requests have errors, either you see more than 10K pages/hour or it's pretty strange [08:12:11] but just for one second or something, refresh always helps [08:12:14] <_joe_> sjoerddebruin: any specific urls where that happens? [08:12:26] <_joe_> because that sounds like a software bug [08:12:28] Mostly special pages, but also just articles. [08:12:32] <_joe_> more than some instability [08:12:55] <_joe_> sjoerddebruin: can I ask you to record such urls for a couple of days and rely those to us? [08:13:03] Will try. [08:13:23] <_joe_> actually I see the 5xx count is way up [08:14:00] Almost doubled yes [08:14:20] <_joe_> sjoerddebruin: I'll take a look [08:14:48] <_joe_> it becan at 4 pm utc [08:14:52] <_joe_> *began [08:14:58] <_joe_> yesterday [08:15:11] <_joe_> sjoerddebruin: I'll dig a bit and maybe open a phab task [08:15:19] <_joe_> Jamesofur: ^^ [08:15:32] thanks much [08:15:34] _joe_: where do you see that (that it's doubled)? [08:15:44] if you remember feel free to add me, will be interested in following [08:15:46] apergos: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [08:15:49] https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=1455696943504&to=1456301743504 [08:16:18] <_joe_> it all began with yesterday's swat [08:16:20] <_joe_> it seems [08:16:28] I see changes after the 18th [08:16:30] tbh [08:16:44] <_joe_> and another big bumb on the 18th, yes [08:16:52] <_joe_> so two successive bumps [08:17:00] If you zoom out to 30 days, you can clearly see more instability [08:17:01] <_joe_> the first one was low enough to escape noticing [08:17:07] <_joe_> sjoerddebruin: yeah you're right [08:17:17] <_joe_> but this is probably more timeouts/app errors [08:17:28] <_joe_> but I'll take the morning to look into it [08:17:51] Great! [08:18:57] out of eqiad it's nice and low [08:18:59] out of esams worse [08:19:40] <_joe_> apergos: what do you mean? [08:20:00] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1456237170805&to=1456247970805&var-site=eqiad&var-cache_type=text&var-status_type=5 [08:20:42] _joe_: [08:20:55] out of ulsfo worse also [08:22:51] <_joe_> you mean the rate of 5xx responses from ulsfo got a surge and eqiad didn't? [08:23:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [08:23:28] <_joe_> because it's not what I see [08:29:45] <_joe_> https://graphite.wikimedia.org/render/?width=800&height=600&_salt=1456302551.26&from=00%3A00_20160214&until=23%3A59_20160224&target=sumSeries(varnish.eqiad.*.frontend.request.client.status.5xx.rate) [08:29:46] (03CR) 10Jcrespo: "Make the buffer pool as large as you feel confortable doing it in a non-dedicated environment so that it provides you a large hit/miss rat" [puppet] - 10https://gerrit.wikimedia.org/r/272808 (https://phabricator.wikimedia.org/T119646) (owner: 10Ottomata) [08:30:04] <_joe_> it's pretty clear this is a ubn issue [08:32:09] are you talking about the increased background? [08:34:10] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [08:37:14] <_joe_> jynus: yes [08:37:25] <_joe_> it's not "background", there is clearly some issue going on [08:40:59] when I look at caches for the last 24 hours, it looks to me like eqiad is pretty quiet now, ulsfo and esams are both increasing now. [08:41:20] (I'm responding slowly because I"m trying to make sure I'm reading these graphs correctly) [08:41:39] <_joe_> apergos: still, the background surge can be seen in eqiad too [08:41:45] I looked at text by itself, upload by itself, and "all" to see if it made much difference [08:42:32] !log installing libssh2 security updates across the cluster [08:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:43:42] from 16.00 to 23.00 there was a bit more going on (except the one spike) [08:44:54] but after, I don't really see it [08:47:15] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058868 (10Joe) [08:47:23] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058880 (10Joe) p:5Triage>3Unbreak! [08:47:38] <_joe_> sjoerddebruin: ^^ follow this task :) [08:47:46] :) [08:47:55] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058868 (10Joe) a:3Joe [08:52:15] (03CR) 10Tim Landscheidt: "@Dzahn: modules/role/manifests/ci/slave/labs/common.pp and modules/role/manifests/ci/labs/common.pp are identical. It looks to me as if t" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [08:55:00] (03CR) 10Tim Landscheidt: "@Dzahn: Also, and more importantly :-), there have been changes to manifests/role/ci.pp in the mean time, so the patch needs to be manuall" [puppet] - 10https://gerrit.wikimedia.org/r/260939 (owner: 10Dzahn) [08:56:40] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance, 7user-notice: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2058902 (10elukey) [08:56:42] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2058901 (10elukey) [08:58:22] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058903 (10Joe) So, SAL entries around the time of the two spikes: February 17th: ``` 20:25 logmsgbot: krinkle@tin Synchronized wm... [09:00:59] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2058906 (10elukey) @Johan: I am planning to re-image two hosts today since I am not blocked anymore by the latency regression. The main side effect will be that 2/18 of the... [09:02:07] _joe_ hi! the plan is to do mc1014/mc1015 today, mc1016->mc1018 tomorrow and then lock manager redises (mc1001->mc1003) on Friday/Monday [09:02:23] so we should be set for Monday [09:03:00] I can re-image the hosts more aggressively if you want, but I'd prefer to use a slow start :) [09:03:01] <_joe_> elukey: ok [09:03:12] <_joe_> elukey: no it's ok [09:03:20] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [09:04:48] (03CR) 10Gehel: [C: 031] "Looks good to me (and trivial enough)." [puppet] - 10https://gerrit.wikimedia.org/r/272908 (https://phabricator.wikimedia.org/T115476) (owner: 10Smalyshev) [09:05:37] Cc: ori - mc1014/mc1015 today, mc1016->mc1018 tomorrow and then lock manager redises (mc1001->mc1003) on Friday/Monday [09:06:21] OK. Redis ops/sec is back to normal once we backed out the IP-logging feature out of SessionManager [09:06:31] Memcached traffic is still elevated but the fixes are in wmf14 [09:06:36] and so should ride the train later today [09:07:30] this patch, specifically: https://gerrit.wikimedia.org/r/#/c/272642/ [09:07:54] so, actually... [09:08:22] elukey: if you don't mind, it would be good to wait one more day. I'm sorry, this must be frustrating. [09:09:04] I'd like to be able to confirm that memcached traffic is back to pre-wmf12 levels, and it's going to be hard to do that if caches are taken out of circulation and then reintroduced [09:11:25] I know that this contradicts what I said yesterday, sorry. [09:12:20] ori: no big issue, I'll start tomorrow morning and I'll go for 3 hosts tomorrow and 3 on Friday. In return though, I might need some re-assurance about the fact that removing a lock manager from the pool (mc1001->mc1003) won't cause any turmoil in mediawiki :P [09:13:09] I had a chat with Aaron a while ago and it shouldn't be a problem, but I don't really know how to be sure [09:13:44] the lock managers should be part of a quorum, so if I remove one at the time to re-image it should be fine.. [09:14:32] untested HA is not HA, right? [09:14:43] that only tells us that it is possible for it to be fine [09:14:49] not that it actually will be [09:14:58] oh wait, you said assurance [09:14:59] exactly:) [09:15:04] i'm sure it'll be fine! [09:15:34] * elukey feels trolled a bit by ori :D [09:16:27] jokes aside, removing one host at the time from mediawiki-config shoud be enough [09:16:38] no i wasn't trolling you, just making jokes [09:16:51] ahhhhh okok! [09:17:06] but i can be around to support this if you like [09:17:09] PROBLEM - Disk space on cerium is CRITICAL: DISK CRITICAL - free space: /srv 13672 MB (3% inode=99%) [09:17:49] ori: we can work together on Monday morning PST, just for the first lock manager if you have time [09:20:33] 6Operations, 13Patch-For-Review, 7user-notice: reinstall eqiad memcache servers with jessie - https://phabricator.wikimedia.org/T123711#2058941 (10elukey) @Johan: quick update, I'll start tomorrow EU morning, not today. [09:23:07] elukey: wfm [09:23:15] can you send me a calendar invite? [09:23:24] ori: sure! Thanks! [09:23:28] thank you [09:23:35] cerium is a cassandra test host, should I stop what I am doing and attend that? [09:27:09] jynus: I am completely ignorant about what the host does (same thing for Cassandra), I can help if you want [09:27:54] jynus elukey looking, but yeah test host [09:28:07] I am not asking for help, I am asking who owns that [09:28:07] test post please ignore [09:29:08] jynus: ok, got it sorry [09:29:11] jynus: me and the services team [09:30:21] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058954 (10Joe) I am looking at the number of errors from the text caches on various days, focusing on the affected backend as x_cac... [09:30:21] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:30:30] I assume it is not urgent, but I do not want to assume somebody will look at it (without hurry) and nobody does [09:31:30] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058956 (10Joe) [09:31:50] RECOVERY - Disk space on cerium is OK: DISK OK [09:32:27] jynus: yup, that makes sense, thanks! [09:33:09] (it recovered by itself btw) [09:33:22] many times I see a mysql error, and i just comment "do not worry", will check it later, so nobody worries [09:33:45] *has to [09:37:20] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058981 (10ArielGlenn) An overly long list of puppet stuff before/after the Feb 17th spike [17:20:41] (CR) Dzahn: [C: 2] partman... [09:43:38] Hi, I’m trying to solicit more traffic for our local internet exchange here (Israel), due to the delicate nature of the local ISP market, I’m looking for stabilizing factors. I figured wikimedia could be one such and congruent with the stated goals of the IIX. Would peering or hosting be the right way to go about it? what would be the expected traffic rates for Israel? (I can provide a set of ASNs for a netflow query as needed) [09:47:11] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2058996 (10ema) We did enable nginx keepalives for text at 8PM on the 17th https://phabricator.wikimedia.org/T107749#2036351 Could t... [09:50:48] (03CR) 10Luke081515: [C: 031] Enable VisualEditor for new accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [09:51:56] (03CR) 10Luke081515: [C: 031] Enable VisualEditor for IP users on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [09:53:14] (03PS7) 10Filippo Giunchedi: swift: adjust mount options for debian and ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/263629 (https://phabricator.wikimedia.org/T117972) [09:53:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: adjust mount options for debian and ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/263629 (https://phabricator.wikimedia.org/T117972) (owner: 10Filippo Giunchedi) [09:54:39] gilbahat: paravoid or mark should be able to comment on that, but they are currently idling, they'll probably chime in after reading backlog [09:55:17] and I am pretty sure the primary point of contact is peering@wikimedia.org [09:55:21] there are some informations at https://wikimediafoundation.org/wiki/Peering [09:56:02] and peeringdb.com might be of some help as well https://www.peeringdb.com/view.php?asn=14907 (that is afaik kept up to date and list wmf point of presences, prefix, as etc) [09:56:06] gilbahat: ^^^ [09:56:56] yeah, read that and the peeringdb entry too. if we go towards peering I’d have to find transit donation all the way through to AMS-IX. what about hosting, or content caching? not referenced in either if anything works. [09:57:28] gilbahat: and for what it is worth, Israel traffic is most certainly served by the WMF datacenter in Amsterdam, Netherlands. That should pass through transit ISP. Certainly unlikely we would setup a direct link from Amsterdam to an IX in Israel, but I might be missing your point [09:58:58] for hosting / content caching I honestly have no idea nor am I in anyway in a position to answer about such enquiries. Your best point of contact would probably be noc@wikimedia.org [09:59:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [10:00:24] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2059048 (10Joe) The timing of the first spike correlates pretty well with that change; this still doesn't explain the second step up. [10:00:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [10:00:56] hashar: to make a long story short about the internet market in israel, it’s characterized by monopolizing ISPs and a very fragile local internet exchange which kind of stands in the way of the big ISPs to monopolize local interconnectivity. any content provider that is ‘present’ in the IIX makes it more useful. it also offsets the costs of the smaller ISPs, allowing them a bit more breathing room to survive. [10:01:03] gilbahat: https://wikitech.wikimedia.org/wiki/Clusters might be a good overview, it describes what each datacenter is meant for (in short two for applications, and two additional for content caching in San Francisco and Amsterdam) [10:01:59] (03PS1) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [10:02:34] gilbahat: then Wikimedia is not present in Israel, so it is unlikely it can peer at an IX there. peering@wikimedia.org would definitely give you the official answer [10:02:40] (03PS2) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [10:02:48] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2059053 (10Joe) [10:03:41] hashar: of course. but if a find someone who can/will donate the transit, it will be remotely present. or else if I can place a few caching servers in the IIX. but it looks like this is not suitable for how wikimedia runs its online presence. [10:04:57] you seem to work with large cache clusters, not a network of small POPs. [10:04:57] <_joe_> gilbahat: having caches anywhere has economic, staffing and legal implications; I wouldn't consider it as a near-future possibility, but this is just my personal opinion [10:05:32] (03CR) 10Hashar: [C: 031] "The setting change itself is all fine. Do not merge until James approve this which is really all pending on the end of community vote at " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [10:05:38] <_joe_> gilbahat: there has been talks of having mini-POPs, but there is nothing practical atm. [10:06:02] _joe_: OTOH it precludes smaller donations like the one ISOC (israeli internet society, the non-profit operating the IIX) is able to muster. [10:07:06] <_joe_> gilbahat: I'm not disagreeing, I'm just stating it's not like "we buy 4 servers, we ship them to a rack and it's done" [10:07:56] <_joe_> anyways, mark and paravoid are way more competent than me on any network-related stuff (and non-network related too, but I digress) [10:10:31] 6Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans - https://phabricator.wikimedia.org/T117972#2059065 (10fgiunchedi) uid change isn't a blocker for jessie, adding swift user pre-swift (i.e. pre-puppet) seems like a good solution! most of my worry comes from upgrading to jessie and new sw... [10:10:39] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:12:20] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:15:20] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:19:55] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2059068 (10Joe) The second surge is very steep, it happens between 16:03 UTC and 16.:07 UTC on Feb 23rd. This surge speed would see... [10:39:17] (03PS1) 10Muehlenhoff: Add CVE IDs to the changelog which was assigned later on [debs/linux] - 10https://gerrit.wikimedia.org/r/272953 [10:39:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add CVE IDs to the changelog which was assigned later on [debs/linux] - 10https://gerrit.wikimedia.org/r/272953 (owner: 10Muehlenhoff) [10:39:55] (03PS4) 10Filippo Giunchedi: grafana: add dashboard import tool [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) [10:40:04] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] grafana: add dashboard import tool [puppet] - 10https://gerrit.wikimedia.org/r/268082 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [10:40:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 847 [10:42:02] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:42:09] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:45:09] RECOVERY - check_mysql on db1008 is OK: Uptime: 3092815 Threads: 2 Questions: 24419793 Slow queries: 20630 Opens: 7209 Flush tables: 2 Open tables: 407 Queries per second avg: 7.895 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:45:33] (03PS4) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [10:45:54] 6Operations, 10ops-codfw: es2011-es2019 have default RAID stripe - https://phabricator.wikimedia.org/T127938#2059103 (10Volans) [10:49:01] 6Operations, 10ops-codfw: es2011-es2019 have default RAID stripe - https://phabricator.wikimedia.org/T127938#2059128 (10jcrespo) [10:53:12] !log grow restbase2001 raid0 to include a 5th disk [10:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:56:30] PROBLEM - salt-minion processes on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:57:10] PROBLEM - dhclient process on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:05] 6Operations, 7Wikimedia-log-errors: mw1099 has lost nutcracker - https://phabricator.wikimedia.org/T127939#2059139 (10hashar) [11:01:00] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is deactivating [11:01:38] !log mdadm errors on restbase2001 while growing the raid0, load increasing [11:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:44] !log reboot restbase2001 [11:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:55] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2059157 (10ema) A few interesting varnish metrics in esams [[ https://ganglia.wikimedia.org/latest/?r=custom&cs=02%2F16%2F2016+00%3A... [11:19:09] RECOVERY - salt-minion processes on restbase2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:19:41] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [11:19:50] RECOVERY - dhclient process on restbase2001 is OK: PROCS OK: 0 processes with command name dhclient [11:20:04] (03PS1) 10Giuseppe Lavagetto: deployment-prep: fixup for I65dc207e5 [puppet] - 10https://gerrit.wikimedia.org/r/272956 (https://phabricator.wikimedia.org/T127845) [11:20:50] (03PS2) 10Giuseppe Lavagetto: deployment-prep: fixup for I65dc207e5 [puppet] - 10https://gerrit.wikimedia.org/r/272956 (https://phabricator.wikimedia.org/T127845) [11:21:01] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: fixup for I65dc207e5 [puppet] - 10https://gerrit.wikimedia.org/r/272956 (https://phabricator.wikimedia.org/T127845) (owner: 10Giuseppe Lavagetto) [11:21:10] (03CR) 10Giuseppe Lavagetto: [V: 032] deployment-prep: fixup for I65dc207e5 [puppet] - 10https://gerrit.wikimedia.org/r/272956 (https://phabricator.wikimedia.org/T127845) (owner: 10Giuseppe Lavagetto) [11:24:17] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2059202 (10Joe) p:5Triage>3Normal a:3Joe [11:29:18] _joe_: Gotcha. Got one on https://www.wikidata.org/wiki/Special:NewItem?lang=nl&site=nlwiki&page=Aulocalycinae&label=Aulocalycinae [11:29:30] Request from 10.20.0.113 via cp1067 cp1067 ([10.64.0.104]:3128), Varnish XID 3880457780 [11:29:30] Forwarded for: 94.214.239.5, 10.20.0.176, 10.20.0.176, 10.20.0.113 [11:29:30] Error: 503, Service Unavailable at Wed, 24 Feb 2016 11:28:59 GMT [11:29:56] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2059209 (10Joe) I removed the cherry pick of https://gerrit.wik... [11:30:04] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2059211 (10Joe) 5Open>3Resolved [11:31:07] <_joe_> sjoerddebruin: which loaded just fine for me even uncached.... [11:31:16] <_joe_> so yeah, thanks, this helps a bit [11:31:23] Yeah, also for me after refresh. It's weird stuff. [11:31:59] <_joe_> sjoerddebruin: if you're curious the phab ticket has our findings in almost-real-time [11:32:22] Yeah reading almost instant when I get mail ;) [11:32:49] (03Abandoned) 10Giuseppe Lavagetto: Beta: fix nutcracker config changes [puppet] - 10https://gerrit.wikimedia.org/r/272899 (https://phabricator.wikimedia.org/T127845) (owner: 10Thcipriani) [11:34:04] _joe_: Another one just at https://nl.m.wikipedia.org/wiki/Letse_regio%27s_en_provincies-serie [11:34:29] Do you need the Varnish XID or something for them? [11:35:12] <_joe_> sjoerddebruin: I don't think it's particularly useful atm [11:35:36] Ah okay, just url's? [11:36:05] <_joe_> yeah but there is no real need now, we can see them in the logs and I'm pretty sure it's not a mediawiki error [11:36:16] Okay. :) [11:36:17] <_joe_> (which would be repeatable somehow) [11:36:27] <_joe_> thanks for the help, and sorry for the issues [11:36:53] Good luck. [11:37:43] <_joe_> (and btw, you're pretty unlucky, from my calculations one request every 100K is failing at the moment) [11:38:23] Yeah, huge drop here https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=9&fullscreen [11:39:04] <_joe_> oh that's graphite lag :) [11:39:18] <_joe_> don't trust the last data point(s) ever [11:39:31] Ah ok [11:39:46] Learning from all of this. :D [11:41:54] my issues are a false hearring, I forgot that kibana does not "AND" "-"terms [11:43:59] I mean the original issues exist, but they are not causing user facing errors [11:59:21] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [12:26:30] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:33:56] (03PS1) 10DCausse: Enable 'popqual' (quality+pageviews) scoring method for the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) [12:34:32] (03CR) 10jenkins-bot: [V: 04-1] Enable 'popqual' (quality+pageviews) scoring method for the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) (owner: 10DCausse) [12:36:27] (03PS2) 10DCausse: Enable 'popqual' (quality+pageviews) scoring method for the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) [12:37:37] (03PS3) 10DCausse: Enable 'popqual' (quality+pageviews) scoring method for the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) [12:41:19] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:46:50] ^ labmon1001 is expected, that's an ongoing package update [13:02:51] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.152, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:03:09] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: Connection refused [13:03:41] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: Connection refused [13:03:50] PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:04:20] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: Connection refused [13:04:29] PROBLEM - cassandra-c service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [13:04:30] PROBLEM - cassandra-b service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [13:04:30] PROBLEM - Restbase root url on restbase2001 is CRITICAL: Connection refused [13:04:31] that's me ^ [13:04:39] expired downtime.. [13:14:40] RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active [13:15:20] RECOVERY - cassandra-c service on restbase2001 is OK: OK - cassandra-c is active [13:15:20] RECOVERY - cassandra-b service on restbase2001 is OK: OK - cassandra-b is active [13:15:20] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.121 second response time [13:15:22] !log restart cassandra on restbase2001, throttle raid rebuild speed to 8MB/s [13:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:40] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [13:16:21] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.039 second response time on port 9042 [13:17:00] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.039 second response time on port 9042 [13:17:30] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.037 second response time on port 9042 [13:20:20] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2059455 (10fgiunchedi) 5Open>3Resolved [13:21:38] 7Blocked-on-Operations, 6Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2059458 (10fgiunchedi) [13:23:13] (03PS1) 10Jcrespo: labsdb1003 is a bit overloaded right now, move commonswiki to 1 [puppet] - 10https://gerrit.wikimedia.org/r/272965 [13:24:08] 6Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans - https://phabricator.wikimedia.org/T117972#2059494 (10faidon) OK, my misunderstanding regarding the uid issue then. I'm not too worried about jessie or packaging but I'll concede that moving from 1.13.1 to 2.2 (or 2.6) -and in a rush- inc... [13:29:40] (03CR) 10Jcrespo: [C: 032] labsdb1003 is a bit overloaded right now, move commonswiki to 1 [puppet] - 10https://gerrit.wikimedia.org/r/272965 (owner: 10Jcrespo) [13:35:03] 7Blocked-on-Operations, 6Operations, 10RESTBase-Cassandra: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951#2059540 (10fgiunchedi) today I've expanded raid0 on restbase2001 with ``` sfdisk -d /dev/sda | sfdisk /dev/sde mdadm /dev/md0 --add /dev/sde1 mdadm /dev/md1 --add /dev/sde... [13:46:10] !log bump max reconstruction speed on restbase2001 to 12000 T127951 [13:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:19] (03CR) 10Krinkle: [C: 032] Set $wgResourceBasePath to "/w" for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271710 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [13:59:06] (03Merged) 10jenkins-bot: Set $wgResourceBasePath to "/w" for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271710 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [14:02:04] 6Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#2059647 (10faidon) a:5faidon>3None [14:08:36] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: T99096: Enable wmfstatic for medium wikis (duration: 01m 40s) [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:04] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2059699 (10ema) We have also noticed an increase in varnish.fetch_chunked as well as a varnish.fetch_close drop. {F3412235} {F3412... [14:24:54] 6Operations, 10Mail: From field should reflect the origin domain - https://phabricator.wikimedia.org/T127961#2059706 (10Danny_B) [14:29:44] (03PS1) 10Muehlenhoff: Assign salt grains for cache::misc in esams/codfw/ulsfo and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272974 [14:29:46] (03PS1) 10Muehlenhoff: Assign salt grains for yubiauth servers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272975 [14:32:49] (03PS1) 10BBlack: disable do_gzip on cache_text (experiment) [puppet] - 10https://gerrit.wikimedia.org/r/272976 (https://phabricator.wikimedia.org/T127931) [14:35:30] RECOVERY - DPKG on labmon1001 is OK: All packages OK [14:39:33] (03CR) 10BBlack: [C: 032] disable do_gzip on cache_text (experiment) [puppet] - 10https://gerrit.wikimedia.org/r/272976 (https://phabricator.wikimedia.org/T127931) (owner: 10BBlack) [14:43:11] !log bump max reconstruction speed on restbase2001 to 20000 T127951 [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:04] !log cache_text: -do_gzip experiment live on all [14:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:08] <_joe_> bblack: it's still a bit early to say it, but the 503s rate seems to be descending [14:51:40] (03PS2) 10Muehlenhoff: Assign salt grains for cache::misc in esams/codfw/ulsfo and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272974 [14:52:17] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2055880 (10hashar) That seems to cause the nutcracker.yam file... [14:52:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for cache::misc in esams/codfw/ulsfo and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272974 (owner: 10Muehlenhoff) [14:55:16] !log nodetool-a repair -pr on restbase1008 T108611 [14:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:57:02] (03PS2) 10Muehlenhoff: Assign salt grains for yubiauth servers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272975 [14:57:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for yubiauth servers and use in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/272975 (owner: 10Muehlenhoff) [15:08:47] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 3.941 second response time [15:09:11] ^andrew dns things? [15:09:42] I'm putting it in maint to look at [15:10:51] chasemp: 502 suggests the tools.admin webservice died [15:10:56] nfs seems fine [15:10:58] hm [15:11:26] tools.admin@tools-bastion-05:~$ qstat [15:11:26] error: sge_gethostbyname failed [15:11:27] .... [15:11:29] or not. [15:11:34] (or both) [15:12:19] (03PS1) 10MarcoAurelio: New throttle settings for Edit-a-thon workshop for orwiki (urgent) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272987 (https://phabricator.wikimedia.org/T127599) [15:13:10] and I can't login to tools-grid-master [15:13:17] sorry, no time to figure this out [15:13:40] PROBLEM - Auth DNS for labs pdns on labs-ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:13:43] !log krinkle@tin Synchronized php-1.27.0-wmf.14/includes/OutputPage.php: Iad94bb2 (duration: 01m 43s) [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:30] RECOVERY - Auth DNS for labs pdns on labs-ns2.wikimedia.org is OK: DNS OK: 3.022 seconds response time. nagiostest.eqiad.wmflabs returns [15:15:36] !log krinkle@tin Synchronized php-1.27.0-wmf.13/includes/OutputPage.php: Iad94bb2 (duration: 01m 50s) [15:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:26] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 766290 bytes in 5.327 second response time [15:18:44] !log krinkle@tin Synchronized php-1.27.0-wmf.14/extensions/wikihiero: Ia0990f5f (duration: 01m 33s) [15:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:43] !log krinkle@tin Synchronized php-1.27.0-wmf.13/extensions/wikihiero: Ia0990f5f (duration: 01m 33s) [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:03] 6Operations, 10Ops-Access-Requests: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2059987 (10Ottomata) Wha? That is very strange. I had to recreate some Hue accounts yesterday, and I did it from the list of users in `analytics-privatedata-users`. I didn... [15:23:36] (03PS1) 10Ottomata: Add bmansurov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272988 (https://phabricator.wikimedia.org/T113069) [15:23:36] (03PS1) 10Ottomata: Add bmansurov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272988 (https://phabricator.wikimedia.org/T113069) [15:24:15] (03CR) 10Ottomata: [C: 032 V: 032] Add bmansurov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272988 (https://phabricator.wikimedia.org/T113069) (owner: 10Ottomata) [15:24:16] (03CR) 10Ottomata: [C: 032 V: 032] Add bmansurov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/272988 (https://phabricator.wikimedia.org/T113069) (owner: 10Ottomata) [15:28:03] I assume a lot of people have already reported this, but a lot of random 503's have started appearing in the past few days. [15:28:26] we have had some reports, guestbird11, but if you have details we would love to have them [15:28:42] urls and times [15:28:59] Request from 10.20.0.107 via cp1054 cp1054 ([10.64.32.106]:3128), Varnish XID 3878459127 Forwarded for: 92.241.155.192, 10.20.0.114, 10.20.0.114, 10.20.0.107 Error: 503, Service Unavailable at Wed, 24 Feb 2016 15:25:54 GMT [15:29:04] URL: https://en.wikipedia.org/wiki/Sideman [15:29:19] It happens on various wikis and on various URLs, by the way. [15:29:33] how often do you see it, rough guess? [15:29:40] guestbird11: Does it typically happen twice on the same url when you reload? [15:29:42] several times in a day, I mean? [15:29:55] Krinkle: Doesn't happen twice. [15:30:44] Krinkle: Not normally, anyway. [15:30:44] <_joe_> Krinkle: there is a ticket already, we're investigating the source of such an increase [15:30:55] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060012 (10Ottomata) `/user/bmansurov` exists in HDFS, so he definitely had access at some time. Anyway, done. Try now. [15:31:28] <_joe_> Krinkle: https://phabricator.wikimedia.org/T127931 [15:31:38] apergos: I'd say, like... 10ish a day? I haven't really kept count. [15:31:43] ok [15:31:56] gives us an idea, that's helpful. thanks! [15:33:07] bd808: I'm getting ready for elasticsearch upgrade this evening [15:34:01] bd808: I've had a look again at logstash-beta and it looks good. If there was an issue, how would I know? Where would people be screaming? [15:35:16] gehel: if there is data since the version upgrade things are probably fine. [15:35:28] so things are probably fine ;-) [15:35:40] sadly Logstash has been broken in beta cluster for days at a time before and nobody noticed :/ [15:35:55] and I have not seen anyone screaming in #wikimedia-labs [15:36:03] gehel: yeah I think you are good to go with the upgrade in prod [15:36:47] (03PS1) 10Filippo Giunchedi: restbase: move test/staging to its own cluster [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) [15:36:47] (03PS1) 10Filippo Giunchedi: restbase: move test/staging to its own cluster [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) [15:36:51] I'm used to much heavier process, and I used to be the cowboy... I need some adjustment [15:36:58] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2060021 (10RobH) In the past, all fundraising keys/certs are generated/stored and stored on boron, NOT in our ops private/public puppet repos. Since this is being used on a third pa... [15:37:27] gehel: we do "just enough process" except on the days we do "not quiet enough process" ;) [15:37:41] *quite [15:38:02] the question is not how much, but is it the right process... [15:39:08] thanks for your help, I'll push this in prod this evening ... [15:39:45] <_joe_> gehel: hehe everyone does :) [15:43:29] (03PS1) 10MarcoAurelio: Set transwiki import sources for hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272992 (https://phabricator.wikimedia.org/T127593) [15:43:29] (03PS1) 10MarcoAurelio: Set transwiki import sources for hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272992 (https://phabricator.wikimedia.org/T127593) [15:45:33] (03PS2) 10Dzahn: Fix role::programdashboard include and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/272906 (owner: 10Dduvall) [15:45:33] (03PS2) 10Dzahn: Fix role::programdashboard include and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/272906 (owner: 10Dduvall) [15:46:06] grrrit-wm: tell your brother it's ok if he goes home [15:46:29] (03CR) 10Dzahn: [C: 032] Fix role::programdashboard include and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/272906 (owner: 10Dduvall) [15:46:30] (03CR) 10Dzahn: [C: 032] Fix role::programdashboard include and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/272906 (owner: 10Dduvall) [15:46:39] /kick [15:48:30] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2060059 (10Jgreen) Generating/storing in the normal repo sounds good to me. [15:49:06] mutante: I think I know how to start/stop grrrit. I'll see if I can fix it [15:50:14] bd808: thanks [15:50:17] hmmm.. k8s thinks there is only one instance running [15:51:24] 7Blocked-on-Operations, 10RESTBase, 13Patch-For-Review: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2060074 (10fgiunchedi) https://gerrit.wikimedia.org/r/272989 is a patch to move things into their own `$cluster` (as puppet calls it), save... [15:51:43] 6Operations, 10MediaWiki-Logging, 10Wikimedia-IRC-RC-Server, 10Wikimedia-Stream, and 3 others: Verify that logs, irc, rcstream changes can flow from codfw to eqiad - https://phabricator.wikimedia.org/T126472#2060075 (10Gehel) [15:54:35] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2060092 (10Gehel) [15:57:00] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:57:00] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:57:13] <_joe_> uhh this too, now? [15:58:44] mutante: I killed one. The other is an orphan on the k8s grid that we don't know how to manage. :/ [15:58:49] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:58:49] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:59:25] so if and when grrrit-wm dies somebody will have to start it manually -- https://wikitech.wikimedia.org/wiki/Grrrit-wm -- that needs a labs root at the moment [15:59:33] <_joe_> bd808: what is an orphan? [15:59:54] _joe_: get pods doesn't know it exists [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160224T1600). Please do the needful. [16:00:05] _joe_ aude bd808 mafk: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] o/ [16:00:14] <_joe_> bd808: I have swat now, sorry [16:00:17] o/ [16:00:23] I can take today's [16:00:34] I am merely lurking [16:00:34] my patch got deployed last night thanks to ostriches [16:00:40] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2060119 (10EBernhardson) WRT elasticsearch's internal (non-GPL) product all they do is enable TLS for the inter-node (port 9300) traffic an... [16:00:47] was in a 1/1 with Tyler he will show up soonish i guess [16:01:06] bd808: Hehe, maybe I should do all morning config swats the night before so everyone gets a happy surprise when they wake up :p [16:01:34] _joe_: You're going last, most likely to Break Stuff :) [16:01:58] (03CR) 10Chad: [C: 032] New throttle settings for Edit-a-thon workshop for orwiki (urgent) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272987 (https://phabricator.wikimedia.org/T127599) (owner: 10MarcoAurelio) [16:01:58] <_joe_> ahah ok fair enough [16:02:20] <_joe_> I was about to propose to release to mw1017 first [16:02:24] (03Merged) 10jenkins-bot: New throttle settings for Edit-a-thon workshop for orwiki (urgent) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272987 (https://phabricator.wikimedia.org/T127599) (owner: 10MarcoAurelio) [16:02:44] (03CR) 10Chad: [C: 032] Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [16:02:55] _joe_: still getting 503's. :) [16:03:12] _joe_: Either way, the other ones are absolutely trivial :) [16:03:15] (CC: bblack) [16:03:17] (03PS1) 10Krinkle: Revert "Set $wgResourceBasePath to "/w" for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272994 [16:03:22] (03CR) 10Krinkle: [C: 032] Revert "Set $wgResourceBasePath to "/w" for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272994 (owner: 10Krinkle) [16:03:35] <_joe_> sjoerddebruin: we're trying to pin down the problem [16:03:40] (03Merged) 10jenkins-bot: Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [16:03:56] Yeah, just a update because we were so optimistic. [16:04:22] (03Merged) 10jenkins-bot: Revert "Set $wgResourceBasePath to "/w" for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272994 (owner: 10Krinkle) [16:04:39] ostriches: per _security, rolling back wmfstatic /w change [16:05:56] swat :) [16:05:58] Krinkle: Ok. I've got some stuff staged to tin, I haven't sync'd it yet. [16:06:07] CommonSettings* and throttle.php [16:06:58] Krinkle: I can push you with swat or did you want to? [16:07:10] ostriches: go ahead [16:07:12] * Krinkle has a meeting [16:08:22] 6Operations, 7Puppet, 7Documentation, 7Need-volunteer: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2060144 (10Dzahn) [16:08:24] 6Operations, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2060143 (10Dzahn) [16:09:40] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Revert "Set $wgResourceBasePath to "/w" for medium wikis" (duration: 01m 30s) [16:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:53] Krinkle: ^^^ [16:11:00] _joe_: bblack ^ [16:11:14] <_joe_> Krinkle: yeah I'm here :) [16:11:48] (03CR) 10Chad: "One (probably minor) inline question, but otherwise good to go in today's swat." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [16:11:53] _joe_: ^^ [16:12:02] Before I merge, just wanna make sure I'm clear on one thing [16:12:44] !log demon@tin Synchronized wmf-config/throttle.php: New throttle settings for Edit-a-thon workshop for orwiki (urgent) (duration: 01m 29s) [16:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:49] <_joe_> ostriches: uh? I guess a rebase issue? [16:12:50] mafk: you're live now ^^^ [16:13:05] ostriches: thanks [16:13:09] <_joe_> ostriches: let me check 1 sec [16:13:12] hope it works [16:13:53] ostriches: did you deploy my patch yet? [16:14:15] It's syncing now [16:14:17] <_joe_> ostriches: I'm checking the patch right now [16:14:20] i'm trying to setup a graph on test.wikipedia just to verify... [16:14:21] ok [16:14:34] _joe_: ty [16:14:46] https://test.wikipedia.org/wiki/User:Aude/graph is the same as https://www.mediawiki.org/wiki/User:Aude/Graph [16:14:57] but doesn't work on test.wikipedia and wonder why [16:15:22] oh, i am missing some data [16:16:05] !log demon@tin Synchronized wmf-config/CommonSettings.php: Don't yet allow wikidatasparql graph urls (duration: 01m 37s) [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:16] aude: And you're live ^ [16:16:26] ok [16:16:49] my graph on mediawiki.org looks the same... ok [16:17:04] and will have to check beta in a bit [16:18:04] _joe_: I'll do yours in a few minutes when you're done working on that patch. Puppy woke up and need to take him out before he pees the carpet. [16:18:06] Back in 5-10 [16:18:11] https://test.wikipedia.org/wiki/User:Aude/graph works now [16:18:16] thanks ostriches [16:18:28] <_joe_> ostriches: ok :P [16:18:29] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2060218 (10fgiunchedi) [16:19:41] that's why I prefer cats :) [16:19:45] 6Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans - https://phabricator.wikimedia.org/T117972#2060227 (10fgiunchedi) sounds good @faidon ! I've retitled {T125024} accordingly (i.e. dist-upgrade to trusty). My plan would be to resume jessie work beginning of next quarter, possibly this qua... [16:19:47] hah :) [16:19:49] =^^= [16:21:49] (03PS6) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [16:23:41] mafk: Done the cat thing, twice. I prefer my dog :P [16:24:47] _joe_: lgtm now [16:25:42] (03CR) 10Chad: [C: 032] Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [16:25:43] <_joe_> ostriches: I propose we do merge on tin, I sync-common on mw1017 and we check nothing explodes in prod [16:25:47] Yeah let's. [16:25:57] And that'll give it some time to sync to beta too and not blow up hopefull [16:25:58] y [16:26:13] <_joe_> ostriches: ok [16:26:27] (03Merged) 10jenkins-bot: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 (owner: 10Giuseppe Lavagetto) [16:26:47] 6Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#2060246 (10fgiunchedi) [16:27:00] <_joe_> ostriches: tell me when I'm ok to sync on mw1017 [16:27:31] (03PS17) 10Chad: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:27:54] A clean rebase is a happy rebase :D [16:28:40] <_joe_> ostriches: did you merge on tin? [16:28:43] Not yet [16:28:48] <_joe_> I'd go one patch at a time maybe? [16:29:06] <_joe_> but well, it will be obvious what broke what [16:29:08] Yeah, first one is now on tin [16:29:13] <_joe_> ok thanks [16:31:48] <_joe_> ostriches: I didn't find obvious errors navigating on mw1017, but let me check logstash [16:32:31] (03PS1) 10BBlack: Revert "disable do_gzip on cache_text (experiment)" [puppet] - 10https://gerrit.wikimedia.org/r/272995 [16:32:33] (03PS1) 10BBlack: disable keepalives on cache_text T127931 [puppet] - 10https://gerrit.wikimedia.org/r/272996 [16:32:36] <_joe_> the only logs I see is of some shady user GLavagetto (WMF) logging in [16:32:45] (03CR) 10BBlack: [C: 032 V: 032] Revert "disable do_gzip on cache_text (experiment)" [puppet] - 10https://gerrit.wikimedia.org/r/272995 (owner: 10BBlack) [16:33:10] _joe_: fatalmonitor shows nothing for mw1017 for me. [16:33:14] <_joe_> how can I check beta? [16:33:31] <_joe_> better, how to check that beta picked up the change? [16:33:35] We have logstash there but I always forget the URL [16:33:58] Jenkins commented on the change [16:33:59] https://logstash-beta.wmflabs.org [16:34:02] Post-merge build succeeded. [16:34:03] beta-mediawiki-config-update-eqiad SUCCESS Change has been deployed on the EQIAD beta cluster in 1s [16:34:05] very tricky url [16:34:26] <_joe_> I assume that beta would be completely broken by that change if it went wrong [16:34:44] Yeah prolly :P [16:34:46] _joe_: http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page works [16:35:10] Search on en.wiki.beta works, which would've likely broken :p [16:35:51] <_joe_> yeah in prod too [16:35:59] <_joe_> (would've likely been broken [16:36:10] Ok, let's sync-dir all of this out then, or try patch 2 next [16:36:30] <_joe_> up to you [16:37:09] The former, I'm being paranoid and don't want to stack all 3 changes for prod at once. [16:37:23] <_joe_> I was about to suggest the same :) [16:38:42] !log demon@tin Synchronized wmf-config/: Rationalize services definitions for labs too. (duration: 01m 45s) [16:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:25] _joe_: Unrelated, but redis/nutcracker on mw1099 is complaining, loudly. [16:39:25] !log +do_gzip done for all cache_text [16:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:30] (was just prior to us doing this too) [16:40:01] <_joe_> ostriches: I'll take a look, while you look at the second patch [16:40:46] (03CR) 10Chad: [C: 032] Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:41:15] (03Merged) 10jenkins-bot: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [16:41:19] RECOVERY - nutcracker process on mw1099 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:41:27] <_joe_> !log started nutcracker on mw1099 [16:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:41] <_joe_> uh already merged? you're bold :P [16:42:32] I had pretty much reviewed it already, was waiting for the other stuff to land before I pulled tho [16:42:33] (03CR) 10BBlack: [C: 032] disable keepalives on cache_text T127931 [puppet] - 10https://gerrit.wikimedia.org/r/272996 (owner: 10BBlack) [16:42:38] <_joe_> :) [16:42:39] RECOVERY - nutcracker port on mw1099 is OK: TCP OK - 0.000 second response time on port 11212 [16:42:56] Ok, staged on tin, running sync-common on mw1017 [16:44:21] <_joe_> Notice: Undefined variable: wmfLocalServices in /srv/mediawiki/wmf-config/InitialiseSettings.php on line 13110 [16:44:24] <_joe_> uhm [16:45:03] ostriches: do you know if we are trying wmf14 again today? or this week? [16:45:13] <_joe_> ostriches: the patch has a problem [16:45:20] aude: No, not planning to [16:45:24] ok :/ [16:45:24] I haven't updated the docs yet tho [16:46:13] <_joe_> ostriches: I think I forgot to include ProductionServices.php in InitialiseSettings.php :/ [16:46:29] there is new wikidata code there and it should be fine, yet it's nice to have some of us around when it is deployed [16:46:37] Either that, or it needs to be added to the global declaration in the top of InitialisteSettings. [16:47:11] <_joe_> ostriches: yeah I tend to avoid global variables but maybe it's better? [16:47:30] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [16:47:44] <_joe_> ostriches: preparing the fix [16:47:58] Ah yeah, it's gotta be global. [16:48:06] InitialiseSettings is included in a function context [16:48:23] <_joe_> yeah, I realized it after seeing the error [16:49:51] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060353 (10hashar) 5Resolved>3Open From T127966 A puppet r... [16:50:08] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060368 (10hashar) [16:51:11] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:52:46] 6Operations, 10ops-codfw: es2011-es2019 have default RAID stripe - https://phabricator.wikimedia.org/T127938#2060391 (10Volans) I've speak with @Papaul and reimaging is necessary (he tested on es2012 if was avoidable). es2011 - es2019 all ready to be reimaged from the service point of view, they were never in... [16:52:58] (03PS1) 10Giuseppe Lavagetto: Add global declarations for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273002 [16:53:07] !log https://wmflabs.org/sal/production missing SAL data since 2016-02-21T14:39 due to bot crash; needs to be backfilled from wikitech data [16:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:18] <_joe_> ostriches: ^^ [16:53:39] <_joe_> ostriches: sadly, I have a meeting in 7 minutes, do you think it's enough? [16:54:03] Eh, I'd move it to the top of InitialiseSettings with the other ones tbh. [16:54:04] I'll amend [16:54:15] <_joe_> oh, sorry :) [16:54:55] (03PS1) 10Andrew Bogott: Glance policy.json: Give glanceadmin a bunch more rights [puppet] - 10https://gerrit.wikimedia.org/r/273003 (https://phabricator.wikimedia.org/T127755) [16:55:39] _joe_: Hm.. not sure it makes sense to include global variables in values within InitialiseSettings.php, [16:55:45] We usually do that in CommonSettings.php [16:56:01] Mainly for clarity (it's not gonna vary by wiki, right?) [16:56:02] (03PS2) 10Chad: Add global declarations for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273002 (owner: 10Giuseppe Lavagetto) [16:56:08] and also soft-enforced now because it is cached [16:56:21] If InitialiseSettings does not change, it values remain used from cache [16:56:24] Krinkle: We already do. [16:56:27] <_joe_> Krinkle: nope, one could argue that those declarations should've been in commonsettings and not in InitialiseSettings [16:56:31] So chanvging PrdServies.php wouldn't work, right? [16:56:34] global $wmfUdp2logDest, $wmfDatacenter, $wmfRealm, $wmfConfigDir, $wgConf was already there. [16:56:45] (03PS2) 10Andrew Bogott: Glance policy.json: Give glanceadmin a bunch more rights [puppet] - 10https://gerrit.wikimedia.org/r/273003 (https://phabricator.wikimedia.org/T127755) [16:57:37] !log es200[1-9] disabling /revoking puppet and salt keys for re-image [16:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:22] papaul: ^^^^those are es2011-es2019 right? [16:58:27] (03CR) 10Andrew Bogott: [C: 032] Glance policy.json: Give glanceadmin a bunch more rights [puppet] - 10https://gerrit.wikimedia.org/r/273003 (https://phabricator.wikimedia.org/T127755) (owner: 10Andrew Bogott) [16:58:31] yep [16:58:40] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:58:43] the message says 2001-2009 :) [16:58:57] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060445 (10hashar) Looking at the nutcracker erb template on ht... [16:59:06] oh shut up icinga-wm I know [16:59:12] !log es201[1-9] disabling /revoking puppet and salt keys for re-image [16:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:02:29] Krinkle: You're probably right re: prodservices not invalidating InitialiseSettings. But I think this is already a general problem with InitialiseSettings the way it's being used already... [17:02:42] $wmfUdp2logDest already does that, for example [17:02:58] ostriches: most variables used there currently are mostly constants used there because they vary by wiki. So it's more likely the use changes than the variable [17:03:10] whereas now, it's much more likely we'll change the variables themselves. [17:03:17] Anytime we pool/depool change services etc. [17:03:35] Although tbh, I don't think the mtime check on InitialiseSettings is all that useful. We should invalidate config anytime wmf-config/* changes [17:03:39] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060482 (10hashar) hieradata/labs/deployment-prep/common.yaml d... [17:03:52] Not *just* Initialise [17:04:12] <_joe_> Krinkle: if the variable is a global, won't it be getting the new value? [17:04:37] <_joe_> well, laters, we're both in a meeting [17:05:37] Imma actually change how we handle that mtime generally. [17:06:04] (03CR) 1020after4: [C: 031] Beta: Move deployment server [puppet] - 10https://gerrit.wikimedia.org/r/270343 (https://phabricator.wikimedia.org/T126377) (owner: 10Thcipriani) [17:06:06] (03CR) 10Chad: [C: 032] Add global declarations for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273002 (owner: 10Giuseppe Lavagetto) [17:06:35] _joe_: Nope, it caches the variable assigned to the string. $foo = 'a' and $b = 'a' + $foo = $b is the same. They both assign the value. [17:06:41] (03Merged) 10jenkins-bot: Add global declarations for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273002 (owner: 10Giuseppe Lavagetto) [17:06:51] There is no object embedded reference or something. [17:07:09] And either way, unlike the existing globals used there, these really don't vary by wiki [17:07:13] <_joe_> Krinkle: I wasn't sure how the bytecode cache in hhvm worked [17:07:15] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060486 (10bmansurov) Thanks, I'm able to login now. [17:07:20] _joe_: not bytecode cache [17:07:24] you're right that'll work fine [17:07:29] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060487 (10bmansurov) 5Open>3Resolved [17:07:39] so we can move them to CommonSettings, there's no benefit to having them in InitialiseSettings [17:08:23] <_joe_> Krinkle: +1 [17:08:34] _joe_: It's expensive to process $wgConf and figure out all the groups and tags (default:x + enwiki:y + wikipedia:z = y, private:a -> 'y' for enwiki, 'a' for officewiki etc.) [17:08:39] So that is cached in a /tmp file [17:08:51] <_joe_> Krinkle: oh I didn't know [17:09:00] (03PS1) 10Chad: Invalidate InitialiseSettings cache anytime config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273004 [17:09:04] It's a few lines down from the include [17:09:20] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1654028 (10bmansurov) I'm not sure if I should create another task, but now I'm getting the following error: "The database list cannot be loaded." when... [17:09:53] _joe_: In fact, I don't think the wmfLoadInitialiseSettings() is used at all in prod [17:10:06] Not for the main execution contextx, only for maintenance scripts that invoke getConfig [17:10:23] Look at line 149-160 in CommonSettings [17:10:46] ostriches: :) [17:10:48] <_joe_> Krinkle: oh ok :P [17:11:31] Lmfao [17:11:54] <_joe_> sorry, I was doing things in a hurry [17:12:08] (03PS1) 10Krinkle: Re-apply "Set $wgResourceBasePath to /w for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273005 [17:12:13] _joe_: I did sync-common your change to mw1017 so the warnings disappeared. [17:12:15] (03CR) 10Krinkle: [C: 032] Re-apply "Set $wgResourceBasePath to /w for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273005 (owner: 10Krinkle) [17:13:00] <_joe_> ostriches: yup looks like that [17:13:02] ostriches: may I deploy ^ (per bblack in -sec) [17:13:09] (03Merged) 10jenkins-bot: Re-apply "Set $wgResourceBasePath to /w for medium wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273005 (owner: 10Krinkle) [17:13:54] Bleh Imma roll you back for just a sec from tin [17:14:03] I wanna push the stuff I have half-staged for _joe_ first [17:14:06] Then we can do you [17:14:55] k, no rush [17:15:30] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060507 (10Ottomata) Saved queries! @bmansurov I did not know that people actually used this, I am sorry. Yesterday we upgraded Hue (and all of the re... [17:16:57] !log demon@tin Synchronized wmf-config/: service entries for initialisesettings + fix (duration: 01m 45s) [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:17:24] oh my, ostriches, I somehow failed to realise until this very moment in time that directories in Linux auto propagate from all files/dirs inside. [17:17:28] That's neat [17:17:32] Assuming it works. [17:17:37] Which Linux manuals says it does [17:17:39] I never realised [17:18:04] You mean re: my mtime patch for wmf-config/.? [17:18:04] :) [17:18:09] Yeah [17:18:13] I was like. no way that works [17:18:20] Google said it would :p [17:18:24] Yep [17:18:37] (03PS8) 10Chad: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:19:08] (03PS1) 10Ori.livneh: Fully-qualify EventLoggingBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) [17:19:14] Well, I didn't check for php/hhvm. But for POSIX/Linux in general, yes. [17:19:46] Well filemtime() should just fall back on the C call, nothing particularly special about it in PHP-land afaik. [17:19:57] So it should Just Work The Way We'd Hope :) [17:20:20] (03CR) 10Krinkle: [C: 04-1] Fully-qualify EventLoggingBaseUri (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) (owner: 10Ori.livneh) [17:20:49] ostriches: You say that with so much confidence. That's never gone wrong in PHP. [17:20:59] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [17:21:00] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Re-apply "Set $wgResourceBasePath to /w for medium wikis" (duration: 01m 42s) [17:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:08] 😂 [17:21:13] Krinkle: You live ^ [17:22:33] <_joe_> ostriches: I'm in a meeting, if you don't feel like merging the pc change it's ok [17:22:37] Crap. [17:22:44] Breakage in the previous one. [17:23:19] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1973660 (10GWicke) Some comments: ## Network level switching While the... [17:23:27] <_joe_> ostriches: what's wrong? [17:23:52] Err, maybe it's just 2 nodes? [17:24:00] "Notice: Undefined variable: wmgLogstashServers in /srv/mediawiki/wmf-config/logging.php on line 234" [17:24:17] <_joe_> where do you read that? [17:24:23] fatalmonitor. [17:24:50] <_joe_> and yes, two hosts.... [17:25:24] Although they died off, wonder if rsync burped temporarily [17:25:43] <_joe_> maybe the next sync from timo fixed it? [17:25:48] <_joe_> yes [17:26:00] <_joe_> 18:21 < logmsgbot> !log demon@tin Synchronized wmf-config/InitialiseSettings.php: [17:26:08] No, that was unrelated. [17:26:21] Er, unrelated change but maybe made those 2 catch up [17:26:50] <_joe_> yeah [17:26:59] <_joe_> it's a bug with the stat_cache/inotify [17:27:11] (03PS1) 10Aaron Schulz: Enable async swift writes to all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273007 [17:29:45] _joe_: I'm reviewing the poolcounter one now. Also gonna wait another few mins for that last batch of errors to drop off before I sync so I can better see a spike in anything else. [17:29:51] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060618 (10bmansurov) No problem about the saved queries. They are also saved in Phabricator tasks ;)) I'll ping you after my meetings later today. Thanks. [17:32:16] (03PS2) 10Ori.livneh: Fully-qualify EventLoggingBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) [17:34:47] (03CR) 10Chad: [C: 032] Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:35:13] (03Merged) 10jenkins-bot: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [17:37:14] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2060645 (10EBernhardson) In my head, the biggest open question here is about the cert's. I'm guessing we would use self signed certs for i... [17:39:33] _joe_: You're on mw1017 with the poolcounter bits [17:39:45] <_joe_> ostriches: still in the meeting [17:39:53] Yeah I skipped it to do your patches :p [17:39:56] <_joe_> ostriches: I can try to do a null edits [17:42:13] <_joe_> ostriches: seems ok to me [17:42:19] Yeah same [17:43:09] Granted if this last one fails, it might not fail with an error, rather fail when concurrent parses to the same article start knocking over apaches like bowling pins [17:43:22] It'll be like pre-poolcounter all over again :) [17:44:24] !log demon@tin Synchronized wmf-config/: poolcounter config simplification (duration: 01m 39s) [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:13] No warnings/errors on this one so far [17:46:33] (03PS1) 10Dzahn: lint: fix some of the few remaining warnings [puppet] - 10https://gerrit.wikimedia.org/r/273009 [17:47:05] (03CR) 10Dzahn: "getting back on this, need to fix path conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [17:48:28] _joe_: Ok, that's all 3 of your patches (plus a fixup) all live and happy :) [17:48:49] <_joe_> ostriches: thanks, a couple of those were really really scary :P [17:51:00] PROBLEM - Disk space on hafnium is CRITICAL: DISK CRITICAL - free space: / 1650 MB (3% inode=97%) [17:51:59] * ostriches carefully, quietly walks away from logs and keyboard for a few minutes [17:52:07] ssshhh, don't tell the apaches i'm gone [17:54:12] sigh, thcipriani I completely forgot about https://gerrit.wikimedia.org/r/#/c/270343/ can be present for tomorrow's puppet swat though [17:54:19] /srv/mongod and I don't know what's safe to do about those files [17:54:45] xhprof.nn and most of them around 2.1GB [17:54:52] re: hafnium space issues [17:54:57] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2060685 (10jcrespo) [17:55:00] 6Operations, 6Performance-Team, 7Availability, 7Epic, and 3 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#2060684 (10jcrespo) [17:56:31] (03PS2) 10Dzahn: lint: fix some of the few remaining warnings [puppet] - 10https://gerrit.wikimedia.org/r/273009 [17:56:50] godog: yeah, only thing on my list for SoS for our team. If we can get it done during puppet swat, that'd work for me: got un-cherry-picked (or overwritten...something) on deployment-puppetmaster yesterday, caused some issues. [17:56:56] any mongodb folsk around who would know how to safely limit the size of this thing? [17:57:38] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2060698 (10jcrespo) I'm ok with leaving the master databases pointing to the original ones but that is 1) more reasons to create a... [17:58:03] (03CR) 10Dzahn: [C: 032] "all for T93645 but i dont wanna spam too much" [puppet] - 10https://gerrit.wikimedia.org/r/273009 (owner: 10Dzahn) [17:58:42] ori: ^ that should be xhgui (mongo and disk space on hafnium) [17:59:34] yep that is [17:59:42] I just don't know how to solve the space issue [17:59:59] oh ori owns that, right [18:00:11] for some reason (brain fart) my mind went to a kosiaris, completely wrong... [18:00:14] i'll fix that [18:00:33] thank you (lemme know what you do too) [18:01:37] thcipriani: ok! yes I can do tomorrow, I've updated puppet swat [18:01:55] godog: cool, thanks! [18:02:20] apergos: it's a new tool so there are no expectations yet about retention. I am clearing a lot of profiles using the mongo shell [18:02:54] !log mongodb on hafnium: ran `db.results.remove( { "meta.SERVER.REQUEST_URI": "/wiki/Special:BlankPage" } ); db.repairDatabase();` to drop profiles of PyBal requests and compact the database. [18:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:45] /dev/md0 46G 42G 1.7G 97% / [18:03:54] does it take a while to run then? [18:03:59] ori: [18:04:14] "errmsg" : "Cannot repair database xhprof having size: 32147243008 (bytes) because free disk space is: 1730621440 (bytes)" [18:04:24] I'm just going to nuke it then, it's OK [18:04:33] more power to ya [18:05:03] 6Operations, 10Traffic, 13Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2060728 (10BBlack) 5Resolved>3stalled See T127931 - caused elevated (but still small) 503s on cache_text, reverted for now, will revisit after further varnish/nginx so... [18:05:50] RECOVERY - Disk space on hafnium is OK: DISK OK [18:06:18] thank you [18:07:00] !log hafnium did not have enough disk space for mongo to execute db.repairDatabase(), which is necessary for reclaiming disk space. Since existing profile data can be tossed, ran `db.dropDatabase(); db.repairDatabase();`. Need to think this through better, obviously. [18:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:04] grrrit-wm is dead [18:10:44] !log disabling nginx keepalives on remaining clusters (upload, misc, maps) [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:48] someone has to restart it manually, needs a labs admin? [18:10:58] * apergos finds the relevant warning in the backread [18:11:20] (05:59:25 μμ) bd808: so if and when grrrit-wm dies somebody will have to start it manually -- https://wikitech.wikimedia.org/wiki/Grrrit-wm -- that needs a labs root at the moment [18:11:46] andrewbogott: is this something you can do? [18:12:15] I probably /can/ but I have never touched grrrit-wm before [18:12:26] let me look [18:12:32] ok [18:12:48] oh god it’s on k8s [18:12:58] yeah :-P [18:13:32] nobody panic [18:13:34] I restarted it [18:13:39] someone had deleted the replication controller [18:13:41] rather than the pod [18:13:46] yuvipanda: thanks :) [18:13:49] oh, thank you [18:13:52] I didn't ask you cause [18:13:56] I was about to just follow your step-by-step, probably it would’ve worked fine [18:13:56] in theory you're on vacation [18:14:00] very theoretically [18:14:48] we once had a typo in DNS, got us this wrong hostname. "es2015.codfw.wmnt", still shows up in 'puppet cert --list'. but when trying to clean it "Error: Could not find a serial number for es2015.codfw.wmnt [18:15:02] ugh [18:15:40] lemme poke at that some if you want [18:16:13] mutante: [18:16:27] apergos: :) sure [18:16:39] they all are going to be reinstalled [18:16:40] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [18:16:46] but it's about getting the "wmnt" out of there [18:16:51] mutante: you can clean it up manually from /var/lib/puppet/ssl/ca/requested or something [18:17:07] (03CR) 10BBlack: [C: 032] Revert "cache_misc: disable do_gzip" [puppet] - 10https://gerrit.wikimedia.org/r/273013 (https://phabricator.wikimedia.org/T127294) (owner: 10BBlack) [18:17:24] * andrewbogott goes out for ramen w/Mom and Dad [18:17:35] yuvipanda: ok, thanks, that's what i was looking for [18:17:43] np [18:18:16] 6Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 13Patch-For-Review: Number of 5xx (mainly 503) errors surge from text caches since February 17th - https://phabricator.wikimedia.org/T127931#2060780 (10BBlack) 5Open>3Resolved [18:18:36] yuvipanda, so when will grrrit-wm service group members be able to restart it? [18:19:05] hmm , yea there is /var/lib/puppet/ssl/certs and certificate_requests and all that, but that doesnt have much content [18:20:04] 6Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic: Logging out of a wiki leaves an XXwikiSession= Cookie behind - https://phabricator.wikimedia.org/T127436#2060781 (10Anomie) > I'm not 100% sure whether this is a recent regression - I think it is It's not. Logging out in MediaWiki clears... [18:22:04] mutante: there should be a 'ca' dir somewhere [18:22:11] Krenair: it isn't too hard to setup, hopefully next week [18:23:44] there's the matter of whether it's revoked or not [18:27:14] mutante: I found a req for signing ./server/ssl/ca/requests/es2015.codfw.wmnt.pem [18:27:17] so I tossed that [18:27:22] it's gone from the list now [18:27:29] ottomata: hello [18:27:32] so no revoking needed it seems [18:27:37] apergos: thank you [18:27:39] yw [18:28:21] ottomata: pinging about https://phabricator.wikimedia.org/T113069 [18:28:45] ottomata: I don't see the error anymore, so all is good [18:29:06] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#2060790 (10bmansurov) The error is gone. All is good now. Thanks for the help. [18:30:24] apergos: I guess was from papaul that is reimaging those ones [18:30:48] ok, well easy to fix [18:31:38] mutante: sorry saw it just now reading the scroll [18:33:11] I've got a nutcracker instance in beta cluster that won't start with a complaint of "configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid". Have we changed something that didn't get propagated to beta cluster? [18:35:30] so apergos mutante sync with papaul, it might need to redo the puppet config for that host [18:35:59] cert was unsigned so I doubt it [18:36:20] but I'll keep it in mind [18:36:54] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060807 (10bd808) p:5Normal>3Unbreak! Nobody can login in b... [18:37:00] apergos: thanks [18:37:20] de nada [18:37:40] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:54] booooooo [18:38:10] _joe_: do you know how to fix T127845? [18:38:10] T127845: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845 [18:38:17] volans: we already did, they all have to be reinstalled, but just this one was special [18:38:41] <_joe_> bd808: 1 sec [18:38:58] a little special one :) [18:38:58] is it still gitblit? [18:39:09] <_joe_> bd808: yeah it's very simple [18:39:19] sweet [18:39:46] !log restart gitblit [18:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:40:01] thanks [18:40:08] I was actually gonna go look around and stuff [18:40:15] <_joe_> mutante: you should be more creative when logging about gitblit [18:40:18] (killing gitblit isnt just a task, it's an entire workboard ) [18:40:20] that was the better response, just kick it [18:40:28] <_joe_> I know people follow our twitter feed of logs :) [18:41:04] _joe_: i wanted to use the new feature that i can link with T12345, but i need a link to the whole project :p [18:41:05] T12345: Create "annotation" namespace on Hebrew Wikisource - https://phabricator.wikimedia.org/T12345 [18:41:12] 6Operations, 6Discovery, 10procurement: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2060818 (10EBernhardson) [18:42:18] (03PS1) 10Giuseppe Lavagetto: nutcracker: fix beta config [puppet] - 10https://gerrit.wikimedia.org/r/273016 (https://phabricator.wikimedia.org/T127845) [18:43:34] (03PS2) 10Giuseppe Lavagetto: nutcracker: fix beta config [puppet] - 10https://gerrit.wikimedia.org/r/273016 (https://phabricator.wikimedia.org/T127845) [18:43:45] <_joe_> bd808: ^^ [18:44:02] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: fix beta config [puppet] - 10https://gerrit.wikimedia.org/r/273016 (https://phabricator.wikimedia.org/T127845) (owner: 10Giuseppe Lavagetto) [18:44:12] (03CR) 10Giuseppe Lavagetto: [V: 032] nutcracker: fix beta config [puppet] - 10https://gerrit.wikimedia.org/r/273016 (https://phabricator.wikimedia.org/T127845) (owner: 10Giuseppe Lavagetto) [18:44:31] <_joe_> bd808: this is a hack, I'll do a proper fix at a later time, promised [18:44:51] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 28169 bytes in 1.799 second response time [18:44:53] cool. I'll pull it on beta puppetmaster and force some runs [18:45:18] <_joe_> bd808: yup thanks and sorry [18:45:32] no worries. you're helping fix it :) [18:45:37] <_joe_> the change seemed good, and I'm doing too many things at once [18:45:41] <_joe_> well I broke it too [18:45:42] <_joe_> :P [18:45:54] <_joe_> I broke it, then i fixed it (TM) [18:46:06] I've seen that shirt somewhere ;) [18:46:08] as it should be :) [18:46:12] will refer to it as "project 46" https://phabricator.wikimedia.org/project/view/46/ [18:46:47] <_joe_> Platonides: actually I broke it twice; but well at least I didn't break production [18:47:27] hehe [18:50:09] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2060866 (10bd808) p:5Unbreak!>3Normal @Joe put in a quick f... [18:52:31] !log es201[1-9] -signing puppet certs, salt-key. initial run [18:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:10] (03PS3) 10Rush: labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 [18:57:02] bmansurov: oh o [18:57:03] ok [18:57:14] if all is well then, many apologies about losing those queries! [18:57:44] (03CR) 10Legoktm: "Missed the settings in the -labs.php files :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [18:59:06] ottomata: np, I have some of them left [19:00:04] (03PS1) 10Legoktm: Finish "wmg" -> "wg" for BetaFeatures in -labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273017 [19:01:44] (03CR) 10Legoktm: [C: 032] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273017 (owner: 10Legoktm) [19:02:10] (03Merged) 10jenkins-bot: Finish "wmg" -> "wg" for BetaFeatures in -labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273017 (owner: 10Legoktm) [19:03:05] (03PS4) 10Rush: labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 [19:03:42] (03CR) 10jenkins-bot: [V: 04-1] labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 (owner: 10Rush) [19:04:09] (03PS5) 10Rush: labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 [19:04:31] !log legoktm@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/273017 (duration: 01m 37s) [19:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:41] 6Operations, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2060952 (10Dzahn) We have come a long way here :) When we started we had to exclude pretty much every single lint check and we had more errors and warnings than code lines (i have some numbers... [19:12:31] (03Abandoned) 10Rush: Create tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [19:14:14] 6Operations, 10ops-codfw: es2011-es2019 have default RAID stripe - https://phabricator.wikimedia.org/T127938#2061008 (10Papaul) @Volans re-image, puppet.salt key sign complete. [19:20:27] (03Restored) 10Giuseppe Lavagetto: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [19:20:45] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2061023 (10Dzahn) [19:26:41] 6Operations: fix all "top-scope variable being used without an explicit namespace" across the puppet repo - https://phabricator.wikimedia.org/T125042#2061087 (10Dzahn) left now: ./modules/cdh/manifests/spark.pp - WARNING: top-scope variable being used without an explicit namespace on line 73 ./modules/monitorin... [19:28:35] (03CR) 10Rush: [C: 032] labstore: real time application of tc setup [puppet] - 10https://gerrit.wikimedia.org/r/272900 (owner: 10Rush) [19:29:06] (03PS1) 10Ottomata: Fix variable reference $standalone_master_host [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273028 (https://phabricator.wikimedia.org/T125042) [19:29:34] (03CR) 10Ottomata: [C: 032 V: 032] Fix variable reference $standalone_master_host [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273028 (https://phabricator.wikimedia.org/T125042) (owner: 10Ottomata) [19:30:53] (03PS1) 10Rush: labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 [19:31:09] (03PS2) 10Rush: labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 [19:32:01] (03PS5) 10Cmjohnson: Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 [19:32:05] (03PS1) 10Legoktm: Update $wgOresModels configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273032 [19:33:58] (03PS2) 10Giuseppe Lavagetto: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [19:34:38] (03CR) 10jenkins-bot: [V: 04-1] labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 (owner: 10Rush) [19:35:19] (03CR) 10Ladsgroup: [C: 031] Update $wgOresModels configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273032 (owner: 10Legoktm) [19:36:20] (03PS3) 10Rush: labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 [19:37:04] (03CR) 10Cmjohnson: [C: 032] Fixing partman recipe that wmf4727-test uses. Needed gpt [puppet] - 10https://gerrit.wikimedia.org/r/272892 (owner: 10Cmjohnson) [19:37:22] (03PS4) 10Rush: labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 [19:37:32] (03CR) 10Rush: [C: 032 V: 032] labstore: tcp-setup path fix [puppet] - 10https://gerrit.wikimedia.org/r/273031 (owner: 10Rush) [19:39:48] (03PS1) 10Rush: nfs: deprecation warnings on /etc/modprobe.d [puppet] - 10https://gerrit.wikimedia.org/r/273034 [19:39:59] (03PS2) 10Rush: nfs: deprecation warnings on /etc/modprobe.d [puppet] - 10https://gerrit.wikimedia.org/r/273034 [19:40:08] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2061165 (10GWicke) I looked a bit into DNS as an option. By default, nod... [19:41:35] !log db1021 replacing disk 8 [19:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:33] (03PS3) 10Giuseppe Lavagetto: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [19:43:44] 6Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics: access for nikerabbit to researchers - https://phabricator.wikimedia.org/T127808#2061176 (10Dzahn) [19:43:57] (03CR) 10Legoktm: [C: 032] "beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273032 (owner: 10Legoktm) [19:44:12] (03CR) 10Rush: [C: 032] nfs: deprecation warnings on /etc/modprobe.d [puppet] - 10https://gerrit.wikimedia.org/r/273034 (owner: 10Rush) [19:44:21] (03Merged) 10jenkins-bot: Update $wgOresModels configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273032 (owner: 10Legoktm) [19:45:58] (03PS1) 10Dzahn: admin: add nikerabbit to researchers [puppet] - 10https://gerrit.wikimedia.org/r/273038 (https://phabricator.wikimedia.org/T127808) [19:46:52] !log legoktm@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/273032 (duration: 01m 41s) [19:46:53] !log runonce apply for https://gerrit.wikimedia.org/r/#/c/272891/ for labs vm's (only affects nfs clients) [19:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:19] PROBLEM - RAID on db1021 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:52:07] (03PS2) 10Dzahn: admin: add nikerabbit to researchers [puppet] - 10https://gerrit.wikimedia.org/r/273038 (https://phabricator.wikimedia.org/T127808) [19:54:14] (03CR) 10Dzahn: [C: 031] "has approval on ticket, does not involve sudo or changes shell access, what it does is give database access to private data in the researc" [puppet] - 10https://gerrit.wikimedia.org/r/273038 (https://phabricator.wikimedia.org/T127808) (owner: 10Dzahn) [19:56:45] (03PS3) 10Dzahn: admin: add nikerabbit to researchers [puppet] - 10https://gerrit.wikimedia.org/r/273038 (https://phabricator.wikimedia.org/T127808) [19:56:56] (03PS2) 10Rush: labstore1001: persist cfq ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270432 (https://phabricator.wikimedia.org/T126090) [19:57:04] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2061242 (10GWicke) I have now verified that there is no DNS caching at a... [19:57:46] (03PS1) 10Ottomata: Fixes for spark-env.sh using spark 1.5 from CDH 5.5.2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273043 [19:58:37] (03PS4) 10Giuseppe Lavagetto: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [19:59:01] (03PS2) 10Eevans: restbase: make statsd metric prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [19:59:49] (03PS4) 10Dzahn: admin: add nikerabbit to researchers [puppet] - 10https://gerrit.wikimedia.org/r/273038 (https://phabricator.wikimedia.org/T127808) [19:59:57] (03CR) 10Eevans: [C: 031] "I rebased this." [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [20:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160224T2000). [20:00:23] (03CR) 10Rush: [C: 032] labstore1001: persist cfq ioscheduler [puppet] - 10https://gerrit.wikimedia.org/r/270432 (https://phabricator.wikimedia.org/T126090) (owner: 10Rush) [20:03:21] (03PS1) 10Rush: labstore: persist cfq io scheduler [puppet] - 10https://gerrit.wikimedia.org/r/273049 [20:08:50] (03CR) 10Rush: [C: 032] labstore: persist cfq io scheduler [puppet] - 10https://gerrit.wikimedia.org/r/273049 (owner: 10Rush) [20:11:18] ACKNOWLEDGEMENT - RAID on db1021 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo It is rebuilding its disks, as requested by Jaime. [20:13:46] !log reboot logstash1001 for kernel update [20:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:26] (03PS2) 10Ottomata: Fixes for spark-env.sh using spark 1.5 from CDH 5.5.2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273043 [20:15:49] !log reboot labstore1002 to ensure io scheduler grub options work [20:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:16] (03PS1) 10Eevans: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273051 (https://phabricator.wikimedia.org/T103124) [20:18:33] (03Abandoned) 10Eevans: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273051 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [20:18:51] legoktm: https://upload.wikimedia.org/wikipedia/commons/1/16/Putting_the_prod_in_production._Production_Drive_Committee_-_NARA_-_534919.jpg is awesome, thank you. [20:19:16] 6Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 7HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2061293 (10Multichill) You can hardly call "Registrar.eu" an individual in Rotterdam or are you looking at different records than I am? * wikilovesmonum... [20:20:58] (03PS1) 10Eevans: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) [20:23:04] (03CR) 10Eevans: [C: 04-1] "I had an awful time trying to rebase this after rebasing https://gerrit.wikimedia.org/r/#/c/238431/, so I submitted https://gerrit.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [20:24:31] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2061306 (10Sjoerddebruin) Another page: http... [20:25:31] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2061310 (10Cmjohnson) Replaced disk 7 and it's back online disk 8 is rebuilding now [20:25:39] (03CR) 10Eevans: [C: 031] "Insofar as I understand all of the implications here (I may not), this LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) (owner: 10Filippo Giunchedi) [20:26:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:26:44] (03PS3) 10Ottomata: Fixes for spark-env.sh using spark 1.5 from CDH 5.5.2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273043 [20:27:57] (03Abandoned) 10Eevans: make statsd metrics prefix configurable [puppet] - 10https://gerrit.wikimedia.org/r/272536 (https://phabricator.wikimedia.org/T127747) (owner: 10Eevans) [20:28:34] !log reboot logstash1002 for kernel and elasticsearch update [20:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:20] RECOVERY - RAID on db1021 is OK: OK: optimal, 1 logical, 2 physical [20:30:39] (03CR) 10Ottomata: [C: 032 V: 032] Fixes for spark-env.sh using spark 1.5 from CDH 5.5.2 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/273043 (owner: 10Ottomata) [20:31:40] (03PS1) 10Ottomata: Update cdh module with spark fixes for CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/273054 [20:32:00] (03PS2) 10Ottomata: Update cdh module with spark fixes for CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/273054 [20:32:10] (03CR) 10EBernhardson: Enable 'popqual' (quality+pageviews) scoring method for the completion suggester (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) (owner: 10DCausse) [20:33:30] ori: Is it still critical that we hold wmf.14 from the wikis? [20:33:38] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with spark fixes for CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/273054 (owner: 10Ottomata) [20:35:17] (03PS1) 10Dzahn: wikistats: let puppet git clone vs. deb [puppet] - 10https://gerrit.wikimedia.org/r/273055 [20:35:40] (03CR) 10Krinkle: [C: 031] Expand computed dblist; leave flow_computed for easy regeneration: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 (owner: 10Mattflaschen) [20:36:10] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: There are 11 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:36:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:39:48] !log reboot logstash1003 for kernel and elasticsearch update [20:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:58] (03PS1) 10Eevans: restbase: override logging name [puppet] - 10https://gerrit.wikimedia.org/r/273061 (https://phabricator.wikimedia.org/T103124) [20:47:37] 6Operations, 10ops-eqiad: db1021 degraded RAID - https://phabricator.wikimedia.org/T126451#2061431 (10Cmjohnson) 5Open>3Resolved both disk replaced and back to normal. [20:49:12] !log reboot logstash1004 for kernel/elasticsearch update [20:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:06] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#1998563 (10Cmjohnson) The new restbase servers are on-site. Let's coordinate which 2 servers to start with. 1001/1002 are in row A 1003/1004 are in row C... [20:56:57] 6Operations, 10puppet-compiler: puppet compiler: NoneType' object is not iterable with node auto-select feature - https://phabricator.wikimedia.org/T117278#2061467 (10hashar) [20:57:26] 6Operations, 10Continuous-Integration-Infrastructure, 10puppet-compiler: puppet compiler wrongly indicates errors when dealing with subrepositories - https://phabricator.wikimedia.org/T118406#2061468 (10hashar) [20:58:07] 6Operations, 10puppet-compiler: Puppet Compiler: Support wildcards, regexps, or 'all hosts' - https://phabricator.wikimedia.org/T114305#2061472 (10hashar) [20:59:12] 6Operations, 10puppet-compiler: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546#2061476 (10hashar) [20:59:33] 6Operations, 7Puppet, 10puppet-compiler, 13Patch-For-Review: puppet compiler runs fail when backup::host is included on host - https://phabricator.wikimedia.org/T122909#2061477 (10hashar) [20:59:47] 6Operations, 7Puppet, 10Continuous-Integration-Infrastructure, 6Labs, 10puppet-compiler: compiler02.puppet3-diffs.eqiad.wmflabs out of disk space - https://phabricator.wikimedia.org/T122346#2061479 (10hashar) [21:00:04] 6Operations, 10puppet-compiler, 13Patch-For-Review: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#2061480 (10hashar) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160224T2100). [21:01:26] !log starting parsoid deploy [21:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:12] verified edit on beta cluster [21:04:58] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 6Release-Engineering-Team, 13Patch-For-Review: deployment-tin puppet Error 400 on SERVER: Failed to parse template nutcracker/nutcracker.yml.erb - https://phabricator.wikimedia.org/T127845#2061502 (10hashar) Thank you @Joe , I was already fighting vari... [21:08:09] !log synced code; restarted parsoid on wtp1001 as a canary [21:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:20] ostriches: it's critical to get memcached traffic back to normal. I don't know what is easier -- rolling out wmf14 or backporting these patches to wmf13 [21:08:34] tgr ^ [21:08:46] (03PS1) 10Chad: postgresql.py ganglia: pep8 fixes, mostly line too long [puppet] - 10https://gerrit.wikimedia.org/r/273106 [21:09:07] ori: Either is fine by me. Does wmf.14 already have the fixes needed? [21:09:15] yes [21:09:25] In which case moving forward would be best imho [21:09:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [21:09:35] ok by me [21:10:05] ori: I thought wmf14 is going out right now [21:10:07] https://phabricator.wikimedia.org/T125597#2052051 [21:10:27] yeah [21:10:32] i think that is what ostriches is confirming [21:10:40] it should already have all the memcache-related sessionmanager improvements [21:10:48] I was under the impression we were holding wmf.14 from monday, didn't know you guys were ok with wmf.14 yet :) [21:11:04] Ok, I'll go ahead and catch up and move it to testwiki2 and mw.org since those would've gone yesterday [21:11:12] If that looks good, we'll do group1 today and get back on track [21:11:17] great [21:11:20] thank you, makes sense [21:11:59] we tested session handling yesterday via mw1017 (since group0 is not really useful for that) and all seemed to work ok [21:12:09] session handling in wmf.14 I mean [21:12:15] 7Blocked-on-Operations, 10RESTBase, 13Patch-For-Review: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2061558 (10Eevans) >>! In T103124#2060074, @fgiunchedi wrote: > https://gerrit.wikimedia.org/r/272989 is a patch to move things into their... [21:13:01] (03PS1) 10Chad: Move rest of group0 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273109 [21:15:15] tgr: what about the cache-miss caching? is that in wmf14? [21:15:33] alright restarting parsoid on all nodes. looking good on wtp1001 [21:16:49] (03CR) 10Chad: [C: 032] Move rest of group0 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273109 (owner: 10Chad) [21:16:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:18:13] (03Merged) 10jenkins-bot: Move rest of group0 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273109 (owner: 10Chad) [21:19:00] !log finished deploying parsoid version 581a43c75 [21:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:16] !log demon@tin Started scap: group0 to wmf.14 [21:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:20:59] ori, yes [21:21:24] https://gerrit.wikimedia.org/r/#/c/272776/, https://gerrit.wikimedia.org/r/#/c/272653/ and https://gerrit.wikimedia.org/r/#/c/272776/ are the perf-related changes (other from the MW_NO_SESSION stuff) [21:21:42] first two are backported, third is still in CR [21:23:42] (03PS2) 10Dzahn: wikistats: let puppet git clone vs. deb [puppet] - 10https://gerrit.wikimedia.org/r/273055 [21:24:10] (03PS3) 10Dzahn: wikistats: let puppet git clone vs. deb [puppet] - 10https://gerrit.wikimedia.org/r/273055 [21:24:25] (03CR) 10Dzahn: [C: 032] wikistats: let puppet git clone vs. deb [puppet] - 10https://gerrit.wikimedia.org/r/273055 (owner: 10Dzahn) [21:26:31] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2061615 (10Dzahn) a:3Dzahn [21:26:58] apergos, re Flow dumps, wanted to let you know that Flow does not currently use a computed DB list (it was merged but then I went to a regular DB list). [21:27:11] So if wmgUseFlow works, that's fine, otherwise you could consider using the fixed DB list. [21:27:12] 6Operations, 7Puppet: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2061646 (10Dzahn) [21:28:52] 6Operations, 10ops-eqiad: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2061650 (10Cmjohnson) a test server has been established and accessible via ssh wmf4727-test.eqiad.wmnet Controller BIOS needs to be added to use the LSI controller. [21:33:21] dammnnnnn rsync is slow. [21:33:41] I didn't deploy any new branches... [21:33:46] Was already live [21:34:21] 6Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic: Logging out of a wiki leaves an XXwikiSession= Cookie behind - https://phabricator.wikimedia.org/T127436#2061676 (10Tgr) [[ https://www.owasp.org/images/6/67/OWASPApplicationSecurityVerificationStandard3.0.pdf | OWASP ASVS ]] 2.7 (a draft... [21:35:36] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: https://integration.wikimedia.org/ci/api/json is corrupted when required more than one time in a raw - https://phabricator.wikimedia.org/T127294#2061679 (10hashar) Verified on Nodepool, requests to Jenkins are all fine: ``... [21:35:38] I've been on sync-proxies for 3 minutes now and 0 have completed? [21:35:43] * ostriches grabs his scap stabbing tools [21:38:27] * ostriches twiddles thumbs [21:38:32] ok sync-common, what gives? [21:39:02] thereeeee we go [21:49:00] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:49:21] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:49:43] that doesn't look good ^ [21:50:10] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:50:42] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:50:42] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:50:42] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [21:50:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [21:50:56] gehel: is the logstash cluster stuff you and moritzm ? [21:51:18] bd808: yes [21:51:37] *nod* thanks [21:51:46] oops, it is taking much more time than we though, downtime was scheduled for much less [21:52:11] yep, downtime expired, will re-add it [21:52:11] I don't see anything in the recovery monitor [21:52:13] yea elasticsearch never comes back to full health as fast as i would like... [21:54:40] * bd808 sees some shards recovering now [21:54:51] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:55:04] gehel, moritzm: 1005 is the master right now. you may want to do it last [21:55:31] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:56:10] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:56:10] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:56:10] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:56:11] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 104, initializing_shards: 1, number_of_data_nodes: 3, delayed_unassigned_shards [21:57:31] 6Operations, 7Puppet: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2061719 (10scfc) Short recap of (failed) trials: The issue does not lie with the `role()` function as the error occurs with plain `include role::something` (one level, i.... [21:58:00] bd808: yep, depending on the final recovery time of 1004 we'll proceed with 1006 and possibly continue with 1005 tomorrow, otherwise it's getting too late in CET [21:59:07] I can take over and do 1005 if you run out of time. Just let me know if there are any non-apt packages that need to be applied [22:00:23] 6Operations, 7Puppet: 'role' function doesn't find classess in autoload layout in manifests/role - https://phabricator.wikimedia.org/T119042#2061723 (10Dzahn) @scfc joe looked at it earlier and said "the problem is "import" is implemented awkwardly.. there is no way to fix it," now this: < _joe_> mutante: we... [22:00:59] bd808: ok, thanks, we'll ping you later. elasticsearch 1.7.5 has been added to carbon, so it's available via plain apt [22:03:34] bd808: Adn moritzm has nice stop / start scripts that do everything for you, except preparing earl grey while waiting for the cluster to start again ... [22:05:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:05:48] 7Blocked-on-Operations, 10RESTBase, 13Patch-For-Review: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2061733 (10GWicke) > If anything, the conventions are different (e.g. - vs _ as a separator). For the staging cluster, using _ as the sep... [22:07:06] !log demon@tin Finished scap: group0 to wmf.14 (duration: 47m 50s) [22:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:42] !log reboot logstash1006 for kernel and elasticsearch update [22:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:56] gehel: these are essentially stripped down versions of bd808s es upgrade script :-) [22:12:53] I need to have a look at them! Are they versionned somewhere ? [22:13:09] (03PS1) 10Chad: Move group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273120 [22:13:51] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 0, number_of_data_nodes: 3, delay [22:14:22] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 29, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 72, initializing_shards: 4, number_of_data_nodes: 3, delay [22:14:22] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 33 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 29, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 72, initializing_shards: 4, number_of_data_nodes: 3, delay [22:14:25] gehel: earl grey best grey [22:14:31] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 32 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 28, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 73, initializing_shards: 4, number_of_data_nodes: 3, delay [22:15:01] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 19 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 15, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 86, initializing_shards: 4, number_of_data_nodes: 3, delay [22:15:17] gehel: look at ~bd808/upgrade-es.sh on logstash1001 [22:15:40] gehel: shamelessly stolen from a script that manybubbles made in the olden days for the cirrus cluster [22:15:42] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: [22:16:02] imitation is the best expression of flattery ... [22:16:11] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: [22:16:11] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: [22:16:21] * gehel wonders if it means anything in English [22:16:21] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: [22:16:41] ^ extended icinga downtime again [22:16:51] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_shards: [22:19:21] PROBLEM - nutcracker process on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [22:19:51] PROBLEM - nutcracker port on mw1099 is CRITICAL: Connection refused [22:21:39] 6Operations, 10MediaWiki-Authentication-and-authorization, 10Traffic: Logging out of a wiki leaves an XXwikiSession= Cookie behind - https://phabricator.wikimedia.org/T127436#2061751 (10csteipp) Yep, exactly what @Tgr said. Historically, it was considered best practice to delete or change it, in case you for... [22:23:07] Bleh, redis/nutcracker complaining on mw1099 again [22:23:08] Meh [22:26:33] ostriches: that's me, i'll ack [22:26:36] i'm using it for debugging [22:26:37] you can ignore [22:28:07] (03CR) 10Chad: [C: 032] Move group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273120 (owner: 10Chad) [22:28:36] (03Merged) 10jenkins-bot: Move group1 to wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273120 (owner: 10Chad) [22:29:10] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.14 too [22:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:31:00] 6Operations, 10ops-codfw: es2011-es2019 have default RAID stripe - https://phabricator.wikimedia.org/T127938#2061771 (10Volans) 5Open>3Resolved @Papaul thanks for the quick fix. I've checked all the servers and are all good. I've run also `/opt/wmf-mariadb10/install` on all of them. [22:42:51] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: puppet fail [22:46:42] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:47:04] (03PS1) 10Dzahn: move files from /usr/bin to /usr/local/bin/wikistats/ [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273124 [22:47:35] (03CR) 10Dzahn: [C: 032 V: 032] move files from /usr/bin to /usr/local/bin/wikistats/ [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273124 (owner: 10Dzahn) [22:49:48] !log reboot logstash1005 for kernel and elasticsearch update [22:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:16] (03PS1) 10Dzahn: add script to deploy/backup/restore [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273127 [22:55:48] 6Operations, 10RESTBase, 10hardware-requests: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2061860 (10Eevans) >>! In T125842#2061460, @Cmjohnson wrote: > The new restbase servers are on-site. Let's coordinate which 2 servers to start with. That'... [22:56:18] (03CR) 10Dzahn: [C: 032] add script to deploy/backup/restore [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273127 (owner: 10Dzahn) [23:10:11] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [23:10:25] (03PS1) 10Dzahn: insert private db password after deploying [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273131 [23:11:06] (03CR) 10Dzahn: [C: 032 V: 032] insert private db password after deploying [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273131 (owner: 10Dzahn) [23:16:47] (03PS1) 10Dzahn: Revert "Revert "Major overhaul of Main Page"" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273132 [23:17:22] (03CR) 10Dzahn: [C: 032 V: 032] Revert "Revert "Major overhaul of Main Page"" [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273132 (owner: 10Dzahn) [23:25:10] (03PS1) 10Dzahn: fix quoting error with dbpass [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273134 [23:25:54] (03PS2) 10Dzahn: fix quoting error with dbpass [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273134 [23:26:00] (03CR) 10Dzahn: [C: 032 V: 032] fix quoting error with dbpass [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273134 (owner: 10Dzahn) [23:40:32] (03CR) 10GWicke: [C: 031] restbase: move test/staging to its own cluster [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) (owner: 10Filippo Giunchedi) [23:40:50] (03PS1) 10Dzahn: index.php, replace tabs with spaces [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273138 [23:41:01] 6Operations, 13Patch-For-Review: change labstore1001/1002 to cfq io scheduler - https://phabricator.wikimedia.org/T126090#2061950 (10chasemp) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/273049/ [23:41:03] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2061952 (10chasemp) [23:41:42] (03CR) 10Dzahn: [C: 032 V: 032] index.php, replace tabs with spaces [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273138 (owner: 10Dzahn) [23:45:09] (03PS1) 10Dzahn: don't wrap lines inside table cells [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273139 [23:47:48] (03PS2) 10Dzahn: don't wrap lines inside table cells [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273139 [23:48:21] (03CR) 10Dzahn: [C: 032 V: 032] don't wrap lines inside table cells [debs/wikistats] - 10https://gerrit.wikimedia.org/r/273139 (owner: 10Dzahn) [23:55:53] (03PS1) 10Ori.livneh: xhgui: Sample fewer requests (1:100k instead of 1:10k) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273142 [23:56:14] Wow, nothing for SWAT? [23:56:14] People have been slacking. ;-) [23:57:32] (03PS3) 10Ori.livneh: Fully-qualify EventLoggingBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) [23:57:38] (03CR) 10Ori.livneh: [C: 032] Fully-qualify EventLoggingBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) (owner: 10Ori.livneh) [23:57:56] (03PS2) 10Ori.livneh: xhgui: Sample fewer requests (1:100k instead of 1:10k) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273142 [23:58:08] (03Merged) 10jenkins-bot: Fully-qualify EventLoggingBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273006 (https://phabricator.wikimedia.org/T127209) (owner: 10Ori.livneh) [23:58:17] (03CR) 10Ori.livneh: [C: 032] xhgui: Sample fewer requests (1:100k instead of 1:10k) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273142 (owner: 10Ori.livneh) [23:58:43] (03Merged) 10jenkins-bot: xhgui: Sample fewer requests (1:100k instead of 1:10k) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273142 (owner: 10Ori.livneh) [23:59:24] James_F: I dunno... you could give https://phabricator.wikimedia.org/T30984 a shot [23:59:42] :-)