[00:00:04] RoanKattouw ostriches Krenair MaxSem awight: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160311T0000). Please do the needful. [00:00:04] RoanKattouw jgirault ebernhardson MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] James_F: yeah, this swat is A-OK [00:00:13] Kk. [00:00:19] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:00:28] who's that gonna be? [00:00:37] RECOVERY - salt-minion processes on mw1168 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:00:38] Me! [00:00:48] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:00:56] Because I basically have my quarterly goal in this SWAT :D [00:01:09] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:01:18] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:01:27] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:01:35] RoanKattouw: hah! [00:01:38] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:01:48] RECOVERY - salt-minion processes on mw1163 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:02:07] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [00:02:19] RECOVERY - salt-minion processes on mw1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:02:27] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:02:28] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:03:02] OK, let's see [00:03:07] We have ... 8 o.O people today [00:03:09] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:03:17] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:03:18] RoanKattouw: morning swat was cancelled :/ [00:03:21] Is anomie here? [00:03:28] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [00:03:40] Also pinging benestar|cloud jgirault ebernhardson dcausse [00:03:44] RoanKattouw: the patch to wmf15, from morning swat, doesn't really matter anymore. I mean might as well ship it since you +2'd but everything is on wmf16 now [00:03:52] RoanKattouw: just needs the config change that i have in evening swat now [00:03:56] OK, will cancel [00:04:01] RoanKattouw: I’m here, but do the others first please [00:04:09] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:04:09] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [00:04:14] OK, ebernhardson's goes first [00:04:18] RECOVERY - salt-minion processes on mw1167 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:04:28] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:04:38] (03CR) 10Catrope: [C: 032] Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) (owner: 10EBernhardson) [00:04:49] RoanKattouw: brad's probably not here [00:05:09] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:05:17] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:05:27] greg-g: Oh but twentyafterfour merged his commit 2 hours ago [00:05:28] RECOVERY - salt-minion processes on mw1164 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:05:55] (03Merged) 10jenkins-bot: Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) (owner: 10EBernhardson) [00:05:55] oh, maybe it was pushed out? dunno [00:06:04] * greg-g looks at sal [00:06:15] Woah, 12 patches in swat. Is that a record? [00:06:25] Yeah it was pushed out [00:06:31] Krinkle: Merger of morning and evening SWAT [00:06:45] So, Brad's patch is already out, and the completion suggester config patch is listed twice [00:06:48] RECOVERY - salt-minion processes on mw1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:06:58] (03PS2) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276094 (https://phabricator.wikimedia.org/T125472) [00:06:59] RoanKattouw: yeah, it was https://tools.wmflabs.org/sal/log/AVNiyFJ5_GUtdAQqNXuZ [00:07:02] also some of those are dupes it looks like, because i put dcausse's patch under my name in evening swat, and it seems since then morning swat got copied in [00:07:15] confusion! [00:07:58] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:08:27] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:08:31] greg-g: Yeah, I independently found it in git log on tin as well [00:08:39] RECOVERY - salt-minion processes on mw1162 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:08:45] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable completion suggester on all but top 12 wikis (duration: 00m 32s) [00:08:48] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:00] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:09:25] ebernhardson: That's you ---^^ please verify [00:10:18] RECOVERY - salt-minion processes on mw1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:10:38] RoanKattouw: nothing appears to have exploded, QPS is climbing as expected. will keep watch on it [00:10:58] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [00:11:38] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:12:01] The Wikibase patch is also deployed already [00:12:27] RECOVERY - salt-minion processes on mw1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:12:29] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:12:30] (03PS1) 10Papaul: dhcp: adding MAC address entries for rdb200[5-6] Bug:T129178 [puppet] - 10https://gerrit.wikimedia.org/r/276663 (https://phabricator.wikimedia.org/T129178) [00:12:47] So that leaves jgirault (who is still frantically merging stuff into the portals repo), MaxSem (whose patch is stuck in Jenkins hell) and me [00:12:59] So I'll do mine next [00:13:50] RoanKattouw: takes soooo much time for that gate-and-submit job :P [00:13:59] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:14:09] the last patch is to add the trailing slash to the url to purge, actually to avoid manual work :P [00:14:15] jgirault: Yeah for some reason wikimedia/portals is not a separate queue :( so it's waiting for a lot of other patches: https://integration.wikimedia.org/zuul/ [00:14:33] legoktm: How does one make something be a separate queue à la mediawiki-config? ---^^ [00:14:48] RECOVERY - salt-minion processes on mw1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:15:06] RoanKattouw: prefixed jobs aka file a bug in #ci-config project [00:15:11] RoanKattouw: FYI there are plans I think to move out of mediawiki-config https://phabricator.wikimedia.org/T129436 [00:15:18] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:15:18] RECOVERY - salt-minion processes on mw1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:16:08] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [00:16:16] legoktm, jgirault: Filed https://phabricator.wikimedia.org/T129591 [00:16:18] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [00:16:27] RECOVERY - salt-minion processes on mw1166 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:16:34] RoanKattouw: thanks [00:16:35] (03CR) 10Catrope: [C: 032] Default to Flow for new talk pages on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [00:17:15] (03Merged) 10jenkins-bot: Default to Flow for new talk pages on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [00:18:36] (03CR) 10Catrope: [C: 032] Remove vestiges of the old Occupy feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276508 (owner: 10Mattflaschen) [00:18:47] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:19:39] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Make Flow the default in talk namespaces on mediawikiwiki (duration: 00m 38s) [00:19:42] (03Merged) 10jenkins-bot: Remove vestiges of the old Occupy feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276508 (owner: 10Mattflaschen) [00:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:06] (03PS3) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276094 (https://phabricator.wikimedia.org/T125472) [00:20:54] RoanKattouw: Portals is ready for deploy anytime :) [00:21:06] Thanks :) [00:21:37] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:21:38] RECOVERY - salt-minion processes on mw1165 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:22:04] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Clean up old Flow occupy stuff (duration: 00m 25s) [00:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:31] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Clean up old Flow occupy stuff (duration: 00m 26s) [00:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:08] (03CR) 10Catrope: [C: 032] Enable cross-wiki notifications beta feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275930 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope) [00:23:46] (03Merged) 10jenkins-bot: Enable cross-wiki notifications beta feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275930 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope) [00:24:42] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable cross-wiki notifications beta feature on all wikis (duration: 00m 27s) [00:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:51] quiddity: https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [00:25:29] woo! [00:25:41] ...and it turns out my volunteer account has a message on the Bavarian Wikipedia [00:25:45] lol [00:25:46] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2110306 (10Papaul) [00:27:20] RoanKattouw: fatalmonitor is showing $wmgFlowNamespaces undefined, and invalidargument suppled for foreach (probably using that vaiable) [00:27:48] i think they may have only been during the sync though ... dunno [00:28:12] Hi. Same here RoanKattouw: Arabic Wikisource, Egyptian Arabic Wikipedia, Asturian Wiktionary, Bavarian Wikipedia, Bihari Wikipedia, Emiliano-Romagnolo Wikipedia. [00:28:24] ebernhardson: That was probably intermittent. I did break wikitext talk pages though, so I'm reverting now [00:28:30] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Revert Flow change on mw.org (duration: 00m 27s) [00:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:44] (people will learn wikis with NewUserMessage or bot equivalent) [00:28:51] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2110319 (10Papaul) rdb2005 mgmt 10.193.2.247 rdb2006 mgmt 10.193.2.248 Port information rdb2005 ge-5/0/5 rack C5 rdb2006 ge-5/0/7 rack D5 [00:28:57] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Revert Flow change on mw.org (duration: 00m 27s) [00:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:43] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2110334 (10Papaul) [00:32:17] !log Running populateContentModel.php on mediawikiwiki so I can un-revert the Flow change [00:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:33:55] (03PS1) 10Papaul: adding install params for rdb200[5-6] Bug:T129178 [puppet] - 10https://gerrit.wikimedia.org/r/276674 (https://phabricator.wikimedia.org/T129178) [00:37:15] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2110374 (10Papaul) [00:40:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:41:17] !log catrope@tin Synchronized php-1.27.0-wmf.16/includes/diff/DifferenceEngine.php: Convert timing to milliseconds (duration: 00m 26s) [00:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:05] !log catrope@tin Synchronized php-1.27.0-wmf.16/extensions/Echo/Hooks.php: Try fixing the thank-you-edit bug again (duration: 00m 26s) [00:43:10] jgirault: Sorry for the delay, I got distracted because I'd broken Flow with a previous patch. Going to do your portals thing now [00:43:16] (03CR) 10Catrope: [C: 032] Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276094 (https://phabricator.wikimedia.org/T125472) (owner: 10JGirault) [00:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:25] RoanKattouw: ok! [00:43:42] RoanKattouw: you gotta run sync-portals script [00:43:47] Oh, that's neww [00:43:53] RoanKattouw: https://github.com/wikimedia/wikimedia-portals/blob/master/sync-portals [00:43:58] (03Merged) 10jenkins-bot: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276094 (https://phabricator.wikimedia.org/T125472) (owner: 10JGirault) [00:44:01] RoanKattouw: suppose to ease the process [00:44:14] Oh, I see [00:45:16] 7Puppet, 10Beta-Cluster-Infrastructure, 6Discovery, 10Wikimedia-Portals, 13Patch-For-Review: beta-mediawiki-config-update-eqiad failing with merge conflict in portals - https://phabricator.wikimedia.org/T129427#2110426 (10greg) Failed again: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-... [00:45:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:45:17] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: Puppet has 1 failures [00:45:36] !log catrope@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 26s) [00:45:41] jgirault: ^ re the beta cluster job failing again because of the failure to rebase portals :/ [00:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:52] 6Operations, 6Services, 13Patch-For-Review: setup/deploy sc[ab]200[1-2] - https://phabricator.wikimedia.org/T129234#2110429 (10Papaul) [00:45:53] ebernhardson: oh noes ^ [00:45:53] 6Operations, 10ops-codfw: update physical label on sc[ab]200[1-2] - https://phabricator.wikimedia.org/T129305#2110427 (10Papaul) 5Open>3Resolved Complete [00:46:02] !log catrope@tin Synchronized portals: (no message) (duration: 00m 26s) [00:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:36] jgirault: All done [00:46:42] RoanKattouw: https://www.wikipedia.org/ [00:46:49] new search box =] [00:46:50] Nice! [00:46:51] jgirault: ? [00:47:04] And it looks the same as in the app! Awesome! [00:47:16] jgirault: did you pull anything to the portals directory on beta cluster? it shouldn't be used anymore [00:47:32] ebernhardson: haven’t touched to the beta cluster today [00:47:44] ebernhardson: but I did this patch on mediawiki-config https://gerrit.wikimedia.org/r/#/c/276094/ [00:48:04] hmm [00:49:10] we might just want to turn off the auto-rebase there and let it do a proper checkout...not sure :S [00:50:07] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:50:17] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: Puppet has 1 failures [00:53:28] yeah, each commit to mediawiki-config runs the job in beta [00:55:07] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:55:07] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: Puppet has 1 failures [00:55:40] Thanks RoanKattouw =] [01:00:17] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: Puppet has 1 failures [01:00:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:04:50] !log Made mediawiki/php/wikidiff read-only in Gerrit [01:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:08] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:05:17] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: Puppet has 1 failures [01:05:48] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Un-revert Flow change on mw.org (duration: 00m 29s) [01:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:06:15] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Un-revert Flow change on mw.org (duration: 00m 26s) [01:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:07:25] (03PS1) 10Krinkle: wmfstatic: Remove redundant file_exists check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276677 [01:07:27] (03PS1) 10Krinkle: wmfstatic: Make 404 for !stat() the same as for !$fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276678 [01:10:07] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 71 seconds ago with 0 failures [01:10:17] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:15:08] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 198 seconds ago with 0 failures [01:18:06] RoanKattouw: All done? [01:18:54] Krinkle: Yes, all clear [01:19:10] (03CR) 10Krinkle: [C: 032] multiversion: Remove logic for branch pointers in /w/static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [01:19:22] (03CR) 10Krinkle: [C: 032] wmfstatic: Remove redundant file_exists check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276677 (owner: 10Krinkle) [01:19:30] (03CR) 10Krinkle: [C: 032] wmfstatic: Make 404 for !stat() the same as for !$fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276678 (owner: 10Krinkle) [01:19:45] (03Merged) 10jenkins-bot: multiversion: Remove logic for branch pointers in /w/static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [01:20:01] (03Merged) 10jenkins-bot: wmfstatic: Remove redundant file_exists check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276677 (owner: 10Krinkle) [01:20:11] (03Merged) 10jenkins-bot: wmfstatic: Make 404 for !stat() the same as for !$fallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276678 (owner: 10Krinkle) [01:21:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:22:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [01:22:41] jdlrobson: Hm.. do you know why pageimage for https://en.wikipedia.org/wiki/Amazon.com is the founder's image (first thumb) rather than the logo or webpage from infobox? [01:22:51] E.g. type "Am" on www.wikipedia.org [01:24:22] Krinkle: blame MaxSem [01:25:06] I guess often things from the infobox may be wrong, but presumably there is a way to override the default guess? [01:25:11] !log krinkle@tin Synchronized w/static.php: 6da604f and 49c07ac (duration: 00m 39s) [01:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:25:21] oh, did the non-free image stuff land already? That might cause it [01:25:33] Krinkle, it's too small [01:25:54] the logo is less than 100px high [01:26:04] It could be this too -- https://github.com/wikimedia/mediawiki-extensions-PageImages/commit/7c78ba622c7fe9c5b2fefbea0db21a279957590f [01:26:08] the homepage is indeed unfree [01:26:17] The SVG's nominal size is 972 × 196 [01:26:26] (ignoring the fact that it's an SVG) [01:26:32] its size on page matters [01:26:46] otherwise all pages will have images from maint templates [01:28:30] Hm.. [01:28:35] I see [01:28:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:28:58] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:40:27] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2110737 (10ori) >>! In T129467#2107062, @Joe wrote: > A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process cre... [01:45:27] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2110740 (10ori) The hosts that had /etc/malloc.conf -> prof:true were mw1122.eqiad.wmnet, mw1169.eqiad.wmnet, mw1015.eqiad.wmnet, and mw1107.eqiad.wmnet, which are exact... [01:45:28] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.074 second response time [01:46:08] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 71447 bytes in 0.378 second response time [01:46:37] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 71449 bytes in 3.502 second response time [01:46:47] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.052 second response time [01:58:47] (03PS2) 10Dzahn: dhcp: adding MAC address entries for rdb200[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/276663 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:01:33] (03PS3) 10Dzahn: dhcp: adding MAC address entries for rdb200[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/276663 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:07:43] (03CR) 10Dzahn: [C: 032] dhcp: adding MAC address entries for rdb200[5-6] [puppet] - 10https://gerrit.wikimedia.org/r/276663 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:11:12] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2097251 (10Dzahn) These have been added to DHCP now. [02:18:28] (03PS2) 10Dzahn: remove cygnus,technetium from hieradata, incl. admin groups [puppet] - 10https://gerrit.wikimedia.org/r/275877 (https://phabricator.wikimedia.org/T118763) [02:18:36] (03CR) 10Dzahn: [C: 032] remove cygnus,technetium from hieradata, incl. admin groups [puppet] - 10https://gerrit.wikimedia.org/r/275877 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [02:21:22] !log technetium - shutdown -h now [02:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:59] PROBLEM - Host technetium is DOWN: PING CRITICAL - Packet loss = 100% [02:23:36] !log cygnus - poweroff (for variety) [02:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:48] RECOVERY - Host technetium is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [02:30:16] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 12m 33s) [02:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:00] !log cygnus/technetium: puppet cert clean, salt-key -d (neodymium), puppetstoredconfigclean.rb (rm from Icinga), gnt-instance remove (destroy VMs) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:15] PROBLEM - Host technetium is DOWN: PING CRITICAL - Packet loss = 100% [02:36:25] ACKNOWLEDGEMENT - Host technetium is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn its dead Jim [02:39:01] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 11 02:39:01 UTC 2016 (duration 8m 45s) [02:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:23] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2110862 (10Papaul) [02:45:38] (03PS2) 10Dzahn: adding install params for rdb200[5-6] Bug:T129178 [puppet] - 10https://gerrit.wikimedia.org/r/276674 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:45:54] (03CR) 10Dzahn: [C: 032] adding install params for rdb200[5-6] Bug:T129178 [puppet] - 10https://gerrit.wikimedia.org/r/276674 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:46:46] (03CR) 10Dzahn: [V: 032] adding install params for rdb200[5-6] Bug:T129178 [puppet] - 10https://gerrit.wikimedia.org/r/276674 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [02:49:06] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2110900 (10RobH) network port description set, enabled, and vlan set. [02:53:46] grmbl, puppet fails on bastions due to the removed user.. looking [02:54:12] !log installing rdb200[5-6] [02:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:07] !log krenair@tin Synchronized php-1.27.0-wmf.16/extensions/VisualEditor/modules/ve-mw/ui/pages/ve.ui.MWTemplatePage.js: touch - see comment on https://gerrit.wikimedia.org/r/#/c/274120/ (duration: 00m 31s) [02:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:51] ACKNOWLEDGEMENT - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T129316 [02:59:50] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 136869 MB (3% inode=99%) [03:01:44] 6Operations: "internal_api_error_MWException: [dbf916b7] Exception Caught: Could not acquire lock for" for some uploads - https://phabricator.wikimedia.org/T129621#2110930 (10zhuyifei1999) [03:03:21] (03PS1) 10Dzahn: admin: remove akumar,mnoushad from bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/276691 (https://phabricator.wikimedia.org/T126012) [03:03:59] (03CR) 10Dzahn: [C: 032] admin: remove akumar,mnoushad from bastiononly [puppet] - 10https://gerrit.wikimedia.org/r/276691 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [03:04:20] PROBLEM - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code [03:05:00] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: puppet fail [03:07:20] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [03:08:20] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:10:30] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:10:42] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:12:14] ACKNOWLEDGEMENT - DPKG on maps-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - DPKG on maps-test2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - DPKG on maps-test2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn held nodejs package [03:12:14] ACKNOWLEDGEMENT - DPKG on maps-test2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn held nodejs package [03:12:15] ACKNOWLEDGEMENT - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn held nodejs package [03:13:31] ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code daniel_zahn https://phabricator.wikimedia.org/T127567 [03:13:31] ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code daniel_zahn https://phabricator.wikimedia.org/T127567 [03:14:11] ACKNOWLEDGEMENT - DPKG on nobelium is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn some kind of hhvm testing [03:23:13] wikibugs: speak up [03:23:14] ACKNOWLEDGEMENT - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T129623 [03:23:55] does it need poking mutante? [03:24:10] it seems like it, yea [03:24:30] I feel like I am moving from one interrupt to the next here :( [03:25:35] yea, same when looking at icinga [03:26:02] down to 18 again [03:27:09] cool is that so many warnings/unknowns are gone because godog fixed many graphite checks [03:29:26] the puppet fails on bastions are fixed. gotta move afk [03:30:12] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2110962 (10Dzahn) [03:33:49] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:45:53] !log rdb200[5-6] signing puppet certs, salt-key, initial run [03:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:04:35] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:15:34] !log rdb200[5-6] installation complete [04:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:18:04] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2111044 (10Papaul) [04:18:22] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2097251 (10Papaul) [04:20:30] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2111046 (10Papaul) a:5Papaul>3elukey The installation is complete, handing the task to @elukey [04:49:04] RECOVERY - DPKG on nobelium is OK: All packages OK [04:50:19] !log Uninstalled HHVM on nobelium; not puppetized. [04:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:59:51] akosiaris: possible to merge https://gerrit.wikimedia.org/r/#q,276405,n,z today? [06:29:36] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:46] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: puppet fail [06:31:05] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:45] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:06] <_joe_> !log restarted hhvm on mw1166 [06:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:45:08] Anyone available for emergency SWAT of, https://gerrit.wikimedia.org/r/#/c/276700/ ? [06:53:45] (03PS1) 10Giuseppe Lavagetto: jobqueue: re-add traditional jobs to mw1161,2 [puppet] - 10https://gerrit.wikimedia.org/r/276701 [06:55:51] ok. I can deploy it then. [06:56:00] (UBN, so) [06:56:18] kart_: yeah, go for it [06:56:24] looks sane, javascript-only [06:56:44] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:54] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:35] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] ori: thanks. [06:58:24] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:58:35] ori: we don't use tin anymore for deployment, right? [06:59:07] I have to fix my ssh config. Not deployed anything since Feb. [07:00:31] <_joe_> kart_: we use tin indeed [07:10:26] _joe_: ok. Thanks! [07:15:04] (03CR) 10Giuseppe Lavagetto: [C: 032] jobqueue: re-add traditional jobs to mw1161,2 [puppet] - 10https://gerrit.wikimedia.org/r/276701 (owner: 10Giuseppe Lavagetto) [07:15:48] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2111154 (10jcrespo) > @jcrespo can you help determine the way forward here? Do we need to pursue figuring out which disk is the faulty one here for replacment? We already have the replacement on rack, it... [07:23:27] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2111170 (10jcrespo) #3 (and #4, partially -for mediawiki "simple" user HTTP requests and dbs) is being d... [07:24:34] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:31:05] !log kartik@tin Synchronized php-1.27.0-wmf.16/extensions/ContentTranslation: Deploying 276700 for ContentTranslation (duration: 00m 43s) [07:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:17] _joe_: let's increase the timeouts [08:06:55] RECOVERY - Disk space on fluorine is OK: DISK OK [08:14:44] <_joe_> ori: timeouts for what? [08:14:55] <_joe_> (sorry, I was walking to the conf) [08:16:55] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.443 second response time [08:17:01] !log restarted hhvm on mw1174, mw1186, mw1211 [08:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:17:15] RECOVERY - HHVM rendering on mw1186 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.272 second response time [08:17:35] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time [08:17:45] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.037 second response time [08:18:25] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 1.001 second response time [08:18:31] !log restarted hhvm on mw1220, mw1244 [08:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:18:54] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.042 second response time [08:18:55] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 71822 bytes in 0.097 second response time [08:18:55] RECOVERY - Apache HTTP on mw1244 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.053 second response time [08:19:06] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [08:19:06] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.311 second response time [08:19:06] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.034 second response time [08:19:21] <_joe_> uhm more crashes? [08:19:43] <_joe_> or just leftovers [08:20:05] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 71823 bytes in 0.121 second response time [08:20:14] !log restarted hhvm on mw1246, mw1258 [08:20:15] RECOVERY - Apache HTTP on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.032 second response time [08:20:16] leftovers [08:20:16] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 71824 bytes in 0.292 second response time [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:33] just some icinga cleansweep [08:21:13] <_joe_> moritzm: thanks :) [08:21:34] RECOVERY - Apache HTTP on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.036 second response time [08:31:25] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:34] !log cp1067: apt-get removed linux-image-3.16.0-4-amd64 and linux-image-4.4.0-1-amd64-dbg to free up some disk space [08:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:55] PROBLEM - Disk space on cp1067 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied [09:03:30] ^ema [09:03:59] jynus: yes I'm looking into it, thanks [09:07:55] !log uploaded linux-meta 1.9 for jessie-wikimedia to carbon (which now defaults linux-meta to installing Linux 4.4) [09:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:18] oh no [09:11:18] The last Puppet run was at Mon Mar 7 10:26:01 UTC 2016 (5684 minutes ago). Puppet is disabled. reason not specified [09:11:33] that is gallium grblbl [09:12:44] !log Enabling puppet on gallium.wikimedia.org . Been disabled since ~ Mon Mar 7 10:26:01 UTC 2016 [09:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:14] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet last ran 3 days ago [09:13:37] !log umounted /sys/kernel/debug/tracing on cp1067 [09:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:35] RECOVERY - Disk space on cp1067 is OK: DISK OK [09:15:04] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:15:26] puppet on gallium is all happy [09:16:01] ema: gah, was that filesystem causing the disk alert? [09:16:23] godog: so there was an actual disk space warning there [09:17:04] then after removing the kernel -dbg package to free up some space the alert started showing up [09:17:45] sudo -u nagios /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]" [09:17:50] this was the command [09:18:06] the plugin tries to enumerate all filesystems, failing on that one [09:18:46] possible solution: call check_disk with --exclude-type=tracefs [09:19:04] ugh, mind opening a task so we don't lose track? [09:19:29] godog: sure, I was about to put in a CR actually [09:20:13] even better! thanks [09:24:50] (03CR) 10ArielGlenn: [C: 032] script to run a maintenance command on all wikis with varying output dirs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276633 (owner: 10ArielGlenn) [09:32:09] ema: good morning. Poke me whenever you want to try out varnishtest ;-} I got varnish installed on the CI instances yesterday [09:32:35] hashar: awesome [09:37:20] moar tests [09:38:14] if we could get the various puppet.git modules/*/Rakefile to be run that would be even better [09:46:00] !log Rebooting rdb200[56].codfw for kernel upgrade [09:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:07] is gerrit git-fetch working for everyone else? [09:49:50] jynus: I'm trying to push a change but it's stuck apparently [09:50:07] I can do it from the cluster [09:51:46] oh, but on the cluster I use https, not ssh [09:52:32] sshing to gerrit works though [09:52:36] eg: ssh ema@gerrit.wikimedia.org -p 29418 [09:53:00] but yeah, git operations seem to be broken [09:53:15] jynus a/2556 git-upload-pack./operations/mediawiki-config.git 0ms 532364ms killed [09:53:19] that is from Gerrit [09:53:41] and i get other similar errors :-/ [09:54:18] should we try restarting the gerrit service? [09:57:28] there is at least one upload-pack error due to java.nio.channels.ClosedChannelException [09:57:28] at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:272) [09:57:33] not much more on backend [09:57:42] maybe it is the varnish / ytterbium connections having some kind of issue [09:58:58] hashar: well my git push through ssh is stuck... [09:59:08] yeah [09:59:23] on ytterbium (Gerrit box) I have been looking at grep killed /var/lib/gerrit2/review_site/logs/sshd_log [09:59:46] errors started at ~ 9:18 UTC [10:02:43] and Gerrit has a bunch of stalled git-upload-pack commands grr [10:04:56] there's also a long list of failed github syncs, not sure whether that's merely fallout and might have caused it in the first place [10:05:07] ( looking at: ssh -p 29418 hashar@gerrit.wikimedia.org 'gerrit show-queue -w' ) [10:05:16] yeah github syncs are not a problem [10:05:19] moritzm: yeah I remember seeing those for the longest time [10:05:31] whenever the github target repo has object Gerrit doesn't know about, it gives up doing replication [10:05:40] that is part of the usual error spam that needs to be handled [10:05:47] but output of ssh -p 29418 hashar@gerrit.wikimedia.org 'gerrit show-queue -w' [10:05:53] (03PS4) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [10:05:55] shows a bunch of stuck git-upload-pack [10:06:09] I have no idea how to kill tasks though [10:06:27] I git-reviewed that ^ a while ago [10:06:51] I dont know about Gerrit internals [10:07:18] but it has 6 stuck entries in git-upload-pack [10:07:25] and maybe that starve the queue somehow [10:07:38] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [10:08:25] will try something [10:08:38] (03PS1) 10Giuseppe Lavagetto: jobrunner: monitor the HHVM server health [puppet] - 10https://gerrit.wikimedia.org/r/276710 [10:09:40] it is working now [10:10:06] thanks hashar [10:11:28] thank you hash ar [10:11:36] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: puppet fail [10:11:49] !log Gerrit: killed stuck "git-upload-pack '/mediawiki/core.git'" tasks [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:48] tahnk you, may I ask you to document (one title and 1 line is enough) that on the gerrit wiki ? [10:12:48] at least show-queue looks normal now [10:12:54] yeah going to do that right now [10:13:18] so one can: ssh -p 29418 hashar@gerrit.wikimedia.org ps -w [10:13:23] to shows the task list [10:13:46] then if an admin, kill a task by passing its id. Something like: ssh -p 29418 hashar@gerrit.wikimedia.org kill 852c9889 [10:13:51] copy pasting that to wikitech [10:13:56] (03PS1) 10Elukey: Set rdb200[56] as part of codfw's Job Queues (rdb2005 master, rdb2006 slave). [puppet] - 10https://gerrit.wikimedia.org/r/276711 (https://phabricator.wikimedia.org/T129178) [10:14:00] I have learned a new command today [10:14:23] great! share that knowledge! Wikimedia is about sharing knowledge! [10:14:32] :-) [10:15:47] jynus: whenever you have time https://gerrit.wikimedia.org/r/#/c/276711/1 :) [10:18:09] "whenever you have time" so never? :-) [10:20:00] (03PS1) 10Ema: Disk space icinga check: pass --exclude-type=tracefs [puppet] - 10https://gerrit.wikimedia.org/r/276712 [10:21:49] jynus: ahhahha point taken [10:23:24] I have other priorities right now: labsdb problems and master-master production setups, maybe later [10:23:32] (03CR) 10Volans: "LGTM and safe given that is only adding a new define." [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:25:59] akosiaris: next inline if you have a bit of time for https://gerrit.wikimedia.org/r/#/c/276711/1 [10:26:08] (not urgent) [10:26:48] !log Gerrit task management documented on https://wikitech.wikimedia.org/wiki/Gerrit#Tasks_management [10:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:11] jynus: we have doc for Gerrit tasks listing and killing now https://wikitech.wikimedia.org/wiki/Gerrit#Tasks_management . Any opsen should have appropriate permissions [10:27:19] hashar: nice! [10:28:27] I suppose the last line is kill , right ? [10:29:21] yes it is, corrected [10:29:54] the magic of the wikis! [10:37:18] (03CR) 10Gehel: [C: 031] "LGTM, simple enough and I see no reason to check tracefs." [puppet] - 10https://gerrit.wikimedia.org/r/276712 (owner: 10Ema) [10:38:26] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:38:46] (03CR) 10Giuseppe Lavagetto: [C: 031] Set rdb200[56] as part of codfw's Job Queues (rdb2005 master, rdb2006 slave). [puppet] - 10https://gerrit.wikimedia.org/r/276711 (https://phabricator.wikimedia.org/T129178) (owner: 10Elukey) [10:39:21] (03CR) 10Filippo Giunchedi: [C: 031] "thanks Ema!" [puppet] - 10https://gerrit.wikimedia.org/r/276712 (owner: 10Ema) [10:43:14] (03CR) 10Volans: [C: 031] Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:43:19] (03CR) 10Elukey: [C: 032] Set rdb200[56] as part of codfw's Job Queues (rdb2005 master, rdb2006 slave). [puppet] - 10https://gerrit.wikimedia.org/r/276711 (https://phabricator.wikimedia.org/T129178) (owner: 10Elukey) [10:43:34] Hey! I need some help to review https://gerrit.wikimedia.org/r/#/c/274382/ (which is blocking volans). [10:44:19] it looks good to me, if someone with more puppet experience could take a quick look will be really appreciated, should be pretty safe given that is only adding a define [10:45:11] (03PS4) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [10:45:36] (03PS1) 10Jcrespo: [WIP]Fix labs-support vlan dhcp-install config on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/276716 (https://phabricator.wikimedia.org/T128753) [10:45:54] (03CR) 10Muehlenhoff: "Ok, I didn't check role::labs::openstack::nova::controller, but only role::salt::masters::labs (it's preferred to have the ferm in their r" [puppet] - 10https://gerrit.wikimedia.org/r/276420 (owner: 10Muehlenhoff) [10:46:09] (03Abandoned) 10Muehlenhoff: Add ferm rules for salt master/labs [puppet] - 10https://gerrit.wikimedia.org/r/276420 (owner: 10Muehlenhoff) [10:48:57] (03PS2) 10Jcrespo: [WIP]Fix labs-support vlan dhcp-install config on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/276716 (https://phabricator.wikimedia.org/T128753) [10:50:23] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:50:45] yeah looks good to merge, make sure to run the puppet compiler on the review that actually uses it cc volans gehel [10:51:21] First time I'm going to use this puppet compiler [10:51:28] * gehel is going to learn one more trick today [10:54:04] (03PS3) 10Jcrespo: Fix labs-support vlan dhcp-install config on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/276716 (https://phabricator.wikimedia.org/T128753) [10:55:13] (03PS5) 10Alexandros Kosiaris: Enable non-default Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [10:56:24] gehel: hehe check out utils/pcc in puppet.git [10:57:25] 6Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2111405 (10jcrespo) Thank you, @jcrespo. I have checked the isuess with the help of Mortiz, and I believe it is not a firewall issue, but lack of dhcp offerings to that vlan that: https... [10:57:51] arg, I just killed your name [10:59:57] (03CR) 10Alexandros Kosiaris: [C: 032] Enable non-default Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [11:00:18] np :-) [11:00:28] kart_: done [11:00:35] elukey: looking into it, but in a meeting right now [11:01:21] akosiaris: already merged, Giuseppe reviewed thanks! [11:01:27] akosiaris: thanks! [11:07:52] off to get cat food and doggie pads (for the cat) plus lunchings, back in 45? minutes [11:08:05] 6Operations, 6Commons, 10MediaWiki-Page-deletion: Can't delete " - https://phabricator.wikimedia.org/T129637#2111423 (10Steinsplitter) [11:10:40] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: Can't delete " - https://phabricator.wikimedia.org/T129637#2111435 (10Steinsplitter) [11:11:18] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: MWException trying to delete a certain file on Commons - https://phabricator.wikimedia.org/T129637#2111437 (10Aklapper) p:5Triage>3High [11:12:36] (03PS2) 10Ema: Disk space icinga check: pass --exclude-type=tracefs [puppet] - 10https://gerrit.wikimedia.org/r/276712 [11:12:51] (03CR) 10Ema: [C: 032 V: 032] Disk space icinga check: pass --exclude-type=tracefs [puppet] - 10https://gerrit.wikimedia.org/r/276712 (owner: 10Ema) [11:16:46] I think git got stuck again [11:24:03] Can someone help me? Since the update yesterday at wikipedia I get "invalidtoken" every time I tried to save a page [11:24:33] I'm using $token = $answer['query']['tokens']['csrftoken']; and $data = "action=query&format=php&meta=tokens&type=csrf"; [11:24:54] (03PS1) 10Jcrespo: s/labdsdb1008/labsdb1008/ for the non-mgmt ip [dns] - 10https://gerrit.wikimedia.org/r/276718 (https://phabricator.wikimedia.org/T128753) [11:25:51] !log killed some long-running git-upload-pack tasks [11:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:25] (03CR) 10Jcrespo: [C: 032] s/labdsdb1008/labsdb1008/ for the non-mgmt ip [dns] - 10https://gerrit.wikimedia.org/r/276718 (https://phabricator.wikimedia.org/T128753) (owner: 10Jcrespo) [11:27:02] !log running authdns-update to solve a dns typo [11:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:59] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 136252 MB (3% inode=99%) [11:36:49] 6Operations, 10ops-codfw, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2111491 (10elukey) JobQueues up and running, I'll wait for @Joe's confirmation before closing. @Dzahn one thing that might be worth to ment... [11:39:58] I am sorry, but killing git tasks every time I want to update something is not a good policy- there is something wrong there [11:45:48] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2111495 (10Luke081515) meh, it is still growing... currently 23 mil jobs [11:48:08] !log restarting gerrit on ytterbium [11:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:52:19] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [11:52:47] er.. ^ expected ? [11:54:00] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [11:54:10] not me [11:54:40] grrrit-wm, are you there? [11:58:29] ah definitely not [11:58:39] after a gerrit restart, grrit-wm needs a restart too [11:58:44] let's see how to do that [12:00:07] grrrit-8wkv6 1/1 Terminating 22 7d [12:00:07] grrrit-evjmb 0/1 Pending 0 3s [12:00:54] and there we are [12:01:10] https://wikitech.wikimedia.org/wiki/Grrrit-wm#Deploying_or_restarting [12:01:13] fyi ^ [12:03:44] (03PS4) 10Alexandros Kosiaris: Sync up eqiad/codfw LVS IP assignments for services [dns] - 10https://gerrit.wikimedia.org/r/276196 (https://phabricator.wikimedia.org/T129234) [12:03:56] hello grrrit-wm :-) [12:04:41] (03CR) 10Alexandros Kosiaris: [C: 032] Sync up eqiad/codfw LVS IP assignments for services [dns] - 10https://gerrit.wikimedia.org/r/276196 (https://phabricator.wikimedia.org/T129234) (owner: 10Alexandros Kosiaris) [12:10:58] !log uploaded backport of linux-tools 4.4-4 for jessie-wikimedia to carbon (provides kbuild and perf amonst others) [12:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:04] 6Operations: 4.4 Linux kernel - https://phabricator.wikimedia.org/T126320#2111521 (10MoritzMuehlenhoff) 5Open>3Resolved perf/kbuild are now also available and 4.4 has been made the new default kernel for new installations. Closing this bug, 4.4 is now under the usual maintenance. [12:13:07] akosiaris: \o/ [12:31:58] (03CR) 10Muehlenhoff: "Thanks for the review, will fix up the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/274962 (owner: 10Muehlenhoff) [12:36:34] !log mathoid deploying 7a282a4181a4 [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:37:00] akosiaris: fyi ^ [12:40:14] akosiaris: hm, scb200x are in the list of mathoid's trebuchet minions, but they're not completing the fetch, known / expected ? [12:40:22] (03PS4) 10Muehlenhoff: Move dynamicproxy ferm rules into role::labs::novaproxy and role::labs::tools::proxy [puppet] - 10https://gerrit.wikimedia.org/r/274962 [12:44:55] 6Operations, 10Mathoid, 6Services, 10Trebuchet: Remove sca100x from the list of Mathoid's minioins - https://phabricator.wikimedia.org/T129645#2111586 (10mobrovac) [12:46:48] mobrovac: not known, I 'll have to look into it [12:46:55] but first..., lunch [12:49:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0] [12:50:51] PROBLEM - Disk space on logstash1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 113478 MB (3% inode=99%) [12:51:36] mobrovac: ok for me on https://gerrit.wikimedia.org/r/#/c/276720/ btw [12:52:07] kk godog [12:56:05] !log restbase deploy start of 3bedb8f5c42 on canary restbase1001 [12:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:04:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:12:55] (03PS1) 10Mobrovac: RESTBase: Remove restbase100[12] from the lists of seeds [puppet] - 10https://gerrit.wikimedia.org/r/276728 [13:26:12] !log restbase deploy end of 3bedb8f5c42 [13:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:06] (03CR) 10Gilles: "Can someone with +2 merge this? It seems uncontroversial and it's been +1ed 3 times already." [puppet] - 10https://gerrit.wikimedia.org/r/251800 (owner: 10Ori.livneh) [13:30:12] (03CR) 10Gilles: "Is this still relevant? It's been sitting for a few months." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [13:30:19] mobrovac: I see scb200X being at the exact same mathoid version as scb100X [13:30:31] so I would say the are checking out the code, maybe just not reporting it ? [13:31:41] (03CR) 10Gilles: "This has been waiting for a while, maybe do what Alexandros suggested to get this merged and not forgotten?" [puppet] - 10https://gerrit.wikimedia.org/r/252396 (https://phabricator.wikimedia.org/T118331) (owner: 10Ori.livneh) [13:31:57] akosiaris: hm, strange, when i checked scb2001 earlier it was a commit behind, but now it's good [13:32:48] heh, interesting... [13:33:03] somehow I am thinking latency and distributed systems ... [13:33:10] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: Puppet has 1 failures [13:33:30] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:37:45] (03PS1) 10Mobrovac: Mathoid: enable PNG generation [puppet] - 10https://gerrit.wikimedia.org/r/276734 (https://phabricator.wikimedia.org/T71702) [13:45:50] RECOVERY - DPKG on labmon1001 is OK: All packages OK [13:49:40] RECOVERY - Disk space on fluorine is OK: DISK OK [13:50:16] (03CR) 10WMDE-Fisch: [C: 031] Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [13:50:43] !log on fluorine truncated archive/redis.log-20160311 and archive/JobQueueFederated.log-20160311 to 5mb each, they were each about 500gb [13:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:09] (03PS1) 10Dereckson: Procomuns Viquimarató - Barcelona throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276735 (https://phabricator.wikimedia.org/T129574) [13:54:49] (03PS1) 10Pmlineditor: Set $wgNamespacesWithSubpages to true for NS_TEMPLATE for ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276737 (https://phabricator.wikimedia.org/T124615) [13:59:49] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:08:01] PROBLEM - HHVM rendering on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [14:10:00] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 72486 bytes in 0.563 second response time [14:10:22] (03CR) 10JanZerebecki: [C: 031] Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [14:12:21] !log installed security updates for openssl, curl, gcrypt, libpng, jasper, expat and libxml2 on mw1017 (other canaries will be upgraded later on if all is well, the rest of mw* on Monday) [14:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:34] (03CR) 10Dereckson: "@leszek You can include this patch for deployment during one of the SWAT windows: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [14:16:27] (03CR) 10WMDE-leszek: [C: 04-1] "This patch will be included for deployment once I have added one more feed to the whitelist (this feed is not set up yet)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [14:16:35] 6Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2111810 (10jcrespo) The issue was, in fact https://gerrit.wikimedia.org/r/276718 labsdb1008 should have been already installed. [14:19:17] (03PS1) 10Alexandros Kosiaris: lvs: normalize ProxyFetch URL configuration [puppet] - 10https://gerrit.wikimedia.org/r/276739 [14:20:43] this is fun ^ [14:21:12] (03PS1) 10BBlack: VCL: remove vcl_config.do_gzip conditional [puppet] - 10https://gerrit.wikimedia.org/r/276740 [14:24:22] (03CR) 10Chad: "Causes the following in production when calling updateWikiversions:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [14:24:25] (03CR) 10Ema: [C: 031] VCL: remove vcl_config.do_gzip conditional [puppet] - 10https://gerrit.wikimedia.org/r/276740 (owner: 10BBlack) [14:25:18] !log run swiftrepl thumbs eqiad -> codfw with concurrency 128 [14:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:43] (03CR) 10Rush: [C: 031] "sure, just looking this part of the change https://gerrit.wikimedia.org/r/#/c/274076/3/modules/admin/files/enforce-users-groups.sh" [puppet] - 10https://gerrit.wikimedia.org/r/274076 (https://phabricator.wikimedia.org/T124962) (owner: 10Jcrespo) [14:33:14] !log run swiftrepl thumbs eqiad -> codfw with concurrency 96 [14:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:24] (03PS1) 10Dereckson: Create Draft namespace on kn.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276743 (https://phabricator.wikimedia.org/T129052) [14:34:17] (03CR) 10Ema: [C: 031] common VCL: use more of the hostname for backend naming [puppet] - 10https://gerrit.wikimedia.org/r/276529 (owner: 10BBlack) [14:38:39] (03CR) 10BBlack: [C: 04-1] "This doesn't actually work, because we'd need a matching change in directors.inc.vcl.erb.tpl or whatever it's called, which uses regex tra" [puppet] - 10https://gerrit.wikimedia.org/r/276529 (owner: 10BBlack) [14:41:25] (03PS1) 10Chad: Moving largest wikis back to wmf.15 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276744 [14:43:49] akosiaris: maybe too much thinking about the future, so feel free to ignore! but I'd think we're moving in the wrong direction with LVS as localhost Host: headers. In the long run, all of these are going to support HTTPS and be tested over HTTPS, at which point things like pybal actually need two hostnames: the service hostname for the TLS connect (the cert will be for foo.svc.eqiad.wmnet), a [14:43:55] nd the Host: header to emit inside the connection (which may be localhost or en.wp or whatever) [14:44:14] but, we could always put that back in a different form when we get to that point [14:44:26] (tls_host, host, url as separate params?) [14:44:45] (03CR) 10Chad: [C: 032] Moving largest wikis back to wmf.15 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276744 (owner: 10Chad) [14:45:12] (03Merged) 10jenkins-bot: Moving largest wikis back to wmf.15 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276744 (owner: 10Chad) [14:45:46] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: moving largest.dblist back to wmf.15 for now [14:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:13] (03CR) 10MZMcBride: "Why?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276744 (owner: 10Chad) [14:47:30] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 167.05, 114.15, 59.69 [14:47:48] (03CR) 10Chad: "Job queue backlog is 25 million and growing and I suspect wmf.16 is at fault." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276744 (owner: 10Chad) [14:48:34] bblack: I thought about the Host as well. As I noted it is the controversial one... We don't currently have a usage for it right now, but we will have in the future as you point it. [14:48:52] ^ godog: ms-be2016 [14:48:53] I am not sure if we should be however relying on a different parameter or on the url [14:50:31] but from the url, the port part is definitely misleading [14:50:36] Leah: I could be wrong, but I want to rule wmf.16 out as being at fault for this. [14:50:48] Mainly just testing my theory and if I'm wrong we'll go back to wmf.16 [14:51:31] akosiaris: I think long term it's a different parameter, since pybal is "different" in that it's going to connect to the service backend host's IP always [14:52:23] akosiaris: you could split it now as http_host (==localhost in the cases in the patch) and http_url, and then later we add tls_host. but all of that's code changes for pybal templating and/or pybal itself down the road [14:52:35] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2111859 (10MZMcBride) >>! In T129517#2109042, @hashar wrote: > ``` > [20:13:22] ori: do you want to block the train or no? > [20:13:25] no > ``` > > Solved, su... [14:52:48] ostriches: Sure. Just trying to cross-reference the commits and tasks. :-) [14:53:29] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2111862 (10demon) p:5High>3Unbreak! [14:53:35] And yes, it should be UBN :) [14:53:54] bblack: I think you are right. In the long run, we probably want to have scheme, path, http_host and tls_host support [14:55:56] I 'll split the patch a bit, one part to be the "uncontroversial" one, removing the ports and one to discuss a bit more the Host part [14:56:45] moritzm: thanks, I'll give it a kick [15:00:11] (03PS12) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [15:01:07] (03PS1) 10Chad: Revert "Moving largest wikis back to wmf.15 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276745 [15:01:15] _joe_: ^^ [15:01:29] I'm gonna move them back to wmf.16 since wmf.15 made zero difference. [15:01:51] (03CR) 10Chad: [C: 032] Revert "Moving largest wikis back to wmf.15 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276745 (owner: 10Chad) [15:02:18] <_joe_> ostriches: +1, but I'm gonna be unavailable from now to tomorrow :) [15:02:25] Yeah np [15:02:26] (03CR) 10BBlack: [C: 032] VCL: remove vcl_config.do_gzip conditional [puppet] - 10https://gerrit.wikimedia.org/r/276740 (owner: 10BBlack) [15:02:50] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [15:03:05] (03Merged) 10jenkins-bot: Revert "Moving largest wikis back to wmf.15 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276745 (owner: 10Chad) [15:03:12] Oh stfu icinga-wm [15:03:14] We knowssss [15:03:43] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: moving large.dblist back to wmf.16, did not help [15:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:00] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 4.85, 2.37, 1.02 [15:04:21] !log demon@tin Synchronized README: no-op, co-master sync (duration: 00m 30s) [15:04:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [15:06:06] !log installing security updates for openssl, curl, gcrypt, libpng, jasper, expat and libxml2 on mw1018-1025 (and restarts of HHVM) [15:07:19] PROBLEM - puppet last run on dbstore2002 is CRITICAL: CRITICAL: puppet fail [15:08:10] Hello. Would it be possible to deploy a last-minute throttle rule? This is for an event organized Sunday: https://gerrit.wikimedia.org/r/#/c/276735/ [15:08:43] (03CR) 10Chad: [C: 032] Procomuns Viquimarató - Barcelona throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276735 (https://phabricator.wikimedia.org/T129574) (owner: 10Dereckson) [15:08:52] Dereckson: Doing [15:09:12] Thanks. [15:09:29] (03Merged) 10jenkins-bot: Procomuns Viquimarató - Barcelona throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276735 (https://phabricator.wikimedia.org/T129574) (owner: 10Dereckson) [15:10:01] Who used root on /srv/mediawiki-staging again? [15:10:05] As in, just now :) [15:10:15] yourself? [15:10:31] :-) [15:10:42] I don't have root on tin ;-) [15:10:56] I was about to say, congrats on getting root, ostriches :) [15:10:58] Transient.... [15:11:00] Wtf [15:11:02] * ostriches stabs git [15:11:13] * greg-g peaks in from the bus ride in [15:11:25] no one deployed anything after you, that I can see [15:11:44] jynus: I blame the gremlins deep inside git. [15:11:45] :) [15:11:46] was it you godog? [15:11:57] (03PS1) 10Krinkle: multiversion: Add updateSymlink() back in updateBranchPointers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276746 [15:12:07] Removed one too many functions :) [15:12:12] Krinkle: thx :) [15:12:13] ostriches: ^ [15:12:17] (03PS1) 10Elukey: Increase the mediawiki::jobrunner::runners_basic concurrency parameter for mw116[3-6] as temporary measure to consume more refreshLinks jobs. [puppet] - 10https://gerrit.wikimedia.org/r/276747 (https://phabricator.wikimedia.org/T129517) [15:12:23] (03CR) 10Krinkle: [C: 032] multiversion: Add updateSymlink() back in updateBranchPointers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276746 (owner: 10Krinkle) [15:12:23] !log demon@tin Synchronized wmf-config/throttle.php: Procomuns Viquimarató - Barcelona throttle rule (duration: 00m 31s) [15:12:25] Dereckson: ^^^^ [15:13:08] Krinkle: nope [15:13:15] (03CR) 10Chad: [C: 031] Increase the mediawiki::jobrunner::runners_basic concurrency parameter for mw116[3-6] as temporary measure to consume more refreshLinks jobs [puppet] - 10https://gerrit.wikimedia.org/r/276747 (https://phabricator.wikimedia.org/T129517) (owner: 10Elukey) [15:13:15] Nice, thanks. [15:13:17] (03Merged) 10jenkins-bot: multiversion: Add updateSymlink() back in updateBranchPointers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276746 (owner: 10Krinkle) [15:13:49] Krenair: " Krinkle: nope" [15:13:50] Krinkle: Pulled and sync'ing [15:14:08] !log demon@tin Synchronized multiversion/updateBranchPointers: (no message) (duration: 00m 25s) [15:14:51] godog: I'm using the puppet-compiler to check my change (https://gerrit.wikimedia.org/r/#/c/274382/). I fully expect it to change nothing at all as it just adds a new defined type, which is currently not used by any code. [15:15:09] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: MWException trying to delete a certain file on Commons - https://phabricator.wikimedia.org/T129637#2111880 (10fgiunchedi) the original is 404 at https://upload.wikimedia.org/wikipedia/commons/e/e8/J_K_Temple%2C_Kanpur.jpg meaning it isn't... [15:15:21] I am checking the logs, and the only sudos between your deploy and now are nagios and trebuchet [15:15:32] jynus: I'm over it, it went away [15:15:36] Transient :) [15:15:49] godog: Still, good exercise for me, so I ran puppet-compiler against one of the elasticsearch node. Do you think I should run it against another node as well? cc: volans [15:15:50] Krinkle: haha sorry! how often do you get that? [15:16:02] godog: About once every other hour :D [15:17:27] gehel: nah that change is not going to affect anything anyway so good to merge, however make sure you run the puppet compiler on the change that will start using your function [15:17:58] godog: yep, the next patch has a bit more risk associated... [15:18:02] Thanks! [15:18:05] (03CR) 10Krinkle: [C: 04-1] "Need to update the static/master/ symlinks as those are relative." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 (owner: 10Krinkle) [15:21:08] (03PS13) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [15:22:22] (03PS1) 10Krinkle: Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) [15:23:58] (03CR) 10Gehel: [C: 032] Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [15:26:55] !log deploying https://gerrit.wikimedia.org/r/#/c/274382/ - new defined type not used anywhere, should have less than zero impact. [15:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:00] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:28:05] (03PS2) 10Elukey: Increase the mediawiki::jobrunner::runners_basic concurrency parameter for mw116[3-6] as temporary measure to consume more refreshLinks jobs. [puppet] - 10https://gerrit.wikimedia.org/r/276747 (https://phabricator.wikimedia.org/T129517) [15:28:56] Also, allow me a suggestion (I do not care so much anywhere, except the title) [15:29:37] (03PS3) 10Jcrespo: Increase mediawiki::jobrunner::runners_basic concurrency for mw116[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/276747 (https://phabricator.wikimedia.org/T129517) (owner: 10Elukey) [15:30:36] yes you are completely right [15:31:41] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:35:50] RECOVERY - puppet last run on dbstore2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:35:52] (03CR) 10Krinkle: [C: 032] Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [15:37:02] (03PS2) 10Krinkle: Remove unused static symlinks for beta php-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) [15:37:37] (03CR) 10Elukey: [C: 032] Increase mediawiki::jobrunner::runners_basic concurrency for mw116[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/276747 (https://phabricator.wikimedia.org/T129517) (owner: 10Elukey) [15:38:30] (03CR) 10Krinkle: "While bits/static-master hasn't been in use for even longer, it is the only url matching pattern /static.master/ on beta apache's that is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [15:39:22] !log increased mediawiki::jobrunner::runners_basic from 20 to 30 for mw116[3-6] [15:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:53] !log forced puppet agent -tv on mw1164 [15:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:23] Hm.. There is a nagios check in puppet for 'check_http_bits' that uses a url that is a 404 [15:43:32] I guess that check is disabled somewhere? [15:44:21] !log labtestmetal and labtestvirt I am commondeering for some nfs and storage testing that cannot be done in labs, I will reimage when done [15:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:14] (03PS5) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [15:48:31] (03PS1) 10Krinkle: Update outdated monitoring url for http_bits [puppet] - 10https://gerrit.wikimedia.org/r/276754 [15:49:41] _joe_: ^ [15:51:48] (03PS1) 10Ottomata: Remove ganglia from jmxtrans kafka broker configs [puppet] - 10https://gerrit.wikimedia.org/r/276755 [15:53:12] (03PS2) 10Alexandros Kosiaris: lvs: normalize ProxyFetch URL configuration [puppet] - 10https://gerrit.wikimedia.org/r/276739 [15:53:14] (03PS1) 10Alexandros Kosiaris: lvs: remove port from ProxyFetch URL definitions [puppet] - 10https://gerrit.wikimedia.org/r/276756 [15:53:25] (03CR) 10Ottomata: [C: 032] Remove ganglia from jmxtrans kafka broker configs [puppet] - 10https://gerrit.wikimedia.org/r/276755 (owner: 10Ottomata) [15:54:30] (03PS2) 10Hashar: (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 [15:54:57] (03PS3) 10Hashar: (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 [15:56:21] (03CR) 10jenkins-bot: [V: 04-1] (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [15:57:12] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 2 others: Create a PKI that can be used by Puppet and for general purpose certificates - https://phabricator.wikimedia.org/T128077#2111932 (10Gehel) A extremely minimalist PKI is now available in the form of a puppet defined type `base:... [16:02:18] (03CR) 10Nuria: [C: 031] eventlogging: Remove server-side udp to kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/276615 (https://phabricator.wikimedia.org/T129402) (owner: 10Madhuvishy) [16:03:34] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 2 others: Create a PKI that can be used by Puppet and for general purpose certificates - https://phabricator.wikimedia.org/T128077#2111949 (10Volans) Thanks @Gehel for the generalized solution, it works for us to simplify and improve My... [16:03:36] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 2 others: Create a PKI that can be used by Puppet and for general purpose certificates - https://phabricator.wikimedia.org/T128077#2111950 (10Gehel) @Deskana As our PO, I'll let you formally close that task... Let me know if you need an... [16:08:03] (03PS1) 10Elukey: Increase mediawiki::jobrunner::runners_basic concurrency for mw116[123789]. [puppet] - 10https://gerrit.wikimedia.org/r/276759 (https://phabricator.wikimedia.org/T129517) [16:08:17] (03CR) 10Eevans: [C: 031] RESTBase: Remove restbase100[12] from the lists of seeds [puppet] - 10https://gerrit.wikimedia.org/r/276728 (owner: 10Mobrovac) [16:10:07] (03PS1) 10Mattflaschen: Note that you have to run populateContentModel.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276760 [16:10:31] (03PS1) 10ArielGlenn: onallwikis: fix typo in var name for verbose mode message [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276761 [16:11:12] (03PS4) 10Hashar: (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 [16:11:34] (03CR) 10Hashar: "Ignored a rubocop error" [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [16:12:14] (03PS2) 10ArielGlenn: onallwikis: fix typo in var name for verbose mode message [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276761 [16:12:57] (03CR) 10Krinkle: [C: 04-1] "Fixed in I354b07a1b3da. Pending that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276748 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [16:13:04] (03CR) 10Mattflaschen: [C: 032] "Comment only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276760 (owner: 10Mattflaschen) [16:13:04] 6Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2111982 (10Cmjohnson) db1074, db1075 and db1076 all had puppet certs signed and salt-keys added labsdb1008 had puppet certs signed and salt-keys added. db1077 and 1078 did not install c... [16:14:12] (03Merged) 10jenkins-bot: Note that you have to run populateContentModel.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276760 (owner: 10Mattflaschen) [16:14:22] (03CR) 10ArielGlenn: [C: 032] onallwikis: fix typo in var name for verbose mode message [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276761 (owner: 10ArielGlenn) [16:16:17] (03PS1) 10ArielGlenn: onallwikis: add in ability to run mysql query and stash results [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276762 [16:17:52] (03CR) 10ArielGlenn: [C: 032 V: 032] onallwikis: add in ability to run mysql query and stash results [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276762 (owner: 10ArielGlenn) [16:30:50] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:30:50] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:32:40] (03PS1) 10Chad: Gerrit: Make git directory location configurable so we can move it [puppet] - 10https://gerrit.wikimedia.org/r/276764 [16:33:42] matt_flaschen: You're going to pull that comment change and sync right? Otherwise icinga will complain :) [16:33:53] Hey, Gerrit doesn't seem to like me. It keeps logging me out intermittently. Any known issues? [16:33:59] * polybuildr cries in corner [16:34:56] ostriches, yeah, I was planning to. Thanks for the reminder. [16:35:15] okie dokie :) [16:35:23] polybuildr: No, it shouldn't be doing that. [16:35:31] I tend to stay logged in forever. [16:35:47] ostriches: I tend to too. Not today, though. Today it doesn't seem to like me. [16:35:48] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2024/mw1162.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/276759 (https://phabricator.wikimedia.org/T129517) (owner: 10Elukey) [16:36:23] I logged in a couple of times in the last 10 minutes or so. Before the last log in, I cleared cookies, maybe that'll resolve the issue. [16:37:59] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [16:38:38] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: Comment-only change (duration: 00m 41s) [16:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [16:40:56] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: MWException trying to delete a certain file on Commons - https://phabricator.wikimedia.org/T129637#2112028 (10fgiunchedi) @Steinsplitter could you try again now? thanks! [16:52:28] 6Operations, 6Commons, 10MediaWiki-Page-deletion, 10media-storage: MWException trying to delete a certain file on Commons - https://phabricator.wikimedia.org/T129637#2112039 (10Steinsplitter) >>! In T129637#2112028, @fgiunchedi wrote: > @Steinsplitter could you try again now? thanks! Works :-), deleted. [17:02:53] Still working on activating SSL for elasticsearch. Looking at the mediawiki config, it seems that elasticsearch LVS is accessed by IP, not by name. Probably to optimize away a DNS call. [17:04:12] (03CR) 10Ricordisamoa: Mathoid: enable PNG generation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/276734 (https://phabricator.wikimedia.org/T71702) (owner: 10Mobrovac) [17:06:15] I'm not too keen on trying to put IP in the cert SAN. [17:06:26] Don't we have local DNS cache? [17:13:21] gehel: many of the other services are using local dns, i don't know for sure but i imagine we must have host level dns caching [17:14:00] ebernhardson: I should actually check that, fairly easy... [17:20:09] no, we do not seem to have local DNS caching ... [17:30:12] :S [17:31:18] There is probably a good reason for that ... [17:33:07] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2112140 (10RobH) a:5Ironholds>3RobH Since I'm on clinic duty this week, I'm stealing this task back for its listing on the o... [17:33:18] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2112142 (10RobH) p:5Triage>3High [17:34:06] gehel: local to our network yeah, we hit the dns recursors from the internal machines [17:35:12] godog: I'm getting distracted by micro optimizations here [17:35:27] * gehel slaps himself on both cheeks for getting off track [17:36:38] hehe no worries, easy to be lured by optimizations [17:36:58] (03CR) 10Elukey: [C: 032] Increase mediawiki::jobrunner::runners_basic concurrency for mw116[123789]. [puppet] - 10https://gerrit.wikimedia.org/r/276759 (https://phabricator.wikimedia.org/T129517) (owner: 10Elukey) [17:38:06] !log Increased mediawiki::jobrunner::runners_basic to 30 for mw116[123789] [17:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:59] 6Operations, 10OTRS, 13Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#2112164 (10MartinK) We still got the some layout-issues: * when a HTML-E-Mail is displayed the iframe (containing this email) is initially set to a fixed px-width... [17:45:30] (03PS5) 10Ema: (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [17:47:30] (03CR) 10Chad: [C: 031] "No actual config changes result for current host: https://puppet-compiler.wmflabs.org/2025/ytterbium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/276764 (owner: 10Chad) [17:47:36] (03CR) 10jenkins-bot: [V: 04-1] (WIP) Lame rake / vcl / erb stuff (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [17:56:17] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2112225 (10elukey) All the mw116[1-9] job runners are running with mediawiki::jobrunner::runners_basic set to 30 (was 20 - basic contains refreshLinks). The CPU utilization went... [18:08:53] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2112252 (10BBlack) This basically needs a hieradata switch that defaults to off called something like `cache::traffic_shutdown`, so we can set it o... [18:13:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:13:30] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 1 failures [18:13:42] 6Operations, 10OTRS, 13Patch-For-Review, 7user-notice: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#2112277 (10Dzahn) @MartinK I would recommend a separate ticket for that since the ticket was specifically about the upgrade and has been closed. Would you mind just... [18:14:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:20:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:20:04] upload? [18:20:10] I arrived to late [18:21:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:22:32] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2112334 (10Krinkle) Finished audit of the network analysis. Performe... [18:23:35] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2112338 (10ema) a:3ema [18:25:10] 6Operations, 10ops-codfw, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2112358 (10Dzahn) @elukey I agree we should have consistency (if possible). Off-hand i don't know why these rdb hosts are using different re... [18:28:39] (03PS1) 10Dzahn: install: use raid1-lvm-ext4-srv.cfg on rdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/276785 (https://phabricator.wikimedia.org/T129178) [18:31:06] gerrit >_< [18:31:55] 6Operations, 10ops-codfw, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2112402 (10Dzahn) so yea, i don't really know if it's worth actually redoing that. but _if_ you really want to make it consistent then i wou... [18:36:38] (03PS1) 10ArielGlenn: dumps: move all dumps-related dir decls to the dirs class [puppet] - 10https://gerrit.wikimedia.org/r/276786 [18:36:43] (03CR) 10Ottomata: "COOOL! I was able to build this and run it without kafka. I copied the binary I built into mediawiki-vagrant and tried to run there with" (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [18:39:40] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:41:56] (03PS2) 10Dzahn: Add Gujarati fonts to mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [18:42:36] !log mw1001 install fonts-gujr-extra to confirm gerrit 276501 is fine [18:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:41] (03CR) 10Dzahn: [C: 032] Add Gujarati fonts to mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [18:46:36] (03PS2) 10ArielGlenn: dumps: move all dumps-related dir decls to the dirs class [puppet] - 10https://gerrit.wikimedia.org/r/276786 [18:47:35] ah right, only on imagescalers [18:47:48] that's a separate ticket that wants us to change that [18:47:56] and install fonts on all appservers [18:50:58] (03CR) 10Dzahn: "mw1153 (imagescaler):" [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [18:51:12] !log mw1001 - removed font again, mw1153 confirmed puppet installs it (only) on imagescalers [18:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:28] (03CR) 10Krinkle: First draft for the Varnish 4 porting. (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [18:57:29] (03PS3) 10ArielGlenn: dumps: move all dumps-related dir decls to the dirs class [puppet] - 10https://gerrit.wikimedia.org/r/276786 [18:58:25] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2107530 (10Dzahn) This has been merged and i confirmed the fonts-gujr-extra gets installed now by puppet. This happens on the imagescalers but not o... [18:59:19] (03CR) 10ArielGlenn: [C: 032] dumps: move all dumps-related dir decls to the dirs class [puppet] - 10https://gerrit.wikimedia.org/r/276786 (owner: 10ArielGlenn) [19:00:00] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [19:00:40] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures [19:02:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [19:03:06] puppet is broken on labs because of E: Unable to locate package fonts-gujr-extra [19:03:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:04:18] andrewbogott: arg? must be precise [19:04:29] tested on trusty and jessie [19:04:37] mutante: the one I’m looking at certainly is [19:04:39] can you copy that font over? [19:05:17] i'll look [19:05:40] it's just one more font in the list, maybe others have been copied before [19:06:03] (03CR) 10Ottomata: First draft for the Varnish 4 porting. (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [19:06:17] andrewbogott: can you give me an instance names [19:06:32] tools-exec-1215 [19:06:42] ok, thanks [19:06:49] i also wonder about just reinstalling that [19:06:59] getting rid of precise [19:07:09] I don’t know — I suspect that we have users relying on it [19:07:15] there are 10 or so precise instances in tools [19:07:30] 6Operations, 6Services, 10Traffic, 7Performance: Look into a solution for replaying traffic for load testing - https://phabricator.wikimedia.org/T129682#2112642 (10GWicke) [19:07:34] yea, would be nice if we can change that [19:07:42] get it closer to prod [19:07:46] to avoid these [19:07:47] 6Operations, 6Services, 10Traffic, 7Performance: Look into a solution for replaying traffic for load testing - https://phabricator.wikimedia.org/T129682#2112654 (10GWicke) p:5Triage>3Normal [19:08:02] ok, checking on that package first [19:08:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:10:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:11:54] eh, why is tools-exec influenced by a change in the mediawiki module? [19:12:08] just thinking out loud [19:12:27] i would have expected beta but tools? [19:15:21] 6Operations, 6Services, 10Traffic, 7Performance: Look into a solution for replaying traffic for load testing - https://phabricator.wikimedia.org/T129682#2112709 (10Pchelolo) [19:15:35] I’m about to break my internet connect, will be on and off for a bit [19:16:56] andrewbogott: i'll fix it in puppet, not in reprepro, but i'll fix it [19:21:57] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2112727 (10Dzahn) Breaks puppet in on some labs instances that still use precise, because that package doesn't exist in precise. [19:26:31] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:27:04] (03PS1) 10Dzahn: mediawiki: do not install fonts-gujr-extra on precise [puppet] - 10https://gerrit.wikimedia.org/r/276792 (https://phabricator.wikimedia.org/T129500) [19:27:41] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:28:15] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: do not install fonts-gujr-extra on precise [puppet] - 10https://gerrit.wikimedia.org/r/276792 (https://phabricator.wikimedia.org/T129500) (owner: 10Dzahn) [19:30:40] (03PS2) 10Dzahn: mediawiki: do not install fonts-gujr-extra on precise [puppet] - 10https://gerrit.wikimedia.org/r/276792 (https://phabricator.wikimedia.org/T129500) [19:32:23] (03CR) 10Dzahn: [C: 032] mediawiki: do not install fonts-gujr-extra on precise [puppet] - 10https://gerrit.wikimedia.org/r/276792 (https://phabricator.wikimedia.org/T129500) (owner: 10Dzahn) [19:38:06] Notice: Finished catalog run in 90.83 seconds [19:38:06] root@tools-exec-1215:~# [19:40:27] thanks mutante [19:40:42] !log restarting dbstore1002 [19:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:49] andrewbogott, yw [19:41:24] (03PS1) 10ArielGlenn: [WIP] stash a script here for "collect groups of hosts from labs/beta/prod in order to run something on them, from my laptop because I'm lazy" [software] - 10https://gerrit.wikimedia.org/r/276796 [19:50:08] (03CR) 10ArielGlenn: "gonna merge this because although a WIP etc I already have someone who wants to add a new type of target list, woo" [software] - 10https://gerrit.wikimedia.org/r/276796 (owner: 10ArielGlenn) [19:50:19] (03CR) 10ArielGlenn: [C: 032 V: 032] "gonna merge this because although a WIP etc I already have someone who wants to add a new type of target list, woo" [software] - 10https://gerrit.wikimedia.org/r/276796 (owner: 10ArielGlenn) [19:51:40] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2112889 (10ori) [19:51:45] apergos: ^ [19:52:10] how sure are we about hhvm being stable now? [19:52:29] and, is it ok to delay til the end of this run (probably another 12 days)? [19:52:31] ori: [19:52:54] apergos: reasonably sure, but we can be more sure in 12 days [19:52:58] ok [19:53:02] let me make a note [19:53:04] so it makes sense to wait [19:53:11] and I will be watching [19:53:12] both for that reason and to not disturb the current run [19:53:14] cool :) [19:53:21] (I was definitely watching the lockups and the texvc thing) [19:53:35] un be fscking lievable [19:53:40] 6Operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2112893 (10RobH) a:5RobH>3Papaul Did we purchase the blanking panels for codfw or did cyrusone provide them? If we ordered, are they the Rittal 1U panels that have the metal plates and plastic screws/fasteners... [19:55:21] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2112911 (10ArielGlenn) After chat on IRC with ori: The current dump run will likely get done in about 12 days, and we'll know a bit more about the stability o... [20:07:31] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2112937 (10Dzahn) from site.pp: ``` # mw1153-1160 are imagescalers (trusty) #mw2086-mw2089 are imagescalers #mw2148-mw2151 are imagescalers ```... [20:13:52] (03PS1) 10Jdlrobson: Strip rather than hide HTML [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276809 (https://phabricator.wikimedia.org/T110613) [20:17:29] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2112970 (10Dzahn) mw2089 and also mw1153: $ apt-cache show fonts-gujr-extra | grep -e 'Break\|Replace' Replaces: ttf-gujarati-fonts Breaks: ttf-guj... [20:18:14] 6Operations, 10Mathoid, 6Services, 10Trebuchet: Remove sca100x from the list of Mathoid's minioins - https://phabricator.wikimedia.org/T129645#2112971 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [20:18:34] 6Operations, 10Mathoid, 6Services, 10Trebuchet: Remove sca100x from the list of Mathoid's minioins - https://phabricator.wikimedia.org/T129645#2111586 (10ArielGlenn) I'll do this, I need to test the new packages anyways. [20:18:57] 6Operations, 10Mathoid, 10Salt, 6Services, 10Trebuchet: Remove sca100x from the list of Mathoid's minioins - https://phabricator.wikimedia.org/T129645#2112975 (10ArielGlenn) [20:21:41] (03PS1) 10Dzahn: mediawiki: ttf-gujarati-fonts replaced by fonts-gujr-extra [puppet] - 10https://gerrit.wikimedia.org/r/276812 (https://phabricator.wikimedia.org/T129500) [20:21:55] 6Operations, 10Mathoid, 10Salt, 6Services, 10Trebuchet: Remove sca100x from the list of Mathoid's minions - https://phabricator.wikimedia.org/T129645#2112983 (10Dzahn) [20:21:58] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2112984 (10ArielGlenn) [20:23:03] (03CR) 10Dzahn: [C: 032] mediawiki: ttf-gujarati-fonts replaced by fonts-gujr-extra [puppet] - 10https://gerrit.wikimedia.org/r/276812 (https://phabricator.wikimedia.org/T129500) (owner: 10Dzahn) [20:23:44] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2108980 (10ArielGlenn) Oh I see it has dual 480GB drives, right? Then I want them raid 1 and after that the snapshot.cfg recipe should be ok. [20:24:55] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2112995 (10ArielGlenn) Unless we want a separate additional partition for /srv which takes the rest of the space, that's also fine... that could get added right into the recipe I suppose. [20:27:35] 6Operations, 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2112998 (10Dzahn) [20:31:19] (03PS1) 10Ori.livneh: Disable OAI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276817 (https://phabricator.wikimedia.org/T70867) [20:31:55] (03CR) 10Ori.livneh: [C: 032] Disable OAI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276817 (https://phabricator.wikimedia.org/T70867) (owner: 10Ori.livneh) [20:32:35] (03Merged) 10jenkins-bot: Disable OAI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276817 (https://phabricator.wikimedia.org/T70867) (owner: 10Ori.livneh) [20:33:45] (03CR) 10Aaron Schulz: [C: 04-1] "For concurrent multi-DC (or before then), but it's blocked on ipsec between MW and redis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [20:35:51] (03PS1) 10Ori.livneh: Revert "Disable OAI extension"; postponed until Monday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276820 [20:35:56] hashar: ^ [20:36:14] 6Operations, 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2113015 (10Dzahn) after this the next issue on tools-webgrid- :/ ttf-indic-fonts : Depends: ttf-gujarati-fonts (= 1:0.5.14ubuntu1) *SIGH* [20:36:19] (03CR) 10Ori.livneh: [C: 032] "grumble grumble" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276820 (owner: 10Ori.livneh) [20:36:37] 6Operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#2113016 (10Papaul) @RobH we did purchase the blanking panels. we are not using any cyrusone blanking panels. We ordered some 1U and 4U panels with metal plates. [20:37:02] (03Merged) 10jenkins-bot: Revert "Disable OAI extension"; postponed until Monday [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276820 (owner: 10Ori.livneh) [20:38:05] mutante: can you look at tools-webgrid-generic-1404 now? Similar issue, but it’s trusty [20:38:17] andrewbogott: yes, i already talk about it in -labs [20:38:44] ah, so I see [20:39:30] (03PS1) 10ArielGlenn: onallwikis: don't crash on retries is None [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276822 [20:41:20] RECOVERY - Disk space on logstash1004 is OK: DISK OK [20:43:19] (03CR) 10ArielGlenn: [C: 032 V: 032] onallwikis: don't crash on retries is None [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276822 (owner: 10ArielGlenn) [20:47:13] (03PS1) 10Dzahn: mediawiki: revert adding fonts-gujr-extra [puppet] - 10https://gerrit.wikimedia.org/r/276825 (https://phabricator.wikimedia.org/T129500) [20:47:43] (03CR) 10Dzahn: "also: https://gerrit.wikimedia.org/r/#/c/218640/" [puppet] - 10https://gerrit.wikimedia.org/r/276825 (https://phabricator.wikimedia.org/T129500) (owner: 10Dzahn) [20:48:19] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:49:05] (03CR) 10Dzahn: [C: 032] "back to before https://gerrit.wikimedia.org/r/#/c/276501/" [puppet] - 10https://gerrit.wikimedia.org/r/276825 (https://phabricator.wikimedia.org/T129500) (owner: 10Dzahn) [20:49:13] oh gerrit [20:49:25] * apergos heaves a heavy sigh [20:51:45] 6Operations, 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2113048 (10Dzahn) for this to work on all distros, precise, trusty and jessie the entire fonts.pp needs to be restructured. (or we need to get rid of... [20:52:25] 6Operations, 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2113049 (10Dzahn) [20:53:05] (03CR) 10Dzahn: "had to revert with https://gerrit.wikimedia.org/r/#/c/276825/" [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [20:54:35] (03PS1) 10ArielGlenn: dumps: move pagetitles production from wikiquery script to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276829 [21:03:10] (03CR) 10ArielGlenn: [C: 032] dumps: move pagetitles production from wikiquery script to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276829 (owner: 10ArielGlenn) [21:13:56] (03PS1) 10Dzahn: set $SSH to $(which ssh) vs manual setup [software] - 10https://gerrit.wikimedia.org/r/276847 [21:15:00] (03PS2) 10Dzahn: set $SSH to $(which ssh) vs manual setup [software] - 10https://gerrit.wikimedia.org/r/276847 [21:15:26] (03PS3) 10Dzahn: salt-misc: set $SSH to $(which ssh) vs manual setup [software] - 10https://gerrit.wikimedia.org/r/276847 [21:19:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [21:20:08] (03PS1) 10Dzahn: salt-misc: make bastion host configurable [software] - 10https://gerrit.wikimedia.org/r/276882 [21:22:19] hey, I can't access to "research/ores/wheels" project even though it exists, why? [21:22:36] I have a patch in it but I can't see my own patch :))) [21:22:45] https://gerrit.wikimedia.org/r/276310 [21:23:50] I can view it [21:24:02] both logged in as an admin and logged out [21:24:44] https://usercontent.irccloud-cdn.com/file/2YRfS179/ [21:24:54] Krenair: ^ [21:25:18] alex@alex-laptop:~$ ssh gerrit gerrit ls-user-refs --project research/ores/wheels --user Ladsgroup [21:25:18] refs/changes/10/276310/3 [21:25:18] refs/changes/10/276310/2 [21:25:18] refs/changes/10/276310/1 [21:25:18] refs/heads/master [21:25:19] HEAD [21:26:13] That's strange [21:26:47] what if you view it while logged out? [21:28:01] Krenair: you can see my user name at top right [21:28:10] oh okay [21:28:12] yes, that means you are logged in [21:28:12] let me [21:28:34] (03PS1) 10Dzahn: salt-misc: set bastion host based on realm as $2 [software] - 10https://gerrit.wikimedia.org/r/276884 [21:29:08] I can see it [21:29:17] this is bizzare [21:29:25] *bizarre [21:29:29] so it's only invisible when you're logged in? [21:29:37] can you view https://gerrit.wikimedia.org/r/#/admin/projects/research/ores/wheels ? [21:31:35] Can someone fix the job queu eproblem? We nor have 30 million jobs [21:31:48] Yeah I can see [21:32:01] but I'm logged out [21:32:02] I think ops were discussing it earlier [21:32:05] I doubt the problem is simple [21:32:08] let me try again [21:32:12] (re Luke081515) [21:32:23] Amir1, can you see that page logged in? [21:32:49] I tried again, and yeah I can see it logged in [21:32:55] (03PS2) 10Dzahn: salt-misc: set bastion host based on realm as $2 [software] - 10https://gerrit.wikimedia.org/r/276884 [21:33:20] any ideas ostriches? [21:35:10] I just made a new commit and I can see it now [21:35:14] at least for now [21:35:19] (03PS3) 10Dzahn: salt-misc: set bastion host based on realm as $2 [software] - 10https://gerrit.wikimedia.org/r/276884 [21:36:24] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2113254 (10RobH) Ideally if it uses /srv as the primary storage, we put it into a raid1-lvm-srv type recipe. [21:36:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:37:05] (03PS1) 10ArielGlenn: pagetitles dumps: use the config file with all wikis [puppet] - 10https://gerrit.wikimedia.org/r/276887 [21:38:44] (03CR) 10ArielGlenn: [C: 032] pagetitles dumps: use the config file with all wikis [puppet] - 10https://gerrit.wikimedia.org/r/276887 (owner: 10ArielGlenn) [21:51:20] (03PS1) 10Dzahn: salt-misc: add new target_type role (WIP) [software] - 10https://gerrit.wikimedia.org/r/276890 [21:52:58] (03PS2) 10Dzahn: salt-misc: add new target_type role (WIP) [software] - 10https://gerrit.wikimedia.org/r/276890 [21:54:56] 6Operations, 10Incident-20160126-WikimediaDomainRedirection, 10Monitoring: add icinga and watchmouse https checks for content on commons. or other wikimedia.org sites - https://phabricator.wikimedia.org/T124812#2113324 (10Dzahn) how about checking for "Picture of the day" on the Main_Page of commons ? [21:58:39] (03PS1) 10ArielGlenn: dumps: start moving media title dumps to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276892 [21:58:41] (03PS1) 10Alex Monk: horizon: Add dynamicproxy IPs to config [puppet] - 10https://gerrit.wikimedia.org/r/276893 (https://phabricator.wikimedia.org/T129245) [22:02:29] Krenair: I had ideas but they were wrong [22:04:58] (03CR) 10ArielGlenn: "I have this in here because my 'ssh' command on laptop is a script 'sshes' and actually I might even add a flag in there. So 'which' won'" [software] - 10https://gerrit.wikimedia.org/r/276847 (owner: 10Dzahn) [22:06:22] (03PS1) 10Dereckson: Women's writes WikiWarriors edit-a-thon throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276895 (https://phabricator.wikimedia.org/T129697) [22:08:02] 6Operations, 7Epic, 7Performance, 5Release-Engineering-Epics: [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394#2113346 (10greg) [22:08:56] 6Operations, 6Release-Engineering-Team, 7Epic, 7Performance: [EPIC] Performance testing environment - https://phabricator.wikimedia.org/T67394#2113355 (10greg) [22:10:23] (03CR) 10ArielGlenn: "You might have to quote $2 in the case statement, in case the user doesn't supply it (test). I like it. This supercedes https://gerrit.w" [software] - 10https://gerrit.wikimedia.org/r/276884 (owner: 10Dzahn) [22:11:39] 6Operations, 10ops-eqiad, 10Dumps-Generation: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2113386 (10ArielGlenn) /srv will have mediawiki branches on it, and a few dump scripts. If it's typical for that to have lvm, make it so :-) [22:19:02] (03PS2) 10ArielGlenn: dumps: start moving media title dumps to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276892 [22:23:58] (03PS1) 10Dereckson: Taller d'iniciació a la Viquipèdia, Montserrat throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276900 (https://phabricator.wikimedia.org/T129490) [22:24:35] (03CR) 10ArielGlenn: [C: 032] dumps: start moving media title dumps to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276892 (owner: 10ArielGlenn) [22:29:08] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2113427 (10ArielGlenn) ariel@palladium:~$ sudo -s root@palladium:~# puppet-merge Fetching new commits from https://gerrit.wikimedia.org/r/p/operations/puppet remote: Counting objects: 84... [22:35:32] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2113465 (10Luke081515) Other ideas to solve this? I guess there only specific types of jobs affected aren't they? Otherwise users at wikis should noticed this, but there I didn'... [22:36:52] _joe_: are the jobrunners running a different version of hhvm than the app servers? [22:41:20] <_joe_> AaronSchulz: I don't think so, and you should ask ori [22:41:27] <_joe_> I'm literally just landed [22:41:49] heh [22:42:23] (03Abandoned) 10Jdlrobson: Enable reference storage on Japanese Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274470 (https://phabricator.wikimedia.org/T126802) (owner: 10Jdlrobson) [22:42:29] (03Abandoned) 10Jdlrobson: Enable reference storage on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275058 (owner: 10Jdlrobson) [22:43:24] looks like all of eqiad was bumped, so I guess that means "no" [22:43:43] errors are almost strictly from runners though. [22:45:41] * AaronSchulz wants to try something [22:45:43] (03PS1) 10Aaron Schulz: Disable persistent redis connections and bump timeout a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276904 [22:47:06] AaronSchulz: Need me to deploy or you already on tin? [22:48:05] (03CR) 10Aaron Schulz: [C: 032] Disable persistent redis connections and bump timeout a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276904 (owner: 10Aaron Schulz) [22:48:28] (03Merged) 10jenkins-bot: Disable persistent redis connections and bump timeout a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276904 (owner: 10Aaron Schulz) [22:48:40] (03PS1) 10ArielGlenn: onallwikis: allow query to include wiki name in string [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276905 [22:50:43] !log aaron@tin Synchronized wmf-config/jobqueue-eqiad.php: Disable persistent redis connections and bump timeout a bit (duration: 00m 38s) [22:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:20] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:53:51] ostriches: massive reduction in errors after that [22:56:10] (03PS1) 10ArielGlenn: [WIP] dumps: convert generation of media titles per project to onallwikis [puppet] - 10https://gerrit.wikimedia.org/r/276907 [22:56:13] 6Operations, 10MediaWiki-JobQueue, 13Patch-For-Review: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107949 (10aaron) I see a massive reduction in errors after https://gerrit.wikimedia.org/r/#/c/276904/ just now. [22:56:47] AaronSchulz: I'm pretty sure that either fixed it or completely hid the problem :p [22:57:09] well no other errors either (e.g. all of mediawiki-errors), heh [22:57:18] that was the first thing I checked for ;) [23:00:38] AaronSchulz: Still seeing the "Warning: timed out after 0.25 seconds when connecting to rdb1003.eqiad.wmnet [110]: Connection timed out" like we had prior. I'm wondering if we could go a tad higher. We tossed around .5 the other day. [23:01:43] looks like it's able to find fallback servers again (unlike the exceptions before were it goes throw 5 instances) [23:02:19] having a higher timeout on the jobrunner side should be fine. It used to be 2s I think [23:02:20] .5 is an eternity though...I always wondered what was up with that...possibly auth related [23:02:44] serialization pause on the redis side? [23:03:51] a note before you folks pack it in on the job queue stuff: [23:04:00] fluorine logs are getting big again [23:04:02] -rw-r--r-- 1 udp2log udp2log 332G Mar 11 23:03 redis.log [23:04:02] -rw-r--r-- 1 udp2log udp2log 288G Mar 11 23:03 JobQueueFederated.log [23:04:08] remember 1 second was not enough for mysql although I would be unable to compare # of connections [23:04:24] these are today's logs, if you wouldn't mind truncating them to a reaosnable point once you are done using the backlog [23:04:25] thanks [23:10:17] I also wonder what is enqueueing those refreshLinks jobs. From the root timestamp, it's not edits, can't be null edits, and I didn't see anything interesting in the API post log (e.g. forcelinksupdate=1). I wonder if it's an extension. [23:12:32] <_joe_> AaronSchulz: are those job new? [23:12:57] <_joe_> I thought of jobs that failed yesterday, but that ship sailed [23:13:11] (03PS1) 10Krinkle: Set test2.wikipedia.org favicon to black-globe.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276912 [23:13:13] <_joe_> also note th strange periodicity in job submission surge [23:13:39] yesterday when I'd run jobs I'd see ones with root timestamps from 2 hours ago for a heavy used module page or template edited 2 or more days ago. Things like that. [23:13:43] I think that was existing, just hugely increased [23:14:02] (the difference between valleys and peaks) [23:14:09] <_joe_> look here: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Jobrunners+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [23:14:35] Total 34333797 [23:14:35] plwiktionary 3542306 [23:14:35] srwiki 3432133 [23:14:36] enwiktionary 3412202 [23:14:49] odd that those ones are still on top... [23:15:03] frwiktionary is high too [23:15:06] <_joe_> AaronSchulz: did you rchange did anything good? [23:15:25] <_joe_> yeah seems like some bug tbh [23:15:31] well the post-send exception flood is gone and the redis errors are way down [23:15:36] size isn't any better though [23:16:16] <_joe_> I'm too tired to really help, sorry [23:16:34] those are all s2, right? [23:16:43] no [23:17:26] s7, and s3 too, lots of shards [23:17:35] yes [23:17:46] AaronSchulz: I wonder if it's that cross-wiki notification stuff? Causing a refreshLinks somewhere deep down. That could explain why a bunch of wikis that get so few edits are spamming linksUpdate.... [23:18:00] <_joe_> AaronSchulz: so since your change: 1) enqueued jobs skyrocketed again 2) network out went down, network in went up [23:18:30] <_joe_> AaronSchulz: what if jobchron is resubmitting duplicate jobs? [23:18:38] the network out/in flip actually takes it closer to the pre-fuckup state, heh [23:19:08] (03CR) 10Krinkle: [C: 032] Set test2.wikipedia.org favicon to black-globe.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276912 (owner: 10Krinkle) [23:19:23] (looking at the week view) [23:19:33] (03Merged) 10jenkins-bot: Set test2.wikipedia.org favicon to black-globe.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276912 (owner: 10Krinkle) [23:19:52] _joe_: the recycle rate is small and nominal from the graphs [23:21:17] still tailing the API log for forcelinkupdate=1 and it looks harmless [23:21:24] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: test2wiki favicon (duration: 00m 30s) [23:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:49] * AaronSchulz leans toward some extension problem rather than users doing evil stuff [shrug] [23:23:18] AaronSchulz: What I said above ^^? [23:23:26] In terms of an extension being at fault [23:24:32] errr, I don't think cross-wiki notifs can trigger refreshlinks jobs [23:25:21] yeah, that would seem odd [23:25:42] ostriches: the changeDispatcher thing maybe? [23:25:56] Could be [23:26:02] * AaronSchulz greps around [23:27:15] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2113612 (10Jdlrobson) >... [23:33:29] wfDebugLog( __CLASS__, ... ) [23:33:48] * AaronSchulz sees no entries even though that corresponding statsd ones are there [23:35:24] nothing in +channel:"Wikibase\Client\Changes\WikiPageUpdater" [23:35:37] bd808: does logstash handle slashes OK? [23:36:02] * AaronSchulz wonder why it could just use a normal string literal name [23:36:08] *couldn't [23:46:21] Side complaint: it would be nice to have a more granular number than 4k/sec or 5k/sec on grafana. [23:50:47] AaronSchulz: it should handle slashes I think. [23:52:47] There we go!