[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T0000). [00:00:04] eddiegp: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:36] Here [00:01:54] (03PS3) 10Andrew Bogott: WIP: Sync ldap project groups with keystone project membership [puppet] - 10https://gerrit.wikimedia.org/r/338918 [00:02:20] DatGuy: this might actually work :) https://github.com/uber-archive/image-diff [00:02:43] :o [00:02:44] EddieGP: can you use a verb to start the commit message? [00:02:57] Hello :) [00:03:03] now imagine that kind of diff in the gerrit view, heh [00:03:19] mutante lol [00:03:20] Dereckson: Of course I can do that :D [00:03:38] EddieGP: it's not really a big matter here, but that allows to create an easier log to read: you can find https://chris.beams.io/posts/git-commit/ a good collection of tips [00:04:12] (03PS7) 10EddieGP: Added new throttle rule, removed expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [00:04:16] mutante: or on Phabricator, ie "I won't print it", and all the discussions about "what I can't see the diff" [00:04:21] imperative form :p [00:04:33] hah, yes :) [00:04:37] each commit is seen as an action to apply [00:05:21] (03PS8) 10EddieGP: Add new throttle rule, remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) [00:05:38] now if there was a gerrit plugin that gave you feedback on your commit msg that would be great :) [00:07:34] to analyze grammar isn't the easiest linter [00:08:04] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) (owner: 10EddieGP) [00:08:14] Dereckson google must have something that does that. [00:09:04] Try sometimes automatic translation between English and German, you'll see it's good to produce grammatically correct sentences, less good to parse correctly grammar [00:09:27] (03Merged) 10jenkins-bot: Add new throttle rule, remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) (owner: 10EddieGP) [00:09:36] (03CR) 10jenkins-bot: Add new throttle rule, remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339224 (https://phabricator.wikimedia.org/T158767) (owner: 10EddieGP) [00:09:38] oh [00:09:57] We did a lot of progress in natural language processing, but there is still some work to do. [00:09:58] Diese Einweisungsnachricht sollte mit dem Namen eines Puppet-Modules beginnen. [00:10:51] EddieGP: your change is live on mwdebug1002, you already installed the X Wikimedia Debug extension? [00:11:23] No, I didn't, but I guess I've heard that name before ;) [00:11:30] Where do I get if (Firefox)? [00:11:50] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug has some links at the top, for Firefox and Chrome [00:12:33] Once installed, you choose mwdebug1002 in the menu, and set the trigger to ON, then you can visit it.wikiversity and confirms nothing is broken. [00:12:33] paladox: i can be that plugin [00:12:39] mutante lol [00:12:39] (03CR) 10Dzahn: [C: 04-1] "human plugin to give feedback on commit message: please start with module name" [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [00:12:48] :D [00:13:11] (03PS3) 10Paladox: Phabricator: Up the size for storage.mysql-engine.max-size to 20mb in bytes [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) [00:13:20] (03PS4) 10Paladox: Phabricator: Up the size for storage.mysql-engine.max-size to 20mb in bytes [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) [00:13:52] also " Diese Einweisungsnachricht sollte mit dem Namen eines Puppet-Modules beginnen." = something about puppet-modules [00:13:56] mutante ^^ [00:13:57] lol [00:14:29] "Einweisungsnachricht" :-D Hahahaha [00:14:51] qchris__: :)) [00:14:57] * mutante laesst sich einweisen [00:15:33] Works for me. Is there anything specific I should test different from loading a few sites, logging in, making a test edit? [00:15:35] qchris Instruction message [00:15:41] qchris__ ^^ [00:15:42] Next time I'll have to introduce someone to git, I'll totally quote you on Einweisungsnachricht :-) [00:15:50] lol [00:16:10] mutante Can be proved [00:16:11] lol [00:16:21] EddieGP: for the throttle rule, I don't think we can do more tests [00:17:02] Okay. [00:17:08] EddieGP: no, there is generally not a need for a full test procedure, the goal here was only to be sure no error is thrown by the piece of code directly touched by the config changed [00:17:13] so no need to make edits, etc. [00:17:31] excepted if you need to test something which require an edit to trigger [00:17:58] I'm syncing to prod [00:18:27] I'll keep that in mind for the next time. Or at least I'll try it. [00:18:40] !log mwscript deleteEqualMessages.php --wiki simplewikibooks (T45917) [00:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:47] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [00:19:21] !log dereckson@tin Synchronized wmf-config/throttle.php: Throttle rule for it.wikiversity (T158767) (duration: 00m 40s) [00:19:21] EddieGP: you can put the extension to off by the way, so you can use again the regular servers [00:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:27] T158767: Mass account creation to IP 213.26.151.190 on it.wikiversity - https://phabricator.wikimedia.org/T158767 [00:19:39] that also helps to have a mwdebug1002 log with only the test related log [00:19:58] Dereckson: I would definitely not have forgotten that, using the test server is horribly slow with me ;) [00:20:15] But thanks anyways for the tip! [00:20:37] Thanks for the change EddieGP and welcome :) [00:24:25] * EddieGP goes looking for his other commit messages :) [00:36:51] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:40:54] (03CR) 10Dzahn: [C: 04-1] "why does the linked ticket talk about "restrict maximum file size" but this change raises it from 10 to 20 MB? And it seems to say "As of " [puppet] - 10https://gerrit.wikimedia.org/r/326932 (https://phabricator.wikimedia.org/T151544) (owner: 10Paladox) [00:58:05] (03PS1) 10Ladsgroup: dumps: Redesign progress report page [puppet] - 10https://gerrit.wikimedia.org/r/339332 (https://phabricator.wikimedia.org/T155697) [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T0100). Please do the needful. [01:02:39] (03PS2) 10Ladsgroup: dumps: Redesign progress report page [puppet] - 10https://gerrit.wikimedia.org/r/339332 (https://phabricator.wikimedia.org/T155697) [01:05:51] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:08:12] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:26:17] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10demon) This can be resolved now, right? Issue fixed, report filed. Have tasks been filed for the ac... [01:37:11] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:59:01] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.225 second response time [02:04:01] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.584 second response time [02:07:51] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, 13Patch-For-Review: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10Krinkle) I'm not sure why T158810 or T158808 would be needed to have prevented this incident. * Ho... [02:23:25] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.12) (duration: 08m 40s) [02:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:47] volans, i responded to your comment on https://gerrit.wikimedia.org/r/#/c/338950/ ... just an fyi in case you missed it. [02:55:08] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3048748 (10Tgr) >>! In T66214#3030371, @cscott wrote: > The one exception is that the current thumbnail code arbitrarily quantizes sizes in order to re... [02:56:23] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.13) (duration: 14m 38s) [02:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Feb 23 03:02:10 UTC 2017 (duration 5m 47s) [03:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:12] (03PS1) 10Yuvipanda: k8s: Install socat on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/339343 [03:20:35] (03PS2) 10Yuvipanda: k8s: Install socat on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/339343 [03:20:41] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Install socat on worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/339343 (owner: 10Yuvipanda) [03:24:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 636.03 seconds [03:28:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 195.17 seconds [04:01:46] (03CR) 10Krinkle: [C: 031] "@bd808: Seems the warning is still there in logstash-beta, despite all entries having it now. I know it's harmless but shouldn't it have m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [04:02:23] (03CR) 10Krinkle: "https://phabricator.wikimedia.org/T123728#3048011" [puppet] - 10https://gerrit.wikimedia.org/r/338805 (owner: 10Filippo Giunchedi) [04:04:01] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 1.037 second response time [04:04:50] (03CR) 10BryanDavis: "> @bd808: Seems the warning is still there in logstash-beta, despite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [04:09:01] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.802 second response time [05:10:51] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:15:51] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 52.94% of data above the critical threshold [1800.0] [05:16:21] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1800.0] [05:18:51] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [1800.0] [05:19:21] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1800.0] [05:29:21] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 45.45% of data above the critical threshold [1800.0] [05:29:51] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [05:33:21] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [05:37:51] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:36:32] (03PS1) 10Urbanecm: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) [06:39:49] (03CR) 10jerkins-bot: [V: 04-1] New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [06:52:41] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:56:49] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339207 (owner: 10Marostegui) [06:59:12] (03PS1) 10Marostegui: db-eqiad.php: Restore db1060 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339349 (https://phabricator.wikimedia.org/T158194) [07:03:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1060 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339349 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [07:03:51] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:05:20] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1060 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339349 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [07:06:29] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1060 original load - T158194 (duration: 00m 40s) [07:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:34] T158194: Replace BBU for db1060 - https://phabricator.wikimedia.org/T158194 [07:07:14] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1060 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339349 (https://phabricator.wikimedia.org/T158194) (owner: 10Marostegui) [07:09:24] (03PS1) 10Marostegui: db-codfw.php: Repool db2062 - depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339350 (https://phabricator.wikimedia.org/T132416) [07:12:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2062 - depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339350 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:12:10] (03PS1) 10Marostegui: linux-host-entries: Remove trusty from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/339351 (https://phabricator.wikimedia.org/T153768) [07:13:07] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2062 - depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339350 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:13:20] (03CR) 10jenkins-bot: db-codfw.php: Repool db2062 - depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339350 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:14:19] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2062 and depool db2069 - T132416 (duration: 00m 42s) [07:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:24] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:14:51] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:16:02] !log Deploy alter table enwiki.revision db2069 - T132416 [07:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:41] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:28:51] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:42:51] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:59:14] !log Run pt-table-checksum on s2 (nlwiki) on logging table - T154485 [07:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:19] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:15:23] (03CR) 10Dereckson: [C: 04-1] "Did you run `optipng -o7` on the file?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [08:16:25] (03PS2) 10Giuseppe Lavagetto: Only output "changed" values if actually changed [software/conftool] - 10https://gerrit.wikimedia.org/r/338985 [08:16:30] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Only output "changed" values if actually changed [software/conftool] - 10https://gerrit.wikimedia.org/r/338985 (owner: 10Giuseppe Lavagetto) [08:17:11] (03CR) 10Giuseppe Lavagetto: "will now test and build the package" [software/conftool] - 10https://gerrit.wikimedia.org/r/339195 (owner: 10Giuseppe Lavagetto) [08:23:39] subbu: RE: no, I didn't [08:24:08] miss it, just I didn't had any time to look at it [08:24:59] I might in the next days, unless marko or someone else can look at it before ;) [08:37:21] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [08:38:26] (03CR) 10Muehlenhoff: [C: 04-1] "Sure, quiet silences that specific (redundant) output, but the problem is that is also silences output that we're actually interested in (" [puppet] - 10https://gerrit.wikimedia.org/r/338980 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [08:43:33] (03PS5) 10Filippo Giunchedi: uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 [08:45:38] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] uwsgi: parametrize service settings [puppet] - 10https://gerrit.wikimedia.org/r/338804 (owner: 10Filippo Giunchedi) [08:47:39] (03PS5) 10Filippo Giunchedi: coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 [08:48:41] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] coal: disable uwsgi autoload [puppet] - 10https://gerrit.wikimedia.org/r/338805 (owner: 10Filippo Giunchedi) [08:49:21] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [08:51:04] !log Run pt-table-checksum on s2 (nlwiki) on revision table - T154485 [08:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:10] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [08:54:38] !log Stop pt-table-checksum on nlwiki.revision - T154485 [08:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:56] db1036 and db2035 - me [08:57:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339364 [08:58:27] (03CR) 10DCausse: [C: 031] elasticsearch: force the creation of the plugins directory symlink [puppet] - 10https://gerrit.wikimedia.org/r/338781 (owner: 10Gehel) [08:58:43] (03PS2) 10Marostegui: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339364 [09:05:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339364 (owner: 10Marostegui) [09:06:24] <_joe_> !log uploaded conftool 0.4.0 to jessie-wikimedia experimental [09:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339364 (owner: 10Marostegui) [09:08:21] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10fgiunchedi) [09:08:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1036 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339364 (owner: 10Marostegui) [09:09:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s) [09:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:32] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [09:10:25] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3049049 (10fgiunchedi) @Krinkle coal is fixed (the web part wasn't working but data collection was) and it broke as part of moving graphite1001 to jessie. No idea about xhgui (on tungsten) and its mo... [09:14:41] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:15:50] (03CR) 10Hashar: "Thanks. And I guess I will add some more rubocop ignores in other patches :-}" [puppet] - 10https://gerrit.wikimedia.org/r/339175 (owner: 10Hashar) [09:19:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1036 from logpager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339366 [09:21:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1036 from logpager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339366 (owner: 10Marostegui) [09:22:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1036 from logpager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339366 (owner: 10Marostegui) [09:23:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1036 from logpager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339366 (owner: 10Marostegui) [09:23:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s) [09:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:41] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [09:25:30] (03PS1) 10Filippo Giunchedi: wmnet: add udplog.codfw CNAME [dns] - 10https://gerrit.wikimedia.org/r/339367 (https://phabricator.wikimedia.org/T123728) [09:26:55] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3049064 (10fgiunchedi) [09:30:28] (03PS2) 10Filippo Giunchedi: wmnet: add udplog.codfw CNAME [dns] - 10https://gerrit.wikimedia.org/r/339367 (https://phabricator.wikimedia.org/T123728) [09:33:29] (03PS3) 10Filippo Giunchedi: wmnet: add udplog.codfw CNAME [dns] - 10https://gerrit.wikimedia.org/r/339367 (https://phabricator.wikimedia.org/T123728) [09:34:48] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: add udplog.codfw CNAME [dns] - 10https://gerrit.wikimedia.org/r/339367 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [09:36:27] dcausse: mwlog1001 should have memcache-keys and apache2/hhvm logs now btw! [09:36:29] (03PS1) 10Muehlenhoff: Disable unpriviled access to perf subsystem on Ubuntu hosts [puppet] - 10https://gerrit.wikimedia.org/r/339369 [09:36:41] godog: thanks! [09:36:51] I noticed yesterday apache/hhvm logs from codfw wouldn't go there, should be fixed in an hour or so [09:37:06] dcausse: thank you for checking! [09:37:23] yw! [09:39:18] !log increase cassandra system_auth replication from 6 to 12 on AQS [09:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:49] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#1732058 (10hashar) For both Nodepool and Zuul, I need packages versions that are not available in our apt.wikimedia.org. For Nodepool that blocks further upgrades and for Zuul I went with a hack to g... [09:42:21] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:21] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:21] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:21] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [09:42:21] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:21] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:21] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:22] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retriev [09:42:23] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [09:42:41] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:42:41] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:42:51] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:42:52] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:42:52] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:43:11] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200) [09:43:17] now this is really weird [09:43:41] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:44:11] (03PS1) 10Phuedx: Enable new Minerva header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) [09:44:25] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3041135 (10hashar) If we go with subcomponents such as `thirdparty/foo` would it be possible to let non-ops to update packages for such component? Assuming the sub component is solely enabled on machine on wh... [09:44:51] godog: I might need your help, I just ran a nodetool-a repair system_auth on aqs1004 [09:45:27] and before that, I've increased the replication factor of system_auth to 12 [09:46:16] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3049104 (10fgiunchedi) a:05fgiunchedi>03RobH thanks for checking! ms-be hosts can be taken down, one at a time, at any time for brief periods (e.g. one day) via graceful `shutdown` to make sure all... [09:46:20] joal: -^ :( [09:46:41] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:46:46] elukey: ok! and cassandra exploded? [09:47:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:48:17] godog: yeah, but I can't see what is happening [09:48:32] the 5xx are pageview api as expected btw [09:48:42] RECOVERY - Text HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:49:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:49:25] elukey: ok I'm checking on aqs1004 [09:49:50] going to check the new settings, but I am pretty sure they are good [09:50:09] command executed: ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; [09:50:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:50:34] (03PS2) 10Phuedx: Enable v2 of Minerva's header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) [09:50:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:50:47] elukey: what's the error on aqs side? [09:51:41] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [09:51:56] I can query the pageviews api from tools [09:52:14] so https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barak_Obama/daily/2017020100/2017020800 doesn't work ( [09:52:44] godog: I only know https://logstash.wikimedia.org/app/kibana#/dashboard/cassandra-aqs [09:53:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [09:53:30] I am trying to get the logs *somewhere* but I don't know where :) [09:53:42] I am pretty sure that an outage like this is surely due to auth not working [09:54:17] User aqs has no SELECT permission on or any of its parents [09:54:27] so yeah your guess was right [09:54:45] this is weird [09:55:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:55:21] godog: I only ran nodetool repair on aqs1004-a, maybe I need to complete the work and run it everywhere? [09:55:36] I stopped checking metrics and saw the 500s [09:55:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [09:55:51] I did the same from repl 3 to 6 a while ago and this didn't happen [09:56:46] elukey: yeah I guess might as well finish the repairs everywhere, is it done on aqs1004-a ? [09:56:52] yeah [09:57:19] I don't see 'aqs' from select * from system_auth.roles; [09:57:21] snap [09:57:25] <_joe_> what's up with 503s? [09:57:32] _joe_: aqs [09:57:40] <_joe_> why is it "text" [09:57:41] <_joe_> ? [09:57:44] because restbase [09:57:45] <_joe_> grrr [09:57:54] sorry people my bad [09:58:07] <_joe_> sorry, aqs is restbase, tell me it's not behind restbase.wikimedia.org too [09:58:15] <_joe_> or whatever the url would be for ti [09:58:18] <_joe_> *it [09:58:38] _joe_, " so https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barak_Obama/daily/2017020100/2017020800 doesn't work" [09:58:50] <_joe_> so let me rephrase this [09:59:05] <_joe_> a request for aqs goes through 2 levels of varnish and 2 levels of restbase? [09:59:16] "are we caching twice the same data?" [09:59:26] or 3 or 4? [09:59:27] godog: running the repairs, the select * from system_auth.roles seems inconsistent across nodes [09:59:29] <_joe_> whoever came up with such a brilliant plan, well it's wrong [09:59:48] maybe we can talk about design decisions later [09:59:59] elukey: ok! [10:00:34] jynus: I see restbase has a 3rd level of varnish cache with larger storage area [10:01:05] maybe we could migrate MariaDB behind restbase [10:02:27] hashar, _joe_ comment is that if aqs is text and restbase is restbase, we may be caching the same thing 4 times, 2 in varnish, 2 in cassandra [10:02:49] but again, not the point now [10:04:19] so the problem is that the 'aqs' user, that hyperswitch/restbase on aqs uses on cassandra, is not available. This is of course due to my change, but it is completely unexpected. I am completing the nodetool repair commands on each node (sequentially), and then restart from there [10:04:27] the API is completely unavailable [10:05:33] this doesn't make any sense [10:05:39] (the error) [10:06:21] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [10:07:29] there's also a flood in the cassandra logs of repairs [10:08:22] that should be related to nodetool repair [10:09:21] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [10:10:36] godog: super weird that on aqs1004-a/b the aqs user is not there anytmore, even after the repairs [10:10:47] elukey: only on those? [10:11:03] I checked others and it was also missing aqsloader, that hadoop uses [10:11:21] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [10:12:13] elukey: ok, I'm not sure either what happened there, if the repairs are finished I guess we can try adding the user back? [10:12:29] err, the role [10:13:07] yeah, checking also our users settings again to avoid PEBCAKs [10:13:44] I am pretty sure that we added two users, aqs and aqsloader, for restbase and hadoop (to avoid the default cassandra super user0 [10:15:18] godog: ah can see the aqs user on aqs1008-a [10:15:28] weeeeeird [10:15:52] The contint1001 alarm on "Work requests waiting in Zuul Gearman server" is a false positive. Can one please schedule that alarms as under maintenance for a week please? [10:15:53] gotta fix it up but I am out of time this week [10:16:22] should be at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=contint1001 [10:16:34] the repairs are taking ~3mins each, and I need to run other 7 of them [10:16:48] can't really do it in parallel [10:17:10] elukey: to recreate the user or finish the repairs? [10:17:59] finish the repairs [10:18:13] elukey: ok [10:18:21] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [10:18:27] I think that they are messing up the roles somehow, meanwhile they should only replicate [10:19:51] godog: I have a theory though, that might explain this mess [10:20:14] I checked select * from system_auth.roles; on aqs100[789]-a/b and they have all the users [10:20:31] (or at least, spot checking) [10:21:07] what if the previous replication, 6, was somehow only on these nodes and triggering repairs from aqs1004 caused a mess? [10:21:58] If I am right, doing nodetool repair on aqs1007-a should bring all the users back in place [10:22:45] could be, I'm saving the cassandra logs in the meantime in case they get rotated [10:28:39] not really, didn't work [10:28:45] the users are not consistent among nodes [10:30:53] elukey: ok, I guess we can try adding the user back? I don't see a lot of other choices [10:34:20] godog: I have the last four repairs to do, so ~10 mins, then I [10:34:24] I'll re-add it [10:39:47] (03CR) 10Filippo Giunchedi: [C: 031] Disable unpriviled access to perf subsystem on Ubuntu hosts [puppet] - 10https://gerrit.wikimedia.org/r/339369 (owner: 10Muehlenhoff) [10:46:09] godog: repairs done, checking the state of roles [10:47:05] (03PS1) 10Addshore: Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) [10:47:24] same state [10:48:43] godog: CREATE USER IF NOT EXISTS aqs WITH PASSWORD 'thepasszorz' NOSUPERUSER; - on aqs1004 and then see how it goes [10:48:46] ? [10:50:13] elukey: yeah, also the grants probably [10:50:36] there was a script now that I think about it [10:51:10] check /etc/cassandra-a/adduser.cql if that's the right password [10:52:01] yeah [10:52:04] executing it [10:53:34] done [10:54:36] ok the user seems to appear everywhere [10:56:18] elukey: maybe try bouncing aqs on aqs1004 see if that fixes it now? I still see the errors [10:56:59] done, but I don't see them on logstash anymore [10:57:12] oh no some trickle is still there [10:57:39] 503s are coming down [10:57:58] (03CR) 10Amire80: [C: 031] Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) (owner: 10Addshore) [10:58:11] elukey: confirmed, nice! [10:59:31] PROBLEM - AQS root url on aqs1004 is CRITICAL: connect to address 10.64.0.107 and port 7232: Connection refused [11:00:01] (03PS2) 10Marostegui: linux-host-entries: Remove trusty from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/339351 (https://phabricator.wikimedia.org/T153768) [11:00:21] PROBLEM - AQS root url on aqs1005 is CRITICAL: connect to address 10.64.32.138 and port 7232: Connection refused [11:00:31] PROBLEM - AQS root url on aqs1006 is CRITICAL: connect to address 10.64.48.146 and port 7232: Connection refused [11:01:05] what a lovely morning [11:01:21] RECOVERY - AQS root url on aqs1005 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.005 second response time [11:01:31] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.007 second response time [11:01:31] RECOVERY - AQS root url on aqs1006 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.013 second response time [11:02:20] https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barak_Obama/daily/2017020100/2017020800 works now [11:03:21] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [11:03:21] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:03:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:03:21] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:03:21] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:03:21] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:03:42] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy [11:03:42] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy [11:03:51] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [11:03:51] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [11:03:51] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy [11:03:57] yesssss [11:04:01] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.261 second response time [11:04:11] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy [11:04:21] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:04:21] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:04:21] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:04:21] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:04:28] * elukey dances [11:05:03] heheh confirmed indeed [11:05:26] godog: going to add the aqsloader user too, that it is still missing [11:05:28] on some nodes [11:07:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:08:49] <_joe_> elukey: why is mobileapps linked to aqs? [11:08:51] <_joe_> is it? [11:11:57] _joe_ I wasn't aware of it, but I guess something related to "retrieve the most-read articles for January 1, 2016)" [11:12:10] <_joe_> elukey: sigh [11:12:17] <_joe_> SOA SOA [11:12:28] <_joe_> *exactly* what I said would happen [11:13:15] _joe_, elukey: Mobile apps use top endpoints yes [11:13:40] <_joe_> and it has no timeout? [11:13:51] <_joe_> for the backend request, I mean [11:14:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:34] _joe_: not sure I underdstand what you mean [11:14:41] RECOVERY - Text HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:14:48] <_joe_> joal: mobileapps calls aqs [11:15:03] <_joe_> and I hope it has a (short) timeout if aqs doesn't respond [11:15:26] _joe_: I have no idea on how mobileapps requests aqs :S [11:15:49] <_joe_> yeah I was asking to everyone, not you specifically [11:16:03] sure _joe_, was still saying I don't know :) [11:16:17] * joal will stop making noise [11:19:02] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.796 second response time [11:28:34] (03CR) 10Aklapper: [C: 04-1] "The values checked for in the oldValue and newValue columns are literally stored with surrounding quotation marks in the DB so you'll need" [puppet] - 10https://gerrit.wikimedia.org/r/317990 (owner: 10Alex Monk) [11:40:46] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3049253 (10Aklapper) [11:55:21] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:56:21] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:01:33] (03CR) 10Bmansurov: [C: 031] Enable v2 of Minerva's header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) (owner: 10Phuedx) [12:02:20] (03CR) 10Marostegui: [C: 032] linux-host-entries: Remove trusty from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/339351 (https://phabricator.wikimedia.org/T153768) (owner: 10Marostegui) [12:03:15] (03CR) 10Muehlenhoff: [C: 032] Disable unpriviled access to perf subsystem on Ubuntu hosts [puppet] - 10https://gerrit.wikimedia.org/r/339369 (owner: 10Muehlenhoff) [12:03:22] (03PS2) 10Muehlenhoff: Disable unpriviled access to perf subsystem on Ubuntu hosts [puppet] - 10https://gerrit.wikimedia.org/r/339369 [12:07:31] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:21] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [12:08:38] (03CR) 10Muehlenhoff: [V: 032 C: 032] Disable unpriviled access to perf subsystem on Ubuntu hosts [puppet] - 10https://gerrit.wikimedia.org/r/339369 (owner: 10Muehlenhoff) [12:35:38] !log installing tomcat updates [12:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:47] !log installing libssh security updates (jessie already fixed) [12:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:56] !log installing libssh security updates on trusty (jessie already fixed) [12:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:06] !log cache_maps: upgrading to varnish 4.1.5 [13:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:35] (03PS2) 10DatGuy: Update logo for bswiki (Bosnian Wikipedia) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) [13:06:58] (03CR) 10DatGuy: [C: 031] Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) (owner: 10Addshore) [13:09:20] (03CR) 10DatGuy: "Yes, it has been run. Don't know why it doesn't show the addition of bswiki-1.5x and bswiki-2x in the diff." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339326 (https://phabricator.wikimedia.org/T158815) (owner: 10DatGuy) [13:09:37] can anyone explain ^? [13:09:56] I've added two files but it doesn't show in the diffs [13:17:51] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:19:42] (03CR) 10Elukey: "Hi! First of all thanks a lot for the code review, I am really sorry that nobody answered for this long time but we didn't have notificati" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 (owner: 10R4q3NWnUx2CEhVyr) [13:20:29] volans: --^ [13:29:37] (03PS6) 10Elukey: Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) [13:29:52] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3049509 (10chasemp) I think @akosiaris is the only person with a lot of context for this setup. If I understand correctly the users are https://wikitech.wikimedia.org/wik... [13:41:31] 06Operations, 10ops-eqiad, 06Discovery, 06Discovery-Search, and 2 others: rack/setup/install elastic1048-1052 - https://phabricator.wikimedia.org/T155790#3049535 (10Gehel) 05Open>03Resolved [13:43:59] addshore: want to do swat today? since one of the two commits is yours [13:44:54] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#3049549 (10Gehel) [13:45:51] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:48:00] (03PS1) 10Muehlenhoff: Update to 4.4.50 [debs/linux44] - 10https://gerrit.wikimedia.org/r/339405 [13:52:17] !log restart logstash on relforge1001 to test logging configuration - T158664 [13:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:25] T158664: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664 [13:54:12] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3049557 (10mark) Approved. [13:56:14] (03PS4) 10Gehel: elasticsearch: don't send logs to the console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) [13:57:01] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decom db2001-db2009 - https://phabricator.wikimedia.org/T125827#3049564 (10jcrespo) a:05mark>03RobH [13:57:38] (03CR) 10Gehel: [C: 032] elasticsearch: don't send logs to the console [puppet] - 10https://gerrit.wikimedia.org/r/338998 (https://phabricator.wikimedia.org/T158664) (owner: 10Gehel) [13:58:28] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3049566 (10Gehel) Change is deployed but will only be active after the next cluster restart. [13:58:47] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: elasticsearch logs are duplicated in journald - https://phabricator.wikimedia.org/T158664#3049567 (10Gehel) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T1400). [14:00:04] kart_, addshore, and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:16] * kart_ is here [14:00:57] * phuedx is here too [14:00:57] o/ [14:01:05] o/ zeljkof I can do! [14:01:09] 06Operations, 06Discovery, 06Maps, 07Epic, 03Interactive-Sprint: Investigate how Kartotherian metrics are published and what they mean - https://phabricator.wikimedia.org/T149889#3049574 (10Gehel) [14:01:12] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: delete unused kartotherian marker metrics - https://phabricator.wikimedia.org/T150353#3049573 (10Gehel) 05Open>03Resolved [14:01:19] addshore: great :) [14:01:20] robh: can the status in the topic be updated cause as far as I can see the "gerrit issue" is resolved [14:01:33] *logs into the places* [14:02:38] kart_: going with https://gerrit.wikimedia.org/r/#/c/339380/ first [14:03:28] addshore: okay! [14:05:04] (03PS2) 10Addshore: Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) [14:09:35] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3049588 (10dschwen) Yes, it is used by me! I'm pulling data from that server for the client-side rendered tiles and 3D buildings in WikiMiniAtlas. [14:09:37] *twiddles thumbs waiting for jenkins* [14:10:41] addshore: mediawiki-config CI takes a bit... [14:10:49] its jessie's fault [14:11:06] Zppix: im waiting for mediawiki/extensions/ContentTranslation ;) [14:11:15] oh :P [14:11:26] i'm looking at the patch you put into ps2 [14:11:30] C+2 V+2 SWAT ;) :D [14:11:32] :P [14:11:45] * Zppix slowly backs away [14:12:02] Zppix: I'm just preparing that one for later ;) [14:12:52] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.50 [debs/linux44] - 10https://gerrit.wikimedia.org/r/339405 (owner: 10Muehlenhoff) [14:13:37] addshore: anything that runs on jessie and sometimes trusty seems to take a bit, i think those instances are running on slower CPUs [14:14:12] 55% of the phpunit tetss [14:14:14] meh [14:14:19] (03CR) 10Addshore: [C: 032] Enable v2 of Minerva's header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) (owner: 10Phuedx) [14:14:22] (03PS3) 10Filippo Giunchedi: lvs: add swift https service [puppet] - 10https://gerrit.wikimedia.org/r/339197 (https://phabricator.wikimedia.org/T127455) [14:14:26] phuedx: ^^ going for yours first... [14:14:49] addshore: πŸ‘ [14:15:00] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] lvs: add swift https service [puppet] - 10https://gerrit.wikimedia.org/r/339197 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [14:15:08] addshore: its succeding so far... its still gotta run the major tests though. :/ weird, ill look into why it tends to be so slow. [14:15:19] at this rate one of the patches might be merged by half past ;) [14:16:24] addshore: i'd start running tests on all the swat patches (atleast using recheck or something) maybe then you want need to wait so long xD [14:16:54] hmmh, still, gate-submit might have different jobs configured, so cant rely on just a recheck [14:17:24] good point didnt think about that. [14:18:04] (03PS1) 10DCausse: [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 [14:18:05] (03PS1) 10DCausse: [cirrus] Add $wgCirrusSearchElasticQuirks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339409 [14:18:14] (03Merged) 10jenkins-bot: Enable v2 of Minerva's header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) (owner: 10Phuedx) [14:18:21] !log upgrading grafana to 4.1 on krypton [14:18:23] (03CR) 10jenkins-bot: Enable v2 of Minerva's header on cawiki and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339370 (https://phabricator.wikimedia.org/T156794) (owner: 10Phuedx) [14:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:47] addshore: there you go jenkins finnaly woke up on one of the patches [14:18:52] (03PS1) 10Filippo Giunchedi: conftool-data: add nginx service to swift [puppet] - 10https://gerrit.wikimedia.org/r/339410 (https://phabricator.wikimedia.org/T127455) [14:19:06] phuedx: your patch is live on mwdebug1002, please check [14:19:17] sure thing [14:19:26] 06Operations, 10hardware-requests: Replace bast3001 - https://phabricator.wikimedia.org/T156506#3049602 (10faidon) Any news about this? I see @Dzahn you claimed that already :) [14:20:51] elukey: thanks! [14:21:23] kart_: your patch is also live on mwdebug1002! [14:21:33] addshore: cool. checking. [14:22:05] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3049607 (10Eevans) >>! In T158583#3049099, @hashar wrote: > If we go with subcomponents such as `thirdparty/foo` would it be possible to let non-ops to update packages for such component? Assuming the sub com... [14:22:50] (03PS2) 10Filippo Giunchedi: conftool-data: add nginx service to swift [puppet] - 10https://gerrit.wikimedia.org/r/339410 (https://phabricator.wikimedia.org/T127455) [14:24:21] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 0 below the confidence bounds [14:24:40] addshore: lgtm [14:24:45] phuedx: ack [14:25:06] as in tcp ack or "ack something's terribly wrong what have you done?" [14:25:11] ;) [14:25:51] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:26:07] phuedx: number 1 ;) [14:26:35] addshore: go ahead. [14:26:36] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T156794 [[gerrit:339370|Enable v2 of Minerva's header on cawiki and itwiki]] (duration: 00m 42s) [14:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:40] T156794: Deploy new header to Catalan and Italian wikipedia mobile website - https://phabricator.wikimedia.org/T156794 [14:26:45] phuedx: yours is everywhere [14:26:50] kart_: ack, doing yours now! [14:27:18] addshore: correction: its on cawiki and itwiki [14:27:43] ^ [14:27:49] but i knew what he meant :) [14:27:52] thanks addshore! [14:27:53] Zppix: indeed :P [14:28:09] phuedx: well if its everywhere then something didnt go right :P [14:28:16] hehe [14:28:34] OH NO HE MEANT THE OTHER KIND OF ACK!!1 [14:29:14] PROBLEM - ITWIKI AND CAWIKI are down (Reason: phuedx) [14:29:35] ... [14:29:37] Zppix: are you logging this? ;) [14:29:48] maybe.... [14:29:58] !log addshore@tin Synchronized php-1.29.0-wmf.13/extensions/ContentTranslation/ContentTranslation.hooks.php: SWAT T158297 [[gerrit:339380|Really disable europeana2802016 campaign]] (duration: 00m 39s) [14:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:03] i like that the reason is me too [14:30:03] T158297: Disable europeana2802016 campaign - https://phabricator.wikimedia.org/T158297 [14:30:04] Zppix: .... poor choice of words for this channel.. [14:30:13] kart_: done! please check! [14:30:13] not "there's a bug" [14:30:21] it's just "things are broken because phuedx" [14:30:23] addshore: thanks [14:30:24] :P [14:30:32] Zppix: no kidding though, i just had a heart attack [14:30:44] addshore: meh this channel too serious operations needs a good living up every now and then eh [14:31:00] srsly, Zppix please don't do that especially when deploys are on [14:31:02] Zppix: you did also just give me a small heart attack too.... [14:31:37] ack [14:31:39] :/ [14:31:43] (03CR) 10Addshore: [C: 032] Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) (owner: 10Addshore) [14:31:49] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:31:56] :( Zppix [14:32:04] it was funny after the heart attack though! :D [14:32:10] :/ [14:32:32] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3049645 (10chasemp) >>! In T157359#3049588, @dschwen wrote: > Yes, it is used by me! I'm pulling data from that server for the client-side rendered tiles and 3D buildings... [14:33:06] phuedx: i try... [14:33:19] kart_: looks like those warnings have gone away! :) [14:33:36] :) [14:34:27] i like how we have different versions of disabling disable and "really" disable [14:34:49] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:37:52] (03Merged) 10jenkins-bot: Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) (owner: 10Addshore) [14:38:01] (03CR) 10jenkins-bot: Enable TwoColConflict on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339382 (https://phabricator.wikimedia.org/T158832) (owner: 10Addshore) [14:38:29] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:59] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [14:39:38] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT T158832 [[gerrit:339382|nable TwoColConflict on hewiki]] (duration: 00m 40s) [14:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:43] T158832: Please enable the TwoColConflict in the Hebrew Wikipedia - https://phabricator.wikimedia.org/T158832 [14:39:49] !log EU SWAT done [14:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:19] did mw2256 just restart, it went down and back up really quick? if not then is there some sort of maintenance going on [14:42:31] !log addshore@tin scap clean 1.29.0-wmf.6 && scap clean 1.29.0-wmf.7 (to remove warning on scap pull on mwdebug1002, T157030) [14:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030 [14:43:14] (03PS3) 10Tim Landscheidt: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) [14:45:49] Hey - can I provide a new public key for the production servers? [14:46:30] (03PS1) 10Filippo Giunchedi: hieradata: use 'uri' for swift icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/339413 (https://phabricator.wikimedia.org/T127455) [14:47:15] (03CR) 10Filippo Giunchedi: [C: 032] conftool-data: add nginx service to swift [puppet] - 10https://gerrit.wikimedia.org/r/339410 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [14:50:19] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [14:53:26] (03CR) 10Elukey: "Tested in labs, it work fine!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:53:51] (03CR) 10Elukey: [C: 032] Add JMX port 9986 to the MapReduce History process [puppet/cdh] - 10https://gerrit.wikimedia.org/r/334667 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:55:18] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: use 'uri' for swift icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/339413 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [14:55:24] (03PS2) 10Filippo Giunchedi: hieradata: use 'uri' for swift icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/339413 (https://phabricator.wikimedia.org/T127455) [14:56:25] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: use 'uri' for swift icinga configuration [puppet] - 10https://gerrit.wikimedia.org/r/339413 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [14:58:58] (03PS1) 10Giuseppe Lavagetto: conftool: add namespace support [puppet] - 10https://gerrit.wikimedia.org/r/339414 [14:59:00] (03PS1) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [14:59:02] (03PS1) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [14:59:04] (03PS1) 10Giuseppe Lavagetto: role::pybaltest: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339417 [15:00:11] (03PS1) 10Elukey: Update the cdh submodule to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/339418 (https://phabricator.wikimedia.org/T156272) [15:00:16] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 (owner: 10Giuseppe Lavagetto) [15:01:01] (03CR) 10jerkins-bot: [V: 04-1] role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 (owner: 10Giuseppe Lavagetto) [15:02:12] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add namespace support [puppet] - 10https://gerrit.wikimedia.org/r/339414 (owner: 10Giuseppe Lavagetto) [15:02:49] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:03:27] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5551/ - LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/339418 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:03:35] (03CR) 10Elukey: [V: 032 C: 032] Update the cdh submodule to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/339418 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:03:42] (03PS2) 10Elukey: Update the cdh submodule to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/339418 (https://phabricator.wikimedia.org/T156272) [15:03:58] (03CR) 10Elukey: [V: 032 C: 032] Update the cdh submodule to the latest sha [puppet] - 10https://gerrit.wikimedia.org/r/339418 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:04:19] <_joe_> elukey: please wait before puppet-merging [15:04:41] _joe_ I was about to ping you, I am not in a hurry.. I'll leave you do it :) [15:05:34] <_joe_> elukey: please go [15:06:09] (03PS2) 10Gehel: elasticsearch: force the creation of the plugins directory symlink [puppet] - 10https://gerrit.wikimedia.org/r/338781 [15:06:09] all right merging [15:07:35] (03CR) 10Gehel: [C: 032] elasticsearch: force the creation of the plugins directory symlink [puppet] - 10https://gerrit.wikimedia.org/r/338781 (owner: 10Gehel) [15:08:06] !log Power off dbstore1001 to change its disks and reimage - T153768 [15:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:11] T153768: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768 [15:12:42] 06Operations, 07Puppet: Puppet's parser function suffix.rb flapping between two versions - https://phabricator.wikimedia.org/T158860#3049765 (10fgiunchedi) [15:12:58] fun ^ [15:13:31] <_joe_> godog: /o\ [15:13:43] I don't even [15:13:44] <_joe_> let me look into it [15:13:54] <_joe_> in 15/20 mins [15:14:17] take your time, doesn't look like it is having an observable impact [15:15:29] PROBLEM - Host ms-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:55] <_joe_> godog: 75b7d4d135c2b0cc69c75aeadd45c12b6dc9c2af [15:16:36] <_joe_> that's an older version [15:16:42] <_joe_> and we can also blame mark! [15:17:22] (03PS2) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [15:17:58] _joe_: the one in the ganglia module? I bet [15:18:14] !log roll-restart pybal in codfw to pick up swift https service [15:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:23] thdj: https://meta.wikimedia.org/wiki/MediaWiki:Centralnotice-template-Inspire_know what to add, i can do it [15:19:34] (03CR) 10Faidon Liambotis: [C: 031] "LGTM. You should also add reverse DNS for neodymium/sarin's IPv6s (in the zonefiles the DNS repo)" [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [15:20:25] (03PS2) 10Rush: labstore: 1001 and 1002 are currently idle [puppet] - 10https://gerrit.wikimedia.org/r/338973 [15:20:29] RECOVERY - Host ms-be2002 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [15:20:59] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:21:16] (03PS3) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [15:21:44] <_joe_> take your time jenkins [15:21:49] <_joe_> I'm not waiting for you [15:21:52] <_joe_> at all [15:23:15] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [15:25:17] <_joe_> but also https://cdn.meme.am/cache/instances/folder607/65458607.jpg [15:25:35] lol [15:25:42] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 (owner: 10Giuseppe Lavagetto) [15:26:06] <_joe_> ahahahah [15:26:09] <_joe_> of course [15:27:57] (03CR) 10Rush: [C: 032] labstore: 1001 and 1002 are currently idle [puppet] - 10https://gerrit.wikimedia.org/r/338973 (owner: 10Rush) [15:31:46] (03PS4) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [15:33:18] (03CR) 10jerkins-bot: [V: 04-1] profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 (owner: 10Giuseppe Lavagetto) [15:33:20] (03PS3) 10Volans: Cumin: authorize also cumin masters IPv6 addresses [puppet] - 10https://gerrit.wikimedia.org/r/339183 (https://phabricator.wikimedia.org/T158753) [15:33:49] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3049830 (10Papaul) IPMI was disable it is now enable. [15:36:12] (03PS2) 10Rush: tools: allow generic banner for inf protection [puppet] - 10https://gerrit.wikimedia.org/r/339007 [15:38:18] (03PS4) 10Rush: nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 [15:38:24] (03PS5) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [15:39:53] (03CR) 10Rush: [C: 032] nova: run fullstack test suite on current labnet [puppet] - 10https://gerrit.wikimedia.org/r/339064 (owner: 10Rush) [15:40:30] (03PS6) 10Giuseppe Lavagetto: profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 [15:41:19] (03PS1) 10Volans: Add reverse mapped IPv6 for neodymium and sarin [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) [15:41:24] <_joe_> chasemp: I just saw https://gerrit.wikimedia.org/r/#/c/339064/ and... well, you might have wanted to make that a profile maybe? [15:42:03] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::conftool::client: first commit [puppet] - 10https://gerrit.wikimedia.org/r/339415 (owner: 10Giuseppe Lavagetto) [15:42:42] (03PS1) 10Gehel: relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) [15:42:52] _joe_: ok let me read up and put up a conversion for you to peruse [15:43:13] (03PS2) 10Gehel: relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) [15:43:18] <_joe_> chasemp: just as a general note, we should all put a (very small) effort in migrating in that direction [15:43:19] I'm not entirely clear on if profiles are a thing we are now using for all things or new things or what [15:43:32] <_joe_> well you created a new "role" [15:43:33] (03CR) 10Volans: "Mapping of IPv6 for neodymium and sarin is added in https://gerrit.wikimedia.org/r/#/c/339183" [dns] - 10https://gerrit.wikimedia.org/r/339422 (https://phabricator.wikimedia.org/T158753) (owner: 10Volans) [15:43:41] <_joe_> which was actually a profile [15:43:43] <_joe_> :P [15:43:49] that hiera novaconfig mess may make it difficult, I'm not sure [15:44:16] (03CR) 10DCausse: [C: 031] relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) (owner: 10Gehel) [15:45:01] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:10] !log banning relforge1001 from clsuter to prepare for ES5 upgrade - T156150 [15:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:15] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [15:47:51] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:50:00] (03PS2) 10Tim Landscheidt: ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069 (https://phabricator.wikimedia.org/T158860) [15:50:02] (03PS2) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [15:50:31] (03PS1) 10Milimetric: Fix output path for stat1002 reports [puppet] - 10https://gerrit.wikimedia.org/r/339424 [15:59:05] (03PS1) 10Rush: nova: fullstack test dupe resource removal [puppet] - 10https://gerrit.wikimedia.org/r/339425 [16:00:59] (03CR) 10Rush: [C: 032] nova: fullstack test dupe resource removal [puppet] - 10https://gerrit.wikimedia.org/r/339425 (owner: 10Rush) [16:01:05] (03PS2) 10Rush: nova: fullstack test dupe resource removal [puppet] - 10https://gerrit.wikimedia.org/r/339425 [16:01:09] (03CR) 10Rush: [V: 032 C: 032] nova: fullstack test dupe resource removal [puppet] - 10https://gerrit.wikimedia.org/r/339425 (owner: 10Rush) [16:04:09] (03PS1) 10Rush: nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 [16:05:23] (03PS2) 10Rush: nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 [16:10:51] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:11:39] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3049952 (10RobH) a:03EBjune I'm assigning this to @EBjune for their approval. Please review/comment. Please assign task from you back to @robh (me), since I'... [16:12:01] RECOVERY - puppet last run on db1088 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:13:04] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3049960 (10MoritzMuehlenhoff) > This would be nice, but I'm guessing reprepo wouldn't make this easy to do, and might be out of scope for this issue. Ack, that's not really possible with reprepro and out of... [16:13:34] (03CR) 10jerkins-bot: [V: 04-1] nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 (owner: 10Rush) [16:15:29] (03PS1) 10Rush: nfs-mount: add chasetestproject for k8s testing [puppet] - 10https://gerrit.wikimedia.org/r/339429 [16:17:29] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069 (https://phabricator.wikimedia.org/T158860) (owner: 10Tim Landscheidt) [16:17:45] (03PS3) 10Giuseppe Lavagetto: ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069 (https://phabricator.wikimedia.org/T158860) (owner: 10Tim Landscheidt) [16:18:58] (03PS3) 10Rush: nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 [16:19:12] !log unban relforge1001 - T156150 [16:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:17] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [16:19:19] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] ganglia: Remove now-duplicate parser function suffix() [puppet] - 10https://gerrit.wikimedia.org/r/339069 (https://phabricator.wikimedia.org/T158860) (owner: 10Tim Landscheidt) [16:19:42] !log starting upgrade relforge cluster to elasticsearch 5.2.1 - expect significant downtime - T156150 [16:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:48] (03PS4) 10Rush: nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 [16:22:32] (03PS3) 10Gehel: relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) [16:22:40] (03CR) 10Rush: [C: 032] nova: remove fullstack circular dependency for upstart [puppet] - 10https://gerrit.wikimedia.org/r/339427 (owner: 10Rush) [16:22:51] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [16:24:42] (03CR) 10Gehel: [C: 032] relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) (owner: 10Gehel) [16:24:53] (03PS4) 10Gehel: relforge: upgrade elasticsearch to v5.2.1 [puppet] - 10https://gerrit.wikimedia.org/r/339423 (https://phabricator.wikimedia.org/T156150) [16:25:45] (03PS1) 10Filippo Giunchedi: hieradata: use 'localhost' vhost for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/339430 (https://phabricator.wikimedia.org/T127455) [16:25:51] (03PS2) 10Jcrespo: [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 [16:25:55] (03PS3) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [16:26:24] (03PS2) 10Filippo Giunchedi: hieradata: use 'localhost' vhost for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/339430 (https://phabricator.wikimedia.org/T127455) [16:27:30] (03CR) 10jerkins-bot: [V: 04-1] [WIP]mariadb: Include a new option "socket" for all servers [puppet] - 10https://gerrit.wikimedia.org/r/339004 (owner: 10Jcrespo) [16:27:31] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:27:58] (03PS1) 10Muehlenhoff: Update email address for Ellery Wulczyn [puppet] - 10https://gerrit.wikimedia.org/r/339431 [16:29:44] 06Operations, 10RESTBase, 10service-runner, 06Services (doing): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3050004 (10GWicke) I flagged the possibility of other write errors (apart from out-of-space) on the task. I really think we should err on the side of caution, and... [16:29:51] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [16:33:49] !log cleaning up openstack packages from einstenium & tegment [16:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:29] (03PS4) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [16:37:51] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:38:18] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050017 (10GWicke) I think the discussion about restricting thumbnail sizes is orthogonal to this RFC. Nothing in this RFC limits our ability to later... [16:38:19] (03PS1) 10Gehel: elasticsearch: correct iterator in ES5 jvm.options template [puppet] - 10https://gerrit.wikimedia.org/r/339434 (https://phabricator.wikimedia.org/T155578) [16:42:18] (03PS2) 10Gehel: elasticsearch: correct iterator in ES5 jvm.options template [puppet] - 10https://gerrit.wikimedia.org/r/339434 (https://phabricator.wikimedia.org/T155578) [16:43:57] (03PS5) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [16:45:01] (03CR) 10Gehel: [C: 032] "puppet compiler looks good: https://puppet-compiler.wmflabs.org/5558/relforge1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/339434 (https://phabricator.wikimedia.org/T155578) (owner: 10Gehel) [16:49:02] (03PS1) 10Ema: tlsproxy: Lua support [puppet] - 10https://gerrit.wikimedia.org/r/339438 [16:50:21] (03PS3) 10Filippo Giunchedi: hieradata: use 'localhost' vhost for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/339430 (https://phabricator.wikimedia.org/T127455) [16:51:09] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] hieradata: use 'localhost' vhost for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/339430 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [16:53:33] (03PS3) 10Gehel: elasticsearch: correct iterator in ES5 jvm.options template [puppet] - 10https://gerrit.wikimedia.org/r/339434 (https://phabricator.wikimedia.org/T155578) [16:54:53] (03PS6) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [16:55:31] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:58:02] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch: correct iterator in ES5 jvm.options template [puppet] - 10https://gerrit.wikimedia.org/r/339434 (https://phabricator.wikimedia.org/T155578) (owner: 10Gehel) [16:58:44] (03CR) 10Filippo Giunchedi: [C: 031] tlsproxy: Lua support [puppet] - 10https://gerrit.wikimedia.org/r/339438 (owner: 10Ema) [16:59:16] (03PS1) 10Rush: nova: nova-fullstack.upstart.erb add \ for line extension [puppet] - 10https://gerrit.wikimedia.org/r/339440 [16:59:54] 06Operations, 07Puppet, 13Patch-For-Review: Puppet's parser function suffix.rb flapping between two versions - https://phabricator.wikimedia.org/T158860#3050122 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Thanks @scfc ! This is fixed now ``` # salt -v -t 10 -b 5 'puppetmaster*' cmd.run 'md5sum /va... [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T1700). [17:00:30] (03PS2) 10Elukey: Fix output path for stat1002 reports [puppet] - 10https://gerrit.wikimedia.org/r/339424 (owner: 10Milimetric) [17:03:01] (03PS2) 10Rush: nova: nova-fullstack.upstart.erb add \ for line extension [puppet] - 10https://gerrit.wikimedia.org/r/339440 [17:04:21] (03PS2) 10Rush: nfs-mount: add chasetestproject for k8s testing [puppet] - 10https://gerrit.wikimedia.org/r/339429 [17:04:37] I see no puppet swat patches, https://i.redd.it/bkntxf69iufy.gif [17:04:47] (03PS2) 10Ema: tlsproxy: Lua support [puppet] - 10https://gerrit.wikimedia.org/r/339438 [17:05:23] (03PS1) 10Gehel: elasticsearch: remove deprecated options from elasticsaerch config file Bug: T155578 [puppet] - 10https://gerrit.wikimedia.org/r/339444 (https://phabricator.wikimedia.org/T155578) [17:06:16] (03CR) 10Elukey: [V: 032 C: 032] Fix output path for stat1002 reports [puppet] - 10https://gerrit.wikimedia.org/r/339424 (owner: 10Milimetric) [17:07:46] godog, paravoid, _joe_: sorry for my absence at the syncup, I came under attack from Murphy [17:08:13] it ended up just me godog and _joe_ so we cancelled it [17:08:21] hope everything's ok! [17:08:53] paravoid: power went out! [17:09:20] with an enormous boom that setup off every car alarm on the street [17:09:27] paravoid: it was exciting [17:09:48] whaat [17:10:19] we had a bad storm here on Sunday, multiple tornados touched down in the city [17:10:34] apparently this left a broken branch in a position to fall on some high voltage lines [17:10:49] :O [17:10:56] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:59] so the power company was out trying to clear them when they fell on the lines [17:11:19] (03PS1) 10Aude: Disallow geo-shape data type on wikidata for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339446 (https://phabricator.wikimedia.org/T158849) [17:11:34] and since my phone spontaneously bricked itself yesterday, i was thrown back into the stone age [17:12:32] urandom: ouch! [17:12:59] (03PS2) 10Urbanecm: New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) [17:22:25] I'm poking swift/https in codfw in icinga, it might page [17:22:51] PROBLEM - LVS HTTPS IPv4 on ms-fe.svc.codfw.wmnet is CRITICAL: connect to address 10.2.2.27 and port 443: Connection refused [17:23:21] QED [17:23:28] lol [17:24:03] (03PS7) 10Giuseppe Lavagetto: role::puppetmaster::frontend: include profile::conftool::master [puppet] - 10https://gerrit.wikimedia.org/r/339416 [17:25:01] sorry about the page [17:25:10] your warning was perfectly timed! [17:25:14] not even in backscroll ;] [17:25:21] anyways note how the address there is eqiad, not codfw [17:25:56] I can't currently figure out why icinga still thinks that [17:26:20] i thought it was odd and was looking it up [17:26:26] (03PS2) 10Gehel: elasticsearch: remove deprecated options from elasticsaerch config file Bug: T155578 [puppet] - 10https://gerrit.wikimedia.org/r/339444 (https://phabricator.wikimedia.org/T155578) [17:26:38] i still have a fairly decent mental map of our internal ip addresses it seems =P [17:26:41] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch: remove deprecated options from elasticsaerch config file Bug: T155578 [puppet] - 10https://gerrit.wikimedia.org/r/339444 (https://phabricator.wikimedia.org/T155578) (owner: 10Gehel) [17:27:13] heheh indeed, I didn't want to restart icinga but the config is right, https://icinga.wikimedia.org/cgi-bin/icinga/config.cgi?type=services&item_name=ms-fe.svc.codfw.wmnet^LVS+HTTPS+IPv4 [17:27:47] (03PS3) 10Ema: tlsproxy: Lua support [puppet] - 10https://gerrit.wikimedia.org/r/339438 [17:27:54] (03CR) 10Ema: [V: 032 C: 032] tlsproxy: Lua support [puppet] - 10https://gerrit.wikimedia.org/r/339438 (owner: 10Ema) [17:28:45] ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] Gehel replication was stuck, but is now recovering. Investigation on T158874 [17:29:14] (03PS3) 10Rush: nova: nova-fullstack.upstart.erb add \ for line extension [puppet] - 10https://gerrit.wikimedia.org/r/339440 [17:29:21] (03CR) 10Rush: [V: 032 C: 032] nova: nova-fullstack.upstart.erb add \ for line extension [puppet] - 10https://gerrit.wikimedia.org/r/339440 (owner: 10Rush) [17:30:17] (03PS3) 10Rush: nfs-mount: add chasetestproject for k8s testing [puppet] - 10https://gerrit.wikimedia.org/r/339429 [17:30:29] (03CR) 10Rush: [V: 032 C: 032] nfs-mount: add chasetestproject for k8s testing [puppet] - 10https://gerrit.wikimedia.org/r/339429 (owner: 10Rush) [17:32:14] (03CR) 10jerkins-bot: [V: 04-1] New namespace aliases for itwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [17:36:08] (03PS8) 10Giuseppe Lavagetto: role::puppetmaster: use profile::conftool [puppet] - 10https://gerrit.wikimedia.org/r/339416 [17:37:27] !log removing old prod indices from relforge1002 (jawikiprod_content, enprodwiki_content, ruwikiprod_content) - T156150 [17:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:32] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [17:37:34] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::puppetmaster: use profile::conftool [puppet] - 10https://gerrit.wikimedia.org/r/339416 (owner: 10Giuseppe Lavagetto) [17:37:56] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:38:47] (03CR) 10MarcoAurelio: New namespace aliases for itwikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [17:40:34] !log removing old prod indices from relforge1002 - T156150 [17:40:38] (03PS1) 10Rush: nova: fullstack test set respawn limits [puppet] - 10https://gerrit.wikimedia.org/r/339450 [17:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:52] (03PS2) 10Rush: nova: fullstack test set respawn limits [puppet] - 10https://gerrit.wikimedia.org/r/339450 [17:45:31] (03PS5) 10Elukey: Set maximum JVM heap size for Zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) [17:46:51] gehel: --^ [17:47:10] elukey: looking [17:47:25] I checked the jvm7/8 and afaics the default Xmx should be either 1/4 of the total ram available or 1GB (whichever is the smaller) [17:48:04] so 1g should be a safe bet, then we'll tune hiera if needed [17:48:19] elukey: sounds good to me! [17:48:38] my goal is to make it predictable and add alarms [17:49:05] super thanks :) [17:49:15] anybody else that wants to comment please feel free to :) [17:49:36] (03CR) 10Gehel: [C: 031] "Change seems sound and aligns with the JVM default." [puppet] - 10https://gerrit.wikimedia.org/r/337797 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [17:49:59] thanks! [17:50:07] will triple check tomorrow [17:50:21] but the plan is to upgrade zk on druid100[123] [17:50:31] then zk on conf1001, wait a bit, then the rest [17:51:09] maybe worth to add alarms in the same code review [17:51:10] ?? [17:51:14] gehel: --^ [17:51:51] monitoring::graphite_threshold for JVM heap size [17:52:14] alerts on JVM heap size don't make all that much sense... [17:52:31] why not? like 90% usage for x datapoints etc.. [17:52:44] I spotted mem leaks in hadoop with them :( [17:53:11] I mean, it is ok to reach the top of the heap and trigger GC [17:53:15] since you fixed the max heap, you know the upper limit. It make more sense to have alert on time spent in GC, it is a better indicator IMHO and independent of heap size [17:53:16] (03CR) 10MarcoAurelio: [C: 04-1] "For your attention." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339348 (https://phabricator.wikimedia.org/T158775) (owner: 10Urbanecm) [17:53:49] mmmm it might also be another good indicator [17:53:53] (03PS2) 10MarcoAurelio: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) [17:54:00] (03CR) 10Rush: [C: 032] nova: fullstack test set respawn limits [puppet] - 10https://gerrit.wikimedia.org/r/339450 (owner: 10Rush) [17:54:10] I'm afraid that heap size will be hard to use without too many false positive... [17:54:16] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [17:54:17] I like the heap size thresholds because usually when a JVM starts trashing it will get caught [17:54:49] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3050419 (10EBjune) Max should definitely have deploy rights for Maps, thanks! [17:55:07] I'd put a conservative value, like CRITICAL if 90% of heap size is crossed for more than half of datapoints in say an hour [17:55:22] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3050424 (10EBjune) a:05EBjune>03RobH [17:56:09] elukey: can't hurt... [17:56:33] gehel: I'll think about it tonight and ping you tomorrow if you have time :) [17:56:43] sure! [17:56:51] I like the GC Time alerts, we migth thing about them for a broader alarming scheme [17:59:50] (03PS1) 10Gehel: elasticsearch: remove memory lock, we do not use swap anyway Bug: T155578 [puppet] - 10https://gerrit.wikimedia.org/r/339455 (https://phabricator.wikimedia.org/T155578) [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T1800). [18:02:00] (03CR) 10EBernhardson: "looks like this is partially true, it seems while relforge has no swap the prod clusters all have 1GB of swap each. I don't see any benefi" [puppet] - 10https://gerrit.wikimedia.org/r/339455 (https://phabricator.wikimedia.org/T155578) (owner: 10Gehel) [18:03:08] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:07:05] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050481 (10demon) >>! In T158782#3048710, @Krinkle wrote: > I'm not sure why T158810 or T158808 would be needed to hav... [18:07:15] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3050482 (10RobH) So, this group seems odd to me, unless its privilege escalation takes place in another file? The modules/admin/data/data.yaml has the following... [18:07:32] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3050483 (10RobH) [18:08:03] (03PS4) 10Tim Landscheidt: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) [18:11:43] (03PS1) 10Giuseppe Lavagetto: role::conftool::master: remove as it is unused [puppet] - 10https://gerrit.wikimedia.org/r/339458 [18:11:45] (03PS1) 10Giuseppe Lavagetto: conftool: remove base class, useless in the refactor [puppet] - 10https://gerrit.wikimedia.org/r/339459 [18:12:29] (03CR) 10Andrew Bogott: [C: 04-1] Tools: Fully qualify hostnames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [18:17:23] (03PS1) 10Muehlenhoff: Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 [18:18:54] (03PS1) 10Rush: nodepool: bumps to max-servers and ready nodes [puppet] - 10https://gerrit.wikimedia.org/r/339463 [18:20:45] (03CR) 10Tim Landscheidt: Tools: Fully qualify hostnames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [18:21:45] (03CR) 10Thcipriani: [C: 031] nodepool: bumps to max-servers and ready nodes [puppet] - 10https://gerrit.wikimedia.org/r/339463 (owner: 10Rush) [18:22:12] (03CR) 10Andrew Bogott: [C: 032] Tools: Fully qualify hostnames (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [18:22:24] (03PS5) 10Andrew Bogott: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [18:22:55] 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3050526 (10Gehel) [18:23:05] (03CR) 10Andrew Bogott: [C: 031] nodepool: bumps to max-servers and ready nodes [puppet] - 10https://gerrit.wikimedia.org/r/339463 (owner: 10Rush) [18:23:23] (03CR) 10Gehel: [C: 032] "I will disable swap from production servers before upgrading elasticsearch" [puppet] - 10https://gerrit.wikimedia.org/r/339455 (https://phabricator.wikimedia.org/T155578) (owner: 10Gehel) [18:24:22] (03PS2) 10Rush: nodepool: bumps to max-servers and ready nodes [puppet] - 10https://gerrit.wikimedia.org/r/339463 [18:24:48] (03CR) 10jerkins-bot: [V: 04-1] Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 (owner: 10Muehlenhoff) [18:27:05] (03CR) 10Rush: [V: 032 C: 032] nodepool: bumps to max-servers and ready nodes [puppet] - 10https://gerrit.wikimedia.org/r/339463 (owner: 10Rush) [18:27:40] (03PS2) 10Muehlenhoff: Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 [18:27:51] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Response times of Wikidata Query Service increasing - https://phabricator.wikimedia.org/T147130#3050543 (10Gehel) 05Open>03Resolved Nothing left to investigate... [18:28:27] (03PS2) 10BBlack: LE: allow non-root key ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/339015 (https://phabricator.wikimedia.org/T154917) [18:28:29] (03PS2) 10BBlack: lists: use LE cert for exim [puppet] - 10https://gerrit.wikimedia.org/r/339016 (https://phabricator.wikimedia.org/T154917) [18:29:06] (03PS1) 10BBlack: varnish: move applayer be_opts defaulting into template [puppet] - 10https://gerrit.wikimedia.org/r/339464 (https://phabricator.wikimedia.org/T134404) [18:29:39] !log labnodepool1001:~# service nodepool restart [18:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:38] (03PS1) 10Ema: WIP: prometheus: add node tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/339465 [18:32:07] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:38:11] (03CR) 10BBlack: [V: 032 C: 032] LE: allow non-root key ownership/perms [puppet] - 10https://gerrit.wikimedia.org/r/339015 (https://phabricator.wikimedia.org/T154917) (owner: 10BBlack) [18:39:46] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050554 (10GWicke) [18:40:57] (03CR) 10BBlack: [V: 032 C: 032] lists: use LE cert for exim [puppet] - 10https://gerrit.wikimedia.org/r/339016 (https://phabricator.wikimedia.org/T154917) (owner: 10BBlack) [18:41:56] Dereckson it seems that https://www.wikipedia.org/ has no text again for me. [18:42:18] I did command + sift + r to force a refresh in chrome and still shows no text. [18:42:26] really? [18:42:26] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3050575 (10GWicke) >>! In T66214#2981032, @Gilles wrote: > Accept headers and Vary: Accept are missing from the current task description. I added a se... [18:42:42] can repro [18:42:57] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-lists] [18:43:15] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050577 (10Paladox) This problem has happened again. I see no text. I did a force refresh in chrome and still nothing. [18:43:24] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050590 (10Paladox) p:05High>03Unbreak! [18:43:31] paladox: they redeploye [18:43:36] oh [18:44:02] (03PS1) 10BBlack: Bugfix: group name does not have 4 at the end [puppet] - 10https://gerrit.wikimedia.org/r/339466 (https://phabricator.wikimedia.org/T154917) [18:44:05] I've repurged /portal/wikipedia.org/assets/js/index-4398b00936.js without success [18:44:16] (03CR) 10BBlack: [V: 032 C: 032] Bugfix: group name does not have 4 at the end [puppet] - 10https://gerrit.wikimedia.org/r/339466 (https://phabricator.wikimedia.org/T154917) (owner: 10BBlack) [18:44:25] Urbanecm: minor issue on wikiversity namespaces [18:44:29] oh [18:44:43] Dereckson: they did? when? [18:45:57] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:46:26] Let me double check, yesterday it was a d1cc91a7f4 hash and today we have a portal/wikipedia.org/assets/js/index-4398b00936.js [18:46:42] no more gerrit issues I'm aware of since then [18:46:45] nothing in SAL [18:46:51] (03PS1) 10Andrew Bogott: Keystone: Go back to using eventlet for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/339467 (https://phabricator.wikimedia.org/T156337) [18:47:28] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050640 (10greg) @Jdrewniak was there another deploy? When did that happen? I don't see anything in [[ https://tools.w... [18:47:47] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050653 (10Dereckson) p:05Unbreak!>03High Could be a cache issue: I send a purge request and it works again. [18:47:50] Dereckson i only see a change in portals from 12am my time. But in mediawiki-configuation it shows you reverted it yesturday. [18:48:18] greg-g: paladox: no, they didn't deploy, I had an old version of the page on by browser it seems [18:48:22] Dereckson: paladox the portal works for me, fwiw [18:48:24] in germany [18:48:36] could be caching [18:48:37] aude: yes, after a purge of https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-4398b00936.js [18:48:39] It works now. But after Dereckson did the purge. [18:48:51] ok [18:48:51] (why www and not en. good question) [18:48:56] works for me as well [18:49:23] please update the task with actions taken :) [18:51:31] Done. [18:51:58] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050674 (10Gehel) For reference, the Apache configuration backing the portals seems to be https://github.com/wikimedia... [18:52:04] (03CR) 10jerkins-bot: [V: 04-1] Fix absent check for users which formerly only had LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/339461 (owner: 10Muehlenhoff) [18:52:06] (03PS1) 10Esanders: Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 [18:52:13] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050675 (10Paladox) Works now after refresh. [18:52:58] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3050681 (10Marostegui) For the record, when reinstalling dbstore1001 (T153768) which is mentioned here: T150160#2951190 as one of the affected hosts, we ran into this issue and tried to t... [18:53:57] (03PS1) 10BBlack: Revert "Bugfix: group name does not have 4 at the end" [puppet] - 10https://gerrit.wikimedia.org/r/339469 [18:54:06] (03CR) 10BBlack: [V: 032 C: 032] Revert "Bugfix: group name does not have 4 at the end" [puppet] - 10https://gerrit.wikimedia.org/r/339469 (owner: 10BBlack) [18:54:12] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050687 (10Dereckson) >>! In T158782#3050674, @Gehel wrote: > For reference, the Apache configuration backing the port... [18:54:17] (03PS1) 10BBlack: Revert "lists: use LE cert for exim" [puppet] - 10https://gerrit.wikimedia.org/r/339471 [18:54:23] (03CR) 10BBlack: [V: 032 C: 032] Revert "lists: use LE cert for exim" [puppet] - 10https://gerrit.wikimedia.org/r/339471 (owner: 10BBlack) [18:54:33] (03PS2) 10BBlack: Revert "lists: use LE cert for exim" [puppet] - 10https://gerrit.wikimedia.org/r/339471 [18:54:50] (03CR) 10jerkins-bot: [V: 04-1] WIP: prometheus: add node tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/339465 (owner: 10Ema) [18:55:20] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3050688 (10MaxSem) Yes, I'm using it (from tool labs, actually), but feel free to take it offline any time. [18:55:40] (03PS1) 10Catrope: Store goodfaith scores in the ORES tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) [18:55:53] (03PS2) 10RobH: Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [18:57:23] (03CR) 10RobH: [C: 032] "I was chatting with Ed about this via IRC. Since he uploaded his own pub key change, with his gerrit account which is linked (via email) " [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [18:58:00] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3050709 (10chasemp) Thanks @maxsem AFAIK we can take this down with proper notice (and really we must). My thinking is to send a general 1 week notice to labs-announce a... [19:00:07] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T1900). [19:00:07] tabbycat: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:18] * tabbycat meows [19:00:22] i iz here [19:00:48] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Go back to using eventlet for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/339467 (https://phabricator.wikimedia.org/T156337) (owner: 10Andrew Bogott) [19:03:02] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10Cmjohnson) dbstore1001 has the remote issue, opened a ticket with Dell to troubleshoot. Updating F/W to see if that will fix the issue [19:03:28] tabbycat: I can SWAT [19:03:39] (03PS1) 10Phuedx: Make Page Previews use RESTBase on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339475 (https://phabricator.wikimedia.org/T156800) [19:03:43] (03PS2) 10Andrew Bogott: Keystone: Go back to using eventlet for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/339467 (https://phabricator.wikimedia.org/T156337) [19:03:44] ^_^ [19:03:46] (03PS1) 10BBlack: LE: chmod +x on key dir, for group read to work [puppet] - 10https://gerrit.wikimedia.org/r/339476 [19:03:48] (03PS1) 10BBlack: lists: use LE cert for exim (let's try again!) [puppet] - 10https://gerrit.wikimedia.org/r/339477 (https://phabricator.wikimedia.org/T154917) [19:03:57] (03CR) 10BBlack: [V: 032 C: 032] LE: chmod +x on key dir, for group read to work [puppet] - 10https://gerrit.wikimedia.org/r/339476 (owner: 10BBlack) [19:04:09] (03CR) 10BBlack: [V: 032 C: 032] lists: use LE cert for exim (let's try again!) [puppet] - 10https://gerrit.wikimedia.org/r/339477 (https://phabricator.wikimedia.org/T154917) (owner: 10BBlack) [19:05:36] tabbycat: https://tools.wmflabs.org/versions/ wikipedia is still wmf12 [19:05:40] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3050727 (10dschwen) @chasemp yes a few days downtime should be OK. I have a cache layer that should serve most of the requests. [19:06:08] Dereckson: wikitech is on wmf13 [19:06:34] (03PS1) 10Phuedx: Hygiene: Remove Page Previews experiment config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339478 [19:06:43] and the train for group 2 is in about one hour Dereckson [19:06:48] Indeed, the group isn't used elsewhere [19:07:04] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3050736 (10Cmjohnson) [19:08:38] (03CR) 10EBernhardson: [C: 031] [cirrus] cleanup old A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339408 (owner: 10DCausse) [19:09:02] (03CR) 10EBernhardson: [C: 031] [cirrus] Add $wgCirrusSearchElasticQuirks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339409 (owner: 10DCausse) [19:09:23] tabbycat: there are still some steps to do for https://phabricator.wikimedia.org/T154816 documented at https://phabricator.wikimedia.org/T63729#3048524 could you care of it before we merge the flow removal from meta one? [19:10:00] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3050755 (10jcrespo) > @jcrespo how does that sound? Good to me. [19:10:11] tabbycat: seems done actually by Matiia [19:10:48] Dereckson: yep, everything has been taken care of I think. However https://meta.wikimedia.org/wiki/Special:Contributions/Flow_talk_page_manager is a bit weird though [19:11:03] since the board was ported, I think I can delete those as well [19:11:06] tabbycat: I've updated the task description to note it's done β€” https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-ia5dltz3wsfe5ow/ [19:11:09] 06Operations: dbstore1001 ipmi issue - https://phabricator.wikimedia.org/T158894#3050760 (10RobH) [19:11:46] 06Operations, 13Patch-For-Review: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10RobH) I've created sub-task T158894 for troubleshooting of dbstore1001. (This master tracking task will get quickly overwhelmed in fine detail if we list off individual steps... [19:12:33] (03CR) 10EBernhardson: [C: 031] Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [19:12:54] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3050792 (10RobH) [19:12:56] 06Operations: dbstore1001 ipmi issue - https://phabricator.wikimedia.org/T158894#3050760 (10RobH) [19:14:03] matt_flaschen: is https://meta.wikimedia.org/wiki/Special:Contributions/Flow_talk_page_manager expected to disapear? [19:14:29] tabbycat, no. [19:14:53] matt_flaschen: but when the extension is gone, those links will error or what? :) [19:15:01] matt_flaschen: you confirm all is ready to remove Flow from meta? [19:16:05] tabbycat, let me ask Roan how that was handled on enwiki. [19:16:20] matt_flaschen: sure [19:16:23] Ahm, hmm [19:16:27] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050802 (10greg) how often is this portal being updated and what is the process? I can't see it in the SAL and I would... [19:16:38] Dereckson: meanwhile we can handle the messages one when jerkins-bot pleases [19:16:41] I think I might have mass-deleted pages in the Topic namespace or something? [19:16:51] it's stuck at the g&s queue [19:17:05] Yeah Jenkins has been slow [19:17:36] (03PS3) 10Dereckson: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [19:19:14] RoanKattouw: idk, but if those can be mass-deleted via db afterwards we can continue [19:19:16] (03CR) 10Dereckson: "PS3: fixed whitespace issue in flow.dblist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [19:19:20] RoanKattouw, I thought it doesn't allow you to delete them (except through Flow UI). Did you do it directly in the DB (I wouldn't recommend we do that). [19:19:40] In my opinion, we shouldn't do anything that interferes with normal undeletion. [19:19:47] matt_flaschen: I used the command line script I think [19:20:00] But for meta, there are only a handful, so we can do whatever we need to do through the UI [19:20:00] cli ftw [19:20:24] I've tried to delete those from the interface and it's not possible [19:20:38] It might become possible once Flow is disabled [19:20:44] okay [19:20:52] I'll check afterwards then [19:20:59] I think the CLI script also errored on deleting those until after Flow was disabled, when I did this on enwiki [19:22:00] 06Operations, 10Ops-Access-Requests, 06Discovery, 06Maps: Give Max Semenik deployment rights for Maps - https://phabricator.wikimedia.org/T158820#3050814 (10mobrovac) @RobH the deploy group gains `sudo` rights to restart the service on the target machines via [`scap::target` declarations](https://github.co... [19:22:18] RoanKattouw, which script, a core one, or FlowRemoveOldTopics? RemoveOldTopics is not intended for this at all, just for duplicates. [19:22:23] Or just a core mass-delete. [19:22:30] I did a core mass-delet [19:22:45] Using the script that lets you feed it a newline-separated list of pages to delete, I forget what it's called [19:22:51] deleteBatch.php ? [19:23:13] 339456 has been merged by the way [19:23:16] Yeah [19:23:40] This is why I don't recommend it be disabled. It's the same as disabling any other content handler extension like Wikidata. It leaves inconsistent state. [19:23:47] Instead, you can just delete the pages. [19:23:56] But deleteBatch at least is reversible if we wanted to re-enable it. [19:24:14] I repeatedly urged the same [19:24:47] Thank you [19:25:14] But with the only boards being a transwikied one and a test one, I'm not too worried [19:25:38] tabbycat, before deploying, let me try something. [19:25:44] And yeah I suppose if we wanted to re-enable it, this stuff is all recoverable [19:26:04] matt_flaschen: sure, but it's Dereckson who is swat-ing :) [19:26:08] RoanKattouw: matt_flaschen: https://phabricator.wikimedia.org/T63729 was probably the right place to note that once more [19:26:59] Dereckson, let me check something quick before you deploy. [19:27:16] no problem, it's not urgent at all to disable meta from Flow [19:27:23] Flow from meta [19:28:03] And I'd totally agree to reject the change as "would create an unstable state, we can simply stop to use it" [19:28:19] RoanKattouw, Dereckson, deleting the topics (that will require undeleting the board temporarily I believe) will remove them from contributions. [19:28:38] So we could either: [19:28:46] matt_flaschen: Can't they just go through regular page deletion once the extension is disabled? [19:28:58] Also, deleting a topic in Flow doesn't actually delete the Topic: page (!) [19:29:09] 1. Just leave it as is now (topics are in the contributions). [19:29:27] 2. Leave Flow enabled and just delete the topics normally in Flow (if for some reason people insist that contribs list by empty). [19:29:32] 1 would allow correct logs and coherent UI [19:29:38] 1 would be also leaving Flow enabled. [19:29:46] which is not an issue [19:30:01] we can live with those topics in the contribs, but flow must go per request from the community, sorry [19:30:09] The second-to-last entry in the contribs list is 01:15, 15 December 2014 (diff | hist) . . (+1)β€Ž . . N Topic:Rph3qplet97huylg β€Ž (β†’β€ŽTaken over by Flow) which links to a deleted topic [19:30:10] 3. Disable Flow (I don't recommend it). As RoanKattouw said, that will require deleteBatch I think even if the topics are first deleted. [19:30:13] So deleting topics doesn't help [19:30:31] There is no technical advantage for 3. [19:30:43] option 1 wfm [19:31:42] Alright; so then we are done, right? [19:32:10] think so [19:32:12] 06Operations: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3050882 (10RobH) [19:32:21] Dereckson: how's the wmf.13 patch going? [19:32:29] is being scap-ed? [19:32:31] tabbycat: can you abandon your config patch with an explanation note? [19:32:45] Dereckson: why? [19:32:47] RoanKattouw, when I tested it on Beta, deleting a topic removes from the contrib, not sure what happened on that entry. [19:32:48] (03PS3) 10RobH: Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [19:32:51] Dereckson: we agreed on option 1 [19:32:58] 1 would be also leaving Flow enabled. [19:33:23] If you disable Flow while not deleting the topic pages you'll get inconsistent state and JSON pages that nothing understands. [19:33:48] RoanKattouw, try checking https://en.wikipedia.beta.wmflabs.org/wiki/Special:Contributions/Mattflaschen and deleting my 'foo' at https://en.wikipedia.beta.wmflabs.org/wiki/Talk:Flow . [19:33:54] -2 it if you want, but I'm just fullfiling a request from the community, a vaid one fwiw [19:34:06] Nemo_bis: ^ [19:35:00] matt_flaschen: Yes, but now look at the top of https://en.wikipedia.beta.wmflabs.org/wiki/Special:Contributions/Flow_talk_page_manager [19:35:08] That links to the deleted topic still [19:35:19] And that was the contribs page we were talking about it as I understood it [19:35:44] (03Abandoned) 10Dereckson: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [19:35:51] yep, ftpm contribs [19:36:31] (03PS3) 10Andrew Bogott: Keystone: Go back to using eventlet for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/339467 (https://phabricator.wikimedia.org/T156337) [19:36:46] RoanKattouw, oh, you're right. I was looking in the wrong place. It hides the Flow entries, but not the core action done by FTPM. [19:37:27] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:37:42] matt_flaschen: RoanKattouw: data state consistency is the important part [19:38:08] (03Restored) 10MarcoAurelio: Remove Flow from Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333860 (https://phabricator.wikimedia.org/T63729) (owner: 10MarcoAurelio) [19:38:09] matt_flaschen: Either way, my point was that the Topic: page continues to exist after topic deletion [19:38:25] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3050912 (10Jdrewniak) @greg there was no updates to the portal today. This brief error today must have happened becaus... [19:38:35] So if you were to try to clean them up in connection with disabling Flow, taking deletion actions from within Flow doesn't help [19:40:09] That's why I used deleteBatch on enwiki (and you have to do that after disabling, because while Flow is installed it will prevent Topic: pages from being deleted) [19:40:53] RoanKattouw: leaving those Topic: in the contrib pages won't cause any harm so we don't need to delete those after flow removal [19:41:14] from what I can understand [19:41:19] I guess the namespace will also go away [19:41:23] So the contribs display will be broken [19:41:23] so we can just go ahead [19:41:28] But that's OK, we can just deleteBatch those pages [19:41:42] the board was ported to mediawiki so nothing will be lost [19:42:04] tabbycat: live on mwdebug1002, you can check the extension.json change didn't break anything [19:42:24] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3047125 (10MaxSem) >>! In T158782#3050802, @greg wrote: > how often is this portal being updated and what is the proce... [19:42:30] Dereckson: for the messages? [19:42:33] yes [19:42:36] sure, okay, checking meta [19:42:46] RoanKattouw, yeah, I don't actually know what happens if you disable a non-empty namespace. [19:42:52] Strange things [19:43:12] I had to temporarily set $wgExtraNamespaces[2600] = 'Topic'; to be able to delete those pages [19:43:26] Basically the contribs page will render page names as ":foobar" instead of "Topic:foobar" [19:43:51] Makes sense [19:44:06] Dereckson: any specifics you'd like me to have a look at? [19:44:10] no [19:44:12] wiki is displaying normally [19:44:16] ok [19:44:27] although {{int:grouppage-contentadmin}} does not render anything [19:44:32] on mwdebug [19:44:59] jouncebot: next [19:44:59] In 0 hour(s) and 15 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T2000) [19:45:21] !log dereckson@tin Synchronized php-1.29.0-wmf.13/extensions/WikimediaMessages/i18n/wikitech/: (no justification provided) (duration: 00m 43s) [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:44] tabbycat: as the part of the train, a full scap will occur, that will deploy these messages [19:46:07] Dereckson: thought about it, thanks for confirming [19:47:07] (03CR) 10Andrew Bogott: [C: 032] Keystone: Go back to using eventlet for Liberty [puppet] - 10https://gerrit.wikimedia.org/r/339467 (https://phabricator.wikimedia.org/T156337) (owner: 10Andrew Bogott) [19:47:19] !log dereckson@tin Synchronized php-1.29.0-wmf.13/extensions/WikimediaMessages/extension.json: Create user group messages for wikitech.wikimedia.org (T158417) (duration: 00m 39s) [19:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:25] T158417: Group messages for Wikitech - https://phabricator.wikimedia.org/T158417 [19:47:40] Dereckson: got a call from the PD I have to leave to a scene [19:47:46] ++ [19:48:40] tabbycat: check when you're back / after the train if all is fine on wikitech, if not ping me [19:49:09] maybe tomorrow, it seems I'll be busy [19:49:24] see ya [19:50:53] RainbowSprinkles: SWAT is done, can you ping me after the full scap? [19:51:06] I won't be doing a full scap [19:51:17] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:58] (03PS1) 10Andrew Bogott: Keystone: define uwsgi services even if not running [puppet] - 10https://gerrit.wikimedia.org/r/339482 [19:52:02] (03PS1) 10Chad: Group2 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339483 [19:52:42] RainbowSprinkles: we'll need one for a l10n change. Best to do it after the train to avoid to delay it. [19:53:03] (or a l10nupdate) [19:53:04] Train shouldn't take long [19:53:05] :) [19:53:08] ok [19:53:35] (03PS4) 10RobH: Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [19:53:39] damn it i keep rebasing and hten someone calls me on the phone [19:53:43] and i miss my merge window =P [19:56:13] I should change puppet to rebase-if-necessary instead of ff-only [19:56:25] So y'all can stop the useless rebases and gerrit can do the useless rebase for you :) [19:57:43] i like the sound of that. [19:57:55] though id check with the rest of ops first ;D [19:58:24] It's basically like ff-only, but it does the rebase for you if it's trivial [19:58:26] (03CR) 10Andrew Bogott: [C: 032] Keystone: define uwsgi services even if not running [puppet] - 10https://gerrit.wikimedia.org/r/339482 (owner: 10Andrew Bogott) [19:58:30] So you still end up with a linear history [19:58:34] But saves you a click [19:58:50] (03PS2) 10BBlack: varnish: move applayer be_opts defaulting into template [puppet] - 10https://gerrit.wikimedia.org/r/339464 (https://phabricator.wikimedia.org/T134404) [19:59:14] i like that [19:59:19] 06Operations, 10Graphite, 07Nodepool, 07Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997#3051083 (10hashar) > zuul (contint1001) > nodepool (labnodepool1001) Both use python-statsd. They create a StatsClient which cache socket.gethostbyname() result. Zuul embeds 2.1.2. Zuul cre... [19:59:30] won't it potentially cause an overload though in zuul? [19:59:37] since it can cause a retrigger of a bunch of things? [19:59:58] (i just see one patch being submitted, and hten the 80 open patchsets all hitting zuul immediately) [20:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T2000). [20:00:15] robh: No more than it does right now :) [20:00:36] It does the rebase right at merge time, and doesn't rebase the whole chain [20:00:50] ah [20:01:00] god dman it [20:01:05] i somehow missed my merge window in like 15 seconds [20:01:19] im just going to manually v it now this is getting stupid. [20:01:23] (03CR) 10Chad: [C: 032] Group2 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339483 (owner: 10Chad) [20:01:29] (03PS5) 10RobH: Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [20:01:47] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [20:03:08] 5 ps just to get rebase and merge.... [20:03:50] (03Merged) 10jenkins-bot: Group2 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339483 (owner: 10Chad) [20:03:57] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:04:30] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.13 [20:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:40] (03CR) 10jenkins-bot: Group2 to wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339483 (owner: 10Chad) [20:05:27] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:08:05] (03CR) 10RobH: [V: 032 C: 032] Change pubkey for Ed Sanders [puppet] - 10https://gerrit.wikimedia.org/r/339468 (owner: 10Esanders) [20:14:47] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.082 second response time [20:15:48] (03PS3) 10BBlack: varnish: move applayer be_opts defaulting into template [puppet] - 10https://gerrit.wikimedia.org/r/339464 (https://phabricator.wikimedia.org/T134404) [20:17:47] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [20:20:27] (03CR) 10BBlack: [C: 032] varnish: move applayer be_opts defaulting into template [puppet] - 10https://gerrit.wikimedia.org/r/339464 (https://phabricator.wikimedia.org/T134404) (owner: 10BBlack) [20:20:47] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.080 second response time [20:21:17] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:24:04] Dereckson: Oh, I'm done btw [20:26:35] (03PS4) 10BBlack: cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema) [20:27:02] 7 Undefined property: stdClass::$el_owner in /srv/mediawiki/php-1.29.0-wmf.13/extensions/SecurePoll/includes/main/Store.php on line 179 [20:27:10] (03CR) 10BBlack: [C: 031] cache: allow specifying applayer backend probes and probe piwik [puppet] - 10https://gerrit.wikimedia.org/r/338953 (https://phabricator.wikimedia.org/T154558) (owner: 10Ema) [20:27:37] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:31:03] Dereckson: Yeah, I saw that a week or two ago [20:31:06] I think I filed a bug [20:33:14] Or not [20:33:39] indeed, nothing recent on https://phabricator.wikimedia.org/project/profile/238/ [20:33:46] PROBLEM - keystone http on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [20:33:56] * Dereckson fills [20:35:13] Filed https://phabricator.wikimedia.org/T158904 [20:35:23] RainbowSprinkles: I suspect a code update without a database change on WMF cluster [20:35:56] https://phabricator.wikimedia.org/rESPO30e266ca8c6761280d58cd009f064cff81642330 [20:36:01] Oh, dur https://phabricator.wikimedia.org/T152721 [20:36:03] Yes [20:36:17] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[keystone] [20:38:25] Dereckson: 21 wikis need that schema change [20:38:29] ^ andrewbogott labtestcontrol [20:38:40] yep, I'm working on it [20:38:51] (03PS1) 10Andrew Bogott: Keystone: Don't remove the upstart script for liberty. [puppet] - 10https://gerrit.wikimedia.org/r/339486 [20:39:37] Wait, 21 wikis, I can't read [20:40:21] By 21 I meant all of them [20:40:22] (03CR) 10Andrew Bogott: [C: 032] Keystone: Don't remove the upstart script for liberty. [puppet] - 10https://gerrit.wikimedia.org/r/339486 (owner: 10Andrew Bogott) [20:41:16] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3985683 keys, up 115 days 12 hours - replication_delay is 620 [20:41:26] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 637 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3985700 keys, up 115 days 12 hours - replication_delay is 637 [20:41:29] RainbowSprinkles: all in securepollglobal.dblist [20:41:46] RECOVERY - keystone http on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.086 second response time [20:42:16] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:42:26] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3963312 keys, up 115 days 12 hours - replication_delay is 0 [20:42:56] RainbowSprinkles: no, actually all -wikitech -login [20:43:03] Yeah [20:43:22] I filed T158906 for getting the schema updated in production [20:43:22] T158906: Apply el_owner patch for SecurePoll on all wikis - https://phabricator.wikimedia.org/T158906 [20:43:56] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:46:25] !log dereckson@tin Started scap: Full scap to deploy new l10n keys on wikitech ([[gerrit:339456]]) [20:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:05] RainbowSprinkles: If the tables are small, it's definitely a jfdi [20:47:40] Um, RainbowSprinkles https://phabricator.wikimedia.org/T152721#2858335 [20:48:05] Hmmm [20:48:08] !log dereckson@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap"; owner is "dereckson"; reason is "Full scap to deploy new l10n keys on wikitech ([[gerrit:339456]])" (duration: 00m 00s) [20:48:10] No backfilled data? [20:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:18] Nope [20:48:30] That could explain it [20:48:54] I guess especially with default null [20:49:00] !log dereckson@tin Started scap: Full scap to deploy new l10n keys on wikitech ([[gerrit:339456]]), take two [20:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:08] Reedy: That would explain the error actually [20:49:27] Should probably just defensively code against no backfill data in the extension [20:50:32] Should be easy [20:50:38] Not many usages [20:54:59] (03CR) 10Ladsgroup: [C: 031] "Okay for me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [20:55:16] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:35] RainbowSprinkles: wfSuppressWarnings around the line? Or is that too hacky? :P [20:55:40] I guess, we should isset [20:56:02] Looking at it [20:56:14] isset() is probably fine, just looking to see if that moves the bug to somewhere else [20:56:36] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:58:34] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [20:59:34] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3963244 keys, up 115 days 12 hours - replication_delay is 44 [20:59:47] Reedy: So, it's basically an unused piece of data right now [20:59:58] So no worries about passing an empty value higher up for display or something [21:00:09] I think [21:00:15] Patch incoming, anyway [21:00:27] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051267 (10RobH) I've emailed the new public cert and private key file (the key being pgp encrypted) over to @EWilfong_WMF. [21:00:33] Yeah [21:00:49] It seems to be only set on creation [21:00:55] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3051269 (10hashar) Fair, sorry I went slightly out of topic. I am a huge fan of allowing sub components such as thirdparty/randomsoftware. I would be more than happy to pair and migrate the #zuul package to... [21:01:39] Reedy: https://gerrit.wikimedia.org/r/#/c/339488/ [21:02:08] RainbowSprinkles: Is a string the best thing? [21:02:14] it's an int field [21:03:54] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:04:14] Oh, is it? [21:04:15] Whoops [21:04:24] I'll swap to 0 [21:05:17] Amended [21:06:34] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [21:07:34] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3963484 keys, up 115 days 12 hours - replication_delay is 0 [21:07:37] heh [21:08:14] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3963321 keys, up 115 days 12 hours - replication_delay is 0 [21:11:54] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [21:11:56] !log dereckson@tin Finished scap: Full scap to deploy new l10n keys on wikitech ([[gerrit:339456]]), take two (duration: 22m 55s) [21:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:36] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3051315 (10Papaul) a:05Papaul>03elukey [21:22:38] (03PS1) 10Ppchelko: WIP: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 [21:23:44] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:23:50] jouncebot now [21:23:51] For the next 0 hour(s) and 36 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170223T2000) [21:25:23] jouncebot: i'm done shut up [21:26:49] RainbowSprinkles how rude [21:27:52] * RainbowSprinkles shrugs [21:28:45] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [21:28:54] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:39:44] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [21:39:54] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [21:39:55] PROBLEM - Disk space on wdqs1001 is CRITICAL: DISK CRITICAL - free space: / 1055 MB (3% inode=96%) [21:40:43] * gehel having a look at wdqs1001... [21:40:54] (03PS2) 10Ppchelko: WIP: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 [21:43:34] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:45:54] RECOVERY - Disk space on wdqs1001 is OK: DISK OK [21:55:14] (03PS3) 10Ppchelko: WIP: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 [22:03:18] (03PS4) 10Ppchelko: WIP: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 [22:06:29] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051412 (10EWilfong_WMF) Thanks, @RobH, the new cert is in place on benefactorevents.wikimedia.org. [22:07:01] 06Operations, 10Traffic, 10fundraising-tech-ops, 07HTTPS: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3051417 (10RobH) 05Open>03Resolved [22:09:54] (03PS5) 10Ppchelko: WIP: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 [22:12:53] (03PS6) 10Ppchelko: Enable local logginng for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) [22:13:34] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [22:13:54] (03CR) 10Ppchelko: "Puppet compiler: https://puppet-compiler.wmflabs.org/5575/restbase1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko) [22:17:26] (03CR) 10Bmansurov: [C: 031] Make Page Previews use RESTBase on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339475 (https://phabricator.wikimedia.org/T156800) (owner: 10Phuedx) [22:19:44] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:23:55] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3051441 (10madhuvishy) [22:25:14] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3051441 (10madhuvishy) [22:31:44] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:57] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051483 (10Paladox) [22:34:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051495 (10Paladox) p:05Triage>03Low [22:34:13] (03CR) 10Madhuvishy: labstore: Install package nethogs from jessie-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [22:34:23] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051483 (10Paladox) [22:34:32] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051483 (10Paladox) p:05Low>03Lowest [22:35:52] (03PS1) 10Cmjohnson: decom servers strontium [puppet] - 10https://gerrit.wikimedia.org/r/339573 [22:36:59] (03PS2) 10Cmjohnson: Removing mentions of decom servers strontium [puppet] - 10https://gerrit.wikimedia.org/r/339573 [22:41:16] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051542 (10demon) It won't work without config/setup on our end, no. And I think we should upgrade and //then// look into using it, not work on it beforehand (we d... [22:41:47] 06Operations, 10Gerrit, 06Release-Engineering-Team: Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3051558 (10Paladox) ok. [22:43:13] (03CR) 10Cmjohnson: [C: 032] Removing mentions of decom servers strontium [puppet] - 10https://gerrit.wikimedia.org/r/339573 (owner: 10Cmjohnson) [22:47:44] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:47:47] (03CR) 10Mobrovac: [C: 031] "No changes to AQS and codfw is looking good as well - https://puppet-compiler.wmflabs.org/5576/" [puppet] - 10https://gerrit.wikimedia.org/r/339501 (https://phabricator.wikimedia.org/T112648) (owner: 10Ppchelko) [22:54:39] (03PS1) 10Cmjohnson: Removing dns entries for several decom servers, strontium, antimony, lanthanum, carbon, neon, magnesium, tantalum, analytics1026 [dns] - 10https://gerrit.wikimedia.org/r/339576 [22:56:11] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for several decom servers, strontium, antimony, lanthanum, carbon, neon, magnesium, tantalum, analytics1026 [dns] - 10https://gerrit.wikimedia.org/r/339576 (owner: 10Cmjohnson) [22:56:34] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [22:57:34] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3970336 keys, up 115 days 14 hours - replication_delay is 42 [22:58:05] (03PS1) 10Madhuvishy: labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) [22:59:08] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667780 (10Ladsgroup) I don't know about disk quota but regarding RAM. This easily can be added in the puppet using [[ht... [23:00:31] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#2430068 (10Ladsgroup) Is there any plans on syncing codfw/eqiad redis nodes? It would be needed if the codfw starts to get traffic. [23:01:14] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3977298 keys, up 115 days 14 hours - replication_delay is 653 [23:01:34] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:06:06] (03PS2) 10Madhuvishy: labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) [23:08:14] PROBLEM - Misc HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:08:54] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:10:24] (03PS6) 10Andrew Bogott: Tools: Fully qualify hostnames [puppet] - 10https://gerrit.wikimedia.org/r/328451 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [23:12:34] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 608 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3970529 keys, up 115 days 14 hours - replication_delay is 608 [23:13:44] (03CR) 10Mattflaschen: [C: 04-1] "Shouldn't this only be done on the wikis where it's trained?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [23:14:34] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3969910 keys, up 115 days 14 hours - replication_delay is 0 [23:15:14] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3969582 keys, up 115 days 14 hours - replication_delay is 0 [23:15:24] (03CR) 10Ladsgroup: [C: 031] "@Matt: The wikis here all trained goodfaith models (damaging and goodfaith models get trained together) I can not imagine a scenario where" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [23:15:54] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:17:14] RECOVERY - Misc HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:20:24] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:21:06] (03CR) 10Volans: "Minor comments inline. [Disclaimer: I'm not familiar with the tool]" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [23:21:59] (03CR) 10Tim Landscheidt: [C: 04-1] "What is "inf protection"? And why only for toollabs::infrastructure and not the other uses of motd::script?" [puppet] - 10https://gerrit.wikimedia.org/r/339007 (owner: 10Rush) [23:24:12] (03CR) 10Mattflaschen: "Great, thanks for clarifying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/339472 (https://phabricator.wikimedia.org/T137966) (owner: 10Catrope) [23:30:14] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:31:14] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3968899 keys, up 115 days 15 hours - replication_delay is 0 [23:32:03] (03CR) 10Tim Landscheidt: [C: 04-1] labstore: Install package nethogs from jessie-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [23:32:09] (03CR) 10Tim Landscheidt: labstore: Install package nethogs from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [23:33:40] (03CR) 10Madhuvishy: labstore: Install package nethogs from jessie-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [23:38:57] (03PS3) 10Madhuvishy: labstore: Install package nethogs from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/334218 [23:42:57] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#3051741 (10madhuvishy) [23:47:53] (03CR) 10Madhuvishy: [C: 032] labstore: Install package nethogs from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/334218 (owner: 10Madhuvishy) [23:48:24] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures