[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170310T0000). Please do the needful. [00:00:04] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:31] Hello. [00:00:35] I can SWAT this evening. [00:00:39] Hey. [00:01:06] o/ [00:01:17] I've CR'ed the VE one, let's do the config meanwhile [00:01:49] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:20] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5727/" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [00:03:49] (mwlog1001 so now for fatalmonitor) [00:04:47] (03CR) 10Dereckson: [C: 032] (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle) [00:06:54] (03CR) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac) [00:07:45] (03CR) 10Hashar: "It is probably terribly wrong in one way or another. I am going to test it out on labs and polish it :}" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [00:08:35] (03CR) 10Dzahn: "see inline comments. i think issue with variable names in manifest vs template. compiler part looks good though" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [00:08:36] It seems Zuul didn't pick the 342149 [00:09:06] (03CR) 10Dzahn: "$directory / $base_path / @base_directory" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [00:09:10] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle) [00:10:56] ah it's because it has https://phabricator.wikimedia.org/rOMWC70772084f5a067f4352c7b691ce328cc6720859d as parent [00:11:10] Krinkle: ^ [00:11:23] this commit is declared as parent in your change, but it's not in master [00:11:33] Dereckson: They're both in swat [00:11:36] other way around I suppose [00:11:49] ok it's https://gerrit.wikimedia.org/r/#/c/342147/3 seen it [00:12:26] (03CR) 10Dereckson: [C: 032] [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle) [00:13:13] James_F: Krinkle: you wish to test them together or 342147 only first? [00:13:32] Dereckson: together [00:13:37] * Dereckson nods [00:13:44] (the two of mine together that is) [00:13:53] Mine doesn't matter. [00:14:30] zuul is gating the two, we wait operations-mw-config-composer-hhvm-jessie [00:14:48] and for VE, we wait mwext-VisualEditor-npm-node-6-jessie [00:15:02] (03Merged) 10jenkins-bot: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle) [00:15:06] (03Merged) 10jenkins-bot: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle) [00:15:24] 342147 and 342149 on mwdebug1002 [00:15:25] (03CR) 10jenkins-bot: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle) [00:17:22] Dereckson: thx [00:18:25] Dereckson: Works fine [00:18:35] ack'ed [00:19:03] !log maxsem@tin Started deploy [tilerator/deploy@160f314]: https://gerrit.wikimedia.org/r/#/c/342153/ - revert submodule updates due to broken manik->libc dependency [00:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:20] !log maxsem@tin Finished deploy [tilerator/deploy@160f314]: https://gerrit.wikimedia.org/r/#/c/342153/ - revert submodule updates due to broken manik->libc dependency (duration: 00m 16s) [00:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:39] RECOVERY - tileratorui on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.089 second response time [00:19:39] RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.093 second response time [00:20:00] greg-g, ^ [00:22:21] VE merged [00:22:29] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Move NavigationTiming config to EventLogging section + Remove setting of unused $wgPercentHHVM ([[Gerrit:342147]] and [[Gerrit:342149]], no-op) (duration: 00m 40s) [00:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:09] Yay, finally. :-) [00:24:38] !log ppchelko@tin Started deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 [00:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:44] T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136 [00:24:52] James_F: VE change on mwdebug1002 [00:26:12] Hmm. Doesn't seem to be working. One moment. [00:27:03] according https://tools.wmflabs.org/versions/ it can be tested on every wiki [00:27:25] Certainly, it's not /worse/. [00:28:14] Both show the same git hash (of the branch cut), but IIRC that's not real any more. [00:29:15] * James_F tries debug. [00:29:49] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:31:56] !log ppchelko@tin Finished deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 (duration: 07m 17s) [00:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136 [00:34:54] Meh. It doesn't seem to fix the bug. Dereckson, you can deploy it anyway, it doesn't make it worse. [00:35:12] (03PS3) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:35:45] James_F: does the new code leads to an existing, up-to-date API URL? [00:36:19] Dereckson: It's the correct code, it just doesn't fix it in prod the way it fixed it in test. Such is life. [00:36:39] (03CR) 10jerkins-bot: [V: 04-1] maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:36:41] Okay. If it's the correct URL, yes, we can deploy it, I concur. [00:36:56] Dereckson: i hope it's ok i'm being bold and just amend to yours [00:37:09] !log ppchelko@tin Started deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 [00:37:10] not about to merge now [00:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:16] T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136 [00:37:44] mutante: yes, thanks, I was working and on the road all the day and didn't had time to amend [00:38:45] !log dereckson@tin Synchronized php-1.29.0-wmf.15/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTargetLoader.js: ArticleTargetLoader: wikitext switch shouldn't require FullRestbaseURL (T158692) (duration: 00m 41s) [00:38:46] cool, i will leave comments on it [00:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:51] T158692: Lost work when switching from wikitext to visual modes on wikitech and private wikis (not using RESTbase) - https://phabricator.wikimedia.org/T158692 [00:39:33] !log ppchelko@tin Finished deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 (duration: 02m 23s) [00:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:41] (03PS4) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:42:47] (03CR) 10Dzahn: "changes i made:" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:44:02] done . and out for now, bbl [00:44:49] good evening [00:45:28] Dereckson: Thanks again. [00:46:11] You're welcome [00:47:00] (03CR) 10Dereckson: "I checked on wasat, I confirm codfw proxy works too:" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:48:22] !log ppchelko@tin Started deploy [trending-edits/deploy@1673068]: Replayed events are purged based on current timestamp T160136 [00:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:28] T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136 [00:54:46] !log ppchelko@tin Finished deploy [trending-edits/deploy@1673068]: Replayed events are purged based on current timestamp T160136 (duration: 06m 24s) [00:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:53] T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136 [01:01:49] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:29] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [01:15:46] (03PS1) 10BryanDavis: toolschecker: remove precise checks [puppet] - 10https://gerrit.wikimedia.org/r/342161 (https://phabricator.wikimedia.org/T94792) [01:29:49] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:07:39] PROBLEM - puppet last run on db1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:16:49] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:26:59] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:33:57] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 12m 17s) [02:33:59] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:40] RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:39:25] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 10 02:39:25 UTC 2017 (duration 5m 28s) [02:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:42] 06Operations, 10Graphite, 05MW-1.27-release (WMF-deploy-2016-04-05_(1.27.0-wmf.20)), 05MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), and 3 others: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#3090174 (10aaron) [02:46:45] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3090171 (10aaron) 05Open>03Resolved a:03aaron Should be deployed now. I restarted one server's services manually to check it on ganglia/logs. T... [02:48:49] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [02:48:59] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [02:48:59] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [02:49:52] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3090175 (10aaron) >>! In T132327#3090171, @aaron wrote: > Should be deployed now. I restarted one server's services manually to check it on ganglia/l... [02:51:50] !log Restarted job services for 510142425d268df (statsd batching) after monitoring mw1161 [02:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:59] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:55:49] PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:57:59] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:58:49] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:00:59] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:02:59] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:03:13] (03PS1) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 [03:04:53] (03CR) 10jerkins-bot: [V: 04-1] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn) [03:04:59] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:04:59] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:06:09] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:13:59] PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:13:59] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:14:09] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:14:59] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:16:49] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:18:59] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:44:22] (03PS1) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164 [03:47:06] those Icinga alerts are failed jobrunner service, but all of it is only codfw , and in SAL there are reinstalls [03:47:43] eh, scratch the SAL part, that isn't current [03:49:37] AaronSchulz: ^ [03:53:33] ah, an update failed. puppet ensured package upgrade and then the service failed [03:54:49] !log codfw appservers showing "systemd degraded" alerts are failed jobrunner service unit. after puppet-agent "Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed..." ..then jobrunner.service: main process exited, code=exited, status=143/n/a [03:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:54] mutante: why would a package upgrade trigger? [03:56:57] !log codfw appserver jobrunner service fail related to https://gerrit.wikimedia.org/r/#/c/259660/ ? [03:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:11] AaronSchulz: dunno yet, just saw that in syslog [03:57:16] puppet-agent did it [03:57:52] Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed 'a0e821661a107b5dbf4616b0f3570fdd93346010' to 'a1eb96c2f30b31cd05f1ef42e61cdfd1421f505a' [03:57:56] Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Package[jobrunner]) Scheduling refresh of Service[jobrunner] [03:57:59] Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]) Triggered 'refresh' from 1 events [03:58:02] Mar 10 03:16:38 mw2249 systemd[1]: jobrunner.service: main process exited, code=exited, status=143/n/a [03:58:05] Mar 10 03:16:38 mw2249 systemd[1]: Unit jobrunner.service entered failed state. [03:58:08] Mar 10 03:16:38 mw2249 puppet-agent[140021]: (/Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]/ensure) ensure changed 'running' to 'stopped' [03:59:24] AaronSchulz: does it make any sense that it would be related to that deploy? [03:59:57] started about an hour ago [04:00:33] which was during the salt restart of the two services, but well after the git deploy [04:01:45] looks like i can simply start it on this one host [04:01:51] want me to just start them? [04:02:34] !log mw2249 systemctl start jobrunner - now Active: active (running) [04:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:02:43] sure [04:03:35] the running the program with --verbose itself looks fine on 2250 (as it does in eqiad) [04:05:36] mw2155, was: Active: failed but a simple "start" and it's working [04:06:36] icinga recovery would be nice now [04:07:13] ah, there is "jobchron" service too [04:07:33] and that is still failed, in the output of systemctl , which makes icinga unhappy [04:08:59] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [04:09:04] !log mw2155 - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running) [04:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:59] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [04:10:59] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational [04:10:59] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational [04:11:09] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [04:11:59] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [04:11:59] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [04:11:59] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [04:11:59] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [04:12:19] !log more mw appservers ... - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running) [04:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:59] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:15:13] some don't have the jobchron.service [04:15:22] liek 2147,2148,2149 [04:18:09] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [04:18:49] ah, no, that was 2247-2249, not 2147-2149 [04:18:50] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational [04:19:49] RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational [04:19:49] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [04:19:59] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [04:19:59] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [04:20:59] RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational [04:21:08] AaronSchulz: that's all now per Icinga, it's green again [04:22:17] goes afk-ish again [04:23:29] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=335.60 Read Requests/Sec=2380.00 Write Requests/Sec=781.30 KBytes Read/Sec=27552.00 KBytes_Written/Sec=15040.00 [04:25:16] ok, thanks [04:25:49] PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:25:59] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:26:22] .. hrmm [04:26:59] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:27:49] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:29:55] !log codfw mw jobrunner: they start but then fail again shortly after: mw2248 jobrunner[67314]: [Fri Mar 10 04:23:07 2017] [hphp] [67314:7f6a34b746c0:0:000024] [] LightProcess::closeShadow failed due to exception: Failed in afdt::sendRaw: Broken pipe [04:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:01] AaronSchulz: ^ :/ [04:30:59] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:32:59] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:33:59] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:34:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=87.70 Read Requests/Sec=200.10 Write Requests/Sec=1.30 KBytes Read/Sec=1878.00 KBytes_Written/Sec=358.80 [04:35:59] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:37:09] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:42:59] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:43:59] PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:44:01] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:44:09] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:45:49] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:48:59] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:57:18] 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3090230 (10Dzahn) [04:58:26] ACKNOWLEDGEMENT - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:26] ACKNOWLEDGEMENT - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:26] ACKNOWLEDGEMENT - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:26] ACKNOWLEDGEMENT - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:26] ACKNOWLEDGEMENT - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:27] ACKNOWLEDGEMENT - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:27] ACKNOWLEDGEMENT - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:28] ACKNOWLEDGEMENT - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:29] ACKNOWLEDGEMENT - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:29] ACKNOWLEDGEMENT - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:29] ACKNOWLEDGEMENT - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:30] ACKNOWLEDGEMENT - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:30] ACKNOWLEDGEMENT - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:31] ACKNOWLEDGEMENT - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146 [04:58:53] acknowledged [05:00:07] ACKNOWLEDGEMENT - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 115 MB (0% inode=51%): daniel_zahn not in prod yet and known [05:00:07] ACKNOWLEDGEMENT - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1045 MB (3% inode=51%): daniel_zahn not in prod yet and known [05:22:33] 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3090249 (10aaron) Probably trebuchet/puppet breakage. I wonder if https://phabricator.wikimedia.org/T129148 would handle this. [06:03:39] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:29:04] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#3090260 (10Shoichi) About cache,after discussion with upstream author , cache put in production server side is better than put in wikis-sites side. No ma... [06:32:39] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:59:50] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:02:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) [07:06:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:08:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [07:08:39] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:09:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1022 - T159414 (duration: 00m 41s) [07:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:41] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [07:09:45] (03PS1) 10Marostegui: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 [07:11:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui) [07:12:44] (03Merged) 10jenkins-bot: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui) [07:13:48] !log Deploy alter table s6 revision table on db1022 - T159414 [07:13:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Added weight 1 for db1061 - T159414 (duration: 00m 40s) [07:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:15] (03PS1) 10Marostegui: dbstore2.my.cnf: Add replication filter for wikis [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) [07:22:49] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:24:02] (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/5728/" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:24:09] (03CR) 10Jcrespo: "I am not sure about this- will we remember to replicate other dbs when they are created?" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:26:16] (03CR) 10Jcrespo: "e.g.: https://www.mediawiki.org/wiki/Extension:Cognate which happens to contain wik." [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:26:18] (03CR) 10Marostegui: "Maybe we can just remove all replication filters and ignore mysql database only?" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:28:29] !log upgrading libarchive on trusty systems (jessie already fixed) [07:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:13] (03CR) 10Jcrespo: "mysql, ops, sys and trash? (I do not know why there is a trash db)" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:30:25] (03CR) 10Marostegui: "yeah I would say: mysql,ops, trash, sys and percona (I have see the percona database somewhere I reckon)" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:30:48] (03CR) 10Jcrespo: "ok" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:33:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:34:36] (03PS2) 10Marostegui: dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) [07:36:08] (03CR) 10Jcrespo: [C: 031] dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:36:35] (03CR) 10Marostegui: [C: 032] dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui) [07:36:40] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:45:47] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3090315 (10MoritzMuehlenhoff) @demon Can you confirm that the user is no longer needed? [08:03:15] (03PS1) 10Muehlenhoff: Add another NDA account [puppet] - 10https://gerrit.wikimedia.org/r/342173 [08:09:29] (03CR) 10Muehlenhoff: [C: 032] Add another NDA account [puppet] - 10https://gerrit.wikimedia.org/r/342173 (owner: 10Muehlenhoff) [08:14:37] (03CR) 10Muehlenhoff: "Why is that needed? Services are stopped prior to removal, so if there's still gerrit2 processing lingering around, then a standard servic" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [08:19:19] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:21:10] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [08:36:39] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [08:39:39] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [08:40:47] (03PS3) 10Jcrespo: mariadb: Decouple core (mediawiki) role on a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850) [08:49:52] (03CR) 10jenkins-bot: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui) [08:52:34] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5729/" [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [08:52:58] (03PS2) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [08:55:49] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:56:49] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:59:57] 06Operations, 10Monitoring: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090494 (10hashar) [09:00:05] (03PS1) 10Filippo Giunchedi: Enable ipvs node_exporter collector on lvs boxes [puppet] - 10https://gerrit.wikimedia.org/r/342175 [09:02:21] 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090513 (10ema) [09:02:45] 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090481 (10ema) p:05Triage>03Normal [09:03:03] (03Abandoned) 10Filippo Giunchedi: graphite: switch graphite alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335765 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [09:04:27] (03CR) 10Jcrespo: [C: 04-1] "https://puppet-compiler.wmflabs.org/5730/ Is that in hiera or where?" [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:07:39] (03PS3) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [09:07:47] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3090521 (10MoritzMuehlenhoff) [09:07:59] 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3090533 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:10:43] PROBLEM - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.25 seconds [09:11:27] what is going on? [09:11:50] don't know, checking [09:12:29] 621 | Copying to tmp table on disk | SELECT /* MostlinkedPage::reallyDoQuery */ pl_namespace AS `namespace`,pl_title AS `title`,COUNT(* [09:12:44] slow queries causing issues? [09:12:49] not normal [09:12:56] raid looking good [09:13:40] srwiki [09:13:56] (03CR) 10Filippo Giunchedi: "LGTM, though note that this is racy on boot, i.e. it needs a first puppet run to work. The fix for redis jessie machines (i.e. all but rcs" [puppet] - 10https://gerrit.wikimedia.org/r/268598 (owner: 10Ori.livneh) [09:14:02] or is it whatever is happening on kshwiki [09:15:38] kshwiki connection is gone now [09:15:54] ah no, it is there still [09:16:48] I can kill it [09:17:04] (03CR) 10Filippo Giunchedi: [C: 031] Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [09:17:17] sure, go ahead [09:17:19] but it would fit more a disk issue [09:17:39] something on hw happens but only shows when heavy io [09:17:39] i was checking bbu/raid [09:17:44] (like a tmp table) [09:17:49] bbu and policy looks good [09:17:52] yes, I trust you [09:17:55] I am commenting it [09:18:10] there are several disks with errors [09:18:16] but might be old [09:18:44] I think it is the long running query [09:18:50] only change since then [09:18:55] is the innodb purge lag [09:19:14] io in fact is lower since a few minutes ago [09:19:23] which again would fit hw-caused issues [09:19:46] fsyncs are high, though [09:20:03] much more disk writes [09:20:29] I think the right way to handle this is to make the server non-transactionaly safe [09:20:34] to reduce fsyncs [09:20:55] you want me to set it to 2 maybe? [09:20:55] ok with that? [09:21:03] no, never 2 [09:21:09] not with a hw raid [09:21:31] setting it to 0 [09:21:32] now [09:21:35] ok! [09:21:56] lag decreasing [09:22:02] let's see how it reponds [09:22:42] RECOVERY - MariaDB Slave Lag: s3 on db1038 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [09:23:24] (03CR) 10Filippo Giunchedi: [C: 04-1] "This wouldn't work because PDUs in codfw use a different snmp community and facilities::monitor_pdu_service only knows about $pdu_snmp_pas" [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [09:24:41] the disks errors do not increase [09:25:49] maybe we should set that on all slow slaves [09:26:07] and reimage when it crashes [09:26:28] we have only have db1038 so far complaining [09:26:45] I would do so if we had more issues like this [09:43:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:43:37] 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#3090586 (10MoritzMuehlenhoff) [09:44:36] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3090591 (10MoritzMuehlenhoff) [09:44:38] 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2547672 (10MoritzMuehlenhoff) 05Open>03Resolved The cross-checking is running daily on terbium. It implements the following checks: - Every account in the privileged "wmf" group should be registe... [09:51:27] (03PS4) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) [09:54:49] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:57:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [09:57:27] (03CR) 10Jcrespo: [C: 031] "This should be ok now https://puppet-compiler.wmflabs.org/5732/labsdb1001.eqiad.wmnet/ , but I want the ok from the change of debdeploy." [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [09:59:42] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3090609 (10MoritzMuehlenhoff) [09:59:45] 06Operations: Require/track Phabricator username - https://phabricator.wikimedia.org/T142830#3090605 (10MoritzMuehlenhoff) 05Open>03declined This isn't needed anymore. NDA management has changed towards a new workflow which doesn't rely on Phabricator any longer. One remaining use case if for the offboardi... [10:01:01] (03PS2) 10Jcrespo: mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850) [10:01:50] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [10:02:29] (03CR) 10Gehel: [C: 031] "Only make sense with the child change, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar) [10:05:09] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3090617 (10Marostegui) The most recent backups for dbstore1001 are still only from Feb, and not March: ``` +--------+-------+----------+-------------------+---------------------+----... [10:10:16] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:12:43] (03CR) 10Giuseppe Lavagetto: [C: 031] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [10:15:49] (03PS2) 10DCausse: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson) [10:17:32] (03CR) 10Hashar: [C: 032] [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson) [10:19:07] (03Merged) 10jenkins-bot: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson) [10:22:49] RECOVERY - puppet last run on db1074 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:27:35] (03PS1) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 [10:27:46] (03PS2) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 [10:28:15] (03CR) 10Hashar: "I keep it simple so that can be merged and used as a foundation to write other tests." [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar) [10:29:39] (03PS1) 10Jcrespo: mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) [10:33:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:35:31] (03PS2) 10Jcrespo: mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) [10:38:35] jenkins may be getting a bit overloaded [10:38:58] it is starting to 503 me [10:39:53] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5735/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:40:48] jynus: happened to me around 8am too [10:48:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [10:50:39] (03CR) 10Muehlenhoff: [C: 032] Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 (owner: 10Muehlenhoff) [10:50:43] (03PS3) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 [10:51:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [10:51:23] (03CR) 10Muehlenhoff: [V: 032 C: 032] Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 (owner: 10Muehlenhoff) [10:51:26] (03PS5) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) [10:53:08] (03CR) 10Hashar: "Thanks!! :]" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [10:53:23] (03PS2) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [10:55:32] (03CR) 10Volans: [C: 032] "Improved some comments in the last patch set." [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [10:58:59] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:01] (03Merged) 10jenkins-bot: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [11:06:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [11:07:44] (03PS1) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) [11:09:12] (03PS2) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) [11:12:12] (03CR) 10Hashar: (WIP) contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [11:12:28] (03PS3) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:13:31] (03PS4) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:15:21] (03PS5) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:17:30] (03PS6) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:18:46] (03PS7) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:19:08] (yeah I should do that directly on the puppet master) [11:19:11] sorry bout the spam [11:23:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [11:26:59] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:43:55] (03PS8) 10Hashar: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [11:52:25] (03CR) 10Muehlenhoff: contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [11:58:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [12:20:03] 06Operations, 10Continuous-Integration-Config: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827#3090778 (10hashar) [12:20:52] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3090782 (10MoritzMuehlenhoff) [12:30:52] (03CR) 10Hashar: contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [12:31:04] (03PS9) 10Hashar: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) [12:53:22] (03PS3) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) [13:08:09] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3090851 (10MoritzMuehlenhoff) >>! In T158176#3074551, @MoritzMuehlenhoff wrote: > The test failure is benign; it tests a new feature introduced into the simple JSON parser uncondit... [13:11:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [13:18:35] (03PS4) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) [13:20:05] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3090867 (10elukey) I followed https://piwik.org/docs/optimize-how-to/ and applied `set global innodb_flush_log_at_trx_commit=2;` as root... [13:21:49] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:22:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [13:27:09] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:35:40] !log Restarting Jenkins. Deadlocks in ssh connections. T160168 [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] T160168: Zuul postmerge blocked on beta-mediawiki-config-update-eqiad - https://phabricator.wikimedia.org/T160168 [13:40:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:40:47] (03CR) 10jenkins-bot: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle) [13:40:51] (03CR) 10jenkins-bot: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson) [13:41:53] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [13:42:24] moritzm: thanks for the review! wanna merge it ? :) I can baby sit it on prod boxes [13:44:46] currently need to finish something else, will ping you in 15-30 mins? [13:44:52] sure! [13:48:03] (03PS10) 10Zppix: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [13:48:11] (03CR) 10Zppix: [C: 031] contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [13:49:16] jouncebot: next [13:49:16] In 65 hour(s) and 10 minute(s): Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T0700) [13:49:51] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:49:53] out of curiosity should that be something jouncebot really needs to detect as a deployment [13:52:27] (03CR) 10Elukey: [C: 032] "Looks good from https://puppet-compiler.wmflabs.org/5739/" [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) (owner: 10Elukey) [13:54:09] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:54:21] waiting a bit before merging, last check in neodym [13:54:33] ping me if you are in a hurry :) [13:55:55] merged [13:58:27] !log added 3 new MW api-appservers (mw2251-53) and 7 new appservers (mw2254-60) to codfw [13:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:48] running puppet on a couple of them to see if everything works fine [14:01:19] PROBLEM - DPKG on mw2251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:03:19] RECOVERY - DPKG on mw2251 is OK: All packages OK [14:04:47] (03CR) 10Muehlenhoff: [C: 032] contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [14:04:49] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:04:52] (03PS11) 10Muehlenhoff: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar) [14:04:59] PROBLEM - DPKG on mw2252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:05:27] !log contint1001 and contint2001 : Migrating git-daemon to systemd . Would stop zuul merger briefly [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:18] sorry for the mw22* broken pkg spam [14:06:19] PROBLEM - DPKG on mw2251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:07:00] RECOVERY - DPKG on mw2252 is OK: All packages OK [14:07:01] I am not liking the general query patterns I am seeing [14:07:59] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:09:09] PROBLEM - Check systemd state on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:10] PROBLEM - configured eth on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:19] PROBLEM - Disk space on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:19] PROBLEM - salt-minion processes on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:19] PROBLEM - dhclient process on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:19] PROBLEM - puppet last run on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:29] PROBLEM - MD RAID on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:09:59] PROBLEM - DPKG on mw2256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:09:59] PROBLEM - DPKG on mw2252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:10:19] trying to schedule downtime [14:10:59] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:05] done [14:12:59] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:13:19] RECOVERY - DPKG on mw2256 is OK: All packages OK [14:13:59] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user [14:14:09] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [14:17:09] PROBLEM - git_daemon_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user [14:18:59] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:19:59] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [14:21:56] (03PS1) 10Ema: Revert "depool eqiad front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/342211 [14:23:55] I am terrible [14:24:27] (03CR) 10Ema: [V: 032 C: 032] Revert "depool eqiad front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/342211 (owner: 10Ema) [14:26:19] RECOVERY - Disk space on mw2251 is OK: DISK OK [14:26:29] RECOVERY - configured eth on mw2251 is OK: OK - interfaces up [14:26:29] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures [14:26:39] RECOVERY - salt-minion processes on mw2251 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:26:39] RECOVERY - dhclient process on mw2251 is OK: PROCS OK: 0 processes with command name dhclient [14:26:39] RECOVERY - MD RAID on mw2251 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:28:16] (03PS1) 10Hashar: zuul: fix git-daemon monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/342212 [14:28:56] (03CR) 10Hashar: "The extra --user was because previously the process was forking." [puppet] - 10https://gerrit.wikimedia.org/r/342212 (owner: 10Hashar) [14:29:33] RECOVERY - DPKG on mw2251 is OK: All packages OK [14:30:30] ACKNOWLEDGEMENT - git_daemon_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user amusso I broke it. Fix is https://gerrit.wikimedia.org/r/#/c/342212/ [14:30:40] _joe_, can you take a look at https://gerrit.wikimedia.org/r/#/c/338950/ again? thanks. [14:31:08] ACKNOWLEDGEMENT - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user amusso https://gerrit.wikimedia.org/r/#/c/342212/ [14:33:03] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:39:03] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:39:35] (03CR) 10Muehlenhoff: [C: 032] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/342212 (owner: 10Hashar) [14:42:04] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [14:42:26] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:44:03] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [14:44:13] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [14:46:16] ema --^ [14:46:43] it seems going down [14:47:10] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=upload&var-status_type=4&var-status_type=5&from=now-6h&to=now - only upload.. [14:47:18] to be fair, I would have called godog first in this case [14:47:22] :-) [14:47:26] 503s [14:47:42] jynus: ah yes I saw the repool and pinged him without thinking :) [14:49:13] RECOVERY - git_daemon_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [14:51:53] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 803484 [14:52:50] it could be still [14:53:03] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 643909 [14:53:07] (03PS1) 10Volans: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) [15:02:32] RECOVERY - DPKG on mw2252 is OK: All packages OK [15:03:02] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [15:03:22] !log Stop slave db2033 for maintenance - T159707 [15:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:29] T159707: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707 [15:04:12] PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:30] (03PS1) 10Jcrespo: mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) [15:06:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:08:12] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:08:22] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:09:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1022" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342218 [15:09:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1022" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342218 (owner: 10Marostegui) [15:10:19] -1? what? [15:10:46] ah! [15:13:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) [15:15:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:17:39] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:17:48] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:18:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 - T159414 (duration: 00m 45s) [15:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:49] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [15:25:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) [15:26:31] <_joe_> subbu: will do [15:27:14] (03CR) 10Eevans: [C: 031] "Ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [15:28:02] PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [15:28:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [15:29:02] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [15:29:25] (03PS3) 10Eevans: Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) [15:32:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:33:12] RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:33:22] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe) [15:33:40] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe) [15:33:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:34:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [15:34:38] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2990133 (10Joe) [15:34:43] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2964028 (10Joe) [15:34:45] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe) [15:37:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] PuppetDB: automatically ucfirst resource names (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans) [15:40:15] (03CR) 10Paladox: "> Why is that needed? Services are stopped prior to removal, so if" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [15:41:08] (03PS2) 10Volans: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) [15:42:12] (03CR) 10Volans: "see inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans) [15:44:35] (03PS1) 10Muehlenhoff: Setup "bot" credentials file for Phabricator support in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342222 [15:47:03] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [15:49:30] 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3091217 (10elukey) Added `innodb_buffer_pool_size = 512M` and `innodb_flush_log_at_trx_commit 2` to `/etc/mysql/my.cnf` restarted mysql... [15:50:43] (03CR) 10Giuseppe Lavagetto: [C: 031] PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans) [15:50:52] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 [15:52:10] (03CR) 10Volans: [C: 032] PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans) [15:53:49] (03Merged) 10jenkins-bot: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans) [15:53:51] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui) [15:54:22] (03CR) 10Jcrespo: [C: 031] "This will probably break pending ferm patches." [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [15:54:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui) [15:55:12] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui) [15:58:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1030 - T159414 (duration: 02m 42s) [15:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:19] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [15:58:26] (03CR) 10Ema: [C: 031] "Pcc output OK, LGTM. https://puppet-compiler.wmflabs.org/5741" [puppet] - 10https://gerrit.wikimedia.org/r/342175 (owner: 10Filippo Giunchedi) [16:00:01] (03CR) 10Muehlenhoff: "The only pending ferm patch is for dbproxy and can easily be adapted, that's not an issue" [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [16:03:02] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 364 [16:04:18] (03CR) 10Filippo Giunchedi: [C: 031] "To be merged Monday" [puppet] - 10https://gerrit.wikimedia.org/r/342175 (owner: 10Filippo Giunchedi) [16:07:03] (03CR) 10Jcrespo: [C: 032] mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [16:07:10] (03PS2) 10Jcrespo: mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) [16:07:21] (03PS1) 10Marostegui: sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) [16:08:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [16:10:41] (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/5742/" [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:12:51] (03CR) 10Jcrespo: [C: 031] "I am ok with this, is the user change for something in particular? (just asking, this can be deployed)" [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:13:44] (03PS1) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [16:13:46] (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 [16:13:48] (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230 [16:13:50] (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231 [16:13:52] (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342232 [16:13:54] (03CR) 10Marostegui: "No particular change, we do not have it under the gtid_domain_id on any of the files, so just a bit of consistency there." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:14:06] (03CR) 10Marostegui: [C: 032] sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:14:10] (03PS2) 10Marostegui: sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) [16:14:56] (03CR) 10Jcrespo: [C: 031] "Cool. This is one of the things that we could separate on subtemplates, so next time we do not have to change 20 files." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:15:31] (03CR) 10Jcrespo: [C: 031] "Not on this change, I am thinking aloud." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [16:18:52] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:20:03] (03PS1) 10Muehlenhoff: debdeploy: Support stretch installations in update spec files [puppet] - 10https://gerrit.wikimedia.org/r/342233 [16:25:22] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:25:34] !log reboot mw22(5[1-9]|60) to enable mw-cgroup mountpoint [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:51] that's not a mountpoint [16:25:53] anyhow [16:25:55] :D [16:27:32] RECOVERY - Check systemd state on mw2251 is OK: OK - running: The system is fully operational [16:27:35] just do !log did something with some serversa and probably rebooting them [16:30:02] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [16:30:06] (03PS1) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969) [16:30:32] PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:31:02] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [16:31:03] (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [16:38:14] (03CR) 10Ottomata: [C: 031] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [16:39:12] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [16:39:12] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [16:39:13] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [16:39:13] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [16:39:14] RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational [16:39:22] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [16:39:22] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [16:39:22] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [16:39:23] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational [16:39:23] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [16:39:32] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [16:42:50] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3091276 (10Papaul) [16:44:15] !log oresrdb2002 - signing puppet certs, salt-key, initial run [16:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:31] (03PS3) 10Gilles: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352) [16:51:15] (03PS1) 10Volans: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) [16:51:17] (03PS1) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [16:51:48] (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [16:51:50] (03CR) 10jerkins-bot: [V: 04-1] Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans) [16:51:53] (03Abandoned) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [16:53:43] (03PS2) 10Volans: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) [16:54:20] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3091293 (10Papaul) [16:54:22] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:54:37] (03PS2) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [16:55:01] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088068 (10Papaul) a:05Papaul>03akosiaris Installation complete @akosiaris [16:55:17] (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [16:58:32] RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:05:02] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091314 (10demon) >>! In T160122#3090315, @MoritzMuehlenhoff wrote: > @demon Can you confirm that the user is no longer needed? For a little history, it **used**... [17:07:49] (03PS3) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [17:08:51] (03PS2) 10Andrew Bogott: Upstart logrotate: Use copytruncate instead of delaycompress. [puppet] - 10https://gerrit.wikimedia.org/r/341808 (https://phabricator.wikimedia.org/T159141) [17:11:01] (03CR) 10Andrew Bogott: [C: 032] Upstart logrotate: Use copytruncate instead of delaycompress. [puppet] - 10https://gerrit.wikimedia.org/r/341808 (https://phabricator.wikimedia.org/T159141) (owner: 10Andrew Bogott) [17:11:48] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 84 [17:17:43] (03PS4) 10Gehel: elasticsearch - statsd plugin isn't used anymore [puppet] - 10https://gerrit.wikimedia.org/r/342052 [17:23:11] (03PS4) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [17:23:55] (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [17:24:34] !log installed librdkafka 0.9.4 via dpkg -i on cp1058 (cache text) and restarted varnishkafka in preparation for fleet upgrade next week [17:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:41] elukey: FYI ^^ [17:25:16] oops, ha, cp1058 is a cache misc [17:25:30] meant to do 1052 [17:26:28] (03PS5) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) [17:28:33] !log installed librdkafka 0.9.4 via dpkg -i on cp1052 (cache text) and restarted varnishkafka in preparation for fleet upgrade next week [17:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:08] (03CR) 10Volans: "Example output of tox -e integration-clustershell" [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans) [17:29:44] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3091385 (10Ottomata) FYI, I've installed librrdkafka on cp1042, cp1058 (cache misc) and cp1052 (cache text) serv... [17:30:59] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091387 (10MoritzMuehlenhoff) Ok, I'll make the change on Tuesday when you're around (since Monday is a holiday for US staff) [17:40:05] thanks ottomata :) [17:42:45] 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091441 (10demon) p:05High>03Normal Sounds good [17:47:02] (03PS1) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) [17:48:01] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [17:50:19] (03PS1) 10Ottomata: Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250 [17:54:28] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:43] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3091494 (10elukey) Ran puppet, rebooted the nodes for the mw-cgroups, re-ran puppet and scap pull, pooled the nodes via conftool. Still to check: I had to reboot mw2256... [17:55:14] (03PS3) 10Ottomata: Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250 [17:55:29] (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/5743/" [puppet] - 10https://gerrit.wikimedia.org/r/342250 (owner: 10Ottomata) [17:56:02] (03CR) 10Ottomata: [V: 032 C: 032] Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250 (owner: 10Ottomata) [17:59:46] 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3091495 (10Dzahn) This looks fixed now in Icinga but there is nothing in SAL or on this ticket that would explain how it got fixed. ? [18:05:18] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:17:03] 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) [18:19:09] 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091579 (10RobH) I'm willing to assist on this as needed. For other changes, we typically do the checklist as follows: [] - stage new private key in private... [18:20:19] 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091582 (10RobH) [18:20:43] 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:05yuvipanda>03None [18:20:52] 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:03RobH [18:22:28] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:22:53] (03PS1) 10GWicke: Update access log sampling to match new hyperswitch levels [puppet] - 10https://gerrit.wikimedia.org/r/342251 [18:23:53] (03Abandoned) 10Gehel: WIP - logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341782 (owner: 10Gehel) [18:23:59] (03CR) 10GWicke: [C: 04-1] Update access log sampling to match new hyperswitch levels [puppet] - 10https://gerrit.wikimedia.org/r/342251 (owner: 10GWicke) [18:24:30] (03CR) 10GWicke: [C: 04-1] "Added a -1 to signal that this should only be deployed with the corresponding hyperswitch change." [puppet] - 10https://gerrit.wikimedia.org/r/342251 (owner: 10GWicke) [18:27:08] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:28:22] !log smalyshev@tin Started deploy [wdqs/wdqs@1f2973c]: Deploy new updater on 1003 for potential connection drop fix [18:28:24] !log smalyshev@tin Finished deploy [wdqs/wdqs@1f2973c]: Deploy new updater on 1003 for potential connection drop fix (duration: 00m 03s) [18:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:18] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:41:51] (03PS4) 10Krinkle: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) [18:42:01] (03PS4) 10Krinkle: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 [18:42:25] (03CR) 10Krinkle: "Self-1 was because I haven't tested it and pending questions (see IRC, -analytics)" [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle) [18:56:08] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:57:56] (03PS1) 10RobH: new cert for *.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/342254 [18:58:36] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091682 (10RobH) [18:58:45] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:05RobH>03yuvipanda [19:07:05] !log Unmasked kartotherian on maps-test2004 [19:07:08] RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational [19:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:08] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:13:05] (03PS1) 10Gehel: maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354) [19:17:28] RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy [19:19:18] !log upgrading kartotherian on maps-test2004 - T150354 [19:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:23] T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354 [19:21:05] (03CR) 10MaxSem: [C: 031] maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [19:25:38] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:27:50] !log gehel@tin Started deploy [kartotherian/deploy@76adf21]: (no justification provided) [19:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:14] !log gehel@tin Finished deploy [kartotherian/deploy@76adf21]: (no justification provided) (duration: 00m 23s) [19:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:22] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3091832 (10AndyRussG) [19:34:32] !log restart kartotherian on maps-test2004 [19:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:53] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3091864 (10MoritzMuehlenhoff) \o/ "mofarrell commented an hour ago: A fix is on its way." [19:37:08] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:44:07] !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided) [19:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:28] !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 01m 20s) [19:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:50] !log failed tilerator deploy on maps-test2004 [19:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided) [19:47:07] !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 03s) [19:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:47] !log restarting tilerator(ui) on maps-test2004 [19:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:59] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:57:16] !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided) [19:57:21] !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 04s) [19:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:52] !log restarting tilerator(ui) on maps-test2004 [19:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:39] (03CR) 10Gehel: [C: 032] maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel) [20:03:20] !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided) [20:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:37] !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 16s) [20:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:28] (03PS1) 10Mholloway: Set ANDROID_HOME environment variable (role::ci::slave::android) [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456) [20:05:29] !log gehel@tin Started deploy [kartotherian/deploy@76adf21]: (no justification provided) [20:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:24] !log gehel@tin Finished deploy [kartotherian/deploy@76adf21]: (no justification provided) (duration: 00m 54s) [20:06:26] !log restart kartotherian / tilerator(ui) on maps-test* [20:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:27] (03PS1) 10Andrew Bogott: Nova: Remove our custom-hacked libvirt driver [puppet] - 10https://gerrit.wikimedia.org/r/342264 (https://phabricator.wikimedia.org/T131548) [20:12:28] (03CR) 10Andrew Bogott: [C: 032] Nova: Remove our custom-hacked libvirt driver [puppet] - 10https://gerrit.wikimedia.org/r/342264 (https://phabricator.wikimedia.org/T131548) (owner: 10Andrew Bogott) [20:16:58] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:37:24] 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3092081 (10Dzahn) 05Open>03Resolved closing ticket again as it looks done for now. feel free to re-open if more changes are planned. [20:44:02] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3092101 (10AndyRussG) @BBlack, @ema, hi! Would it be possible to maybe get your input on the [[ https://gerrit.wikimedia... [20:44:31] (03CR) 10Dzahn: [C: 04-1] "Error: Could not find template 'mediawiki/maintenance/uploads/wgetrc.erb'" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [20:44:47] (03CR) 10Dzahn: [C: 04-1] "i will amend one more time" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:00:11] (03PS5) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:00:44] (03PS6) 10Dzahn: mediawiki::maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:03:17] (03CR) 10Dzahn: [C: 031] "now it works: http://puppet-compiler.wmflabs.org/5745/" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:03:23] (03CR) 10Dzahn: [C: 032] mediawiki::maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:04:47] 06Operations, 06Commons, 13Patch-For-Review: Improve Terbium (and wasat) userland to process server side uploads - https://phabricator.wikimedia.org/T159661#3092138 (10Dzahn) [21:07:28] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:09] (03CR) 10Dzahn: "eh, almost works, it's a directory instead of a file" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [21:08:52] Dereckson: we have an empty dir /etc/wgetrc instead of a file /etc/wgetrc [21:09:34] ensure => ensure_directory($ensure), [21:09:36] it was a file before (which was manually added afaict) [21:09:41] yea, i saw [21:10:38] but the template contents need to go somewhere [21:11:08] did you want /etc/wgetrc and then a file inside it? [21:11:14] or /etc/wgetrc itself as the file [21:12:31] just file, right (checks wget man page) [21:13:32] No, it's a file /etc/wgetrc according the wget documentation [21:13:37] btw, you can also have per-user dotfiles in the repo [21:13:38] not a .d subdir [21:14:17] ok [21:16:59] (03PS1) 10Dzahn: mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661) [21:19:10] (03PS2) 10Dzahn: mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661) [21:20:06] (03CR) 10Dzahn: [C: 032] mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661) (owner: 10Dzahn) [21:21:28] Ensure set to :present but file type is directory so no content will be synced [21:21:38] deletes it [21:23:42] Dereckson: done now. file exists on both, terbium uses webproxy.eqiad. wasat uses webproxy.codfw [21:26:18] 06Operations, 06Commons, 13Patch-For-Review: Improve Terbium (and wasat) userland to process server side uploads - https://phabricator.wikimedia.org/T159661#3074310 (10Dzahn) edited task title to point out we should always treat it as the pair terbium/wasat for eqiad/codfw. amended and merged the changes... [21:27:01] :) thanks [21:29:07] (03Draft1) 10Paladox: Phabricator: Remove three unneeded configs [puppet] - 10https://gerrit.wikimedia.org/r/342275 [21:29:10] (03PS2) 10Paladox: Phabricator: Remove three unneeded configs [puppet] - 10https://gerrit.wikimedia.org/r/342275 [21:32:09] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:34:39] (03PS2) 10Dzahn: contint: Zuul no more interact with Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar) [21:35:28] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:38:02] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3092165 (10RobH) a:05RobH>03Papaul This system still has the ipmi issue when run on the local OS: ``` robh@ms-be2002:~$ sudo ipmi-chassis --get-chassis-status ipmi_cmd_get_chassis_status: internal... [21:38:36] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5746/" [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar) [21:38:49] (03CR) 10Dzahn: [C: 032] contint: Zuul no more interact with Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar) [21:39:00] puppet compiler looks good stil [21:40:05] yes [21:40:21] hands over to you [21:41:44] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Krinkle) > `nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clon... [21:42:28] !log restarted Zuul [21:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:10] (03Draft1) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [21:46:13] (03PS2) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [21:47:44] (03CR) 10Dzahn: "private repo: [master 0112f51] (dzahn) remove passwords::misc::contint::jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar) [21:47:46] (03PS1) 10Hashar: Remove passwords::misc::contint::jenkins [labs/private] - 10https://gerrit.wikimedia.org/r/342277 [21:52:28] (03PS2) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164 [21:52:28] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:53:28] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:54:32] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092179 (10kaldari) I approve! [21:55:09] 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092181 (10RobH) [21:57:05] (03PS1) 10RobH: add niharika29 to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/342278 [21:57:50] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092202 (10RobH) [21:58:01] (03CR) 10RobH: [C: 032] add niharika29 to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/342278 (owner: 10RobH) [21:59:45] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092204 (10RobH) 05Open>03Resolved There have been no objections noted, and I noticed that this 3 day wait ended today. I chatte... [22:01:08] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [22:09:27] (03CR) 10Dzahn: [C: 04-1] "influences admin groups, since they are added in hiera role/common/ http://puppet-compiler.wmflabs.org/5747/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [22:18:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092215 (10Ottomata) Thanks @robh! [22:20:28] RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:21:28] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:25:36] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#3092220 (10Arthur2e5) >>! In T148693#3090260, @Shoichi wrote: > No matter how many sites connect to the server, they share the same cache. Makes sense a... [22:38:04] (03CR) 10Hashar: "That is better done directly in the job. For example by adding a build parameter:" [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456) (owner: 10Mholloway) [22:50:28] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:59:43] (03PS3) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164 [23:02:28] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [23:02:58] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [23:13:58] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092324 (10Dzahn) [23:14:41] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#2497714 (10Dzahn) [23:18:05] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092331 (10Dzahn) [23:18:28] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [23:20:15] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092348 (10Dzahn) a:03RobH @Robh this was an older decom task that was still open, i added the newer checklist now and checked the boxes after the fact. assigning to you to check if switch ports was done alre... [23:21:35] 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092353 (10Dzahn) [23:30:05] (03PS4) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164 [23:34:50] (03CR) 10Dzahn: [C: 031] "now the only difference are motd contents / role names. http://puppet-compiler.wmflabs.org/5749/" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [23:36:29] (03CR) 10Dzahn: [C: 031] "@Moritz do i need to change the name of the debdeploy grain here to match the role name?" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [23:40:29] (03PS2) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 [23:41:41] (03CR) 10jerkins-bot: [V: 04-1] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn) [23:52:12] (03Abandoned) 10Mholloway: Set ANDROID_HOME environment variable (role::ci::slave::android) [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456) (owner: 10Mholloway) [23:58:48] (03PS3) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 [23:59:08] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.54 seconds