[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170310T0000). Please do the needful.
[00:00:04] <jouncebot>	 James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:31] <Dereckson>	 Hello.
[00:00:35] <Dereckson>	 I can SWAT this evening.
[00:00:39] <James_F>	 Hey.
[00:01:06] <Krinkle>	 o/
[00:01:17] <Dereckson>	 I've CR'ed the VE one, let's do the config meanwhile
[00:01:49] <icinga-wm>	 PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:03:20] <wikibugs_>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5727/" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[00:03:49] <Dereckson>	 (mwlog1001 so now for fatalmonitor)
[00:04:47] <wikibugs_>	 (03CR) 10Dereckson: [C: 032] (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle)
[00:06:54] <wikibugs_>	 (03CR) 10Mobrovac: RESTBase: Send the logs locally to stdout/syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342103 (https://phabricator.wikimedia.org/T112648) (owner: 10Mobrovac)
[00:07:45] <wikibugs_>	 (03CR) 10Hashar: "It is probably terribly wrong in one way or another.  I am going to test it out on labs and polish it :}" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[00:08:35] <wikibugs_>	 (03CR) 10Dzahn: "see inline comments. i think issue with variable names in manifest vs template. compiler part looks good though" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[00:08:36] <Dereckson>	 It seems Zuul didn't pick the 342149
[00:09:06] <wikibugs_>	 (03CR) 10Dzahn: "$directory / $base_path / @base_directory" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[00:09:10] <wikibugs_>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle)
[00:10:56] <Dereckson>	 ah it's because it has https://phabricator.wikimedia.org/rOMWC70772084f5a067f4352c7b691ce328cc6720859d as parent
[00:11:10] <Dereckson>	 Krinkle: ^
[00:11:23] <Dereckson>	 this commit is declared as parent in your change, but it's not in master
[00:11:33] <Krinkle>	 Dereckson: They're both in swat
[00:11:36] <Krinkle>	 other way around I suppose
[00:11:49] <Dereckson>	 ok it's https://gerrit.wikimedia.org/r/#/c/342147/3 seen it
[00:12:26] <wikibugs_>	 (03CR) 10Dereckson: [C: 032] [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle)
[00:13:13] <Dereckson>	 James_F: Krinkle: you wish to test them together or 342147 only first?
[00:13:32] <Krinkle>	 Dereckson: together
[00:13:37] * Dereckson nods
[00:13:44] <Krinkle>	 (the two of mine together that is)
[00:13:53] <James_F>	 Mine doesn't matter.
[00:14:30] <Dereckson>	 zuul is gating the two, we wait operations-mw-config-composer-hhvm-jessie
[00:14:48] <Dereckson>	 and for VE, we wait mwext-VisualEditor-npm-node-6-jessie
[00:15:02] <wikibugs_>	 (03Merged) 10jenkins-bot: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle)
[00:15:06] <wikibugs_>	 (03Merged) 10jenkins-bot: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle)
[00:15:24] <Dereckson>	 342147 and 342149 on mwdebug1002
[00:15:25] <wikibugs_>	 (03CR) 10jenkins-bot: [noop] Move NavigationTiming config to EventLogging section [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342147 (owner: 10Krinkle)
[00:17:22] <Krinkle>	 Dereckson: thx
[00:18:25] <Krinkle>	 Dereckson: Works fine
[00:18:35] <Dereckson>	 ack'ed
[00:19:03] <logmsgbot>	 !log maxsem@tin Started deploy [tilerator/deploy@160f314]: https://gerrit.wikimedia.org/r/#/c/342153/ - revert submodule updates due to broken manik->libc dependency
[00:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:20] <logmsgbot>	 !log maxsem@tin Finished deploy [tilerator/deploy@160f314]: https://gerrit.wikimedia.org/r/#/c/342153/ - revert submodule updates due to broken manik->libc dependency (duration: 00m 16s)
[00:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:39] <icinga-wm>	 RECOVERY - tileratorui on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.089 second response time
[00:19:39] <icinga-wm>	 RECOVERY - tilerator on maps-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.093 second response time
[00:20:00] <MaxSem>	 greg-g, ^
[00:22:21] <Dereckson>	 VE merged
[00:22:29] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Move NavigationTiming config to EventLogging section + Remove setting of unused $wgPercentHHVM ([[Gerrit:342147]] and [[Gerrit:342149]], no-op) (duration: 00m 40s)
[00:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:09] <James_F>	 Yay, finally. :-)
[00:24:38] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136
[00:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:44] <stashbot>	 T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136
[00:24:52] <Dereckson>	 James_F: VE change on mwdebug1002
[00:26:12] <James_F>	 Hmm. Doesn't seem to be working. One moment.
[00:27:03] <Dereckson>	 according https://tools.wmflabs.org/versions/ it can be tested on every wiki
[00:27:25] <James_F>	 Certainly, it's not /worse/.
[00:28:14] <James_F>	 Both show the same git hash (of the branch cut), but IIRC that's not real any more.
[00:29:15] * James_F tries debug.
[00:29:49] <icinga-wm>	 RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[00:31:56] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 (duration: 07m 17s)
[00:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:03] <stashbot>	 T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136
[00:34:54] <James_F>	 Meh. It doesn't seem to fix the bug. Dereckson, you can deploy it anyway, it doesn't make it worse.
[00:35:12] <wikibugs_>	 (03PS3) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[00:35:45] <Dereckson>	 James_F: does the new code leads to an existing, up-to-date API URL?
[00:36:19] <James_F>	 Dereckson: It's the correct code, it just doesn't fix it in prod the way it fixed it in test. Such is life.
[00:36:39] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[00:36:41] <Dereckson>	 Okay. If it's the correct URL, yes, we can deploy it, I concur.
[00:36:56] <mutante>	 Dereckson: i hope it's ok i'm being bold and just amend to yours
[00:37:09] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136
[00:37:10] <mutante>	 not about to merge now 
[00:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:16] <stashbot>	 T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136
[00:37:44] <Dereckson>	 mutante: yes, thanks, I was working and on the road all the day and didn't had time to amend
[00:38:45] <logmsgbot>	 !log dereckson@tin Synchronized php-1.29.0-wmf.15/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTargetLoader.js: ArticleTargetLoader: wikitext switch shouldn't require FullRestbaseURL (T158692) (duration: 00m 41s)
[00:38:46] <mutante>	 cool, i will leave comments on it
[00:38:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:51] <stashbot>	 T158692: Lost work when switching from wikitext to visual modes on wikitech and private wikis (not using RESTbase) - https://phabricator.wikimedia.org/T158692
[00:39:33] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@a5716b9]: Replayed events are purged based on current timestamp T160136 (duration: 02m 23s)
[00:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:41] <wikibugs_>	 (03PS4) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[00:42:47] <wikibugs_>	 (03CR) 10Dzahn: "changes i made:" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[00:44:02] <mutante>	 done . and out for now, bbl
[00:44:49] <Dereckson>	 good evening
[00:45:28] <James_F>	 Dereckson: Thanks again.
[00:46:11] <Dereckson>	 You're welcome
[00:47:00] <wikibugs_>	 (03CR) 10Dereckson: "I checked on wasat, I confirm codfw proxy works too:" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[00:48:22] <logmsgbot>	 !log ppchelko@tin Started deploy [trending-edits/deploy@1673068]: Replayed events are purged based on current timestamp T160136
[00:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:28] <stashbot>	 T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136
[00:54:46] <logmsgbot>	 !log ppchelko@tin Finished deploy [trending-edits/deploy@1673068]: Replayed events are purged based on current timestamp T160136 (duration: 06m 24s)
[00:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:53] <stashbot>	 T160136: Replayed events are purged based on current timestamp - https://phabricator.wikimedia.org/T160136
[01:01:49] <icinga-wm>	 PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:13:29] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[01:15:46] <wikibugs_>	 (03PS1) 10BryanDavis: toolschecker: remove precise checks [puppet] - 10https://gerrit.wikimedia.org/r/342161 (https://phabricator.wikimedia.org/T94792)
[01:29:49] <icinga-wm>	 RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[02:07:39] <icinga-wm>	 PROBLEM - puppet last run on db1080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:16:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:26:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:33:57] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 12m 17s)
[02:33:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:40] <icinga-wm>	 RECOVERY - puppet last run on db1080 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[02:39:25] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 10 02:39:25 UTC 2017 (duration 5m 28s)
[02:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:46:42] <wikibugs_>	 06Operations, 10Graphite, 05MW-1.27-release (WMF-deploy-2016-04-05_(1.27.0-wmf.20)), 05MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), and 3 others: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#3090174 (10aaron)
[02:46:45] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3090171 (10aaron) 05Open>03Resolved a:03aaron Should be deployed now. I restarted one server's services manually to check it on ganglia/logs. T...
[02:48:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational
[02:48:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[02:48:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[02:49:52] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review, 15User-Addshore: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#3090175 (10aaron) >>! In T132327#3090171, @aaron wrote: > Should be deployed now. I restarted one server's services manually to check it on ganglia/l...
[02:51:50] <AaronSchulz>	 !log Restarted job services for 510142425d268df (statsd batching) after monitoring mw1161
[02:51:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:54:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:55:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:57:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:58:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:00:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:02:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:03:13] <wikibugs_>	 (03PS1) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163
[03:04:53] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn)
[03:04:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:04:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:06:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:13:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:13:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:14:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:14:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:16:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:18:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:44:22] <wikibugs_>	 (03PS1) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164
[03:47:06] <mutante>	 those Icinga alerts are failed jobrunner service, but all of it is only codfw , and in SAL there are reinstalls 
[03:47:43] <mutante>	 eh, scratch the SAL part, that isn't current
[03:49:37] <mutante>	 AaronSchulz: ^
[03:53:33] <mutante>	 ah, an update failed. puppet ensured package upgrade and then the service failed
[03:54:49] <mutante>	 !log codfw appservers showing "systemd degraded" alerts are failed jobrunner service unit. after puppet-agent "Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed..." ..then jobrunner.service: main process exited, code=exited, status=143/n/a
[03:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:56:54] <AaronSchulz>	 mutante: why would a package upgrade trigger?
[03:56:57] <mutante>	 !log codfw appserver jobrunner service fail related to https://gerrit.wikimedia.org/r/#/c/259660/ ?
[03:57:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:57:11] <mutante>	 AaronSchulz: dunno yet, just saw that in syslog
[03:57:16] <mutante>	 puppet-agent did it
[03:57:52] <mutante>	 Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Package[jobrunner]/ensure) ensure changed 'a0e821661a107b5dbf4616b0f3570fdd93346010' to 'a1eb96c2f30b31cd05f1ef42e61cdfd1421f505a'
[03:57:56] <mutante>	 Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Package[jobrunner]) Scheduling refresh of Service[jobrunner]
[03:57:59] <mutante>	 Mar 10 02:17:30 mw2249 puppet-agent[98949]: (/Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]) Triggered 'refresh' from 1 events
[03:58:02] <mutante>	 Mar 10 03:16:38 mw2249 systemd[1]: jobrunner.service: main process exited, code=exited, status=143/n/a
[03:58:05] <mutante>	 Mar 10 03:16:38 mw2249 systemd[1]: Unit jobrunner.service entered failed state.
[03:58:08] <mutante>	 Mar 10 03:16:38 mw2249 puppet-agent[140021]: (/Stage[main]/Mediawiki::Jobrunner/Base::Service_unit[jobrunner]/Service[jobrunner]/ensure) ensure changed 'running' to 'stopped'
[03:59:24] <mutante>	 AaronSchulz: does it make any sense that it would be related to that deploy?
[03:59:57] <mutante>	 started about an hour ago
[04:00:33] <AaronSchulz>	 which was during the salt restart of the two services, but well after the git deploy
[04:01:45] <mutante>	 looks like i can simply start it on this one host
[04:01:51] <mutante>	 want me to just start them?
[04:02:34] <mutante>	 !log mw2249 systemctl start jobrunner - now Active: active (running) 
[04:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:02:43] <AaronSchulz>	 sure
[04:03:35] <AaronSchulz>	 the running the program with --verbose itself looks fine on 2250 (as it does in eqiad)
[04:05:36] <mutante>	 mw2155, was: Active: failed     but a simple "start" and it's working
[04:06:36] <mutante>	 icinga recovery would be nice now
[04:07:13] <mutante>	 ah, there is "jobchron" service too
[04:07:33] <mutante>	 and that is still failed, in the output of systemctl , which makes icinga unhappy
[04:08:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[04:09:04] <mutante>	 !log mw2155 - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running)
[04:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:09:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[04:10:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational
[04:10:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational
[04:11:09] <icinga-wm>	 RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational
[04:11:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[04:11:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational
[04:11:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[04:11:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational
[04:12:19] <mutante>	 !log more mw appservers ...  - systemctl start jobchron, systemctl start jobrunner (both were failed but are now active (running)
[04:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:13:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:15:13] <mutante>	 some don't have the jobchron.service
[04:15:22] <mutante>	 liek 2147,2148,2149
[04:18:09] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[04:18:49] <mutante>	 ah, no, that was 2247-2249, not 2147-2149
[04:18:50] <icinga-wm>	 RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational
[04:19:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational
[04:19:49] <icinga-wm>	 RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational
[04:19:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational
[04:19:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[04:20:59] <icinga-wm>	 RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational
[04:21:08] <mutante>	 AaronSchulz: that's all now per Icinga, it's green again
[04:22:17] <mutante>	 goes afk-ish again
[04:23:29] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=335.60 Read Requests/Sec=2380.00 Write Requests/Sec=781.30 KBytes Read/Sec=27552.00 KBytes_Written/Sec=15040.00
[04:25:16] <AaronSchulz>	 ok, thanks
[04:25:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:25:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:26:22] <mutante>	 .. hrmm
[04:26:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:27:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:29:55] <mutante>	 !log codfw mw jobrunner: they start but then fail again shortly after:   mw2248 jobrunner[67314]: [Fri Mar 10 04:23:07 2017] [hphp] [67314:7f6a34b746c0:0:000024] [] LightProcess::closeShadow failed due to exception: Failed in afdt::sendRaw: Broken pipe
[04:30:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:30:01] <mutante>	 AaronSchulz: ^ :/
[04:30:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:32:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:33:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:34:29] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=87.70 Read Requests/Sec=200.10 Write Requests/Sec=1.30 KBytes Read/Sec=1878.00 KBytes_Written/Sec=358.80
[04:35:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:37:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:42:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:43:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:44:01] <icinga-wm>	 PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:44:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:45:49] <icinga-wm>	 PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:48:59] <icinga-wm>	 PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:57:18] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3090230 (10Dzahn)
[04:58:26] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:26] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:26] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:26] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:26] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:27] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:27] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:28] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:29] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:29] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:29] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:30] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:30] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:31] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T160146
[04:58:53] <closedmouth>	 acknowledged
[05:00:07] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on prometheus1003 is CRITICAL: DISK CRITICAL - free space: / 115 MB (0% inode=51%): daniel_zahn not in prod yet and known
[05:00:07] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: / 1045 MB (3% inode=51%): daniel_zahn not in prod yet and known
[05:22:33] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3090249 (10aaron) Probably trebuchet/puppet breakage. I wonder if https://phabricator.wikimedia.org/T129148 would handle this.
[06:03:39] <icinga-wm>	 PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:29:04] <wikibugs_>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#3090260 (10Shoichi) About cache,after discussion with upstream author , cache put in production server side is better than put in wikis-sites side. No ma...
[06:32:39] <icinga-wm>	 RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[06:59:50] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:02:59] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414)
[07:06:30] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[07:08:03] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[07:08:39] <icinga-wm>	 PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:09:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1022 - T159414 (duration: 00m 41s)
[07:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:41] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[07:09:45] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170
[07:11:25] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui)
[07:12:44] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui)
[07:13:48] <marostegui>	 !log Deploy alter table s6 revision table on db1022 - T159414
[07:13:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Added weight 1 for db1061 - T159414 (duration: 00m 40s)
[07:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:15] <wikibugs_>	 (03PS1) 10Marostegui: dbstore2.my.cnf: Add replication filter for wikis [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707)
[07:22:49] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[07:24:02] <wikibugs_>	 (03CR) 10Marostegui: "This looks good: https://puppet-compiler.wmflabs.org/5728/" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:24:09] <wikibugs_>	 (03CR) 10Jcrespo: "I am not sure about this- will we remember to replicate other dbs when they are created?" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:26:16] <wikibugs_>	 (03CR) 10Jcrespo: "e.g.: https://www.mediawiki.org/wiki/Extension:Cognate which happens to contain wik." [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:26:18] <wikibugs_>	 (03CR) 10Marostegui: "Maybe we can just remove all replication filters and ignore mysql database only?" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:28:29] <moritzm>	 !log upgrading libarchive on trusty systems (jessie already fixed)
[07:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:13] <wikibugs_>	 (03CR) 10Jcrespo: "mysql, ops, sys and trash? (I do not know why there is a trash db)" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:30:25] <wikibugs_>	 (03CR) 10Marostegui: "yeah I would say: mysql,ops, trash, sys and percona (I have see the percona database somewhere I reckon)" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:30:48] <wikibugs_>	 (03CR) 10Jcrespo: "ok" [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:33:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:34:36] <wikibugs_>	 (03PS2) 10Marostegui: dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707)
[07:36:08] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:36:35] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] dbstore2.my.cnf: Add replication ignore filters [puppet] - 10https://gerrit.wikimedia.org/r/342171 (https://phabricator.wikimedia.org/T159707) (owner: 10Marostegui)
[07:36:40] <icinga-wm>	 RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[07:45:47] <wikibugs_>	 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3090315 (10MoritzMuehlenhoff) @demon Can you confirm that the user is no longer needed?
[08:03:15] <wikibugs_>	 (03PS1) 10Muehlenhoff: Add another NDA account [puppet] - 10https://gerrit.wikimedia.org/r/342173
[08:09:29] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Add another NDA account [puppet] - 10https://gerrit.wikimedia.org/r/342173 (owner: 10Muehlenhoff)
[08:14:37] <wikibugs_>	 (03CR) 10Muehlenhoff: "Why is that needed? Services are stopped prior to removal, so if there's still gerrit2 processing lingering around, then a standard servic" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox)
[08:19:19] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[08:21:10] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[08:36:39] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[08:39:39] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[08:40:47] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Decouple core (mediawiki) role on a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850)
[08:49:52] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Weight 1 for db1061 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342170 (owner: 10Marostegui)
[08:52:34] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5729/" [puppet] - 10https://gerrit.wikimedia.org/r/342054 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[08:52:58] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[08:55:49] <icinga-wm>	 PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:56:49] <icinga-wm>	 RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[08:59:57] <wikibugs_>	 06Operations, 10Monitoring: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090494 (10hashar)
[09:00:05] <wikibugs_>	 (03PS1) 10Filippo Giunchedi: Enable ipvs node_exporter collector on lvs boxes [puppet] - 10https://gerrit.wikimedia.org/r/342175
[09:02:21] <wikibugs_>	 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090513 (10ema)
[09:02:45] <wikibugs_>	 06Operations, 10Monitoring, 10Traffic: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3090481 (10ema) p:05Triage>03Normal
[09:03:03] <wikibugs_>	 (03Abandoned) 10Filippo Giunchedi: graphite: switch graphite alerts to graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/335765 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi)
[09:04:27] <wikibugs_>	 (03CR) 10Jcrespo: [C: 04-1] "https://puppet-compiler.wmflabs.org/5730/ Is that in hiera or where?" [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[09:07:39] <wikibugs_>	 (03PS3) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[09:07:47] <wikibugs_>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3090521 (10MoritzMuehlenhoff)
[09:07:59] <wikibugs_>	 06Operations, 06Office-IT, 07LDAP: Make disabled accounts visible in the corp mirror LDAP replica - https://phabricator.wikimedia.org/T160158#3090533 (10MoritzMuehlenhoff) p:05Triage>03Normal
[09:10:43] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.25 seconds
[09:11:27] <jynus>	 what is going on?
[09:11:50] <marostegui>	 don't know, checking
[09:12:29] <marostegui>	 621 | Copying to tmp table on disk                   | SELECT /* MostlinkedPage::reallyDoQuery  */  pl_namespace AS `namespace`,pl_title AS `title`,COUNT(*
[09:12:44] <jynus>	 slow queries causing issues?
[09:12:49] <jynus>	 not normal
[09:12:56] <marostegui>	 raid looking good
[09:13:40] <jynus>	 srwiki
[09:13:56] <wikibugs_>	 (03CR) 10Filippo Giunchedi: "LGTM, though note that this is racy on boot, i.e. it needs a first puppet run to work. The fix for redis jessie machines (i.e. all but rcs" [puppet] - 10https://gerrit.wikimedia.org/r/268598 (owner: 10Ori.livneh)
[09:14:02] <jynus>	 or is it whatever is happening on kshwiki
[09:15:38] <marostegui>	 kshwiki connection is gone now
[09:15:54] <marostegui>	 ah no, it is there still
[09:16:48] <jynus>	 I can kill it
[09:17:04] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans)
[09:17:17] <marostegui>	 sure, go ahead
[09:17:19] <jynus>	 but it would fit more a disk issue
[09:17:39] <jynus>	 something on hw happens but only shows when heavy io
[09:17:39] <marostegui>	 i was checking bbu/raid
[09:17:44] <jynus>	 (like a tmp table)
[09:17:49] <marostegui>	 bbu and policy looks good
[09:17:52] <jynus>	 yes, I trust you
[09:17:55] <jynus>	 I am commenting it
[09:18:10] <marostegui>	 there are several disks with errors
[09:18:16] <marostegui>	 but might be old
[09:18:44] <jynus>	 I think it is the long running query
[09:18:50] <jynus>	 only change since then
[09:18:55] <jynus>	 is the innodb purge lag
[09:19:14] <jynus>	 io in fact is lower since a few minutes ago
[09:19:23] <jynus>	 which again would fit hw-caused issues
[09:19:46] <jynus>	 fsyncs are high, though
[09:20:03] <jynus>	 much more disk writes
[09:20:29] <jynus>	 I think the right way to handle this is to make the server non-transactionaly safe
[09:20:34] <jynus>	 to reduce fsyncs
[09:20:55] <marostegui>	 you want me to set it to 2 maybe?
[09:20:55] <jynus>	 ok with that?
[09:21:03] <jynus>	 no, never 2
[09:21:09] <jynus>	 not with a hw raid
[09:21:31] <jynus>	 setting it to 0
[09:21:32] <jynus>	 now
[09:21:35] <marostegui>	 ok!
[09:21:56] <marostegui>	 lag decreasing
[09:22:02] <jynus>	 let's see how it reponds
[09:22:42] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1038 is OK: OK slave_sql_lag Replication lag: 0.33 seconds
[09:23:24] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "This wouldn't work because PDUs in codfw use a different snmp community and facilities::monitor_pdu_service only knows about $pdu_snmp_pas" [puppet] - 10https://gerrit.wikimedia.org/r/341534 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[09:24:41] <marostegui>	 the disks errors do not increase
[09:25:49] <jynus>	 maybe we should set that on all slow slaves
[09:26:07] <jynus>	 and reimage when it crashes
[09:26:28] <marostegui>	 we have only have db1038 so far complaining
[09:26:45] <marostegui>	 I would do so if we had more issues like this
[09:43:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[09:43:37] <wikibugs_>	 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#3090586 (10MoritzMuehlenhoff)
[09:44:36] <wikibugs_>	 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3090591 (10MoritzMuehlenhoff)
[09:44:38] <wikibugs_>	 06Operations, 13Patch-For-Review: Cross-validation of account data - https://phabricator.wikimedia.org/T142836#2547672 (10MoritzMuehlenhoff) 05Open>03Resolved The cross-checking is running daily on terbium. It implements the following checks:  - Every account in the privileged "wmf" group should be registe...
[09:51:27] <wikibugs_>	 (03PS4) 10Jcrespo: mariadb: Decouple labsdb mariadb role (deprecated) to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850)
[09:54:49] <icinga-wm>	 PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:57:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[09:57:27] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "This should be ok now https://puppet-compiler.wmflabs.org/5732/labsdb1001.eqiad.wmnet/ , but I want the ok from the change of debdeploy." [puppet] - 10https://gerrit.wikimedia.org/r/342060 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[09:59:42] <wikibugs_>	 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3090609 (10MoritzMuehlenhoff)
[09:59:45] <wikibugs_>	 06Operations: Require/track Phabricator username - https://phabricator.wikimedia.org/T142830#3090605 (10MoritzMuehlenhoff) 05Open>03declined This isn't needed anymore. NDA management has changed towards a new workflow which doesn't rely on Phabricator any longer.   One remaining use case if for the offboardi...
[10:01:01] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850)
[10:01:50] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[10:02:29] <wikibugs_>	 (03CR) 10Gehel: [C: 031] "Only make sense with the child change, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar)
[10:05:09] <wikibugs_>	 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3090617 (10Marostegui) The most recent backups for dbstore1001 are still only from Feb, and not March: ``` +--------+-------+----------+-------------------+---------------------+----...
[10:10:16] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decouple mariadb wikitech role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342064 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[10:12:43] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] facilities: add row and site parameters for pdus [puppet] - 10https://gerrit.wikimedia.org/r/341533 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi)
[10:15:49] <wikibugs_>	 (03PS2) 10DCausse: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson)
[10:17:32] <wikibugs_>	 (03CR) 10Hashar: [C: 032] [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson)
[10:19:07] <wikibugs_>	 (03Merged) 10jenkins-bot: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson)
[10:22:49] <icinga-wm>	 RECOVERY - puppet last run on db1074 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[10:27:35] <wikibugs_>	 (03PS1) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187
[10:27:46] <wikibugs_>	 (03PS2) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187
[10:28:15] <wikibugs_>	 (03CR) 10Hashar: "I keep it simple so that can be merged and used as a foundation to write other tests." [puppet] - 10https://gerrit.wikimedia.org/r/340420 (owner: 10Hashar)
[10:29:39] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850)
[10:33:08] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[10:35:31] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decouple maintenance mariadb role to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850)
[10:38:35] <jynus>	 jenkins may be getting a bit overloaded
[10:38:58] <jynus>	 it is starting to 503 me
[10:39:53] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/5735/terbium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342188 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[10:40:48] <marostegui>	 jynus: happened to me around 8am too
[10:48:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[10:50:39] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 (owner: 10Muehlenhoff)
[10:50:43] <wikibugs_>	 (03PS3) 10Muehlenhoff: Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187
[10:51:09] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans)
[10:51:23] <wikibugs_>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Don't make realname optional in account check script [puppet] - 10https://gerrit.wikimedia.org/r/342187 (owner: 10Muehlenhoff)
[10:51:26] <wikibugs_>	 (03PS5) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968)
[10:53:08] <wikibugs_>	 (03CR) 10Hashar: "Thanks!! :]" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[10:53:23] <wikibugs_>	 (03PS2) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[10:55:32] <wikibugs_>	 (03CR) 10Volans: [C: 032] "Improved some comments in the last patch set." [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans)
[10:58:59] <icinga-wm>	 PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:04:01] <wikibugs_>	 (03Merged) 10jenkins-bot: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans)
[11:06:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[11:07:44] <wikibugs_>	 (03PS1) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180)
[11:09:12] <wikibugs_>	 (03PS2) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180)
[11:12:12] <wikibugs_>	 (03CR) 10Hashar: (WIP) contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[11:12:28] <wikibugs_>	 (03PS3) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:13:31] <wikibugs_>	 (03PS4) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:15:21] <wikibugs_>	 (03PS5) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:17:30] <wikibugs_>	 (03PS6) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:18:46] <wikibugs_>	 (03PS7) 10Hashar: (WIP) contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:19:08] <hashar>	 (yeah I should do that directly on the puppet master)
[11:19:11] <hashar>	 sorry bout the spam
[11:23:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[11:26:59] <icinga-wm>	 RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[11:43:55] <wikibugs_>	 (03PS8) 10Hashar: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[11:52:25] <wikibugs_>	 (03CR) 10Muehlenhoff: contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[11:58:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[12:20:03] <wikibugs_>	 06Operations, 10Continuous-Integration-Config: Enforce reference to Phabricator task for all commits to modules/admin/data/data.yaml - https://phabricator.wikimedia.org/T142827#3090778 (10hashar)
[12:20:52] <wikibugs_>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3090782 (10MoritzMuehlenhoff)
[12:30:52] <wikibugs_>	 (03CR) 10Hashar: contint: migrate git-daemon to systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[12:31:04] <wikibugs_>	 (03PS9) 10Hashar: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785)
[12:53:22] <wikibugs_>	 (03PS3) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180)
[13:08:09] <wikibugs_>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3090851 (10MoritzMuehlenhoff) >>! In T158176#3074551, @MoritzMuehlenhoff wrote: > The test failure is benign; it tests a new feature introduced into the simple JSON parser uncondit...
[13:11:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[13:18:35] <wikibugs_>	 (03PS4) 10Elukey: Add new MW appservers and api-appservers in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180)
[13:20:05] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3090867 (10elukey) I followed https://piwik.org/docs/optimize-how-to/ and applied `set global innodb_flush_log_at_trx_commit=2;` as root...
[13:21:49] <icinga-wm>	 PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:22:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[13:27:09] <icinga-wm>	 PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:35:40] <hashar>	 !log Restarting Jenkins. Deadlocks in ssh connections.  T160168
[13:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:47] <stashbot>	 T160168: Zuul postmerge blocked on beta-mediawiki-config-update-eqiad - https://phabricator.wikimedia.org/T160168
[13:40:45] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342169 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[13:40:47] <wikibugs_>	 (03CR) 10jenkins-bot: (no-op) Remove setting of unused $wgPercentHHVM (no longer exists) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342149 (owner: 10Krinkle)
[13:40:51] <wikibugs_>	 (03CR) 10jenkins-bot: [cirrus] Config update for elasticsearch 5.x in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342040 (owner: 10EBernhardson)
[13:41:53] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[13:42:24] <hashar>	 moritzm: thanks for the review! wanna merge it ? :)  I can baby sit it on prod boxes
[13:44:46] <moritzm>	 currently need to finish something else, will ping you in 15-30 mins?
[13:44:52] <hashar>	 sure!
[13:48:03] <wikibugs_>	 (03PS10) 10Zppix: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[13:48:11] <wikibugs_>	 (03CR) 10Zppix: [C: 031] contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[13:49:16] <Zppix>	 jouncebot:  next
[13:49:16] <jouncebot>	 In 65 hour(s) and 10 minute(s): Holiday (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170313T0700)
[13:49:51] <icinga-wm>	 RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[13:49:53] <Zppix>	 out of curiosity should that be something jouncebot really needs to detect as a deployment
[13:52:27] <wikibugs_>	 (03CR) 10Elukey: [C: 032] "Looks good from https://puppet-compiler.wmflabs.org/5739/" [puppet] - 10https://gerrit.wikimedia.org/r/342194 (https://phabricator.wikimedia.org/T155180) (owner: 10Elukey)
[13:54:09] <icinga-wm>	 RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[13:54:21] <elukey>	 waiting a bit before merging, last check in neodym
[13:54:33] <elukey>	 ping me if you are in a hurry :)
[13:55:55] <elukey>	 merged
[13:58:27] <elukey>	 !log added 3 new MW api-appservers (mw2251-53) and 7 new appservers (mw2254-60) to codfw
[13:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:48] <elukey>	 running puppet on a couple of them to see if everything works fine
[14:01:19] <icinga-wm>	 PROBLEM - DPKG on mw2251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:03:19] <icinga-wm>	 RECOVERY - DPKG on mw2251 is OK: All packages OK
[14:04:47] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[14:04:49] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:04:52] <wikibugs_>	 (03PS11) 10Muehlenhoff: contint: migrate git-daemon to systemd [puppet] - 10https://gerrit.wikimedia.org/r/342128 (https://phabricator.wikimedia.org/T157785) (owner: 10Hashar)
[14:04:59] <icinga-wm>	 PROBLEM - DPKG on mw2252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:05:27] <hashar>	 !log contint1001 and contint2001 :  Migrating git-daemon to systemd . Would stop zuul merger briefly
[14:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:18] <elukey>	 sorry for the mw22* broken pkg spam
[14:06:19] <icinga-wm>	 PROBLEM - DPKG on mw2251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:07:00] <icinga-wm>	 RECOVERY - DPKG on mw2252 is OK: All packages OK
[14:07:01] <jynus>	 I am not liking the general query patterns I am seeing
[14:07:59] <icinga-wm>	 PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger
[14:09:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:10] <icinga-wm>	 PROBLEM - configured eth on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:19] <icinga-wm>	 PROBLEM - Disk space on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:19] <icinga-wm>	 PROBLEM - salt-minion processes on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:19] <icinga-wm>	 PROBLEM - dhclient process on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:19] <icinga-wm>	 PROBLEM - puppet last run on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:29] <icinga-wm>	 PROBLEM - MD RAID on mw2251 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:09:59] <icinga-wm>	 PROBLEM - DPKG on mw2256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:09:59] <icinga-wm>	 PROBLEM - DPKG on mw2252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:10:19] <elukey>	 trying to schedule downtime
[14:10:59] <icinga-wm>	 PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:11:05] <elukey>	 done
[14:12:59] <icinga-wm>	 RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger
[14:13:19] <icinga-wm>	 RECOVERY - DPKG on mw2256 is OK: All packages OK
[14:13:59] <icinga-wm>	 PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user
[14:14:09] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[14:17:09] <icinga-wm>	 PROBLEM - git_daemon_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user
[14:18:59] <icinga-wm>	 PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger
[14:19:59] <icinga-wm>	 RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger
[14:21:56] <wikibugs_>	 (03PS1) 10Ema: Revert "depool eqiad front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/342211
[14:23:55] <hashar>	 I am terrible
[14:24:27] <wikibugs_>	 (03CR) 10Ema: [V: 032 C: 032] Revert "depool eqiad front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/342211 (owner: 10Ema)
[14:26:19] <icinga-wm>	 RECOVERY - Disk space on mw2251 is OK: DISK OK
[14:26:29] <icinga-wm>	 RECOVERY - configured eth on mw2251 is OK: OK - interfaces up
[14:26:29] <icinga-wm>	 RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures
[14:26:39] <icinga-wm>	 RECOVERY - salt-minion processes on mw2251 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:26:39] <icinga-wm>	 RECOVERY - dhclient process on mw2251 is OK: PROCS OK: 0 processes with command name dhclient
[14:26:39] <icinga-wm>	 RECOVERY - MD RAID on mw2251 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:28:16] <wikibugs_>	 (03PS1) 10Hashar: zuul: fix git-daemon monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/342212
[14:28:56] <wikibugs_>	 (03CR) 10Hashar: "The extra --user was because previously the process was forking." [puppet] - 10https://gerrit.wikimedia.org/r/342212 (owner: 10Hashar)
[14:29:33] <icinga-wm>	 RECOVERY - DPKG on mw2251 is OK: All packages OK
[14:30:30] <icinga-wm>	 ACKNOWLEDGEMENT - git_daemon_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user amusso I broke it. Fix is https://gerrit.wikimedia.org/r/#/c/342212/
[14:30:40] <subbu>	 _joe_, can you take a look at https://gerrit.wikimedia.org/r/#/c/338950/ again? thanks.
[14:31:08] <icinga-wm>	 ACKNOWLEDGEMENT - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/git-core/git-daemon --user amusso https://gerrit.wikimedia.org/r/#/c/342212/
[14:33:03] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:39:03] <icinga-wm>	 RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[14:39:35] <wikibugs_>	 (03CR) 10Muehlenhoff: [C: 032] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/342212 (owner: 10Hashar)
[14:42:04] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[14:42:26] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:44:03] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[14:44:13] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[14:46:16] <elukey>	 ema --^
[14:46:43] <elukey>	 it seems going down
[14:47:10] <elukey>	 https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=upload&var-status_type=4&var-status_type=5&from=now-6h&to=now - only upload..
[14:47:18] <jynus>	 to be fair, I would have called godog first in this case
[14:47:22] <jynus>	 :-)
[14:47:26] <jynus>	 503s
[14:47:42] <elukey>	 jynus: ah yes I saw the repool and pinged him without thinking :)
[14:49:13] <icinga-wm>	 RECOVERY - git_daemon_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon
[14:51:53] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 803484
[14:52:50] <jynus>	 it could be still
[14:53:03] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 643909
[14:53:07] <wikibugs_>	 (03PS1) 10Volans: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970)
[15:02:32] <icinga-wm>	 RECOVERY - DPKG on mw2252 is OK: All packages OK
[15:03:02] <icinga-wm>	 RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon
[15:03:22] <marostegui>	 !log Stop slave db2033 for maintenance - T159707
[15:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:29] <stashbot>	 T159707: Import x1 on dbstore2001 - https://phabricator.wikimedia.org/T159707
[15:04:12] <icinga-wm>	 PROBLEM - puppet last run on ms-be1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:04:30] <wikibugs_>	 (03PS1) 10Jcrespo: mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850)
[15:06:12] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[15:08:12] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:08:22] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:09:25] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1022" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342218
[15:09:33] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1022" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342218 (owner: 10Marostegui)
[15:10:19] <marostegui>	 -1? what?
[15:10:46] <marostegui>	 ah!
[15:13:05] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414)
[15:15:50] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:17:39] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:17:48] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342219 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:18:43] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 - T159414 (duration: 00m 45s)
[15:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:49] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[15:25:17] <wikibugs_>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414)
[15:26:31] <_joe_>	 subbu: will do
[15:27:14] <wikibugs_>	 (03CR) 10Eevans: [C: 031] "Ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans)
[15:28:02] <icinga-wm>	 PROBLEM - git_daemon_running on contint2001 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon
[15:28:09] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry)
[15:29:02] <icinga-wm>	 RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon
[15:29:25] <wikibugs_>	 (03PS3) 10Eevans: Optional Cassandra client encryption; Enabled on RESTBase Staging [puppet] - 10https://gerrit.wikimedia.org/r/342075 (https://phabricator.wikimedia.org/T111113)
[15:32:05] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:33:12] <icinga-wm>	 RECOVERY - puppet last run on ms-be1026 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[15:33:22] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe)
[15:33:40] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe)
[15:33:49] <wikibugs_>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:34:01] <wikibugs_>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342220 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui)
[15:34:38] <wikibugs_>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2990133 (10Joe)
[15:34:43] <wikibugs_>	 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#2964028 (10Joe)
[15:34:45] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Joe)
[15:37:16] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] PuppetDB: automatically ucfirst resource names (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans)
[15:40:15] <wikibugs_>	 (03CR) 10Paladox: "> Why is that needed? Services are stopped prior to removal, so if" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox)
[15:41:08] <wikibugs_>	 (03PS2) 10Volans: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970)
[15:42:12] <wikibugs_>	 (03CR) 10Volans: "see inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans)
[15:44:35] <wikibugs_>	 (03PS1) 10Muehlenhoff: Setup "bot" credentials file for Phabricator support in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342222
[15:47:03] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[15:49:30] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10Traffic, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3091217 (10elukey) Added `innodb_buffer_pool_size = 512M` and `innodb_flush_log_at_trx_commit 2` to `/etc/mysql/my.cnf` restarted mysql...
[15:50:43] <wikibugs_>	 (03CR) 10Giuseppe Lavagetto: [C: 031] PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans)
[15:50:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:51:17] <wikibugs_>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224
[15:52:10] <wikibugs_>	 (03CR) 10Volans: [C: 032] PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans)
[15:53:49] <wikibugs_>	 (03Merged) 10jenkins-bot: PuppetDB: automatically ucfirst resource names [software/cumin] - 10https://gerrit.wikimedia.org/r/342214 (https://phabricator.wikimedia.org/T159970) (owner: 10Volans)
[15:53:51] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui)
[15:54:22] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "This will probably break pending ferm patches." [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[15:54:59] <wikibugs_>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui)
[15:55:12] <wikibugs_>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1030" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342224 (owner: 10Marostegui)
[15:58:14] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1030 - T159414 (duration: 02m 42s)
[15:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:19] <stashbot>	 T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414
[15:58:26] <wikibugs_>	 (03CR) 10Ema: [C: 031] "Pcc output OK, LGTM. https://puppet-compiler.wmflabs.org/5741" [puppet] - 10https://gerrit.wikimedia.org/r/342175 (owner: 10Filippo Giunchedi)
[16:00:01] <wikibugs_>	 (03CR) 10Muehlenhoff: "The only pending ferm patch is for dbproxy and can easily be adapted, that's not an issue" [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[16:03:02] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 364
[16:04:18] <wikibugs_>	 (03CR) 10Filippo Giunchedi: [C: 031] "To be merged Monday" [puppet] - 10https://gerrit.wikimedia.org/r/342175 (owner: 10Filippo Giunchedi)
[16:07:03] <wikibugs_>	 (03CR) 10Jcrespo: [C: 032] mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo)
[16:07:10] <wikibugs_>	 (03PS2) 10Jcrespo: mariadb: Decouple ferm mariadb common class into a separate file [puppet] - 10https://gerrit.wikimedia.org/r/342217 (https://phabricator.wikimedia.org/T150850)
[16:07:21] <wikibugs_>	 (03PS1) 10Marostegui: sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418)
[16:08:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0]
[16:10:41] <wikibugs_>	 (03CR) 10Marostegui: "Looks good: https://puppet-compiler.wmflabs.org/5742/" [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:12:51] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "I am ok with this, is the user change for something in particular? (just asking, this can be deployed)" [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:13:44] <wikibugs_>	 (03PS1) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228
[16:13:46] <wikibugs_>	 (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229
[16:13:48] <wikibugs_>	 (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342230
[16:13:50] <wikibugs_>	 (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342231
[16:13:52] <wikibugs_>	 (03PS1) 10Gehel: logrotate - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342232
[16:13:54] <wikibugs_>	 (03CR) 10Marostegui: "No particular change, we do not have it under the gtid_domain_id on any of the files, so just a bit of consistency there." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:14:06] <wikibugs_>	 (03CR) 10Marostegui: [C: 032] sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:14:10] <wikibugs_>	 (03PS2) 10Marostegui: sanitarium2.my.cnf: Using standard gtid_domain_id [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418)
[16:14:56] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "Cool. This is one of the things that we could separate on subtemplates, so next time we do not have to change 20 files." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:15:31] <wikibugs_>	 (03CR) 10Jcrespo: [C: 031] "Not on this change, I am thinking aloud." [puppet] - 10https://gerrit.wikimedia.org/r/342226 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui)
[16:18:52] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[16:20:03] <wikibugs_>	 (03PS1) 10Muehlenhoff: debdeploy: Support stretch installations in update spec files [puppet] - 10https://gerrit.wikimedia.org/r/342233
[16:25:22] <icinga-wm>	 PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:25:34] <elukey>	 !log reboot mw22(5[1-9]|60) to enable mw-cgroup mountpoint
[16:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:51] <elukey>	 that's not a mountpoint
[16:25:53] <elukey>	 anyhow
[16:25:55] <elukey>	 :D
[16:27:32] <icinga-wm>	 RECOVERY - Check systemd state on mw2251 is OK: OK - running: The system is fully operational
[16:27:35] <Reedy>	 just do !log did something with some serversa and probably rebooting them
[16:30:02] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[16:30:06] <wikibugs_>	 (03PS1) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969)
[16:30:32] <icinga-wm>	 PROBLEM - puppet last run on mw1302 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:31:02] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[16:31:03] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[16:38:14] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle)
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational
[16:39:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational
[16:39:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational
[16:39:13] <icinga-wm>	 RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational
[16:39:14] <icinga-wm>	 RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational
[16:39:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[16:39:22] <icinga-wm>	 RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational
[16:39:22] <icinga-wm>	 RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational
[16:39:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational
[16:39:23] <icinga-wm>	 RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational
[16:39:32] <icinga-wm>	 RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational
[16:42:50] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3091276 (10Papaul)
[16:44:15] <papaul>	 !log oresrdb2002 - signing puppet certs, salt-key, initial run
[16:44:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:31] <wikibugs_>	 (03PS3) 10Gilles: Send thumbor process age to statsd via cron [puppet] - 10https://gerrit.wikimedia.org/r/341791 (https://phabricator.wikimedia.org/T159352)
[16:51:15] <wikibugs_>	 (03PS1) 10Volans: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968)
[16:51:17] <wikibugs_>	 (03PS1) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969)
[16:51:48] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[16:51:50] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968) (owner: 10Volans)
[16:51:53] <wikibugs_>	 (03Abandoned) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342235 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[16:53:43] <wikibugs_>	 (03PS2) 10Volans: Clustershell: always require an event handler [software/cumin] - 10https://gerrit.wikimedia.org/r/342238 (https://phabricator.wikimedia.org/T159968)
[16:54:20] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3091293 (10Papaul)
[16:54:22] <icinga-wm>	 RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[16:54:37] <wikibugs_>	 (03PS2) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969)
[16:55:01] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: oresrdb2002 rack/setup - https://phabricator.wikimedia.org/T160082#3088068 (10Papaul) a:05Papaul>03akosiaris Installation complete @akosiaris
[16:55:17] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[16:58:32] <icinga-wm>	 RECOVERY - puppet last run on mw1302 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[17:05:02] <wikibugs_>	 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091314 (10demon) >>! In T160122#3090315, @MoritzMuehlenhoff wrote: > @demon Can you confirm that the user is no longer needed?  For a little history, it **used**...
[17:07:49] <wikibugs_>	 (03PS3) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969)
[17:08:51] <wikibugs_>	 (03PS2) 10Andrew Bogott: Upstart logrotate:  Use copytruncate instead of delaycompress. [puppet] - 10https://gerrit.wikimedia.org/r/341808 (https://phabricator.wikimedia.org/T159141)
[17:11:01] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Upstart logrotate:  Use copytruncate instead of delaycompress. [puppet] - 10https://gerrit.wikimedia.org/r/341808 (https://phabricator.wikimedia.org/T159141) (owner: 10Andrew Bogott)
[17:11:48] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 84
[17:17:43] <wikibugs_>	 (03PS4) 10Gehel: elasticsearch - statsd plugin isn't used anymore [puppet] - 10https://gerrit.wikimedia.org/r/342052
[17:23:11] <wikibugs_>	 (03PS4) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969)
[17:23:55] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[17:24:34] <ottomata>	 !log installed librdkafka 0.9.4 via dpkg -i on cp1058 (cache text) and restarted varnishkafka in preparation for fleet upgrade next week
[17:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:41] <ottomata>	 elukey:  FYI ^^
[17:25:16] <ottomata>	 oops, ha, cp1058 is a cache misc
[17:25:30] <ottomata>	 meant to do 1052
[17:26:28] <wikibugs_>	 (03PS5) 10Volans: Add integration tests for clustershell transport [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969)
[17:28:33] <ottomata>	 !log installed librdkafka 0.9.4 via dpkg -i on cp1052 (cache text) and restarted varnishkafka in preparation for fleet upgrade next week
[17:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:08] <wikibugs_>	 (03CR) 10Volans: "Example output of tox -e integration-clustershell" [software/cumin] - 10https://gerrit.wikimedia.org/r/342239 (https://phabricator.wikimedia.org/T159969) (owner: 10Volans)
[17:29:44] <wikibugs_>	 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3091385 (10Ottomata) FYI, I've installed librrdkafka on cp1042, cp1058 (cache misc) and cp1052 (cache text) serv...
[17:30:59] <wikibugs_>	 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091387 (10MoritzMuehlenhoff) Ok, I'll make the change on Tuesday when you're around (since Monday is a holiday for US staff)
[17:40:05] <elukey>	 thanks ottomata :)
[17:42:45] <wikibugs_>	 06Operations, 10Gerrit, 06Labs, 06Release-Engineering-Team, 07LDAP: Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3091441 (10demon) p:05High>03Normal Sounds good
[17:47:02] <wikibugs_>	 (03PS1) 10Gehel: elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718)
[17:48:01] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - move role::elasticsearch::common to a profile [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel)
[17:50:19] <wikibugs_>	 (03PS1) 10Ottomata: Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250
[17:54:28] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:54:43] <wikibugs_>	 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3091494 (10elukey) Ran puppet, rebooted the nodes for the mw-cgroups, re-ran puppet and scap pull, pooled the nodes via conftool.  Still to check: I had to reboot mw2256...
[17:55:14] <wikibugs_>	 (03PS3) 10Ottomata: Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250
[17:55:29] <wikibugs_>	 (03CR) 10Ottomata: "No op https://puppet-compiler.wmflabs.org/5743/" [puppet] - 10https://gerrit.wikimedia.org/r/342250 (owner: 10Ottomata)
[17:56:02] <wikibugs_>	 (03CR) 10Ottomata: [V: 032 C: 032] Create new refinery/job directory and move refinery cron job classes there [puppet] - 10https://gerrit.wikimedia.org/r/342250 (owner: 10Ottomata)
[17:59:46] <wikibugs_>	 06Operations, 10MediaWiki-JobRunner: jobrunner/jobchron services fail in codfw - https://phabricator.wikimedia.org/T160146#3091495 (10Dzahn) This looks fixed now in Icinga but there is nothing in SAL or on this ticket that would explain how it got fixed. ?
[18:05:18] <icinga-wm>	 PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:17:03] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH)
[18:19:09] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091579 (10RobH) I'm willing to assist on this as needed.  For other changes, we typically do the checklist as follows:  [] - stage new private key in private...
[18:20:19] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091582 (10RobH)
[18:20:43] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:05yuvipanda>03None
[18:20:52] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:03RobH
[18:22:28] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[18:22:53] <wikibugs_>	 (03PS1) 10GWicke: Update access log sampling to match new hyperswitch levels [puppet] - 10https://gerrit.wikimedia.org/r/342251
[18:23:53] <wikibugs_>	 (03Abandoned) 10Gehel: WIP - logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/341782 (owner: 10Gehel)
[18:23:59] <wikibugs_>	 (03CR) 10GWicke: [C: 04-1] Update access log sampling to match new hyperswitch levels [puppet] - 10https://gerrit.wikimedia.org/r/342251 (owner: 10GWicke)
[18:24:30] <wikibugs_>	 (03CR) 10GWicke: [C: 04-1] "Added a -1 to signal that this should only be deployed with the corresponding hyperswitch change." [puppet] - 10https://gerrit.wikimedia.org/r/342251 (owner: 10GWicke)
[18:27:08] <icinga-wm>	 PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:28:22] <logmsgbot>	 !log smalyshev@tin Started deploy [wdqs/wdqs@1f2973c]: Deploy new updater on 1003 for potential connection  drop fix
[18:28:24] <logmsgbot>	 !log smalyshev@tin Finished deploy [wdqs/wdqs@1f2973c]: Deploy new updater on 1003 for potential connection  drop fix (duration: 00m 03s)
[18:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:18] <icinga-wm>	 RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[18:41:51] <wikibugs_>	 (03PS4) 10Krinkle: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977)
[18:42:01] <wikibugs_>	 (03PS4) 10Krinkle: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724
[18:42:25] <wikibugs_>	 (03CR) 10Krinkle: "Self-1 was because I haven't tested it and pending questions (see IRC, -analytics)" [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle)
[18:56:08] <icinga-wm>	 RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[18:57:56] <wikibugs_>	 (03PS1) 10RobH: new cert for *.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/342254
[18:58:36] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091682 (10RobH)
[18:58:45] <wikibugs_>	 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: ssl certificate/key update: *.tools.wmflabs.org (expires on 2017-03-24) - https://phabricator.wikimedia.org/T160187#3091560 (10RobH) a:05RobH>03yuvipanda
[19:07:05] <MaxSem>	 !log Unmasked kartotherian on maps-test2004
[19:07:08] <icinga-wm>	 RECOVERY - Check systemd state on maps-test2004 is OK: OK - running: The system is fully operational
[19:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:08] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:13:05] <wikibugs_>	 (03PS1) 10Gehel: maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354)
[19:17:28] <icinga-wm>	 RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy
[19:19:18] <gehel>	 !log upgrading kartotherian on maps-test2004 - T150354
[19:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:23] <stashbot>	 T150354: Implement Node6 support for Kartotherian/Tilerator - https://phabricator.wikimedia.org/T150354
[19:21:05] <wikibugs_>	 (03CR) 10MaxSem: [C: 031] maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel)
[19:25:38] <icinga-wm>	 RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[19:27:50] <logmsgbot>	 !log gehel@tin Started deploy [kartotherian/deploy@76adf21]: (no justification provided)
[19:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:14] <logmsgbot>	 !log gehel@tin Finished deploy [kartotherian/deploy@76adf21]: (no justification provided) (duration: 00m 23s)
[19:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:22] <wikibugs_>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3091832 (10AndyRussG)
[19:34:32] <gehel>	 !log restart kartotherian on maps-test2004
[19:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:53] <wikibugs_>	 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3091864 (10MoritzMuehlenhoff) \o/ "mofarrell commented an hour ago:  A fix is on its way."
[19:37:08] <icinga-wm>	 RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[19:44:07] <logmsgbot>	 !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided)
[19:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:28] <logmsgbot>	 !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 01m 20s)
[19:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:50] <gehel>	 !log failed tilerator deploy on maps-test2004
[19:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:04] <logmsgbot>	 !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided)
[19:47:07] <logmsgbot>	 !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 03s)
[19:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:47] <gehel>	 !log restarting tilerator(ui) on maps-test2004
[19:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:59] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:57:16] <logmsgbot>	 !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided)
[19:57:21] <logmsgbot>	 !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 04s)
[19:57:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:52] <gehel>	 !log restarting tilerator(ui) on maps-test2004
[19:57:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:39] <wikibugs_>	 (03CR) 10Gehel: [C: 032] maps - upgrade maps-test cluster to node js version 6 [puppet] - 10https://gerrit.wikimedia.org/r/342256 (https://phabricator.wikimedia.org/T150354) (owner: 10Gehel)
[20:03:20] <logmsgbot>	 !log gehel@tin Started deploy [tilerator/deploy@b501046]: (no justification provided)
[20:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:37] <logmsgbot>	 !log gehel@tin Finished deploy [tilerator/deploy@b501046]: (no justification provided) (duration: 00m 16s)
[20:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:28] <wikibugs_>	 (03PS1) 10Mholloway: Set ANDROID_HOME environment variable (role::ci::slave::android) [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456)
[20:05:29] <logmsgbot>	 !log gehel@tin Started deploy [kartotherian/deploy@76adf21]: (no justification provided)
[20:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:24] <logmsgbot>	 !log gehel@tin Finished deploy [kartotherian/deploy@76adf21]: (no justification provided) (duration: 00m 54s)
[20:06:26] <gehel>	 !log restart kartotherian / tilerator(ui) on maps-test*
[20:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:27] <wikibugs_>	 (03PS1) 10Andrew Bogott: Nova:  Remove our custom-hacked libvirt driver [puppet] - 10https://gerrit.wikimedia.org/r/342264 (https://phabricator.wikimedia.org/T131548)
[20:12:28] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Nova:  Remove our custom-hacked libvirt driver [puppet] - 10https://gerrit.wikimedia.org/r/342264 (https://phabricator.wikimedia.org/T131548) (owner: 10Andrew Bogott)
[20:16:58] <icinga-wm>	 RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[20:37:24] <wikibugs_>	 06Operations, 10Annual-Report, 10Security-Reviews, 13Patch-For-Review: add subdomain for annual report 2016 - https://phabricator.wikimedia.org/T151798#3092081 (10Dzahn) 05Open>03Resolved closing ticket again as it looks done for now. feel free to re-open if more changes are planned.
[20:44:02] <wikibugs_>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3092101 (10AndyRussG) @BBlack, @ema, hi! Would it be possible to maybe get your input on the [[ https://gerrit.wikimedia...
[20:44:31] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "Error: Could not find template 'mediawiki/maintenance/uploads/wgetrc.erb'" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[20:44:47] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "i will amend one more time" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:00:11] <wikibugs_>	 (03PS5) 10Dzahn: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:00:44] <wikibugs_>	 (03PS6) 10Dzahn: mediawiki::maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:03:17] <wikibugs_>	 (03CR) 10Dzahn: [C: 031] "now it works: http://puppet-compiler.wmflabs.org/5745/" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:03:23] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] mediawiki::maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:04:47] <wikibugs_>	 06Operations, 06Commons, 13Patch-For-Review: Improve Terbium (and wasat) userland to process server side uploads - https://phabricator.wikimedia.org/T159661#3092138 (10Dzahn)
[21:07:28] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:08:09] <wikibugs_>	 (03CR) 10Dzahn: "eh, almost works, it's a directory instead of a file" [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson)
[21:08:52] <mutante>	 Dereckson: we have an empty dir /etc/wgetrc instead of a file /etc/wgetrc 
[21:09:34] <Dereckson>	         ensure  => ensure_directory($ensure),
[21:09:36] <mutante>	 it was a file before (which was manually added afaict)
[21:09:41] <mutante>	 yea, i saw
[21:10:38] <mutante>	 but the template contents need to go somewhere
[21:11:08] <mutante>	 did you want /etc/wgetrc and then a file inside it?
[21:11:14] <mutante>	 or /etc/wgetrc itself as the file
[21:12:31] <mutante>	 just file, right (checks wget man page)
[21:13:32] <Dereckson>	 No, it's a file /etc/wgetrc according the wget documentation
[21:13:37] <mutante>	 btw, you can also have per-user dotfiles in the repo
[21:13:38] <Dereckson>	 not a .d subdir
[21:14:17] <mutante>	 ok
[21:16:59] <wikibugs_>	 (03PS1) 10Dzahn: mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661)
[21:19:10] <wikibugs_>	 (03PS2) 10Dzahn: mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661)
[21:20:06] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] mediawiki::maintenance: ensure /etc/wgetrc is file, not dir [puppet] - 10https://gerrit.wikimedia.org/r/342274 (https://phabricator.wikimedia.org/T159661) (owner: 10Dzahn)
[21:21:28] <mutante>	 Ensure set to :present but file type is directory so no content will be synced
[21:21:38] <mutante>	 deletes it
[21:23:42] <mutante>	 Dereckson: done now. file exists on both, terbium uses webproxy.eqiad. wasat uses webproxy.codfw
[21:26:18] <wikibugs_>	 06Operations, 06Commons, 13Patch-For-Review: Improve Terbium (and wasat) userland to process server side uploads - https://phabricator.wikimedia.org/T159661#3074310 (10Dzahn) edited task title to point out we should always treat it as the pair terbium/wasat for eqiad/codfw.    amended and merged the changes...
[21:27:01] <Dereckson>	 :) thanks
[21:29:07] <wikibugs_>	 (03Draft1) 10Paladox: Phabricator: Remove three unneeded configs [puppet] - 10https://gerrit.wikimedia.org/r/342275
[21:29:10] <wikibugs_>	 (03PS2) 10Paladox: Phabricator: Remove three unneeded configs [puppet] - 10https://gerrit.wikimedia.org/r/342275
[21:32:09] <icinga-wm>	 PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:34:39] <wikibugs_>	 (03PS2) 10Dzahn: contint: Zuul no more interact with Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar)
[21:35:28] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[21:38:02] <wikibugs_>	 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3092165 (10RobH) a:05RobH>03Papaul This system still has the ipmi issue when run on the local OS:  ``` robh@ms-be2002:~$ sudo ipmi-chassis --get-chassis-status ipmi_cmd_get_chassis_status: internal...
[21:38:36] <wikibugs_>	 (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5746/" [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar)
[21:38:49] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] contint: Zuul no more interact with Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar)
[21:39:00] <hashar>	 puppet compiler looks good stil
[21:40:05] <mutante>	 yes
[21:40:21] <mutante>	 hands over to you 
[21:41:44] <wikibugs_>	 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3091181 (10Krinkle) > `nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clon...
[21:42:28] <hashar>	 !log restarted Zuul
[21:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:10] <wikibugs_>	 (03Draft1) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276
[21:46:13] <wikibugs_>	 (03PS2) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276
[21:47:44] <wikibugs_>	 (03CR) 10Dzahn: "private repo: [master 0112f51] (dzahn) remove passwords::misc::contint::jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/340529 (owner: 10Hashar)
[21:47:46] <wikibugs_>	 (03PS1) 10Hashar: Remove passwords::misc::contint::jenkins [labs/private] - 10https://gerrit.wikimedia.org/r/342277
[21:52:28] <wikibugs_>	 (03PS2) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164
[21:52:28] <icinga-wm>	 PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:53:28] <icinga-wm>	 PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:54:32] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092179 (10kaldari) I approve!
[21:55:09] <wikibugs_>	 06Operations, 10Ops-Access-Requests: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092181 (10RobH)
[21:57:05] <wikibugs_>	 (03PS1) 10RobH: add niharika29 to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/342278
[21:57:50] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092202 (10RobH)
[21:58:01] <wikibugs_>	 (03CR) 10RobH: [C: 032] add niharika29 to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/342278 (owner: 10RobH)
[21:59:45] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092204 (10RobH) 05Open>03Resolved There have been no objections noted, and I noticed that this 3 day wait ended today.  I chatte...
[22:01:08] <icinga-wm>	 RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[22:09:27] <wikibugs_>	 (03CR) 10Dzahn: [C: 04-1] "influences admin groups, since they are added in hiera role/common/ http://puppet-compiler.wmflabs.org/5747/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn)
[22:18:31] <wikibugs_>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3092215 (10Ottomata) Thanks @robh!
[22:20:28] <icinga-wm>	 RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[22:21:28] <icinga-wm>	 RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[22:25:36] <wikibugs_>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#3092220 (10Arthur2e5) >>! In T148693#3090260, @Shoichi wrote: > No matter how many sites connect to the server, they share the same cache.  Makes sense a...
[22:38:04] <wikibugs_>	 (03CR) 10Hashar: "That is better done directly in the job. For example by adding a build parameter:" [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456) (owner: 10Mholloway)
[22:50:28] <icinga-wm>	 PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:59:43] <wikibugs_>	 (03PS3) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164
[23:02:28] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[23:02:58] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
[23:13:58] <wikibugs_>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092324 (10Dzahn)
[23:14:41] <wikibugs_>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#2497714 (10Dzahn)
[23:18:05] <wikibugs_>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092331 (10Dzahn)
[23:18:28] <icinga-wm>	 RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[23:20:15] <wikibugs_>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092348 (10Dzahn) a:03RobH @Robh this was an older decom task that was still open, i added the newer checklist now and checked the boxes after the fact.  assigning to you to check if switch ports was done alre...
[23:21:35] <wikibugs_>	 06Operations, 10ops-eqiad: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3092353 (10Dzahn)
[23:30:05] <wikibugs_>	 (03PS4) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164
[23:34:50] <wikibugs_>	 (03CR) 10Dzahn: [C: 031] "now the only difference are motd contents / role names.  http://puppet-compiler.wmflabs.org/5749/" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn)
[23:36:29] <wikibugs_>	 (03CR) 10Dzahn: [C: 031] "@Moritz do i need to change the name of the debdeploy grain here to match the role name?" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn)
[23:40:29] <wikibugs_>	 (03PS2) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163
[23:41:41] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn)
[23:52:12] <wikibugs_>	 (03Abandoned) 10Mholloway: Set ANDROID_HOME environment variable (role::ci::slave::android) [puppet] - 10https://gerrit.wikimedia.org/r/342262 (https://phabricator.wikimedia.org/T158456) (owner: 10Mholloway)
[23:58:48] <wikibugs_>	 (03PS3) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163
[23:59:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.54 seconds