[00:00:04] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T0000).
[00:02:27] <twentyafterfour>	 dear jouncebot: yes, master
[00:03:46] <twentyafterfour>	 !log Preparing to deploy phabricator update. Tagged release/2016-06-08/1
[00:03:53] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "yep, confirmed, the old ones are gone" [dns] - 10https://gerrit.wikimedia.org/r/293446 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul)
[00:05:38] <twentyafterfour>	 wtf logmsgbot
[00:06:53] <mutante>	 !log meta T46791
[00:06:53] <stashbot>	 T46791: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791
[00:09:07] <mutante>	 stashbot: tell wikibugs that logmsgbot is gone
[00:11:44] <mutante>	 !log restarted the log bot
[00:11:59] <mutante>	 what..
[00:12:53] <mutante>	 oh, wrong bot , duh
[00:14:24] <twentyafterfour>	 taking phabricator offline for a moment
[00:14:37] <twentyafterfour>	 (scheduled, icinga already silenced)
[00:15:19] <mutante>	 !log restarted the log bot
[00:15:30] <mutante>	 gimme a break
[00:16:43] <mutante>	 joins #morebots-test :p
[00:17:16] <mutante>	 well that doesnt exist anymore like the docs say
[00:19:37] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2366623 (10Papaul)
[00:19:52] <mutante>	 all copies of morebots stopped logging
[00:20:00] <mutante>	 not just the one here in production channel
[00:20:13] <mutante>	 i think toollabs issue then
[00:20:36] <mutante>	 more in -labs
[00:20:57] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#2366624 (10Papaul) rectification not B4 but A4
[00:26:57] <wikibugs>	 06Operations, 06Performance-Team, 06Services, 07Availability: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2366638 (10Smalyshev) I checked the session size on my local vagrant install, it's 780 bytes, so not too big. Of course, productions sessions may be...
[00:28:32] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0]
[00:32:20] <legoktm>	 woah, phabricator is now sending html email?
[00:33:16] <paladox>	 legoktm yep
[00:33:29] <paladox>	 was changed to default
[00:33:32] <mutante>	 is it possible that this broke morebots
[00:33:35] <mutante>	 [15:44:06] <logmsgbot> !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group1 for reals T135504 (duration: 00m 25s)
[00:33:36] <stashbot>	 T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504
[00:34:07] <mutante>	 html email? heh, glad i changed it all to browser notifications
[00:34:09] <paladox>	 mutante maybe but we can set it back to plain email for that bot.
[00:34:49] <paladox>	 legoktm download links too https://phabricator.wikimedia.org/diffusion/MW/
[00:35:32] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:35:33] <mutante>	 plain email?
[00:35:54] <paladox>	 mutante yes you can change it in settings.
[00:36:14] <paladox>	 Maybe twentyafterfour would know 
[00:36:40] <twentyafterfour>	 I don't think that broke the bots because the bots broke shortly before the update, not after
[00:36:56] <legoktm>	 none of the bots AFAIK parse phab email
[00:37:04] <mutante>	 the log message abotu the AuthManager being enabled
[00:37:11] <mutante>	 is the last that worked
[00:37:27] <mutante>	 phab mail is unrelated
[00:40:28] <legoktm>	 does morebots keep a debug log somewhere?
[00:40:45] <mutante>	 the docs said there is a testbot in #morebots-test but the channel is empty
[00:41:21] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.042 second response time
[00:56:19] <paladox>	 egoktm hi could you update the site notice on this channel please
[00:56:31] <paladox>	 legoktm ^^
[00:56:37] <paladox>	 Since it still says Status: CI down
[00:58:02] <paladox>	 Thanks
[01:14:54] <grrrit-wm>	 (03PS3) 10Dzahn: Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 (owner: 10Hoo man)
[01:15:09] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Retry Wikidata dump creation up to three times [puppet] - 10https://gerrit.wikimedia.org/r/293445 (owner: 10Hoo man)
[01:17:52] <wikibugs>	 06Operations, 07Graphite: "carbon-cache too many creates" on graphite1001 - https://phabricator.wikimedia.org/T137380#2366699 (10Dzahn)
[01:18:16] <wikibugs>	 06Operations, 10Monitoring, 07Graphite: "carbon-cache too many creates" on graphite1001 - https://phabricator.wikimedia.org/T137380#2366712 (10Dzahn)
[01:25:22] <Debra>	 legoktm: I've been getting HTML e-mail from Phabricator Phabricator for a while now.
[01:25:38] <Debra>	 And Evan just did better "edited task description" e-mails with diffs!
[01:27:20] <wikibugs>	 06Operations, 10ops-eqiad: mw1063 broken - https://phabricator.wikimedia.org/T137381#2366726 (10Dzahn)
[01:27:28] <legoktm>	 Oooh :D
[01:27:48] <Debra>	 https://secure.phabricator.com/T7643 has screenshots.
[01:30:10] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw1063 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T137381
[01:30:29] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[01:31:08] <icinga-wm>	 PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2099 MB (3% inode=96%): /srv/swift-storage/sdl1 112647 MB (5% inode=91%)
[02:26:33] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 11m 00s)
[02:27:39] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:29:29] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time
[02:48:37] <grrrit-wm>	 (03PS1) 10Microchip08: Redirect phabricator.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) 
[02:49:01] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 11m 02s)
[02:52:10] <wikibugs>	 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366860 (10MC8) It looks like redirects existed for Bugzilla, so I guess you could call this a regression.
[02:53:56] <mutante>	 !log ms-be2012 ran out of disk due to huge syslog, deleted log, restarted rsyslogd
[02:55:20] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun  9 02:55:20 UTC 2016 (duration 6m 19s)
[03:01:19] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:03:09] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time
[03:17:08] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:18:58] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.025 second response time
[03:22:26] <wikibugs>	 06Operations, 10DNS, 10Phabricator, 10Traffic, 13Patch-For-Review: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2366879 (10MZMcBride) >>! In T137252#2362373, @Krenair wrote: > Isn't commons.wikipedia.org just a historical thing?  Yeah.  >>! I...
[03:37:51] <wikibugs>	 06Operations, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2366885 (10Dereckson) Do a manual run of l10nupdate on Tin to check if all is now fine perhaps?
[04:00:57] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:05:21] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.225 second response time
[04:06:30] <icinga-wm>	 PROBLEM - MD RAID on mw2216 is CRITICAL: Timeout while attempting connection
[04:06:30] <icinga-wm>	 PROBLEM - MD RAID on mw2215 is CRITICAL: Timeout while attempting connection
[04:07:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw2215 is CRITICAL: Connection timed out
[04:07:30] <icinga-wm>	 PROBLEM - configured eth on mw2216 is CRITICAL: Timeout while attempting connection
[04:07:30] <icinga-wm>	 PROBLEM - configured eth on mw2215 is CRITICAL: Timeout while attempting connection
[04:07:40] <icinga-wm>	 PROBLEM - dhclient process on mw2216 is CRITICAL: Timeout while attempting connection
[04:07:40] <icinga-wm>	 PROBLEM - dhclient process on mw2215 is CRITICAL: Timeout while attempting connection
[04:07:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2215 is CRITICAL: Host mw2215 is not in mediawiki-installation dsh group
[04:07:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2216 is CRITICAL: Host mw2216 is not in mediawiki-installation dsh group
[04:08:10] <icinga-wm>	 PROBLEM - nutcracker port on mw2216 is CRITICAL: Timeout while attempting connection
[04:08:11] <icinga-wm>	 PROBLEM - nutcracker port on mw2215 is CRITICAL: Timeout while attempting connection
[04:08:30] <icinga-wm>	 PROBLEM - nutcracker process on mw2216 is CRITICAL: Timeout while attempting connection
[04:08:30] <icinga-wm>	 PROBLEM - nutcracker process on mw2215 is CRITICAL: Timeout while attempting connection
[04:08:42] <icinga-wm>	 PROBLEM - puppet last run on mw2216 is CRITICAL: Timeout while attempting connection
[04:08:50] <icinga-wm>	 PROBLEM - puppet last run on mw2215 is CRITICAL: Timeout while attempting connection
[04:09:00] <icinga-wm>	 PROBLEM - salt-minion processes on mw2215 is CRITICAL: Timeout while attempting connection
[04:09:00] <icinga-wm>	 PROBLEM - salt-minion processes on mw2216 is CRITICAL: Timeout while attempting connection
[04:09:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw2216 is CRITICAL: Connection timed out
[04:09:31] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2216 is CRITICAL: Timeout while attempting connection
[04:09:31] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2215 is CRITICAL: Timeout while attempting connection
[04:09:50] <icinga-wm>	 PROBLEM - DPKG on mw2215 is CRITICAL: Timeout while attempting connection
[04:09:50] <icinga-wm>	 PROBLEM - DPKG on mw2216 is CRITICAL: Timeout while attempting connection
[04:10:10] <icinga-wm>	 PROBLEM - Disk space on mw2215 is CRITICAL: Timeout while attempting connection
[04:10:10] <icinga-wm>	 PROBLEM - Disk space on mw2216 is CRITICAL: Timeout while attempting connection
[04:22:39] <Dereckson>	 $
[04:23:18] <Dereckson>	 seems to be the monitoring as every probe fails ^
[04:28:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time
[04:31:01] <icinga-wm>	 RECOVERY - configured eth on mw2215 is OK: OK - interfaces up
[04:31:02] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2215 is OK: OK: nf_conntrack is 0 % full
[04:31:11] <icinga-wm>	 RECOVERY - dhclient process on mw2215 is OK: PROCS OK: 0 processes with command name dhclient
[04:31:41] <icinga-wm>	 RECOVERY - Disk space on mw2215 is OK: DISK OK
[04:31:51] <icinga-wm>	 RECOVERY - nutcracker port on mw2215 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[04:32:00] <icinga-wm>	 RECOVERY - MD RAID on mw2215 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[04:32:10] <icinga-wm>	 RECOVERY - nutcracker process on mw2215 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[04:32:29] <icinga-wm>	 RECOVERY - DPKG on mw2215 is OK: All packages OK
[04:33:09] <icinga-wm>	 RECOVERY - salt-minion processes on mw2215 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[04:33:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time
[04:34:28] <icinga-wm>	 PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.038 second response time
[04:35:28] <icinga-wm>	 RECOVERY - DPKG on mw2216 is OK: All packages OK
[04:35:58] <icinga-wm>	 PROBLEM - Disk space on mw2219 is CRITICAL: Timeout while attempting connection
[04:35:58] <icinga-wm>	 RECOVERY - Disk space on mw2216 is OK: DISK OK
[04:35:59] <icinga-wm>	 RECOVERY - configured eth on mw2216 is OK: OK - interfaces up
[04:36:18] <icinga-wm>	 RECOVERY - nutcracker port on mw2216 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[04:36:19] <icinga-wm>	 PROBLEM - MD RAID on mw2219 is CRITICAL: Timeout while attempting connection
[04:36:19] <icinga-wm>	 RECOVERY - dhclient process on mw2216 is OK: PROCS OK: 0 processes with command name dhclient
[04:36:38] <icinga-wm>	 RECOVERY - MD RAID on mw2216 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[04:36:39] <icinga-wm>	 RECOVERY - nutcracker process on mw2216 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[04:36:49] <icinga-wm>	 RECOVERY - salt-minion processes on mw2216 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[04:36:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2216 is OK: OK: nf_conntrack is 0 % full
[04:37:09] <icinga-wm>	 PROBLEM - configured eth on mw2219 is CRITICAL: Timeout while attempting connection
[04:37:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw2219 is CRITICAL: Connection timed out
[04:37:19] <icinga-wm>	 PROBLEM - dhclient process on mw2219 is CRITICAL: Timeout while attempting connection
[04:37:39] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group
[04:38:08] <icinga-wm>	 PROBLEM - nutcracker port on mw2219 is CRITICAL: Timeout while attempting connection
[04:38:19] <icinga-wm>	 PROBLEM - nutcracker process on mw2219 is CRITICAL: Timeout while attempting connection
[04:38:38] <icinga-wm>	 PROBLEM - puppet last run on mw2219 is CRITICAL: Timeout while attempting connection
[04:38:58] <icinga-wm>	 PROBLEM - salt-minion processes on mw2219 is CRITICAL: Timeout while attempting connection
[04:39:28] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2219 is CRITICAL: Timeout while attempting connection
[04:39:40] <icinga-wm>	 PROBLEM - DPKG on mw2219 is CRITICAL: Timeout while attempting connection
[04:42:58] <icinga-wm>	 PROBLEM - NTP on mw2215 is CRITICAL: NTP CRITICAL: Offset unknown
[04:46:49] <icinga-wm>	 RECOVERY - NTP on mw2215 is OK: NTP OK: Offset -0.009445309639 secs
[04:51:48] <icinga-wm>	 PROBLEM - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 2 failures
[04:53:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw2215 is CRITICAL: Connection refused
[04:56:00] <icinga-wm>	 PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 2 failures
[04:58:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw2216 is CRITICAL: Connection refused
[05:02:49] <icinga-wm>	 RECOVERY - Apache HTTP on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.079 second response time
[05:05:49] <icinga-wm>	 RECOVERY - MD RAID on mw2219 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[05:05:59] <icinga-wm>	 RECOVERY - nutcracker process on mw2219 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[05:06:29] <icinga-wm>	 RECOVERY - salt-minion processes on mw2219 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[05:06:40] <icinga-wm>	 RECOVERY - configured eth on mw2219 is OK: OK - interfaces up
[05:06:59] <icinga-wm>	 RECOVERY - dhclient process on mw2219 is OK: PROCS OK: 0 processes with command name dhclient
[05:06:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2219 is OK: OK: nf_conntrack is 0 % full
[05:07:28] <icinga-wm>	 RECOVERY - Disk space on mw2219 is OK: DISK OK
[05:07:38] <icinga-wm>	 RECOVERY - nutcracker port on mw2219 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[05:08:09] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:09:19] <icinga-wm>	 RECOVERY - DPKG on mw2219 is OK: All packages OK
[05:10:09] <icinga-wm>	 PROBLEM - NTP on mw2219 is CRITICAL: NTP CRITICAL: Offset unknown
[05:11:50] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.013 second response time
[05:19:59] <icinga-wm>	 RECOVERY - NTP on mw2219 is OK: NTP OK: Offset 0.00165784359 secs
[05:27:49] <icinga-wm>	 PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 2 failures
[05:28:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw2219 is CRITICAL: Connection refused
[05:38:30] <icinga-wm>	 PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.817 second response time
[05:42:44] <ori>	 yuvipanda: ^
[05:50:28] <icinga-wm>	 RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.795 second response time
[05:53:28] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.006 second response time
[05:59:29] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.983 second response time
[06:01:30] <wikibugs>	 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367102 (10Dzahn)
[06:02:06] <wikibugs>	 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367114 (10Dzahn) a:03faidon
[06:06:41] <wikibugs>	 06Operations: ms-be2012 ran out of disk space - https://phabricator.wikimedia.org/T137397#2367116 (10Dzahn) meanwhile it's already warning again  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be2012&service=Disk+space   and about 8G .. there is a lot of activity there , PUTs and DELETEs...
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2215 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2215 is CRITICAL: Host mw2215 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on mw2215 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2216 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2216 is CRITICAL: Host mw2216 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:49] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:50] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw2219 is CRITICAL: Connection refused daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:50] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2219 is CRITICAL: Host mw2219 is not in mediawiki-installation dsh group daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:10:51] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on mw2219 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn new https://phabricator.wikimedia.org/T135466
[06:13:10] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Dzahn) These have been added to DNS now , mgmt and prod IPs, changes by papaul, i reviewed and merged. papaul started installing servers ..mw2215 thru mw2219 already showing up in Icinga...
[06:14:45] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2367125 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/293446/ https://gerrit.wikimedia.org/r/#/c/292307/ https://gerrit.wikimedia.org/r/#/c/293246/ https://gerrit.wikimedia.org/r/#/c/293218/ https:/...
[06:20:35] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:26:34] <icinga-wm>	 RECOVERY - Disk space on ms-be2012 is OK: DISK OK
[06:30:45] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:16] <icinga-wm>	 PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:49] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:57] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:47] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:17] <icinga-wm>	 PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:35:37] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:36:18] <icinga-wm>	 PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%)
[06:36:48] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:44:25] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) 
[06:54:48] <icinga-wm>	 PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:55:28] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[06:55:37] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:37] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:56:38] <icinga-wm>	 RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:18] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:38] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:27] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:48] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:10:21] <icinga-wm>	 PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 10976 MB (3% inode=99%)
[07:11:18] <ori>	 what is lithium?
[07:13:02] <moritzm>	 log aggregation via rsyslog, little used AFAICT
[07:14:11] <icinga-wm>	 RECOVERY - Disk space on lithium is OK: DISK OK
[07:14:24] <moritzm>	 that check seems bogus, there's 73 GB free?
[07:14:30] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:16:30] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.357 second response time
[07:17:10] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:18:52] <icinga-wm>	 RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[07:22:56] <moritzm>	 !log removed /var/log/logstash/logstash.1 on logstash1001, logspam (similar to the what is described in https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/144) depleted the space on the root partition
[07:23:30] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:24:11] <icinga-wm>	 RECOVERY - Disk space on logstash1001 is OK: DISK OK
[07:43:40] <grrrit-wm>	 (03CR) 10Elukey: "Alex point is that configuring specific monitors in a puppet module, rather than in a role, brings us to issues like the one we are discus" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[07:51:19] <grrrit-wm>	 (03Draft2) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 
[07:51:38] <grrrit-wm>	 (03CR) 10Gehel: Script to do the initial data load from OSM for Maps project (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel)
[08:04:49] <grrrit-wm>	 (03PS1) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) 
[08:09:55] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:11:45] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.008 second response time
[08:15:07] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2367183 (10Gehel)
[08:15:50] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2367196 (10Gehel)
[08:25:36] <icinga-wm>	 PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: puppet fail
[08:25:58] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: ores: Add redis settings to worker nodes in labs [puppet] - 10https://gerrit.wikimedia.org/r/293429 (owner: 10Ladsgroup)
[08:26:08] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Add redis settings to worker nodes in labs [puppet] - 10https://gerrit.wikimedia.org/r/293429 (owner: 10Ladsgroup)
[08:28:14] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Add graphite settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[08:32:09] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] ores: move config file to /etc/ores [puppet] - 10https://gerrit.wikimedia.org/r/292516 (owner: 10Ladsgroup)
[08:38:16] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:38:24] <grrrit-wm>	 (03CR) 10Ema: [C: 031] Force varnishkafka (compatible with Varnish 4) to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey)
[08:38:26] <moritzm>	 !log rolling restart of app server canaries for libtasn security update
[08:40:05] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time
[08:42:05] <grrrit-wm>	 (03PS4) 10Elukey: Force Varnishkafka to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) 
[08:42:58] <grrrit-wm>	 (03CR) 10Ema: [C: 031] Force Varnishkafka to output the Resp timestamp. [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey)
[08:43:35] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[08:43:48] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] "Puppet compiler looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/293327 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey)
[08:44:47] <elukey>	 akosiaris: o/ shall I merge your ores commit too?
[08:44:54] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[08:47:30] <elukey>	 seems only labs stuff, probably safe to merge
[08:48:17] <elukey>	 merging
[08:48:44] <elukey>	 from https://gerrit.wikimedia.org/r/#/c/293429/ it seems that we are backporting a config from prod to labs
[08:49:26] <icinga-wm>	 PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Puppet has 1 failures
[08:49:26] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[08:50:12] <grrrit-wm>	 (03PS4) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 
[08:50:24] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[08:50:25] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:50:35] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[08:50:45] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[08:51:05] <grrrit-wm>	 (03PS2) 10Ladsgroup: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) 
[08:52:24] <icinga-wm>	 RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[08:54:59] <moritzm>	 !log installing libtasn security updates
[08:56:15] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:56:35] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[09:05:17] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.013 second response time
[09:05:30] <Amir1>	 !log restarting uwsgi-ores celery-ores-worker in scb1001 and scb1002
[09:08:57] <icinga-wm>	 PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:09:08] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.030 second response time
[09:10:03] <grrrit-wm>	 (03PS2) 10Gehel: Change expired file zoom level from 16 to 15. [puppet] - 10https://gerrit.wikimedia.org/r/291885 (https://phabricator.wikimedia.org/T136483) 
[09:12:07] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Change expired file zoom level from 16 to 15. [puppet] - 10https://gerrit.wikimedia.org/r/291885 (https://phabricator.wikimedia.org/T136483) (owner: 10Gehel)
[09:14:56] <icinga-wm>	 RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[09:16:47] <gehel>	 !log lowering disk high watermark to rebalance disk usage on elasticsearch eqiad cluster
[09:21:07] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:23:06] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.018 second response time
[09:32:35] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) 
[09:36:23] <grrrit-wm>	 (03PS1) 10Gehel: Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 
[09:37:13] <icinga-wm>	 RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:38:14] <grrrit-wm>	 (03PS2) 10Gehel: Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) 
[09:46:52] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[09:46:53] <icinga-wm>	 RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.051 second response time
[09:47:36] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Provide the firejail containment for imagemagick's convert(1) on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/293328 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff)
[09:53:02] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Skip installing the imagemagick profile for now [puppet] - 10https://gerrit.wikimedia.org/r/293478 
[09:53:38] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Skip installing the imagemagick profile for now [puppet] - 10https://gerrit.wikimedia.org/r/293478 (owner: 10Muehlenhoff)
[09:54:52] <icinga-wm>	 PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:55:02] <icinga-wm>	 PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:55:03] <icinga-wm>	 PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:55:32] <icinga-wm>	 PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:55:33] <icinga-wm>	 PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:55:42] <icinga-wm>	 PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:12] <icinga-wm>	 PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:13] <icinga-wm>	 PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:13] <icinga-wm>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:14] <icinga-wm>	 PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:14] <icinga-wm>	 PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:32] <icinga-wm>	 PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:34] <icinga-wm>	 PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:42] <icinga-wm>	 PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:43] <icinga-wm>	 PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:43] <icinga-wm>	 PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:56:44] <icinga-wm>	 PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:02] <icinga-wm>	 PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:04] <moritzm>	 puppet failures should resolve soon
[09:57:13] <icinga-wm>	 PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:13] <icinga-wm>	 PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:14] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:23] <icinga-wm>	 PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:23] <icinga-wm>	 PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:33] <icinga-wm>	 PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:57:53] <icinga-wm>	 PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:04] <icinga-wm>	 PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:04] <icinga-wm>	 PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:04] <icinga-wm>	 PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:12] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:13] <icinga-wm>	 PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:58:14] <icinga-wm>	 PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:04] <icinga-wm>	 PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:13] <icinga-wm>	 PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:13] <icinga-wm>	 PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:14] <icinga-wm>	 PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:22] <icinga-wm>	 PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:33] <icinga-wm>	 PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:42] <icinga-wm>	 RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:59:43] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:44] <icinga-wm>	 PROBLEM - puppet last run on mw1022 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:52] <icinga-wm>	 PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:53] <icinga-wm>	 PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures
[09:59:53] <icinga-wm>	 PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:03] <icinga-wm>	 PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:22] <icinga-wm>	 PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:23] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:23] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:23] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:33] <icinga-wm>	 PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:43] <icinga-wm>	 PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:00:54] <icinga-wm>	 PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:01:13] <icinga-wm>	 PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:06:23] <icinga-wm>	 RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:20:42] <icinga-wm>	 RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[10:20:52] <icinga-wm>	 RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[10:21:03] <icinga-wm>	 RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:21:03] <icinga-wm>	 RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:21:23] <icinga-wm>	 RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[10:21:32] <icinga-wm>	 RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:22:03] <icinga-wm>	 RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[10:22:13] <icinga-wm>	 RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[10:22:15] <icinga-wm>	 RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:22:15] <icinga-wm>	 RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[10:22:23] <icinga-wm>	 RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:22:33] <icinga-wm>	 RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[10:22:33] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures
[10:22:43] <icinga-wm>	 RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:22:43] <icinga-wm>	 RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:22:52] <icinga-wm>	 RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[10:22:52] <icinga-wm>	 RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[10:23:03] <icinga-wm>	 RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:23:13] <icinga-wm>	 RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[10:23:22] <icinga-wm>	 RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[10:23:22] <icinga-wm>	 RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:23:32] <icinga-wm>	 RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:23:33] <icinga-wm>	 RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:23:42] <icinga-wm>	 RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:23:52] <icinga-wm>	 RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[10:24:02] <icinga-wm>	 RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[10:24:12] <icinga-wm>	 RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[10:24:12] <icinga-wm>	 RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[10:24:13] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[10:24:14] <icinga-wm>	 RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:24:22] <icinga-wm>	 RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[10:24:53] <icinga-wm>	 RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[10:25:04] <icinga-wm>	 RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[10:25:14] <icinga-wm>	 RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[10:25:14] <icinga-wm>	 RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[10:25:22] <icinga-wm>	 RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:25:23] <icinga-wm>	 RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[10:25:33] <icinga-wm>	 RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[10:25:43] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:25:44] <icinga-wm>	 RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[10:25:52] <icinga-wm>	 RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:25:53] <icinga-wm>	 RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:26:04] <icinga-wm>	 RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:26:13] <icinga-wm>	 RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:26:22] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[10:26:22] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[10:26:23] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[10:26:32] <icinga-wm>	 RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:26:33] <icinga-wm>	 RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:27:03] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:27:03] <icinga-wm>	 RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:29:03] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[10:35:47] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:37:07] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Configure proxy for HTTPS as well as HTTP in replicate-osm. [puppet] - 10https://gerrit.wikimedia.org/r/293477 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel)
[10:37:37] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:39:27] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.616 second response time
[10:39:47] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[10:46:57] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 674 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5572966 keys - replication_delay is 674
[10:48:36] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[10:48:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[10:49:28] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:50:56] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5526228 keys - replication_delay is 0
[10:51:14] <urandom>	 !log Restarting Cassandra on {cerium,praseodymium}.eqiad.wmnet (RESTBase staging) : T126629
[10:51:15] <stashbot>	 T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629
[10:51:26] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.757 second response time
[10:52:13] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 04-1] contint: cleanup gallium / use contint1001 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar)
[10:53:55] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2366461 (10JanZerebecki) Based on your existing production access, you should be given the nda group.
[10:54:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] explicitely set input reader format in osm2pgsql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[10:54:52] <grrrit-wm>	 (03CR) 10JanZerebecki: "I would suggest to go with the only winning option." [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[10:55:37] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:27] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:56:27] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:57:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[10:57:55] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[10:58:01] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[10:59:13] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Right before merging I noticed that the default should also be changed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[11:02:54] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[11:03:53] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:05:03] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:05:33] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.026 second response time
[11:05:33] <icinga-wm>	 PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:06:22] <mobrovac>	 urandom: ^
[11:06:24] <mobrovac>	 known?
[11:08:13] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:08:35] <mobrovac>	 urandom went for lunch, sigh
[11:08:43] <wikibugs>	 06Operations: decom furud - https://phabricator.wikimedia.org/T137221#2367426 (10akosiaris) >>! In T137221#2361776, @Dzahn wrote: > @akosiaris have you seen the problem above before when deleting VMs? ^   No, but then again we haven't been deleting VMs much yet. I am not sure what that is, but it might be a netw...
[11:08:46] <mobrovac>	 that's staging, though, so no worries
[11:09:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel)
[11:12:14] <grrrit-wm>	 (03Abandoned) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/269115 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry)
[11:12:53] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[11:13:14] <grrrit-wm>	 (03PS1) 10KartikMistry: lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/293484 (https://phabricator.wikimedia.org/T124137) 
[11:13:23] <icinga-wm>	 RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy
[11:13:53] <kart_>	 akosiaris: I'm rebuilding all Apertium packages with +wmf version scheme, so it is easier to track when importing from Debian.
[11:14:15] <kart_>	 akosiaris: I'll ping once base packages are ready.
[11:15:08] <akosiaris>	 kart_: ok
[11:16:02] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[11:19:14] <icinga-wm>	 PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:19:24] <grrrit-wm>	 (03PS4) 10Ladsgroup: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) 
[11:21:13] <icinga-wm>	 RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active
[11:22:12] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[11:23:55] <grrrit-wm>	 (03PS1) 10KartikMistry: cg3: New upstream release [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/293485 (https://phabricator.wikimedia.org/T107306) 
[11:25:48] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[11:26:17] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[11:27:23] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[11:27:34] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[11:27:39] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[11:27:44] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] ores: Add graphite settings [puppet] - 10https://gerrit.wikimedia.org/r/293442 (https://phabricator.wikimedia.org/T137367) (owner: 10Ladsgroup)
[11:27:53] <grrrit-wm>	 (03PS2) 10Gehel: Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) 
[11:28:04] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:50] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Define backup for contint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[11:30:03] <grrrit-wm>	 (03PS3) 10Gehel: Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) 
[11:30:39] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson)
[11:31:18] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Maps tables should all be owned by osmimporter [puppet] - 10https://gerrit.wikimedia.org/r/293331 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel)
[11:33:27] <Amir1>	 !log manually restarting ores-uwsgi and celery-ores-worker in scb100[12]
[11:37:30] <grrrit-wm>	 (03PS3) 10Gehel: WIP - Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) 
[11:37:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@Eevans, the cassandra role is not re-used anywhere but the restbase clusters. And that's a good thing. In fact, given that the cassandra " [puppet] - 10https://gerrit.wikimedia.org/r/290860 (owner: 10Eevans)
[11:38:18] <grrrit-wm>	 (03PS2) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) 
[11:38:34] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson)
[11:38:40] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix whitespace issue [puppet] - 10https://gerrit.wikimedia.org/r/293291 (owner: 10Dereckson)
[11:41:23] <icinga-wm>	 PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:43:23] <icinga-wm>	 RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active
[11:44:56] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Update openldap module's README [puppet] - 10https://gerrit.wikimedia.org/r/292585 
[11:45:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update openldap module's README [puppet] - 10https://gerrit.wikimedia.org/r/292585 (owner: 10Alexandros Kosiaris)
[11:48:13] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[11:48:24] <wikibugs>	 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2367466 (10Cmjohnson)
[11:51:27] <urandom>	 mobrovac: yeah, sorry, it might do that...
[11:51:41] <urandom>	 mobrovac: i'm pushing it hard, trying to break it
[11:52:14] <mobrovac>	 no pb, urandom, better for it to be a controlled failure than a wtf :)
[11:52:51] <urandom>	 mobrovac: well, and it helps that it's not a production syste
[11:52:52] <urandom>	 m
[11:53:03] <mobrovac>	 that too :P
[11:53:06] <urandom>	 though it would be nice if it weren't monitored as such
[11:53:42] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:56:38] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 
[11:56:40] <grrrit-wm>	 (03PS28) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[11:57:21] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris)
[11:57:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[11:57:32] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time
[11:59:18] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@Faidon, mw_appserver_networks and analytics_networks are being done in the followup patch. Cleaning up the ERB defs is something I 'd lik" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[12:00:23] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:03:14] <grrrit-wm>	 (03CR) 10JanZerebecki: "In case Alexandros comment wasn't to be understood as this is a social issue. I think the preferred thing in operations.git these days is " [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[12:04:57] <grrrit-wm>	 (03Abandoned) 10Gergő Tisza: Clean up AuthManager configuration (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293440 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[12:06:45] <urandom>	 !log Temporarily disabling puppet on xenon.eqiad.wmnet to test settings : T126629
[12:06:46] <stashbot>	 T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629
[12:10:45] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367544 (10mobrovac)
[12:11:47] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet : T126629
[12:11:48] <stashbot>	 T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629
[12:12:13] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[12:12:27] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367569 (10Gehel) It seems that at the moment only .js and .css files which ar in the js / css directories are processed by filerev. Is there any reason to not...
[12:14:24] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367571 (10MoritzMuehlenhoff) For background: firejail starts the various service in a separate namespace. "firejail --join" allows a user to join that namespace a...
[12:14:33] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) 
[12:15:43] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:16:36] <grrrit-wm>	 (03PS1) 10Gehel: Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) 
[12:17:43] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0]
[12:19:21] <tgr>	 !log deploying [[gerrit:293459]] to fix morebots (T137377)
[12:19:22] <stashbot>	 T137377: all morebots stopped listening to !log lines - https://phabricator.wikimedia.org/T137377
[12:19:43] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:20:19] <tgr>	 thcipriani|afk: ^ (just in case you are still working on tin)
[12:21:17] <grrrit-wm>	 (03CR) 10Luke081515: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) (owner: 10Microchip08)
[12:24:24] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:24:51] <wikibugs>	 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2367600 (10KartikMistry)
[12:26:23] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[12:27:34] <grrrit-wm>	 (03PS1) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/293494 (https://phabricator.wikimedia.org/T95653) 
[12:29:43] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:33:15] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/LdapAuthentication/LdapPrimaryAuthenticationProvider.php: deploy [[gerrit:293459]] to fix wikitech API login / morebots (T137377) (duration: 00m 47s)
[12:33:16] <stashbot>	 T137377: all morebots stopped listening to !log lines - https://phabricator.wikimedia.org/T137377
[12:33:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:33:51] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:34:45] <grrrit-wm>	 (03PS1) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/293497 (https://phabricator.wikimedia.org/T107306) 
[12:37:30] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:45:31] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[12:47:12] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:49:40] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[12:50:59] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet : T126629
[12:51:00] <stashbot>	 T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629
[12:51:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:51:11] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:02:19] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2367721 (10Jonas) We can only use filerev for files that are referenced from the html file, because the application itself is not aware about filerev and then will not find...
[13:03:23] <grrrit-wm>	 (03CR) 10Hashar: "Few replies to Moritz." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar)
[13:06:11] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418#2367728 (10hashar)
[13:08:00] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:11:20] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:11:41] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.029 second response time
[13:12:49] <grrrit-wm>	 (03PS1) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) 
[13:12:51] <icinga-wm>	 RECOVERY - Host mw1063 is UP: PING OK - Packet loss = 0%, RTA = 2.21 ms
[13:13:31] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:15:20] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:15:30] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[13:15:31] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[13:16:33] <grrrit-wm>	 (03PS5) 10Gehel: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 
[13:18:50] <grrrit-wm>	 (03CR) 10Anomie: [C: 031] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[13:20:21] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] "After looking at the change again. Forget what I said, correct solution." [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[13:20:21] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[13:21:11] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:21:30] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:22:18] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "@Elukey, on 1) since you 've justified adequately removing these redundant checks and are OK with you and andrew receiving the alarms and " [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[13:25:21] <grrrit-wm>	 (03PS4) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) 
[13:26:39] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet to apply 2G file cache : T137419
[13:26:40] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[13:26:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:27:12] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:27:20] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[13:28:28] <grrrit-wm>	 (03PS2) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) 
[13:33:51] <grrrit-wm>	 (03PS3) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) 
[13:34:16] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:35:37] <grrrit-wm>	 (03CR) 10Hashar: "I have ran patchset 1 through the puppet compiler and noticed the ferm rule that allow Gearman on gallium would be dropped which would cau" [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[13:36:06] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[13:38:46] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "two inline comments, otherwise looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[13:39:46] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:41:53] <grrrit-wm>	 (03PS1) 10Anomie: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 
[13:41:58] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:42:22] <grrrit-wm>	 (03PS2) 10Elukey: Remove old and redundant AQS specific alarms. [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) 
[13:42:44] <grrrit-wm>	 (03CR) 10Hashar: "I will restore /etc/default/zuul-merger to the default from the deb package and then disable the service on boot with:" [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[13:44:45] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] "Thanks for all the comments! I am going to merge this code review and then follow up in https://phabricator.wikimedia.org/T137422" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey)
[13:45:21] <mobrovac>	 !log change-prop deploying 2161403c
[13:45:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:46:38] <grrrit-wm>	 (03CR) 10Hashar: [C: 04-1] contint: limit access to zuul-merger git daemon (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) (owner: 10Dzahn)
[13:50:16] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures
[13:51:23] <grrrit-wm>	 (03PS4) 10Hashar: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) 
[13:52:30] <mobrovac>	 !log change-prop restarting on scb1002 for update
[13:52:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:53:56] <grrrit-wm>	 (03PS5) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) 
[13:54:17] <icinga-wm>	 RECOVERY - jenkins_zmq_publisher on contint1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 8888
[13:54:27] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:55:47] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5545535 keys - replication_delay is 711
[13:56:31] <grrrit-wm>	 (03CR) 10Gehel: [C: 04-1] "tests need to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[13:56:47] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 2 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2367839 (10hashar) @Dzahn thanks, though all those rules are indeed present on ho...
[13:57:07] <icinga-wm>	 PROBLEM - cassandra-a service on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:57:35] <hashar>	 Icinga check  jenkins_zmq_publisher on contint1001  can be ignored. Jenkins is not running there
[13:57:37] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:57:46] <hashar>	 been doing a nc -l on the host for tests purposes
[13:58:17] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.172 second response time
[13:59:07] <icinga-wm>	 RECOVERY - cassandra-a service on xenon is OK: OK - cassandra-a is active
[13:59:21] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2367841 (10Dzahn)
[14:00:16] <grrrit-wm>	 (03CR) 10Hashar: "I have dropped the explicit ferm rule for Zuul server -- gearman since ferm allow all traffic on 127.0.0.1." [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[14:00:26] <icinga-wm>	 PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused
[14:00:40] <hashar>	 ^^^  jenkins_zmq_publisher can be ACK / ignored
[14:02:57] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5530360 keys - replication_delay is 0
[14:04:17] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[14:04:45] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:05:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1263 is CRITICAL: Connection timed out
[14:05:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1266 is CRITICAL: Timeout while attempting connection
[14:05:27] <icinga-wm>	 PROBLEM - dhclient process on mw1263 is CRITICAL: Timeout while attempting connection
[14:05:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1263 is CRITICAL: Host mw1263 is not in mediawiki-installation dsh group
[14:05:55] <icinga-wm>	 PROBLEM - DPKG on mw1266 is CRITICAL: Timeout while attempting connection
[14:06:15] <icinga-wm>	 PROBLEM - Disk space on mw1266 is CRITICAL: Timeout while attempting connection
[14:06:15] <icinga-wm>	 PROBLEM - nutcracker port on mw1263 is CRITICAL: Timeout while attempting connection
[14:06:35] <icinga-wm>	 PROBLEM - nutcracker process on mw1263 is CRITICAL: Timeout while attempting connection
[14:06:35] <icinga-wm>	 PROBLEM - MD RAID on mw1266 is CRITICAL: Timeout while attempting connection
[14:06:45] <icinga-wm>	 PROBLEM - puppet last run on mw1263 is CRITICAL: Timeout while attempting connection
[14:06:52] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2367846 (10Ottomata)
[14:06:55] <icinga-wm>	 PROBLEM - salt-minion processes on mw1263 is CRITICAL: Timeout while attempting connection
[14:07:05] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2367848 (10Ottomata) p:05Triage>03Normal
[14:07:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1266 is CRITICAL: Connection timed out
[14:07:17] <icinga-wm>	 PROBLEM - configured eth on mw1266 is CRITICAL: Timeout while attempting connection
[14:07:35] <icinga-wm>	 PROBLEM - dhclient process on mw1266 is CRITICAL: Timeout while attempting connection
[14:07:35] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1263 is CRITICAL: Timeout while attempting connection
[14:07:39] <urandom>	 !log Re-enabling puppet on xenon.eqiad.wmnet, forcing a run, and restarting Cassandra : T137419
[14:07:40] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[14:07:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:07:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1266 is CRITICAL: Host mw1266 is not in mediawiki-installation dsh group
[14:07:47] <icinga-wm>	 PROBLEM - DPKG on mw1263 is CRITICAL: Timeout while attempting connection
[14:08:06] <icinga-wm>	 PROBLEM - nutcracker port on mw1266 is CRITICAL: Timeout while attempting connection
[14:08:06] <icinga-wm>	 PROBLEM - Disk space on mw1263 is CRITICAL: Timeout while attempting connection
[14:08:25] <icinga-wm>	 PROBLEM - MD RAID on mw1263 is CRITICAL: Timeout while attempting connection
[14:08:25] <icinga-wm>	 PROBLEM - nutcracker process on mw1266 is CRITICAL: Timeout while attempting connection
[14:08:26] <icinga-wm>	 RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 2 hours ago with 0 failures
[14:08:45] <icinga-wm>	 PROBLEM - puppet last run on mw1266 is CRITICAL: Timeout while attempting connection
[14:08:56] <icinga-wm>	 PROBLEM - salt-minion processes on mw1266 is CRITICAL: Timeout while attempting connection
[14:09:15] <icinga-wm>	 PROBLEM - configured eth on mw1263 is CRITICAL: Timeout while attempting connection
[14:12:13] <mobrovac>	 !log change-prop restarting on scb1001 for update
[14:12:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:16:26] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:16:46] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[14:17:31] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367865 (10JanZerebecki) 05Open>03Resolved a:03JanZerebecki Done. sc-admins share the namespaces of all services. (Except the tmpfs which the task indicates...
[14:18:37] <icinga-wm>	 PROBLEM - NTP on mw1063 is CRITICAL: NTP CRITICAL: No response from NTP server
[14:18:45] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[14:19:02] <grrrit-wm>	 (03PS1) 10KartikMistry: apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/293507 (https://phabricator.wikimedia.org/T107306) 
[14:22:44] <grrrit-wm>	 (03CR) 10KartikMistry: "Can this be issue like, https://unix.stackexchange.com/questions/167533/what-does-gbperror-upstream-1-5-13-is-not-a-valid-treeish-mean ?" [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry)
[14:26:46] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:28:45] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time
[14:29:09] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me now." [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[14:33:00] <hashar>	 !log stopped / disabled zuul-merger on gallium T137418
[14:33:01] <stashbot>	 T137418: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418
[14:33:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:33:33] <grrrit-wm>	 (03PS5) 10Muehlenhoff: contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[14:34:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.006 second response time
[14:35:07] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] contint: remove zuul-merger from gallium [puppet] - 10https://gerrit.wikimedia.org/r/293501 (https://phabricator.wikimedia.org/T137418) (owner: 10Hashar)
[14:35:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey)
[14:36:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.007 second response time
[14:37:57] <hashar>	 !log Removing zuul-merger from gallium
[14:38:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:38:35] <elukey>	 !log Tested temp setting retention.bytes=2G for Analytics kafka topic webrequest_misc
[14:38:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:39:45] <icinga-wm>	 PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:39:45] <icinga-wm>	 RECOVERY - Disk space on mw1266 is OK: DISK OK
[14:39:54] <icinga-wm>	 RECOVERY - salt-minion processes on mw1266 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:40:04] <icinga-wm>	 RECOVERY - configured eth on mw1266 is OK: OK - interfaces up
[14:40:05] <icinga-wm>	 RECOVERY - dhclient process on mw1266 is OK: PROCS OK: 0 processes with command name dhclient
[14:40:35] <icinga-wm>	 RECOVERY - nutcracker port on mw1266 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[14:40:35] <icinga-wm>	 RECOVERY - salt-minion processes on mw1263 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:40:38] <grrrit-wm>	 (03PS4) 10Ottomata: Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4)
[14:40:44] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1263 is OK: OK: nf_conntrack is 0 % full
[14:40:44] <icinga-wm>	 RECOVERY - MD RAID on mw1263 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:40:44] <icinga-wm>	 RECOVERY - nutcracker process on mw1266 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[14:40:45] <icinga-wm>	 RECOVERY - Disk space on mw1063 is OK: DISK OK
[14:40:52] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367957 (10JanZerebecki) 05Resolved>03Open Sorry I forgot the main part of the request, the other namespaces besides file system. Will upload a patch.
[14:40:54] <icinga-wm>	 RECOVERY - MD RAID on mw1266 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:40:55] <icinga-wm>	 RECOVERY - nutcracker process on mw1263 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[14:40:56] <icinga-wm>	 RECOVERY - salt-minion processes on mw1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:40:56] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1266 is OK: OK: nf_conntrack is 0 % full
[14:40:56] <icinga-wm>	 RECOVERY - DPKG on mw1263 is OK: All packages OK
[14:41:05] <icinga-wm>	 RECOVERY - configured eth on mw1063 is OK: OK - interfaces up
[14:41:08] <elukey>	 those are mine, new app servers!
[14:41:24] <icinga-wm>	 PROBLEM - NTP on mw1263 is CRITICAL: NTP CRITICAL: Offset unknown
[14:41:25] <icinga-wm>	 RECOVERY - dhclient process on mw1063 is OK: PROCS OK: 0 processes with command name dhclient
[14:41:25] <icinga-wm>	 RECOVERY - Disk space on mw1263 is OK: DISK OK
[14:41:28] <elukey>	 not sure why mw1063 still pops up
[14:41:45] <icinga-wm>	 RECOVERY - configured eth on mw1263 is OK: OK - interfaces up
[14:42:04] <icinga-wm>	 RECOVERY - dhclient process on mw1263 is OK: PROCS OK: 0 processes with command name dhclient
[14:42:05] <icinga-wm>	 RECOVERY - nutcracker port on mw1263 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[14:44:45] <icinga-wm>	 RECOVERY - DPKG on mw1063 is OK: All packages OK
[14:44:49] <grrrit-wm>	 (03PS6) 10Gehel: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) 
[14:45:24] <icinga-wm>	 RECOVERY - DPKG on mw1266 is OK: All packages OK
[14:45:35] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:46:29] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2367970 (10hashar)
[14:46:31] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418#2367967 (10hashar) 05Open>03Resolved Puppet ran just fine, and the iptables rules looks ok.  I have manually cleaned up the host:...
[14:47:34] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.108 second response time
[14:47:46] <mobrovac>	 !log change-prop stopped on scb1002
[14:47:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:49:16] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Add ssh:userkey for eventlogging scap::targets [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4)
[14:50:08] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2367977 (10mobrovac) @JanZerebecki I think there must be a misunderstanding here. This is an access request for `sc[ab][12]00[12]` hosts in production.
[14:52:05] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[14:52:52] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[14:53:01] <grrrit-wm>	 (03CR) 10Gergő Tisza: [C: 031] Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie)
[14:55:05] <icinga-wm>	 RECOVERY - NTP on mw1063 is OK: NTP OK: Offset 0.002803564072 secs
[14:55:14] <icinga-wm>	 RECOVERY - NTP on mw1263 is OK: NTP OK: Offset 0.0001165866852 secs
[14:55:26] <yurik>	 thcipriani, gehel, want to use morning SWAT for another scap3 try?
[14:55:40] <grrrit-wm>	 (03PS1) 10JanZerebecki: Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) 
[14:55:50] <thcipriani>	 yurik: still trying to figure out what could have happened with tilerator :\
[14:56:06] * gehel is available
[14:56:45] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) (owner: 10JanZerebecki)
[14:56:52] <gehel>	 thcipriani: just let me know if there is something I can do to help, I had a quick look and don't see what was wrong...
[14:59:15] <icinga-wm>	 PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Puppet has 2 failures
[14:59:15] <icinga-wm>	 PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Puppet has 2 failures
[14:59:18] <yurik>	 thcipriani, poke akosiaris, i'm sure its all his fault :-P
[14:59:35] <icinga-wm>	 PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 2 failures
[14:59:58] <yurik>	 akosiaris, context - for some reason, tilerator switch to scap3 fails to chown its dir
[15:00:04] <jouncebot>	 anomie, ostriches, thcipriani, marktraceur, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1500).
[15:00:29] <thcipriani>	 gehel: yeah, my suspicion is it has something to do with the fact that there are multiple services that both use scap::target on the same box, but so far I can't see how that is causing a problem. Also it's true that we have multiple scap::targets on other boxes seemingly without issue.
[15:00:36] <wikibugs>	 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2368015 (10Sebastian_Berlin-WMSE) >>! In T706#2364293, @Aklapper wrote: > Now that [[ https://www.mediawiki.org/wiki/Phabricator/Project_m...
[15:01:46] <aude>	 nothing for swat
[15:02:04] <Dereckson>	 I can add a patch in a few minutes.
[15:02:09] <akosiaris>	 a tilerator has moved to scap3 ? great
[15:03:03] <gehel>	 akosiaris: tilerator is *trying* to move to scap3... but failing at this point
[15:03:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "It's git tags missing from the repo." [debs/contenttranslation/giella-core] - 10https://gerrit.wikimedia.org/r/270671 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry)
[15:03:08] <thcipriani>	 well, kartotherian moved, tilerator had some ownership issues that scap::target should take care of on targets that I'm trying to figure out.
[15:04:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1263 is CRITICAL: Connection refused
[15:04:28] <elukey>	 this is still a new appserver
[15:04:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1266 is CRITICAL: Connection refused
[15:06:20] <icinga-wm>	 RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:06:57] <grrrit-wm>	 (03PS1) 10Dereckson: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) 
[15:07:10] <grrrit-wm>	 (03CR) 10Gehel: [C: 04-1] "This should be merged at the same time as https://gerrit.wikimedia.org/r/#/c/293475/" [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[15:07:16] <Dereckson>	 aude: ^ here you are if you wish to SWAT something, I can add this patch to the Deployments table for this .
[15:07:19] <Dereckson>	 morning SWAT
[15:07:33] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368049 (10Aklapper) All patches seem to be merged. What are the next steps in this task?
[15:09:11] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[15:09:31] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[15:11:15] <aude>	 Dereckson: looking
[15:11:31] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5533846 keys - replication_delay is 0
[15:12:39] <grrrit-wm>	 (03CR) 10Aude: Add *.nara.gov to wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson)
[15:13:04] <grrrit-wm>	 (03PS7) 10Yurik: Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[15:13:16] <grrrit-wm>	 (03PS1) 10JanZerebecki: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 
[15:13:42] <grrrit-wm>	 (03CR) 10Dereckson: "Fixing that." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson)
[15:13:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Upgrade osm2pgsql to 0.90.0 [puppet] - 10https://gerrit.wikimedia.org/r/287600 (https://phabricator.wikimedia.org/T112423) (owner: 10Gehel)
[15:14:09] <aude>	 Dereckson: can deploy the patch
[15:14:23] <grrrit-wm>	 (03PS2) 10Dereckson: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) 
[15:14:28] <Dereckson>	 ^ with spaces to align
[15:14:34] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Raise cache frontend memory sizes significantly - https://phabricator.wikimedia.org/T135384#2368064 (10BBlack) There's still long-term investigation ongoing on the effects of jemalloc tuning and the effect on frontend hitrates (and the latter in conjuction with co...
[15:14:46] <aude>	 thanks
[15:15:13] <paladox>	 Does phabricator.wikimedia.org use nginx with apache or apache only.
[15:15:17] <paladox>	 Just wondering.
[15:15:26] <paladox>	 please
[15:15:29] <grrrit-wm>	 (03PS2) 10JanZerebecki: Allow firefail --join=* for sc-admin [puppet] - 10https://gerrit.wikimedia.org/r/293510 (https://phabricator.wikimedia.org/T137412) 
[15:15:54] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson)
[15:16:41] <Dereckson>	 paladox: modules/role/manifests/phabricator/main.pp
[15:16:53] <paladox>	 Dereckson ok thanks
[15:16:58] <Dereckson>	 no trace of nginx
[15:17:27] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add *.nara.gov to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293513 (https://phabricator.wikimedia.org/T137423) (owner: 10Dereckson)
[15:18:15] <Dereckson>	 paladox: do you need help to set up Phabricator somewhere with nginx?
[15:18:28] <paladox>	 Dereckson nope, im testing locally.
[15:18:42] <paladox>	 And found that the websites using nginx are faster then apache.
[15:19:15] <logmsgbot>	 !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Add *.nara.gov to wgCopyUploadDomains (duration: 00m 40s)
[15:19:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:19:22] <aude>	 done
[15:19:32] <grrrit-wm>	 (03PS1) 10Thcipriani: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[15:19:34] <Dereckson>	 Testing.
[15:19:37] <Dereckson>	 paladox: https://github.com/nasqueron/docker-phabricator/blob/master/files/etc/nginx/sites-available/default
[15:19:40] <aude>	 thanks
[15:19:56] <paladox>	 Dereckson thanks
[15:22:07] <Dereckson>	 Error fetching URL: SSL certificate problem: unable to get local issuer certificate
[15:22:15] <hashar>	 !log Cleaning git-daemon on gallium (was used by zuul-merger) T137418
[15:22:16] <stashbot>	 T137418: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418
[15:22:16] <aude>	 :(
[15:22:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:23:21] <Dereckson>	 aude: works in HTTP
[15:23:26] <aude>	 hmmm
[15:24:10] <Dereckson>	 mutante: an US archives site have some certificates issues too, not publishing the full certificates chain. That's not reserved to our bloggers.
[15:24:30] <Dereckson>	 aude: that's what happen when only the final certificate is published, not the intermediate between the root CA and this one
[15:24:45] <Dereckson>	 browsers download intermediate certificates, curl doesn't
[15:25:07] <aude>	 if there is something NARA needs to fix, suppose we can poke dominic and maybe he can poke the right people
[15:27:57] <grrrit-wm>	 (03PS3) 10Gehel: explicitely set input reader format in osm2pgsql [puppet] - 10https://gerrit.wikimedia.org/r/293475 (https://phabricator.wikimedia.org/T112423) 
[15:30:18] <wikibugs>	 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2368115 (10Andrew) 05Open>03Resolved I believe we've now updated everything to use the latest version of qemu, a modern version from the cloud archive.
[15:31:32] <Dereckson>	 aude: Yes there is something. I've checked on https://www.ssllabs.com/ssltest/analyze.html?d=clinton4.nara.gov&s=2620%3a0%3a2b0%3a10f1%3a0%3a0%3a0%3a109 this is the same issue than explained at https://phabricator.wikimedia.org/P3001#13606
[15:31:43] <elukey>	 !log added topic override retention.bytes=536870912000 to Kafka webrequest_text (T136690) 
[15:31:44] <stashbot>	 T136690: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690
[15:31:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:33:03] <Dereckson>	 aude: They need to concatenate Entrust Certification Authority - L1K + their certificate.
[15:35:32] <Dereckson>	 I'm reporting that on the task, and warn they should currently use http:// for server side upload.
[15:35:39] <Dereckson>	 Thanks for the deploy aude.
[15:36:19] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 031] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[15:38:47] <grrrit-wm>	 (03PS2) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[15:42:27] <grrrit-wm>	 (03PS3) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[15:46:22] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10hashar)
[15:46:26] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2368164 (10hashar) 05Open>03stalled From a quick chat with @mark we dont want...
[15:49:55] <grrrit-wm>	 (03PS4) 10Thcipriani: WIP: tilerator to scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[15:52:49] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2368210 (10ema) It looks like the culprit might be lack of grouping at the varnishlog level.  The following experiment is currently ongoing on cp1061:...
[15:53:52] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2368213 (10hashar) Following {T137323}, @mark stated that there should be no traffic between the private network and labs...
[15:57:51] <grrrit-wm>	 (03Abandoned) 10Ema: varnishapi.py: reset error message [puppet] - 10https://gerrit.wikimedia.org/r/293132 (owner: 10Ema)
[15:58:11] <icinga-wm>	 RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.063 second response time
[15:58:23] <grrrit-wm>	 (03PS3) 10Elukey: Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) 
[15:59:22] <grrrit-wm>	 (03CR) 10Ottomata: [C: 031] Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey)
[16:00:04] <jouncebot>	 godog and moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1600).
[16:00:04] <jouncebot>	 mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:07] <grrrit-wm>	 (03PS2) 10Ema: varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) 
[16:00:31] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368243 (10Gehel) We shoudl be able to also process at least:  * vendor/jquery.uls/css/jquery.uls.css * logo.svg  They are unlikely to change frequently but still it would b...
[16:03:35] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: analytics1049.eqiad.wmnet disk failure - https://phabricator.wikimedia.org/T137273#2368252 (10Milimetric) a:03elukey
[16:03:41] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2368253 (10Milimetric) a:03elukey
[16:04:31] <mobrovac>	 anybody doing puppetswat?
[16:04:39] <paravoid>	 what do you need?
[16:05:11] <paravoid>	 https://gerrit.wikimedia.org/r/#/c/292573/ ?
[16:05:21] <mobrovac>	 a noop - https://gerrit.wikimedia.org/r/#/c/292573/
[16:05:22] <mobrovac>	 yes
[16:05:52] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Change Prop: Use the URIs for MW and RB from service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/292573 (owner: 10Mobrovac)
[16:05:59] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] Change Prop: Use the URIs for MW and RB from service::configuration [puppet] - 10https://gerrit.wikimedia.org/r/292573 (owner: 10Mobrovac)
[16:07:30] <icinga-wm>	 RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[16:08:37] <mobrovac>	 thnx paravoid!
[16:09:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.196 second response time
[16:17:31] <grrrit-wm>	 (03PS4) 10Elukey: Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) 
[16:18:41] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[16:20:01] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[16:20:10] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail
[16:22:11] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[16:23:10] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:23:11] <Jeff_Green>	 barium ^^^ looking
[16:24:11] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[16:25:10] <icinga-wm>	 RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[16:25:11] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 9.792 second response time
[16:27:40] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Limit the maximum Kafka topic partition size to 500GB. [puppet] - 10https://gerrit.wikimedia.org/r/293270 (https://phabricator.wikimedia.org/T136690) (owner: 10Elukey)
[16:32:31] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup)
[16:35:38] <grrrit-wm>	 (03PS2) 10Ladsgroup: ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) 
[16:36:36] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema)
[16:37:20] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[16:37:34] <grrrit-wm>	 (03PS5) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) 
[16:37:40] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368340 (10Jonas) >>! In T137238#2368243, @Gehel wrote: > We shoudl be able to also process at least: > * vendor/jquery.uls/css/jquery.uls.css This is fixed.    > * main pag...
[16:37:52] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup)
[16:37:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Move CORS to uwsgi from nginx [puppet] - 10https://gerrit.wikimedia.org/r/293521 (https://phabricator.wikimedia.org/T137433) (owner: 10Ladsgroup)
[16:39:14] <grrrit-wm>	 (03PS6) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) 
[16:44:15] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2368393 (10RobH)
[16:44:52] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey)
[16:45:37] <grrrit-wm>	 (03PS2) 10Hoo man: Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry)
[16:45:54] <wikibugs>	 06Operations, 10procurement: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368426 (10RobH)
[16:46:25] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368431 (10RobH)
[16:46:51] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:47:51] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time
[16:50:10] <grrrit-wm>	 (03PS1) 10Hoo man: Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) 
[16:50:12] <grrrit-wm>	 (03PS1) 10Hoo man: Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) 
[16:52:10] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.903 second response time
[16:52:42] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2368449 (10RobH) a:05RobH>03fgiunchedi So we'll need @fgiunchedi to offer input on this, as it is nearly identical to T136631.  With the 6 new swift backends, we need to know if they wi...
[16:53:13] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2341913 (10RobH) a:03fgiunchedi Filippo is back shortly, so assigning to him for input.  Please provide feedback and assign back to me for followup, thank you!
[16:53:16] <grrrit-wm>	 (03PS1) 10Elukey: Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 
[16:58:31] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2368472 (10Smalyshev) > * main page (index.html) -> no cache (or very short)  I would cache it for the same as below. It doesn't change that much - pretty much once a week n...
[16:58:53] <grrrit-wm>	 (03PS2) 10Catrope: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) 
[16:59:10] <grrrit-wm>	 (03PS2) 10Catrope: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) 
[16:59:44] <grrrit-wm>	 (03PS5) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[17:00:04] <jouncebot>	 yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1700). Please do the needful.
[17:00:13] <yurik>	 i'll skip
[17:00:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] lttoolbox: New upstream version [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/293484 (https://phabricator.wikimedia.org/T124137) (owner: 10KartikMistry)
[17:03:20] <grrrit-wm>	 (03PS6) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[17:04:40] <grrrit-wm>	 (03PS1) 10Ema: varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) 
[17:04:41] <icinga-wm>	 PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:06:01] <grrrit-wm>	 (03CR) 10Thcipriani: "Puppet compiler is finally happy and aware that there is a change: https://puppet-compiler.wmflabs.org/3078/" [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani)
[17:10:22] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:12:12] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time
[17:14:41] <icinga-wm>	 PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Puppet last ran 1 day ago
[17:15:11] <elukey>	 --^ just re-enabled it
[17:16:42] <icinga-wm>	 RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[17:16:46] <elukey>	 good
[17:18:12] <grrrit-wm>	 (03CR) 10RobH: [C: 031] Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 (owner: 10Elukey)
[17:18:32] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:19:38] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add new appservers to the related DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/293527 (owner: 10Elukey)
[17:19:50] <mobrovac>	 !log change-prop deploying ecfda93f09d
[17:19:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:20:02] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[17:20:22] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.014 second response time
[17:29:35] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani)
[17:31:12] <icinga-wm>	 RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[17:32:20] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] Deploy Tilerator with Scap3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani)
[17:39:01] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] cg3: New upstream release [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/293485 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry)
[17:41:31] <icinga-wm>	 PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.920 second response time
[17:41:43] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Define backup for contint [puppet] - 10https://gerrit.wikimedia.org/r/293294 (https://phabricator.wikimedia.org/T80385) 
[17:41:49] <grrrit-wm>	 (03PS7) 10Thcipriani: Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 
[17:43:02] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet (use exponentially decaying resevoirs for metrics histograms) : T126629
[17:43:02] <stashbot>	 T126629: Cassandra 2.1.13 and/or 2.2.6 - https://phabricator.wikimedia.org/T126629
[17:43:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:43:41] <icinga-wm>	 RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 24.910 second response time
[17:46:51] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[17:58:52] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1264 is OK: OK
[18:00:04] <jouncebot>	 hoo and frimelle: Dear anthropoid, the time has come. Please deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1800).
[18:01:11] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:02:19] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.535 second response time
[18:08:28] <icinga-wm>	 PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2124 MB (3% inode=96%)
[18:09:39] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1263 is OK: OK
[18:11:39] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1266 is OK: OK
[18:14:58] <icinga-wm>	 PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.762 second response time
[18:16:59] <icinga-wm>	 RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 7.536 second response time
[18:19:08] <icinga-wm>	 PROBLEM - test icmp reachability to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 82 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[18:22:29] <grrrit-wm>	 (03Abandoned) 10EBernhardson: Make three of the newer ES nodes master eligable [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson)
[18:22:46] <grrrit-wm>	 (03Abandoned) 10EBernhardson: Remove old ES nodes from master capable list [puppet] - 10https://gerrit.wikimedia.org/r/251025 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson)
[18:23:11] <grrrit-wm>	 (03PS1) 10MaxSem: Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 
[18:23:17] <MaxSem>	 gehel, ^
[18:23:43] <grrrit-wm>	 (03CR) 10EBernhardson: "ping for merge?" [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson)
[18:23:48] <grrrit-wm>	 (03PS4) 10EBernhardson: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 
[18:28:00] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Deploy Tilerator with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/293518 (owner: 10Thcipriani)
[18:29:18] <gehel>	 MaxSem: thanks! Busy right now, I'll have a look as soon as I can...
[18:33:18] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:33:57] <wikibugs>	 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2368912 (10EBernhardson) 05Open>03declined These nodes are being removed from the clu...
[18:34:08] <icinga-wm>	 PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.811 second response time
[18:34:28] <grrrit-wm>	 (03CR) 10Jforrester: "\o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[18:34:28] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.016 second response time
[18:36:09] <icinga-wm>	 RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.324 second response time
[18:55:05] <hoo>	 I just spent an hour waiting for various Jenkins jobs… fun way to spend time :P
[18:56:12] <paladox>	 hoo jenkins seems to be taking along time. Since gallium failed yesturday.
[18:57:10] <hoo>	 Don't think it's slower than usual
[18:58:32] <hoo>	 ly
[18:59:22] * bawolff thinks its kind of a bit slow
[18:59:51] <hoo>	 hm… maybe
[19:00:05] <jouncebot>	 thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T1900).
[19:00:39] <hoo>	 … and there goes my deployment slot
[19:00:42] <thcipriani>	 holding train for: https://phabricator.wikimedia.org/T137404
[19:01:39] <hoo>	 thcipriani: Do you think/ not think that's Wikidata related?
[19:01:59] <hoo>	 … also, can I continue with my deploy now?
[19:02:49] <thcipriani>	 hoo: unsure about the cause. Feel free to continue your deploy, looks like there's some work to be done before the train rolls yet.
[19:03:21] <hoo>	 Probably going to look into the bug after
[19:03:34] <hoo>	 quite likely that is wikidata related
[19:03:54] <hoo>	 (we mess with interwiki links on all levels, including in the Skin)
[19:05:48] <thcipriani>	 thanks :)
[19:19:11] <aude>	 o_O
[19:25:17] <icinga-wm>	 RECOVERY - test icmp reachability to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 394 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[19:26:49] <grrrit-wm>	 (03CR) 10Kaldari: [C: 031] Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson)
[19:29:50] <grrrit-wm>	 (03CR) 10Dereckson: "Dependent change has been merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson)
[19:34:59] <grrrit-wm>	 (03Abandoned) 10Dzahn: contint: limit access to zuul-merger git daemon [puppet] - 10https://gerrit.wikimedia.org/r/293449 (https://phabricator.wikimedia.org/T137323) (owner: 10Dzahn)
[19:40:57] <bearND>	 thcipriani: looks like the mw wiki trains is on hold. Mind if i do a quick mobileapps deploy now?
[19:41:37] <thcipriani>	 bearND: it is on hold. mobileapps deploy should be fine.
[19:41:52] <bearND>	 thcipriani: thanks
[19:42:05] <bearND>	 !log starting mobileapps deploy
[19:42:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:44:18] <bearND>	 !log mobileapps deployed 71ff97c
[19:44:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:44:38] <bearND>	 thcipriani: done. Thanks! //cc: mdholloway mobrovac 
[19:44:59] <thcipriani>	 bearND: ack. Thanks.
[19:47:54] <logmsgbot>	 !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (duration: 02m 04s)
[19:47:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:51:25] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:53:26] <hoo>	 https://phabricator.wikimedia.org/T116404 bit us
[19:53:31] <logmsgbot>	 !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: revert, possible s5 master overload (duration: 01m 57s)
[19:53:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:53:41] <hoo>	 Forget the s5 master thing
[19:53:42] <hoo>	 https://phabricator.wikimedia.org/T116404
[19:53:47] <hoo>	 crap :(
[19:55:35] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:55:36] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:58:14] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[19:58:19] <hoo>	 Should not be related to my changes (anymore)
[19:58:22] <hoo>	 reverted them
[19:59:15] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet (removing patched test build; restoring state) : T137474
[19:59:16] <stashbot>	 T137474: Investigate lack of recency bias in Cassandra histogram metrics - https://phabricator.wikimedia.org/T137474
[19:59:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:00:24] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:02:05] <wikibugs>	 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2369349 (10hoo) p:05Low>03High We just hit this hard when we changed the query traffic patterns towards...
[20:02:11] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:03:10] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:05:57] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2369353 (10Dzahn) @JanZerebecki is right. Ladsgroup already has existing production access / deployer and i see he signed L2 in th...
[20:06:30] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: Grant graphite.wikimedia.org rights to grafana-admin LDAP group - https://phabricator.wikimedia.org/T137373#2369354 (10Dzahn) @Ladsgroup you should now be able to login on graphite  (and icinga).
[20:07:06] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2369355 (10Dzahn)
[20:07:16] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 07LDAP: add Ladsgroup to nda LDAP group (was: Grant graphite.wikimedia.org rights to grafana-admin LDAP group) - https://phabricator.wikimedia.org/T137373#2366461 (10Dzahn) 05Open>03Resolved a:03Dzahn
[20:11:15] <mutante>	 @seen Ladsgroup
[20:11:16] <wm-bot>	 mutante: I have never seen Ladsgroup
[20:11:53] <wikibugs>	 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#1748737 (10hoo)
[20:16:21] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[20:16:37] <grrrit-wm>	 (03PS1) 10Dzahn: add lint:ignore's for remaining files outside modules [puppet] - 10https://gerrit.wikimedia.org/r/293569 
[20:16:50] <thcipriani>	 mutante: it's Amir1 in IRC if that's helpful :)
[20:17:26] <yuvipanda>	 mutante: Amir1 is on a flight / gtting to an airport I think
[20:17:30] <mutante>	 thcipriani: thanks, yes it is
[20:17:50] <mutante>	 alright, just wanted to let him know about graphite access
[20:20:56] <wikibugs>	 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2369385 (10hoo) db1070 vs. db1068 (different database, cold queries, the fact that the result rows match is...
[20:25:17] <grrrit-wm>	 (03CR) 10Gehel: Move everything Postgres-related out of role::maps::server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem)
[20:29:12] <grrrit-wm>	 (03CR) 10Gehel: "This makes a lot of sense and adds clarity! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/293540 (owner: 10MaxSem)
[20:29:41] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "I suggested some more rewrite rules on T127224 but this is a good start." [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[20:32:57] <grrrit-wm>	 (03PS5) 10Paladox: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[20:33:03] <logmsgbot>	 !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/user/User.php: c3b1f80a701d61dc57ccac0c8b1dc7daf03fa925 (duration: 00m 29s)
[20:33:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:34:31] <grrrit-wm>	 (03PS2) 10MaxSem: Move everything Postgres-related out of role::maps::server [puppet] - 10https://gerrit.wikimedia.org/r/293540 
[20:36:04] <logmsgbot>	 !log hoo@tin Synchronized php-1.28.0-wmf.5/extensions/Wikidata: Update ArticlePlaceholder (without unrelated T136598 fixes this time) (duration: 01m 51s)
[20:36:05] <stashbot>	 T136598: Wikidata master database connection issue - https://phabricator.wikimedia.org/T136598
[20:36:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:36:41] <hoo>	 Looks good
[20:36:48] <hoo>	 no obvious fallout this time
[20:40:20] <logmsgbot>	 !log hoo@tin Synchronized php-1.28.0-wmf.4/extensions/Wikidata: Update ArticlePlaceholder (duration: 01m 54s)
[20:40:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:42:19] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:44:10] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.410 second response time
[20:45:48] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry)
[20:46:22] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable ArticlePlaceholder extension in guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291589 (https://phabricator.wikimedia.org/T136517) (owner: 10KartikMistry)
[20:48:28] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on guwiki (T136517) (duration: 00m 24s)
[20:48:29] <stashbot>	 T136517: Enable ArticlePlaceholder extension in guwiki - https://phabricator.wikimedia.org/T136517
[20:48:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:49:25] <hoo>	 Looks good on guwiki
[20:50:49] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) (owner: 10Hoo man)
[20:50:59] <grrrit-wm>	 (03CR) 10Smalyshev: [C: 031] Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel)
[20:51:02] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] varnishlog4: default to request grouping [puppet] - 10https://gerrit.wikimedia.org/r/293530 (https://phabricator.wikimedia.org/T137114) (owner: 10Ema)
[20:51:22] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293525 (https://phabricator.wikimedia.org/T136100) (owner: 10Hoo man)
[20:51:33] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Don't publish etags for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/293492 (https://phabricator.wikimedia.org/T137238) (owner: 10Gehel)
[20:53:48] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on lvwiki (T136100) (duration: 00m 26s)
[20:53:49] <stashbot>	 T136100: Enable ArticlePlaceholder on lvwiki - https://phabricator.wikimedia.org/T136100
[20:53:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:55:49] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) (owner: 10Hoo man)
[20:56:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder on nnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293526 (https://phabricator.wikimedia.org/T130997) (owner: 10Hoo man)
[20:57:11] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2365971 (10demon) 05Open>03declined Per what I said in T133300#2369730.
[20:57:13] <wikibugs>	 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2369740 (10demon)
[20:57:17] <logmsgbot>	 !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on nnwiki (T130997) (duration: 00m 24s)
[20:57:18] <stashbot>	 T130997: [Task] Configure ArticlePlaceholder for nnwiki - https://phabricator.wikimedia.org/T130997
[20:57:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:58:09] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time
[20:59:00] <hoo>	 Ok, all verified ok now
[20:59:14] <hoo>	 Think I'm done with ArticlePlaceholder deploys for today
[20:59:20] <hoo>	 not even 2 hours late
[21:00:10] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.029 second response time
[21:19:31] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 
[21:20:39] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:21:01] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 (owner: 10Andrew Bogott)
[21:22:45] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Makedomain: append a '.' on the requested domain if needed. [puppet] - 10https://gerrit.wikimedia.org/r/293622 (owner: 10Andrew Bogott)
[21:24:39] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.149 second response time
[21:29:38] <wikibugs>	 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2027631 (10Blahma) Just noticed this bug just before opening a new bug report on cswiki_p missing virtually all revisions and categorylinks from between 2016-03-08 18:00 and 21:00 UTC, which spoils the results...
[21:29:43] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2369922 (10hashar) From talk we had, contint1001 was setup in emergency since gallium could have been unrecoverable.  Turn...
[21:40:49] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.336 second response time
[21:41:05] <logmsgbot>	 !log aaron@tin Synchronized php-1.28.0-wmf.5/includes: 904dd4ae088a8f67942c09b2b28178377955d6a6 (duration: 01m 18s)
[21:41:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:41:29] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[21:44:49] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.011 second response time
[21:45:44] <grrrit-wm>	 (03Abandoned) 10Hashar: cache_misc: change doc/integration.wm.o backend [puppet] - 10https://gerrit.wikimedia.org/r/293284 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar)
[21:54:38] <grrrit-wm>	 (03CR) 10Catrope: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope)
[21:59:54] <grrrit-wm>	 (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[22:00:04] <jouncebot>	 tgr: Respected human, time to deploy AuthManager (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T2200). Please do the needful.
[22:01:13] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[22:01:53] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] "Seems to work well, should be able to integrate this in deployments without issue." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke)
[22:02:15] <thcipriani>	 tgr: train is still blocked, FYI.
[22:02:43] <tgr>	 thcipriani: any guess when it will end?
[22:02:44] <aude>	 thcipriani: https://gerrit.wikimedia.org/r/#/c/293627/ (if hoo is not around)
[22:02:58] <thcipriani>	 aude: ooh, hadn't seen that yet.
[22:02:59] <hoo>	 Looking already
[22:03:10] <aude>	 just figured it out and has nothign to do with content translation
[22:03:26] <aude>	 i don't know if caches will be broken for an hour
[22:03:45] <aude>	 or if we can force expire sites cache
[22:05:14] <hoo>	 aude: Ok, now it's obvious why it's broken
[22:05:44] <hoo>	 +2ed
[22:06:03] <kaldari>	 anyone doing SWAT deploys?
[22:06:03] <aude>	 :/
[22:06:07] <aude>	 thanks
[22:06:18] <aude>	 i was wasting time looking at content translation
[22:08:44] <thcipriani>	 kaldari: not at the moment, will likely be deploying https://gerrit.wikimedia.org/r/#/c/293627/ backport and rolling forward wikiversions here in...however long jenkins takes.
[22:08:51] <thcipriani>	 15 minutes?
[22:09:30] <paladox>	 thcipriani: https://phabricator.wikimedia.org/T133911
[22:09:40] <paladox>	 would speed up jenkins ^^
[22:11:27] <thcipriani>	 aude: could you double check me on the backport? https://gerrit.wikimedia.org/r/#/c/293631/
[22:13:31] <thcipriani>	 also need to figure out how to sync it without breaking anything :P
[22:14:16] <aude>	 looking
[22:14:52] <aude>	 sync-dir should be ok
[22:15:23] <aude>	 or sync the DBSiteStore first
[22:15:30] <aude>	 errr
[22:15:33] <aude>	 otherway
[22:15:58] <aude>	 sync the ServiceWiring so it's not calling the method in DBSiteStore
[22:16:15] <aude>	 then DBsiteStore
[22:18:12] <thcipriani>	 kk
[22:18:31] <aude>	 we don't use the file site store in production
[22:19:37] <thcipriani>	 gotcha, ok, that seems simple enough, thanks!
[22:19:53] <thcipriani>	 if the backport looks good, I'll go ahead and +2, fetch down, and roll forward.
[22:20:02] <aude>	 ok
[22:20:07] <thcipriani>	 well, fetch down and sync then roll forward.
[22:20:21] <aude>	 we may have a fix for ARticlePlaceholder after
[22:20:27] <aude>	 but not a blocker
[22:20:56] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:20:58] <thcipriani>	 k, sounds good.
[22:26:04] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:28:04] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.749 second response time
[22:35:01] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/ServiceWiring.php: [[gerrit:293631|Revert "Map dummy language codes in sites"]] Part I (duration: 00m 23s)
[22:35:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:35:38] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.28.0-wmf.5/includes/site/DBSiteStore.php: [[gerrit:293631|Revert "Map dummy language codes in sites"]] Part II (duration: 00m 31s)
[22:35:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:36:09] <wikibugs>	 06Operations, 06Performance-Team, 06Services, 07Availability: Consider restbase/cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2370126 (10aaron)
[22:36:24] <thcipriani>	 ^ aude sync'd!
[22:36:33] <grrrit-wm>	 (03Abandoned) 10Aaron Schulz: Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz)
[22:37:08] <thcipriani>	 aude: anything to check in group1, or are we ok to roll forward?
[22:39:38] <kaldari>	 thcipriani: OK, let me know if you decide to go ahead with the SWAT stuff later (looks like it's only config changes)
[22:41:05] <thcipriani>	 kaldari: I don't normally run evening SWAT, it'll likely happen in 20 minutes (I hope)
[22:44:59] <grrrit-wm>	 (03PS1) 10Thcipriani: all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 
[22:45:01] <grrrit-wm>	 (03PS6) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) 
[22:45:15] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[22:46:21] <wikibugs>	 06Operations, 06Performance-Team, 06Services, 07Availability: Consider restbase/cassandra for session storage (with SSL) - https://phabricator.wikimedia.org/T134811#2370144 (10aaron)
[22:46:29] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 (owner: 10Thcipriani)
[22:47:04] <grrrit-wm>	 (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293634 (owner: 10Thcipriani)
[22:47:16] <thcipriani>	 alright, rolling forward
[22:48:12] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.5
[22:48:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:48:53] <thcipriani>	 tgr: sorry I took up most of your window, but we're all on wmf.5 now.
[22:49:23] <tgr>	 thx thcipriani 
[22:49:33] <aude>	 thcipriani: thanks
[22:50:10] <thcipriani>	 aude: thank you for the patch!
[22:50:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50421 bytes in 0.016 second response time
[22:56:25] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:57:53] <grrrit-wm>	 (03PS2) 10Gergő Tisza: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie)
[23:00:05] <jouncebot>	 RoanKattouw, ostriches, Krenair, MaxSem, awight, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160609T2300). Please do the needful.
[23:00:05] <jouncebot>	 Dereckson, Eranroz, Dereckson, and matt_flaschen: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:28] <matt_flaschen>	 Present
[23:00:31] <Dereckson>	 Hello, I can SWAT this evening, but after the previous deployment window is done.
[23:00:34] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 1.629 second response time
[23:01:50] <Dereckson>	 tgr: thcipriani: could I get a ping when you're finished with train or AuthManager?
[23:01:55] <awight|brb>	 Dereckson: I'm available also, if you need a backup.
[23:02:04] <aude>	 i think train is done, assuming no problems
[23:02:09] <Dereckson>	 :)
[23:02:18] <thcipriani>	 Dereckson: train is finished. AuthManager may still be in progress...
[23:02:48] * aude got my train deployment sms alert :)
[23:03:04] <tgr>	 Dereckson: will do (jenkins has two more backports to merge so maybe 10-20 min?)
[23:03:30] <Dereckson>	 tgr: perfect, happy Zuul/Jenkins waiting time
[23:05:21] <grrrit-wm>	 (03PS3) 10Dereckson: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope)
[23:07:03] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[23:07:41] <aude>	 Dereckson: i might have a swat patch for wikidata, if i can prepare it quick enough
[23:08:12] <Dereckson>	 aude: okay :)
[23:08:26] <tgr>	 or longer of the qunit tests continue acting up :(
[23:18:32] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/ConfirmEdit/FancyCaptcha/resources/ext.confirmEdit.fancyCaptcha.js: deploying [[gerrit:293637]] for AuthManager T135504 (duration: 00m 24s)
[23:18:33] <stashbot>	 T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504
[23:18:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:18:53] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[23:19:55] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.5/extensions/MobileFrontend/resources/skins.minerva.special.userlogin.styles/userlogin.less: deploying [[gerrit:293638]] for AuthManager T135504 (duration: 00m 25s)
[23:19:56] <stashbot>	 T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504
[23:19:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:20:52] <logmsgbot>	 !log tgr@tin Synchronized php-1.28.0-wmf.5/includes/specialpage/LoginSignupSpecialPage.php: deploying [[gerrit:293636]] for AuthManager T135504 (duration: 00m 25s)
[23:20:53] <stashbot>	 T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504
[23:20:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:21:25] <grrrit-wm>	 (03CR) 10Gergő Tisza: [C: 032] Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie)
[23:22:06] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add tags for group1 and group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293502 (owner: 10Anomie)
[23:22:17] <grrrit-wm>	 (03PS2) 10Gergő Tisza: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) 
[23:23:42] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.002 second response time
[23:25:50] <grrrit-wm>	 (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[23:26:33] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable AuthManager on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293491 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza)
[23:28:15] <logmsgbot>	 !log tgr@tin Synchronized dblists/group2.dblist: add dblist for group2 (duration: 00m 22s)
[23:28:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:29:34] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.019 second response time
[23:29:50] <logmsgbot>	 !log tgr@tin Synchronized wmf-config/CommonSettings.php: enable use of group1, group2 dblists in config (duration: 00m 23s)
[23:29:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:31:05] <logmsgbot>	 !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: enable AuthManager on group2 wikis T135504 (duration: 00m 24s)
[23:31:06] <stashbot>	 T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504
[23:31:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:31:24] <tgr>	 anomie: dapatrick: ^
[23:31:32] <dapatrick>	 I see it!
[23:33:11] <tgr>	 Dereckson: done
[23:33:13] <dapatrick>	 Hmm. Login failed first try, redirected me to Login form again with no message. Tried logging in again and it worked.
[23:33:35] <Dereckson>	 dapatrick: you need to investigate that?
[23:33:38] <tgr>	 dapatrick: did you open the login form before the deployment?
[23:33:51] <dapatrick>	 tgr Nope.
[23:34:01] <dapatrick>	 Cleared cookies and was sitting at the main page.
[23:34:37] <Dereckson>	 I opened a private session, tried to login on fr.wikip, worked like a charm.
[23:34:37] <dapatrick>	 I'm not freaking out about it.
[23:35:03] <tgr>	 seems to work for me, can you figure out how to reproduce?
[23:36:46] <dapatrick>	 Things are working fine now, after that initial weirdness. I can hit all sites as expected (used repro steps from T136989)
[23:36:46] <stashbot>	 T136989: Enabling two-factor authentication disrupts SUL behavior - https://phabricator.wikimedia.org/T136989
[23:36:57] <dapatrick>	 Trying to reproduce my initial problem now.
[23:37:00] <Dereckson>	 okay let's go matt_flaschen, we'll give back the window to tgr and dapatrick if they need a hotfix
[23:37:08] <grrrit-wm>	 (03PS4) 10Dereckson: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope)
[23:37:11] <matt_flaschen>	 Okay
[23:37:17] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope)
[23:37:55] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove HiddenPrefs hack for turning off cross-wiki notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope)
[23:38:24] <Dereckson>	 matt_flaschen: live on mw1017
[23:38:30] <dapatrick>	 Dereckson, tgr, seems fine. Thanks!
[23:40:08] <matt_flaschen>	 Dereckson, it doesn't affect testwiki, so I can't test there.
[23:40:46] <Dereckson>	 matt_flaschen: you can test where you want with https://wikitech.wikimedia.org/wiki/Debugging_in_production
[23:41:24] <Dereckson>	 if you use this extension - https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb - you can with one click send request to this server
[23:42:21] <matt_flaschen>	 Dereckson, thanks, I forgot about that.
[23:42:44] <Dereckson>	 there is also an extension for Firefox: https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/
[23:46:10] <aude>	 Dereckson: added wikidata patch to the wiki
[23:46:18] <aude>	 it will apply to wmf.5 core
[23:47:02] <matt_flaschen>	 Dereckson, looks good.
[23:47:11] <Dereckson>	 aude, matt_flaschen, okay
[23:47:45] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Remove HiddenPrefs hack for turning off cross-wiki notifications (T135266) (duration: 00m 27s)
[23:47:46] <stashbot>	 T135266: Gate cross-wiki preferences entirely (default off) - https://phabricator.wikimedia.org/T135266
[23:47:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:48:36] <grrrit-wm>	 (03PS2) 10Dzahn: Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier)
[23:48:50] <grrrit-wm>	 (03PS3) 10Dereckson: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope)
[23:48:59] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope)
[23:49:03] <icinga-wm>	 PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: puppet fail
[23:49:54] <grrrit-wm>	 (03PS3) 10Dzahn: Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier)
[23:50:19] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Flow beta feature on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292198 (https://phabricator.wikimedia.org/T136684) (owner: 10Catrope)
[23:50:45] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Remove Nik Everett's production access [puppet] - 10https://gerrit.wikimedia.org/r/291125 (https://phabricator.wikimedia.org/T130113) (owner: 10Greg Grossmeier)
[23:51:26] <Dereckson>	 matt_flaschen: live on mw1017
[23:52:47] <matt_flaschen>	 Dereckson, looks good.
[23:53:22] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on frwiki (T136684) (duration: 00m 27s)
[23:53:23] <stashbot>	 T136684: Deploy Flow as a Beta Feature on French Wikipedia - https://phabricator.wikimedia.org/T136684
[23:53:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:54:05] <Dereckson>	 I'm skipping Set import sources for he.wikipedia as there is a -1 and no news from Eranroz since.
[23:54:26] <grrrit-wm>	 (03PS1) 10Dzahn: admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) 
[23:54:36] <grrrit-wm>	 (03PS2) 10Dereckson: Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) 
[23:54:44] <grrrit-wm>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson)
[23:55:32] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[23:55:47] <grrrit-wm>	 (03PS2) 10Dzahn: admin: move manybubbles to absented users [puppet] - 10https://gerrit.wikimedia.org/r/293656 (https://phabricator.wikimedia.org/T130113) 
[23:55:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Revert "Set Tamil projects to use uca-ta collation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290686 (https://phabricator.wikimedia.org/T75453) (owner: 10Dereckson)
[23:57:00] <Dereckson>	 works on mw1017
[23:57:31] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation II (T75453) (duration: 00m 25s)
[23:57:31] <stashbot>	 T75453: Tamil sort order - https://phabricator.wikimedia.org/T75453
[23:57:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:59:12] <Dereckson>	 aude: you self-merged the fix
[23:59:32] <aude>	 only on the branch
[23:59:45] <Dereckson>	 oh okay