[00:01:34] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[00:02:09] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2387712 (10Papaul) @fgiunchedi the other ms-be* systems have Trusty installed are we also installing Trusty on the new systems of Jessie?  Thanks
[00:02:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:03:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:03:20] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387714 (10Danny_B) 05Open>03Resolved Rules are written. Deploying them is another task.
[00:03:54] <icinga-wm>	 PROBLEM - nutcracker port on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:15] <icinga-wm>	 PROBLEM - nutcracker process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:33] <icinga-wm>	 PROBLEM - configured eth on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:43] <icinga-wm>	 PROBLEM - DPKG on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:04:53] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:03] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387718 (10Paladox) The above works. git.wmflabs.org
[00:05:03] <icinga-wm>	 PROBLEM - Disk space on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:04] <icinga-wm>	 PROBLEM - salt-minion processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:34] <icinga-wm>	 PROBLEM - dhclient process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:44] <icinga-wm>	 PROBLEM - SSH on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:06:03] <icinga-wm>	 PROBLEM - HHVM processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:09:43] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387721 (10Paladox)
[00:14:35] <grrrit-wm>	 (03PS1) 10Papaul: DHCP: Add MAC address entries for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) 
[00:19:10] <grrrit-wm>	 (03PS2) 10Papaul: DHCP: Add MAC address entries for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) 
[00:27:34] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[00:28:33] <icinga-wm>	 RECOVERY - DPKG on mw1133 is OK: All packages OK
[00:28:34] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures
[00:28:44] <icinga-wm>	 RECOVERY - Disk space on mw1133 is OK: DISK OK
[00:28:44] <icinga-wm>	 RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:29:13] <grrrit-wm>	 (03PS1) 10Papaul: adding install params for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) 
[00:29:14] <icinga-wm>	 RECOVERY - dhclient process on mw1133 is OK: PROCS OK: 0 processes with command name dhclient
[00:29:26] <icinga-wm>	 RECOVERY - SSH on mw1133 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[00:29:43] <icinga-wm>	 RECOVERY - HHVM processes on mw1133 is OK: PROCS OK: 6 processes with command name hhvm
[00:29:45] <icinga-wm>	 RECOVERY - nutcracker port on mw1133 is OK: TCP OK - 0.000 second response time on port 11212
[00:30:14] <icinga-wm>	 RECOVERY - nutcracker process on mw1133 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[00:30:24] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1133 is OK: OK: nf_conntrack is 0 % full
[00:30:24] <icinga-wm>	 RECOVERY - configured eth on mw1133 is OK: OK - interfaces up
[00:32:47] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2387770 (10Papaul)
[00:37:24] <icinga-wm>	 PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 109 failures
[01:01:58] <grrrit-wm>	 (03PS1) 1020after4: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) 
[01:03:37] <grrrit-wm>	 (03PS2) 1020after4: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) 
[01:20:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: Fix default 'type' behavior [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294844 
[01:20:29] <grrrit-wm>	 (03PS5) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 
[02:00:10] <grrrit-wm>	 (03PS6) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 
[02:00:12] <grrrit-wm>	 (03PS1) 10Yuvipanda: Cleanup backend: field too when killing webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294872 
[02:03:51] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:24:34] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 09m 46s)
[02:24:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:32] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[02:31:00] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jun 17 02:31:00 UTC 2016 (duration 6m 26s)
[02:31:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:05:47] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2248 is CRITICAL: Connection timed out
[03:05:47] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2249 is CRITICAL: Connection timed out
[03:05:47] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw2250 is CRITICAL: Connection timed out
[03:05:56] <icinga-wm>	 PROBLEM - nutcracker process on mw2250 is CRITICAL: Timeout while attempting connection
[03:05:56] <icinga-wm>	 PROBLEM - nutcracker process on mw2248 is CRITICAL: Timeout while attempting connection
[03:05:56] <icinga-wm>	 PROBLEM - nutcracker process on mw2249 is CRITICAL: Timeout while attempting connection
[03:06:16] <icinga-wm>	 PROBLEM - puppet last run on mw2250 is CRITICAL: Timeout while attempting connection
[03:06:16] <icinga-wm>	 PROBLEM - puppet last run on mw2249 is CRITICAL: Timeout while attempting connection
[03:06:16] <icinga-wm>	 PROBLEM - puppet last run on mw2248 is CRITICAL: Timeout while attempting connection
[03:06:46] <icinga-wm>	 PROBLEM - salt-minion processes on mw2249 is CRITICAL: Timeout while attempting connection
[03:06:46] <icinga-wm>	 PROBLEM - salt-minion processes on mw2250 is CRITICAL: Timeout while attempting connection
[03:06:46] <icinga-wm>	 PROBLEM - salt-minion processes on mw2248 is CRITICAL: Timeout while attempting connection
[03:07:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2248 is CRITICAL: Timeout while attempting connection
[03:07:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2249 is CRITICAL: Timeout while attempting connection
[03:07:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2250 is CRITICAL: Timeout while attempting connection
[03:07:16] <icinga-wm>	 PROBLEM - DPKG on mw2249 is CRITICAL: Timeout while attempting connection
[03:07:16] <icinga-wm>	 PROBLEM - DPKG on mw2250 is CRITICAL: Timeout while attempting connection
[03:07:16] <icinga-wm>	 PROBLEM - DPKG on mw2248 is CRITICAL: Timeout while attempting connection
[03:07:28] <icinga-wm>	 PROBLEM - Disk space on mw2248 is CRITICAL: Timeout while attempting connection
[03:07:28] <icinga-wm>	 PROBLEM - Disk space on mw2250 is CRITICAL: Timeout while attempting connection
[03:07:28] <icinga-wm>	 PROBLEM - Disk space on mw2249 is CRITICAL: Timeout while attempting connection
[03:08:08] <icinga-wm>	 PROBLEM - MD RAID on mw2248 is CRITICAL: Timeout while attempting connection
[03:08:08] <icinga-wm>	 PROBLEM - MD RAID on mw2249 is CRITICAL: Timeout while attempting connection
[03:08:08] <icinga-wm>	 PROBLEM - MD RAID on mw2250 is CRITICAL: Timeout while attempting connection
[03:08:57] <icinga-wm>	 PROBLEM - configured eth on mw2250 is CRITICAL: Timeout while attempting connection
[03:08:57] <icinga-wm>	 PROBLEM - configured eth on mw2248 is CRITICAL: Timeout while attempting connection
[03:08:57] <icinga-wm>	 PROBLEM - configured eth on mw2249 is CRITICAL: Timeout while attempting connection
[03:09:16] <icinga-wm>	 PROBLEM - dhclient process on mw2249 is CRITICAL: Timeout while attempting connection
[03:09:16] <icinga-wm>	 PROBLEM - dhclient process on mw2248 is CRITICAL: Timeout while attempting connection
[03:09:16] <icinga-wm>	 PROBLEM - dhclient process on mw2250 is CRITICAL: Timeout while attempting connection
[03:09:26] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2248 is CRITICAL: Host mw2248 is not in mediawiki-installation dsh group
[03:09:26] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2249 is CRITICAL: Host mw2249 is not in mediawiki-installation dsh group
[03:09:26] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group
[03:09:47] <icinga-wm>	 PROBLEM - nutcracker port on mw2248 is CRITICAL: Timeout while attempting connection
[03:09:47] <icinga-wm>	 PROBLEM - nutcracker port on mw2249 is CRITICAL: Timeout while attempting connection
[03:09:47] <icinga-wm>	 PROBLEM - nutcracker port on mw2250 is CRITICAL: Timeout while attempting connection
[03:19:20] <grrrit-wm>	 (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) 
[03:30:26] <icinga-wm>	 RECOVERY - Disk space on mw2248 is OK: DISK OK
[03:30:37] <icinga-wm>	 RECOVERY - nutcracker port on mw2248 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:30:56] <icinga-wm>	 RECOVERY - nutcracker process on mw2248 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[03:31:07] <icinga-wm>	 RECOVERY - MD RAID on mw2248 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[03:31:47] <icinga-wm>	 RECOVERY - salt-minion processes on mw2248 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:31:47] <icinga-wm>	 RECOVERY - salt-minion processes on mw2249 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:32:16] <icinga-wm>	 RECOVERY - configured eth on mw2249 is OK: OK - interfaces up
[03:32:16] <icinga-wm>	 RECOVERY - configured eth on mw2248 is OK: OK - interfaces up
[03:32:47] <icinga-wm>	 RECOVERY - dhclient process on mw2248 is OK: PROCS OK: 0 processes with command name dhclient
[03:33:26] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2249 is OK: OK: nf_conntrack is 0 % full
[03:33:36] <icinga-wm>	 RECOVERY - DPKG on mw2249 is OK: All packages OK
[03:33:48] <icinga-wm>	 RECOVERY - Disk space on mw2249 is OK: DISK OK
[03:34:26] <icinga-wm>	 RECOVERY - MD RAID on mw2249 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[03:34:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2248 is OK: OK: nf_conntrack is 0 % full
[03:34:56] <wikibugs>	 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2387881 (10KartikMistry)
[03:34:56] <icinga-wm>	 RECOVERY - dhclient process on mw2249 is OK: PROCS OK: 0 processes with command name dhclient
[03:34:56] <icinga-wm>	 RECOVERY - DPKG on mw2248 is OK: All packages OK
[03:35:07] <icinga-wm>	 RECOVERY - nutcracker port on mw2249 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:35:27] <icinga-wm>	 RECOVERY - nutcracker process on mw2249 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[03:39:28] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2248 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.089 second response time
[03:40:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.101 second response time
[03:40:21] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Exit when given unsupported parameters [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294843 (owner: 10Yuvipanda)
[03:40:33] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Fix default 'type' behavior [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294844 (owner: 10Yuvipanda)
[03:40:46] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Cleanup backend: field too when killing webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294872 (owner: 10Yuvipanda)
[03:40:59] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 (owner: 10Yuvipanda)
[03:41:58] <icinga-wm>	 PROBLEM - NTP on mw2248 is CRITICAL: NTP CRITICAL: Offset unknown
[03:41:58] <icinga-wm>	 PROBLEM - NTP on mw2249 is CRITICAL: NTP CRITICAL: Offset unknown
[03:42:07] <icinga-wm>	 PROBLEM - NTP on mw2250 is CRITICAL: NTP CRITICAL: Offset unknown
[03:43:16] <icinga-wm>	 RECOVERY - MD RAID on mw2250 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[03:43:27] <icinga-wm>	 RECOVERY - configured eth on mw2250 is OK: OK - interfaces up
[03:43:46] <icinga-wm>	 RECOVERY - dhclient process on mw2250 is OK: PROCS OK: 0 processes with command name dhclient
[03:43:57] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2250 is OK: OK: nf_conntrack is 0 % full
[03:44:07] <icinga-wm>	 RECOVERY - nutcracker port on mw2250 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[03:44:26] <icinga-wm>	 RECOVERY - nutcracker process on mw2250 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[03:44:28] <icinga-wm>	 RECOVERY - Disk space on mw2250 is OK: DISK OK
[03:44:47] <icinga-wm>	 RECOVERY - salt-minion processes on mw2250 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:46:08] <icinga-wm>	 RECOVERY - NTP on mw2248 is OK: NTP OK: Offset 0.002550601959 secs
[03:46:17] <icinga-wm>	 RECOVERY - DPKG on mw2250 is OK: All packages OK
[03:48:17] <icinga-wm>	 RECOVERY - NTP on mw2249 is OK: NTP OK: Offset -6.330013275e-05 secs
[03:51:07] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.082 second response time
[03:58:46] <icinga-wm>	 RECOVERY - NTP on mw2250 is OK: NTP OK: Offset -0.001456141472 secs
[04:26:21] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387918 (10ori)
[04:30:37] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:57:08] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:57:50] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387932 (10GWicke) Varnish -> RB GET p95 latency: {F4173733}
[05:35:44] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "@bblack it's my full intention to do that; I made this behave like the standard class if one doesn't specify a reason in enabling puppet." [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto)
[05:41:33] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: conftool: add new jessie api appservers [puppet] - 10https://gerrit.wikimedia.org/r/294737 
[05:43:34] <grrrit-wm>	 (03PS1) 10Yuvipanda: Fix terrible typo in status check for restarts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294877 
[05:44:21] <grrrit-wm>	 (03PS1) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294878 
[05:48:58] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:51:07] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:51:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Fix terrible typo in status check for restarts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294877 (owner: 10Yuvipanda)
[05:51:30] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294878 (owner: 10Yuvipanda)
[05:53:27] <icinga-wm>	 PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:55:36] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add new jessie api appservers [puppet] - 10https://gerrit.wikimedia.org/r/294737 (owner: 10Giuseppe Lavagetto)
[05:57:47] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot
[05:58:31] <logmsgbot>	 !log root@palladium conftool action : set/pooled=yes:weight=20; selector: cluster=api_appserver,name=mw127.*
[05:58:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:03:46] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot
[06:04:01] <logmsgbot>	 !log root@palladium conftool action : set/weight=25; selector: cluster=api_appserver,name=mw127.*
[06:04:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:07:26] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api
[06:08:05] <_joe_>	 uhm this is bad
[06:08:45] <_joe_>	 but also not true
[06:08:49] <_joe_>	 citoid is working
[06:09:03] <_joe_>	 so this is a problem with neon I'd say
[06:10:25] <_joe_>	 it's zotero that is down, apparently, looking
[06:12:19] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:14:05] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot
[06:14:20] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.013 second response time
[06:15:29] <akosiaris>	 [JavaScript Error: "uncaught exception: out of memory"]
[06:15:52] <akosiaris>	 virt 28.319g, resident 0.027t
[06:15:54] <akosiaris>	 restarting it
[06:16:02] <_joe_>	 yes
[06:16:09] <_joe_>	 that's sca1002?
[06:16:14] <_joe_>	 I just restarted it on 1001
[06:16:16] <akosiaris>	 ok
[06:16:19] <akosiaris>	 yup
[06:16:31] <akosiaris>	 ah that's why it's fine on sca1001 and it recovered
[06:16:40] <akosiaris>	 !log restarted zotero on sca1002
[06:16:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:16:45] <grrrit-wm>	 (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) 
[06:16:45] <_joe_>	 yes, I was about to !log
[06:16:50] <akosiaris>	 !log _joe_ restarted zotero on sca1001
[06:16:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:16:58] <_joe_>	 thanks
[06:17:10] <_joe_>	 so the citoid LVS alert is actually working
[06:17:16] <icinga-wm>	 RECOVERY - zotero on sca1002 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.011 second response time
[06:17:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[06:17:25] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[06:17:25] <akosiaris>	 depends
[06:17:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[06:17:26] <_joe_>	 It caught the problem well before the zotero lvs did
[06:17:36] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[06:17:43] <akosiaris>	 it probably is that gov database again
[06:17:45] <_joe_>	 ofc it's a symptom
[06:17:51] <_joe_>	 akosiaris: an OOM?
[06:18:13] <_joe_>	 btw I don't know why pybal didn't depool at least one of the zotero hosts
[06:18:21] <akosiaris>	 no, it never ooms
[06:18:21] <_joe_>	 proabbly we don't do proper checks
[06:18:52] <akosiaris>	 oh, the check I 've managed to do with zotero is quite simple
[06:19:00] <akosiaris>	 so https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Service+Cluster+A+eqiad&m=cpu_report&s=by+name&mc=2&g=mem_report
[06:19:17] <_joe_>	 sigh
[06:19:22] <akosiaris>	 I am this close to putting a cronjob to restart zotero once a week
[06:19:25] <_joe_>	 no I mean pybal checks
[06:19:36] <_joe_>	 akosiaris: see what I did for jobrunners :)
[06:19:41] <akosiaris>	 heh
[06:19:47] <_joe_>	 as far as cron restarts go
[06:19:55] <_joe_>	 I am thinking of extending that to the API cluster
[06:20:02] <grrrit-wm>	 (03PS3) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) 
[06:20:39] <akosiaris>	 ah the zotero pybal check is IdleConnection
[06:20:49] <akosiaris>	 ProxyFetch is basically impossible IIRC
[06:20:51] <_joe_>	 akosiaris: right
[06:20:54] <_joe_>	 why?
[06:20:55] <akosiaris>	 zotero only accepts POSTs
[06:21:02] <_joe_>	 well let me try :)
[06:21:04] <akosiaris>	 no GETs
[06:21:25] <_joe_>	 and we don't do POST in ProxyFetch
[06:21:31] <_joe_>	 well, we must add that
[06:21:41] <_joe_>	 if anyone opens a ticket on that :P
[06:22:39] <akosiaris>	 $USER1$/check_http -I $HOSTADDRESS$ -H $ARG1$ -p $ARG2$ -P '[{"itemType":"journalArticle"}]' -T 'application/json' -u "$ARG3$"
[06:22:43] <akosiaris>	 that the zotero nagios check
[06:22:49] <akosiaris>	 that actually worked from what I see
[06:23:09] <_joe_>	 yes
[06:23:15] <akosiaris>	 (08:53:27 πμ) icinga-wm: PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:23:19] <akosiaris>	 and very very early
[06:23:23] <_joe_>	 yes
[06:23:35] <akosiaris>	 20 whole minutes before LVS paged 
[06:24:01] <_joe_>	 and 5 before the functional test on citoid lvs paged
[06:24:17] <_joe_>	 we should add more functional checks to LVSs
[06:24:22] <akosiaris>	 no that's the host check $USER1$/check_http -H $HOSTADDRESS$ -p $ARG1$ -P '[{"itemType":"journalArticle"}]' -T 'application/json' -u /export?format=wikipedia
[06:24:24] <_joe_>	 and make them page
[06:24:29] <akosiaris>	 the other one I pasted is the LVS check
[06:24:35] <akosiaris>	 not that they differ a lot
[06:24:48] <_joe_>	 nope
[06:25:33] <akosiaris>	 anyway, back to breakfast, bbl
[06:26:37] <_joe_>	 heh, I should do breakfast as well
[06:26:45] <mobrovac>	 zotero again?
[06:27:33] <akosiaris>	 mobrovac: yup. out of memory and the xul engine logged but that's all it does about it
[06:27:42] <mobrovac>	 lol
[06:28:01] <mobrovac>	 "fyi i'm out of mem, but i'll continue"
[06:30:57] <icinga-wm>	 PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:47] <icinga-wm>	 PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:47] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:16] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:26] <icinga-wm>	 PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:30] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2386622 (10ori) From reading the schema, it looks like you're reporting an average for all requests made in the c...
[06:32:35] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:46] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:46] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:34:03] <grrrit-wm>	 (03PS1) 10Mobrovac: Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294880 
[06:40:19] <moritzm>	 !log installing apache update on palladium
[06:40:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:44:52] <icinga-wm>	 PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: puppet fail
[06:44:53] <icinga-wm>	 PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail
[06:44:53] <icinga-wm>	 PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Puppet has 29 failures
[06:45:02] <icinga-wm>	 PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 48 failures
[06:45:02] <icinga-wm>	 PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: puppet fail
[06:45:03] <icinga-wm>	 PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: puppet fail
[06:45:12] <icinga-wm>	 PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 5 failures
[06:45:15] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2387980 (10jcrespo) Do do know when was flow enabled for the first time/what is the oldest content we will find?
[06:45:22] <icinga-wm>	 PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: puppet fail
[06:45:22] <icinga-wm>	 PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Puppet has 49 failures
[06:45:23] <icinga-wm>	 PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail
[06:45:32] <icinga-wm>	 PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: puppet fail
[06:45:33] <icinga-wm>	 PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: puppet fail
[06:45:33] <icinga-wm>	 PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail
[06:45:33] <icinga-wm>	 PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:45:34] <icinga-wm>	 PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 33 failures
[06:45:42] <icinga-wm>	 PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail
[06:45:52] <icinga-wm>	 PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: puppet fail
[06:45:53] <icinga-wm>	 PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 53 failures
[06:45:53] <icinga-wm>	 PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Puppet has 27 failures
[06:45:53] <icinga-wm>	 PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Puppet has 32 failures
[06:45:53] <icinga-wm>	 PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: puppet fail
[06:45:54] <icinga-wm>	 PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 52 failures
[06:45:54] <icinga-wm>	 PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail
[06:45:54] <icinga-wm>	 PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 88 failures
[06:46:02] <icinga-wm>	 PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 75 failures
[06:46:02] <icinga-wm>	 PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 92 failures
[06:46:03] <icinga-wm>	 PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: puppet fail
[06:46:03] <icinga-wm>	 PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: puppet fail
[06:46:03] <icinga-wm>	 PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 71 failures
[06:46:13] <icinga-wm>	 PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 60 failures
[06:46:13] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail
[06:46:22] <icinga-wm>	 PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail
[06:46:22] <icinga-wm>	 PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 35 failures
[06:46:32] <icinga-wm>	 PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 90 failures
[06:46:33] <icinga-wm>	 PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail
[06:46:33] <icinga-wm>	 PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: puppet fail
[06:46:34] <icinga-wm>	 PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 32 failures
[06:46:43] <icinga-wm>	 PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 36 failures
[06:46:43] <icinga-wm>	 PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 4 failures
[06:46:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 38 failures
[06:46:43] <icinga-wm>	 PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 35 failures
[06:46:43] <icinga-wm>	 PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 136 failures
[06:46:53] <icinga-wm>	 PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 37 failures
[06:47:22] <icinga-wm>	 PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 60 failures
[06:47:23] <icinga-wm>	 PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 61 failures
[06:48:03] <icinga-wm>	 PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 33 failures
[06:50:13] <icinga-wm>	 RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[06:56:23] <icinga-wm>	 RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:56:53] <icinga-wm>	 RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:03] <icinga-wm>	 RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:13] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[06:57:23] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[06:57:42] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:57:53] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:01] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294880 (owner: 10Mobrovac)
[07:02:56] <mobrovac>	 !log change-prop restarting it to apply https://gerrit.wikimedia.org/r/294880
[07:03:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:10:04] <icinga-wm>	 PROBLEM - DPKG on hafnium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:10:05] <icinga-wm>	 PROBLEM - DPKG on bohrium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:10:45] <icinga-wm>	 RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[07:10:45] <icinga-wm>	 RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[07:10:46] <icinga-wm>	 RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[07:10:55] <icinga-wm>	 RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[07:10:55] <icinga-wm>	 RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[07:11:04] <icinga-wm>	 RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[07:11:06] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Fix sorting of hostnames [dns] - 10https://gerrit.wikimedia.org/r/293282 
[07:11:15] <icinga-wm>	 RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:11:15] <icinga-wm>	 RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[07:11:16] <icinga-wm>	 RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[07:11:24] <icinga-wm>	 RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:11:25] <icinga-wm>	 RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[07:11:25] <icinga-wm>	 RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[07:11:29] <mobrovac>	 !log restbase started mobile-sections dump on restbase1009 for T136964
[07:11:30] <stashbot>	 T136964: Pre-generate/purge mobile-sections endpoints to fix page links inside image captions - https://phabricator.wikimedia.org/T136964
[07:11:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:11:35] <icinga-wm>	 RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[07:11:35] <icinga-wm>	 RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[07:11:36] <icinga-wm>	 RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[07:11:44] <icinga-wm>	 RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[07:11:45] <icinga-wm>	 RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[07:11:55] <icinga-wm>	 RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[07:11:55] <icinga-wm>	 RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[07:12:05] <icinga-wm>	 RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[07:12:05] <icinga-wm>	 RECOVERY - DPKG on hafnium is OK: All packages OK
[07:12:05] <icinga-wm>	 RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[07:12:05] <icinga-wm>	 RECOVERY - DPKG on bohrium is OK: All packages OK
[07:12:16] <icinga-wm>	 RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:16] <icinga-wm>	 RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[07:12:16] <icinga-wm>	 RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:25] <icinga-wm>	 RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[07:12:25] <icinga-wm>	 RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:25] <icinga-wm>	 RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[07:12:26] <icinga-wm>	 RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:12:26] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Fix sorting of hostnames [dns] - 10https://gerrit.wikimedia.org/r/293282 (owner: 10Giuseppe Lavagetto)
[07:12:35] <icinga-wm>	 RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:35] <icinga-wm>	 RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:36] <icinga-wm>	 RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[07:12:44] <icinga-wm>	 RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:45] <icinga-wm>	 RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:46] <icinga-wm>	 RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[07:12:46] <icinga-wm>	 RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[07:12:55] <icinga-wm>	 RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[07:13:05] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:13:05] <icinga-wm>	 RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[07:13:05] <icinga-wm>	 RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:13:24] <icinga-wm>	 RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[07:13:25] <icinga-wm>	 RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[07:13:25] <icinga-wm>	 RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:13:55] <icinga-wm>	 RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[07:14:05] <icinga-wm>	 RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:14:36] <icinga-wm>	 RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:14:49] <grrrit-wm>	 (03PS1) 10Jcrespo: Depool db1072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294887 
[07:16:30] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depool db1072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294887 (owner: 10Jcrespo)
[07:18:42] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 for maintenance (duration: 00m 31s)
[07:18:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:19:39] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off [puppet] - 10https://gerrit.wikimedia.org/r/286141 (owner: 10Giuseppe Lavagetto)
[07:19:45] <icinga-wm>	 PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail
[07:23:50] <jynus>	 !log backuping and reimaging db1072
[07:23:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:25:32] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "The reason this is not part of base::service_unit is that it's systemd only." [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto)
[07:26:28] <grrrit-wm>	 (03PS4) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki)
[07:26:42] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 
[07:29:38] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 
[07:30:15] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 04-1] Deploy Compact Language Links as default (Stage 1) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[07:32:05] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 
[07:33:05] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:34:35] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388011 (10matthiasmullie) The very first Flow commit was Wed Jul 10 23:05:11 2013 +0100. It was enabled on labs o...
[07:40:40] <grrrit-wm>	 (03PS10) 10Yuvipanda: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[07:40:52] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] "Gonna meeeerge!" [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[07:41:01] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] "Gonna meeeerge!" [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[07:42:40] <grrrit-wm>	 (03PS4) 10Yuvipanda: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[07:43:17] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 
[07:44:33] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 
[07:45:00] <grrrit-wm>	 (03PS4) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) 
[07:45:35] <icinga-wm>	 RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[07:45:38] <grrrit-wm>	 (03CR) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[07:46:36] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto)
[07:46:40] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 
[07:46:54] <yuvipanda>	 joe can you +1 ^? (just a Restart=always)
[07:47:39] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 (owner: 10Yuvipanda)
[07:47:51] <yuvipanda>	 joe thanks
[07:48:02] <grrrit-wm>	 (03PS3) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 
[07:48:14] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 (owner: 10Yuvipanda)
[07:55:31] <wikibugs>	 06Operations, 06Discovery, 06Services, 03Maps-Sprint, 13Patch-For-Review: Allow configuration of contact groups for monitoring of services - https://phabricator.wikimedia.org/T137891#2382756 (10mobrovac) I'm not a fan of this solution. The defaults come from hiera, so can't you just change the hiera valu...
[07:59:42] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 
[08:00:01] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 (owner: 10Yuvipanda)
[08:00:03] <grrrit-wm>	 (03PS2) 10Yuvipanda: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 
[08:10:01] <grrrit-wm>	 (03CR) 10WMDE-Fisch: [C: 031] tools: Install jdk8 in trusty nodes [puppet] - 10https://gerrit.wikimedia.org/r/292960 (https://phabricator.wikimedia.org/T121279) (owner: 10Yuvipanda)
[08:11:14] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2368961 (10Paladox) But I think this task is also about uploaded the rewrites and then remove the server behind git.w...
[08:12:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:16:22] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[08:23:16] <wikibugs>	 06Operations: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2388070 (10Joe) a:03Joe
[08:24:54] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388071 (10greg) 05Resolved>03Open >>! In T137224#2387714, @Danny_B wrote: > Rules are written. Deploying them is...
[08:29:06] <hashar>	 !log Restarting Jenkins on gallium. Web interface at least is deadlocked somehow
[08:29:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:30:01] <jynus>	 probably it is not jenkins but git
[08:30:37] <hashar>	 yeah I havee seen a defunct git  child
[08:30:49] <hashar>	 and jstack did not produce anything helpful :(
[08:33:36] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388125 (10greg) These, I believe, should be merged for this to be done: * https://gerrit.wikimedia.org/r/#/c/293789/...
[08:34:02] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 57 failures
[08:36:32] <wikibugs>	 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2388141 (10Gehel) Referrer check has been removed (T137848) which allow more experimentation to happen. I'd still like to validate our automation of setting up new nodes and the rel...
[08:40:08] <Amir1>	 gerrit seems to be down
[08:41:00] <Amir1>	 and labs
[08:41:19] <Amir1>	 DNS issues 
[08:41:22] <gehel>	 Amir1: seems to work for me...
[08:42:06] <gehel>	 Amir1: I just checked the gerrit web ui and pulling a repo, there might be other issues...
[08:42:15] <Amir1>	 gehel: maybe that's a MENA issue
[08:42:25] <Amir1>	 back up now
[08:42:35] <gehel>	 Amir1: MENA ?
[08:42:37] <Amir1>	 Lydia_WMDE and I couldn't connet 
[08:42:40] * gehel is not good with acronyms
[08:42:41] <Amir1>	 *connect 
[08:42:52] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[08:42:57] <Amir1>	 Middle East and North Africa
[08:43:14] <Nikerabbit>	 also having problem from europe, no dns or slow loading
[08:43:22] <Amir1>	 I'm in Berlin though now
[08:44:09] <paravoid>	 hey
[08:44:12] <paravoid>	 traceroutes?
[08:44:16] <paravoid>	 and your IP
[08:44:39] <paravoid>	 Amir1, Nikerabbit ^^
[08:44:56] <_joe_>	 no dns for what specifically?
[08:45:34] <Amir1>	 doing it right now
[08:45:42] <Nemo_bis>	 Failed to resolve host: No address associated with hostname
[08:45:47] <Nemo_bis>	 $ mtr -w -c 20 wmflabs.org
[08:45:51] <Nemo_bis>	 but now is back
[08:45:51] <paravoid>	 I need your IPs and traceroutes
[08:46:04] <Amir1>	 https://www.irccloud.com/pastebin/bry86UjK/
[08:46:42] <Amir1>	 my IP: 80.153.119.142
[08:46:45] <godog>	 there's v4 icmp failures from atlas too, https://atlas.ripe.net/measurements/1790945/#!map https://atlas.ripe.net/measurements/1791307/#!map https://atlas.ripe.net/measurements/1791210/#!map
[08:46:49] <Amir1>	 (WMDE office)
[08:47:19] <paravoid>	 seems to work now, right?
[08:47:34] <Amir1>	 sometime it does, soemtime it doesn't
[08:47:42] <Amir1>	 let me check again
[08:48:21] <Amir1>	 https://www.irccloud.com/pastebin/5Ig9h6EK/
[08:48:32] <Amir1>	 This one looks different ^
[08:48:44] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdk failed https://phabricator.wikimedia.org/T135975
[08:48:52] <Amir1>	 but I'm not super familiar with traceroute 
[08:49:13] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING WARNING - Packet loss = 50%, RTA = 81.76 ms
[08:49:39] <Amir1>	 Everything looks okay now to me
[08:50:02] <paravoid>	 nod
[08:50:10] <paravoid>	 looks like issues with Telia
[08:50:23] <icinga-wm>	 PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 14 failures
[08:50:54] <paladox>	 Hi could someone remove production branch from production Track Onlyin https://phabricator.wikimedia.org/diffusion/OPUP/manage/branches/ please
[08:51:37] <paladox>	 It will allow refs/changes/ to be processed
[08:51:42] <paladox>	 Please
[08:51:57] <paravoid>	 Amir1: looks like it has recovered -- let me know if you experience any trouble again
[08:52:12] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[08:52:54] <Amir1>	 paravoid: sure, thank you :)
[08:53:13] <icinga-wm>	 PROBLEM - test icmp reachability to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 26 probes of 388 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map
[08:53:20] <paravoid>	 hey that's nice
[08:53:33] <paravoid>	 a little late, but nice!
[08:54:56] <godog>	 hehe indeed, that's how I noticed earlier, perhaps it wasn't critical for long enough to notify on irc, it was in the web interface tho
[08:55:53] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[08:55:55] <grrrit-wm>	 (03PS4) 10Filippo Giunchedi: DNS: Add mgmt DNS entries for ms-be2022 to ms-be2027 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[08:56:02] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] DNS: Add mgmt DNS entries for ms-be2022 to ms-be2027 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[08:56:52] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[08:56:58] <paravoid>	 sigh at our icinga dashboard :(
[08:57:04] <paravoid>	 so many failures
[08:57:25] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[08:57:38] <grrrit-wm>	 (03PS4) 10Filippo Giunchedi: DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[08:57:46] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[08:58:33] <hashar>	 godog: Icinga check that ripe-atlas-eqiad every 5 minutes. On the first two failures it set it in SOFT state which does not trigger notificaction. On the third that is a HARD start which does trigger notif
[08:58:39] <hashar>	 godog: history https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ripe-atlas-eqiad&service=test+icmp+reachability+to+eqiad
[08:59:09] <hashar>	 godog: with first encounter at 08:41 utc and  HARD state reached at 08:53 matching the time of the IRC message
[09:00:22] <godog>	 hashar: indeed, thanks! yeah the rest for codfw/ulsfo has been to brief to go in HARD state and notify
[09:00:32] <hashar>	 check states are described on http://docs.icinga.org/latest/en/statetypes.html
[09:00:37] <hashar>	 and that is often a source of confusion :(
[09:00:50] <grrrit-wm>	 (03PS1) 10Jcrespo: Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 
[09:01:32] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: adding install params for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[09:01:42] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: adding install params for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[09:01:56] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] adding install params for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul)
[09:02:55] <grrrit-wm>	 (03PS2) 10Jcrespo: Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 
[09:03:41] <icinga-wm>	 RECOVERY - test icmp reachability to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 388 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map
[09:03:54] <wikibugs>	 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2388214 (10Gehel)
[09:04:20] <grrrit-wm>	 (03PS2) 10Gehel: Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 
[09:06:19] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 (owner: 10Jcrespo)
[09:06:44] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 (owner: 10Gehel)
[09:07:41] <grrrit-wm>	 (03PS3) 10Gehel: Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 
[09:07:42] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy
[09:08:45] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388217 (10fgiunchedi) >>! In T136630#2387712, @Papaul wrote: > @fgiunchedi the other ms-be* systems have Trusty installed are we also installing Trusty on the new sys...
[09:08:51] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[09:15:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:16:09] <icinga-wm>	 PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:17:37] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:17:57] <icinga-wm>	 PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:19:07] <icinga-wm>	 PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:19:19] <icinga-wm>	 PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:19:37] <icinga-wm>	 PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:20:08] <icinga-wm>	 PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:21:08] <icinga-wm>	 PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:25:37] <icinga-wm>	 PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:27:28] <icinga-wm>	 PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:27:28] <icinga-wm>	 PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:30:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:30:34] <moritzm>	 !log rolling reboot of mw1153,mw1155,mw1156 into new kernels
[09:30:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:31:13] <_joe_>	 !log powercycling mw1140, OOMd
[09:31:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:31:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.016 second response time
[09:31:49] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[09:32:11] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 
[09:32:13] <grrrit-wm>	 (03PS29) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[09:32:56] <icinga-wm>	 RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[09:33:15] <icinga-wm>	 RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212
[09:33:17] <icinga-wm>	 RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[09:33:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 70860 bytes in 1.646 second response time
[09:33:46] <icinga-wm>	 RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:34:07] <icinga-wm>	 RECOVERY - configured eth on mw1140 is OK: OK - interfaces up
[09:34:26] <icinga-wm>	 RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient
[09:34:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.077 second response time
[09:34:56] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 11 % full
[09:35:15] <icinga-wm>	 RECOVERY - DPKG on mw1140 is OK: All packages OK
[09:35:26] <icinga-wm>	 RECOVERY - Disk space on mw1140 is OK: DISK OK
[09:35:46] <icinga-wm>	 RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm
[09:36:58] <wikibugs>	 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388249 (10akosiaris) So, this is happening due to carbon having a public IP and somewhat relax firewall rules. The premise is that hosts with public IPs have that due to t...
[09:37:26] <icinga-wm>	 RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:41:52] <wikibugs>	 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388256 (10Krenair) phab-01.phabricator.eqiad.wmflabs should no longer be doing it
[09:50:07] <grrrit-wm>	 (03CR) 10Paladox: [C: 04-1] "Rewrite rules need to be updated to https://phabricator.wikimedia.org/P3262 please." [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) (owner: 10JanZerebecki)
[09:53:13] <grrrit-wm>	 (03PS3) 10JanZerebecki: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4)
[09:53:16] <grrrit-wm>	 (03Abandoned) 10JanZerebecki: Add gitblit compatibility apache vhost to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) (owner: 10JanZerebecki)
[09:54:13] <grrrit-wm>	 (03CR) 10JanZerebecki: "Merged in some pieces of https://gerrit.wikimedia.org/r/#/c/294867/" [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4)
[09:55:48] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] "Thankyou." [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4)
[10:02:02] <grrrit-wm>	 (03CR) 10Danny B.: [C: 031] Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4)
[10:24:03] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388330 (10mmodell) Greg: I guess https://gerrit.wikimedia.org/r/#/c/294867/ supersedes the others.  We also need to...
[10:32:54] <wikibugs>	 06Operations: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2388340 (10fgiunchedi)
[10:38:14] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388379 (10Paladox) @mmodell I guess we can ask ops to do the dns bit please.
[10:39:18] <wikibugs>	 06Operations, 10procurement: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2388384 (10fgiunchedi)
[10:39:53] <grrrit-wm>	 (03CR) 10JanZerebecki: Rewrite rules for git.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4)
[10:47:03] <wikibugs>	 06Operations, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2388405 (10Peachey88)
[10:50:58] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388406 (10JanZerebecki) >>! In T137224#2388330, @mmodell wrote: > Greg: I guess https://gerrit.wikimedia.org/r/#/c/2...
[10:54:14] <grrrit-wm>	 (03CR) 10JanZerebecki: "I have seen no one ask for it not being on iridium, so I think it is good that this is in the same patch. (If the tests were being execute" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[10:56:10] <wikibugs>	 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2388427 (10ArielGlenn) https://github.com/facebook/hhvm/issues/7075 has just been closed after this commit: https://github.com/facebook/hhvm/commit/9d2be6c3...
[10:59:05] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] varnish: git.wm.org to antimony, remove git-related config/tests [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[11:00:11] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388438 (10BBlack) I don't even understand the initial thing the ticket is complaining about.  Can you explain to...
[11:00:55] <wikibugs>	 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2388442 (10akosiaris)
[11:00:57] <wikibugs>	 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2388441 (10akosiaris)
[11:05:42] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 04-1] varnish: git.wm.org to antimony, remove git-related config/tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[11:10:51] <grrrit-wm>	 (03CR) 10JanZerebecki: varnish: git.wm.org to antimony, remove git-related config/tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[11:13:26] <grrrit-wm>	 (03CR) 10Paladox: "@JanZerebecki would you be able todo ^^ please since @dzahn is on holiday." [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[11:14:06] <moritzm>	 !log stopping puppet on hosts using service::node (restbase, sca, scb, aqs) for step-by-step rollout of two puppet patches for firejail/service::node
[11:14:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:16:59] <grrrit-wm>	 (03PS1) 10Jcrespo: Install jessie on all db eqiad hosts > db1050 [puppet] - 10https://gerrit.wikimedia.org/r/294904 
[11:17:03] <grrrit-wm>	 (03PS5) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki)
[11:19:24] <grrrit-wm>	 (03PS1) 10Jcrespo: Repool db1072 with low weight, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294905 
[11:20:20] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Install jessie on all db eqiad hosts > db1050 [puppet] - 10https://gerrit.wikimedia.org/r/294904 (owner: 10Jcrespo)
[11:20:26] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki)
[11:20:33] <grrrit-wm>	 (03PS6) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki)
[11:20:42] <grrrit-wm>	 (03CR) 10Muehlenhoff: [V: 032] services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki)
[11:21:48] <jynus>	 I think there is a race condition here
[11:22:06] <moritzm>	 mine is merged now, so should be resolved :-)
[11:22:15] <jynus>	 no, that is the bad thing
[11:23:00] <jynus>	 look https://phabricator.wikimedia.org/P3266
[11:23:20] <jynus>	 it merges more than promised
[11:23:57] <jynus>	 and it doesn't warn there are 2 changes
[11:24:09] <moritzm>	 weird, puppet-merge only displayed me my change
[11:24:32] <jynus>	 mmm
[11:24:41] <jynus>	 maybe you merged at the same time?
[11:25:14] <jynus>	 so running 2 is "transactional"?
[11:25:15] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2224
[11:25:28] <moritzm>	 could be given the timing our the gerrit-wm output
[11:25:50] <jynus>	 also, your merge did not complain
[11:25:55] <jynus>	 on gerrit
[11:28:13] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) 
[11:28:40] <moritzm>	 jynus: let's create a task, so that it can be re-checked with the new puppet next quarter?
[11:30:35] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) (owner: 10Muehlenhoff)
[11:32:53] <bblack>	 there's definitely no transactionality to what puppet-merge does
[11:34:07] <jynus>	 maybe add a .lock ?
[11:34:37] <jynus>	 but this issue is different, is accepting one change and merging 2
[11:35:10] <jynus>	 so it is that, and if the changes accepted != the ones about to merge, quit
[11:35:15] <icinga-wm>	 RECOVERY - check_mysql on fdb2001 is OK: Uptime: 307176 Threads: 1 Questions: 3741336 Slow queries: 2072 Opens: 651 Flush tables: 2 Open tables: 572 Queries per second avg: 12.179 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:37:36] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "I am seeing 24G for /srv, 80K for /var/lib/zuul, and /var/lib/jenkins is a 512M tempfs.." [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[12:04:04] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Actually no, /var/lib/jenkins/tmpfs is a tmpfs... still trying to run the numbers..." [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[12:08:05] <icinga-wm>	 PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail
[12:08:15] <icinga-wm>	 PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures
[12:08:34] <grrrit-wm>	 (03PS5) 10Nikerabbit: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[12:09:49] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 031] "Code looks good, did not test." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[12:14:29] <grrrit-wm>	 (03PS1) 10Ctrochalakis: zookeeper::jmxtrans: Expose statsd & group_prefix parameters [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/294906 
[12:15:18] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "so," [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[12:16:38] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388494 (10BBlack) nice find in the old tickets! See also: T120121
[12:18:19] <akosiaris>	 hasharAway: got a question for you in https://gerrit.wikimedia.org/r/293690
[12:27:27] <icinga-wm>	 RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 70840 bytes in 9.964 second response time
[12:27:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.431 second response time
[12:27:53] <moritzm>	 !log restarted hhvm on mw1133 and mw1135
[12:27:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:28:36] <icinga-wm>	 RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.092 second response time
[12:29:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 71402 bytes in 0.904 second response time
[12:31:55] <icinga-wm>	 RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[12:32:46] <icinga-wm>	 RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:39:01] <grrrit-wm>	 (03PS1) 10Catrope: Enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294909 (https://phabricator.wikimedia.org/T138064) 
[12:47:56] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repool db1072 with low weight, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294905 (owner: 10Jcrespo)
[12:49:04] <moritzm>	 !log rolling reboot of mw1157-mw1160 into new kernels
[12:49:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:49:53] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 with low weight, depool db1073 (duration: 00m 27s)
[12:49:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:54:38] <grrrit-wm>	 (03PS1) 10Jcrespo: Increase db1072 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294910 
[12:57:29] <jynus>	 !log stopping, backuping and reimaging db1073
[12:57:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:59:08] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388543 (10Yurik) We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent management of the proxy ips, and won't as...
[13:02:20] <wikibugs>	 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388560 (10BBlack) >>! In T56783#2388543, @Yurik wrote: > We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent ma...
[13:02:36] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[13:03:47] <wikibugs>	 06Operations, 06Discovery, 06Maps, 03Maps-Sprint: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2388568 (10Gehel)
[13:10:42] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto)
[13:25:09] <grrrit-wm>	 (03PS2) 10JanZerebecki: varnish: git.wm.org to iridium, remove related config/tests/monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[13:26:06] <grrrit-wm>	 (03CR) 10JanZerebecki: varnish: git.wm.org to iridium, remove related config/tests/monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[13:27:25] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:42:47] <grrrit-wm>	 (03PS1) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 
[13:43:59] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (owner: 10Gehel)
[13:46:10] <grrrit-wm>	 (03PS2) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 
[13:47:24] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[13:52:40] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) 
[13:53:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto)
[13:57:08] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) 
[13:57:25] <_joe_>	 akosiaris: did I use the backup classes correctly?
[13:57:26] <wikibugs>	 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2388637 (10MoritzMuehlenhoff) >>! In T135991#2377176, @ArielGlenn wrote: > I'm fine with a daily salt-minion restart but let's make sure that it doesn't leave a duplicate (old) salt-minion ru...
[13:59:16] <wikibugs>	 06Operations: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733#2388640 (10MoritzMuehlenhoff)
[13:59:18] <wikibugs>	 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2388639 (10MoritzMuehlenhoff)
[13:59:33] <wikibugs>	 06Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#2388642 (10MoritzMuehlenhoff)
[13:59:35] <wikibugs>	 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10MoritzMuehlenhoff)
[14:03:51] <grrrit-wm>	 (03PS1) 10KartikMistry: apertium-eo-fr: New upstream release and Jessie rebuild [debs/contenttranslation/apertium-eo-fr] - 10https://gerrit.wikimedia.org/r/294917 (https://phabricator.wikimedia.org/T107306) 
[14:07:40] <wikibugs>	 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2388644 (10KartikMistry)
[14:07:53] <grrrit-wm>	 (03PS1) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 
[14:08:37] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] etcd: perform backups to /srv/backups/etcd, bacula (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto)
[14:08:46] <akosiaris>	 _joe_: I 've commented
[14:10:04] <grrrit-wm>	 (03CR) 10Gehel: Configuration for new elasticsearch servers in eqiad. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[14:11:16] <_joe_>	 thanks :)
[14:11:31] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "+1 with a small minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294914 (owner: 10Gehel)
[14:13:15] <gehel>	 akosiaris: where did you find the LVS IP for maps? (and where can I find it myself next time).
[14:13:24] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 
[14:13:26] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 
[14:14:40] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis)
[14:14:43] <grrrit-wm>	 (03PS3) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 
[14:14:45] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis)
[14:14:47] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis)
[14:15:01] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Increase db1072 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294910 (owner: 10Jcrespo)
[14:15:02] <_joe_>	 jenkins doesn't like us
[14:15:11] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis)
[14:15:14] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 
[14:15:16] <grrrit-wm>	 (03PS30) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[14:15:19] <paravoid>	 well it's right :)
[14:15:38] <_joe_>	 paravoid: it's clearly biased 
[14:15:48] <_joe_>	 inconscious bias it is!
[14:15:56] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 
[14:15:58] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 
[14:16:14] <akosiaris>	 gehel: https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/wmnet$4213 and https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/10.in-addr.arpa$51
[14:16:22] <akosiaris>	 I 've already reserved it for you :-)
[14:16:32] <gehel>	 akosiaris: thanks!
[14:16:41] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1072 weight after repooling (duration: 00m 36s)
[14:16:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:19:47] <jynus>	 Could not start Service[isc-dhcp-server]: Execution of '/sbin/start isc-dhcp-server' returned 1
[14:20:57] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto)
[14:21:27] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) 
[14:23:58] <jynus>	 I broke it
[14:24:02] <jynus>	 and I will fix it
[14:24:26] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis)
[14:24:32] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis)
[14:26:03] <jynus>	 this is the first time in a long time where logs were immediately useful
[14:26:16] <jynus>	 we should celebrate it!
[14:28:23] <grrrit-wm>	 (03PS1) 10Jcrespo: Fix missing '}' after entry [puppet] - 10https://gerrit.wikimedia.org/r/294921 
[14:29:02] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388656 (10Mholloway) @ori, thank you, those are excellent points and I've created a new task to address them....
[14:29:31] <wikibugs>	 06Operations, 06Revision-Scoring-As-A-Service, 06Services, 07service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#2388657 (10akosiaris) 05Open>03Resolved a:03akosiaris Resolving since ORES has been in production for the past 2 weeks
[14:30:26] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388660 (10BBlack) @mholloway - any chance this has interactions with the lazy-images/lazy-refs experiments, whic...
[14:30:57] <grrrit-wm>	 (03PS5) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 
[14:31:08] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Fix missing '}' after entry [puppet] - 10https://gerrit.wikimedia.org/r/294921 (owner: 10Jcrespo)
[14:32:36] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:33:35] <grrrit-wm>	 (03PS1) 10Mobrovac: Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294922 
[14:34:06] <icinga-wm>	 RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:34:29] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294922 (owner: 10Mobrovac)
[14:34:48] <godog>	 heh "puppet fail" flapping on restbase1007 is https://phabricator.wikimedia.org/T137952, I've silenced the alarm now but I'm not sure how to best fix it, update the cassandra version in puppet perhaps
[14:35:32] <mobrovac>	 godog: perhaps have puppet ensure the version that should applied on each node (2.1 vs 2.2)? 
[14:36:31] <mobrovac>	 godog: oh, no, no no ensure, a custom 2.2 deb has been installed on rb1007
[14:37:29] <godog>	 mobrovac: ah yeah, now I get it, will fix in puppet
[14:40:29] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) 
[14:46:51] <grrrit-wm>	 (03CR) 10Eevans: "LGTM, but could we remove the cassandra::version overrides in hieradata/regex.yaml as part of the same changeset?" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi)
[14:51:00] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 031] Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[14:55:08] <mobrovac>	 !log scb disabling puppet for stopping change-prop to clear transclusion queues
[14:55:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:56:25] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[14:58:18] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/267056 (https://phabricator.wikimedia.org/T124156) 
[14:59:34] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/267056 (https://phabricator.wikimedia.org/T124156) (owner: 10Alexandros Kosiaris)
[15:00:04] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#2388690 (10akosiaris)
[15:00:06] <wikibugs>	 06Operations, 10Traffic: unix domain socket listening for varnish4 - https://phabricator.wikimedia.org/T138084#2388691 (10BBlack)
[15:00:55] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[15:03:31] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[15:03:31] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[15:07:32] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[15:07:41] <grrrit-wm>	 (03CR) 10Hashar: "All of gallium is in puppet/configuration management BUT the Jenkins configuration. The whole point is thus to backup /var/lib/jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[15:08:03] <hashar>	 akosiaris: I have replied about gallium backup.  In short  /var/lib/jenkins is what we want to backup, but we would want to exclude the builds history
[15:22:25] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) 
[15:22:55] <grrrit-wm>	 (03CR) 10Hashar: "The bacula fileset configuration is expanded from modules/bacula/templates/bacula-dir-fileset.erb . The Exclude are defined with File keyw" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[15:34:08] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388783 (10Mattflaschen-WMF) By timestamp:  Converted from UUID with: ``` Flow\Model\UUID::create( strtolower( '<H...
[15:38:51] <wikibugs>	 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2388786 (10BBlack) New usernames in the past 24H: ``` Raboe001 ```  (I figure no point repeating the already-notified list every day, but c...
[15:44:52] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures Giuseppe Lavagetto Both jobrunner and jobchron are masked while I wait to activate the jessie jobrunner
[15:54:05] <akosiaris>	 hasharAway: yeah looking right now
[15:54:58] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet to enable large pages : T137419
[15:54:59] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[15:55:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:55:56] <grrrit-wm>	 (03CR) 10Paladox: "There now done at https://gerrit.wikimedia.org/r/#/c/294867/" [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[15:56:05] <grrrit-wm>	 (03CR) 10Paladox: [C: 04-1] git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[15:56:18] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388820 (10jcrespo) The question was because, unlike beta, the setup is a bit more complex: we have read-write and...
[15:59:30] <urandom>	 !log Starting html dumps from xenon.eqiad.wmnet and cerium.eqiad.wmnet : T137419
[15:59:31] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[15:59:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:59:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@hashar: I see what you mean. I 'll look into enabling the wilddir option" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff)
[16:00:59] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) 
[16:03:13] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi)
[16:06:30] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage: codfw: rack/setup/deploy ms-be202[2-7] switch configuration - https://phabricator.wikimedia.org/T138052#2388842 (10RobH)
[16:08:54] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage: codfw: rack/setup/deploy ms-be202[2-7] switch configuration - https://phabricator.wikimedia.org/T138052#2388846 (10RobH) 05Open>03Resolved switch config updated
[16:08:56] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388848 (10RobH)
[16:09:12] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 
[16:10:06] <grrrit-wm>	 (03PS1) 10Jcrespo: Repool db1073 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294931 
[16:12:44] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "PCC says yes: https://puppet-compiler.wmflabs.org/3142/" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi)
[16:13:01] <grrrit-wm>	 (03PS1) 10RobH: adding production dns entries for ms-be202[2-7] [dns] - 10https://gerrit.wikimedia.org/r/294932 
[16:13:03] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) 
[16:13:09] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi)
[16:14:20] <grrrit-wm>	 (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/3143/" [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff)
[16:14:34] <grrrit-wm>	 (03CR) 10RobH: [C: 032] adding production dns entries for ms-be202[2-7] [dns] - 10https://gerrit.wikimedia.org/r/294932 (owner: 10RobH)
[16:16:29] <wikibugs>	 06Operations, 13Patch-For-Review: "puppet fail" flapping on restbase1007 - https://phabricator.wikimedia.org/T137952#2388869 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi fixed in https://gerrit.wikimedia.org/r/294924
[16:17:41] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388873 (10RobH) So these are HP with HW raid controllers.  We'll need @papaul to setup the raid10 of the primary spinning disks for these before we can proceed with i...
[16:21:00] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388901 (10Paladox) I guess we remove https://github.com/wikimedia/operations-puppet/blob/52737634512bf43f8f98b757be4...
[16:21:22] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388902 (10Paladox) But we will need to update git.wikimedia.org ip to use iridium.
[16:22:51] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[16:25:23] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388907 (10fgiunchedi) @robh, yeah same as other the others, namely raid configuration for ms-be is all disks in raid0 from the hw controller, similarly to https://pha...
[16:27:16] <wikibugs>	 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388911 (10RobH) So I was wrong, no hw raid setup other than they need to present as individual disks to the OS.
[16:29:31] <moritzm>	 !log installing squid security updates on carbon
[16:29:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:31:52] <grrrit-wm>	 (03PS4) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T135018) 
[16:33:52] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi)
[16:34:07] <grrrit-wm>	 (03PS1) 10Dereckson: Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) 
[16:38:37] <wikibugs>	 06Operations, 06Discovery, 06Maps: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel)
[16:39:04] <wikibugs>	 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285911 (10Gehel) Closing this. The configuration is tracked in T138092.
[16:39:27] <grrrit-wm>	 (03PS5) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T138092) 
[16:44:55] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388961 (10BBlack)
[16:46:54] <grrrit-wm>	 (03PS5) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) 
[16:47:03] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:47:26] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[16:47:48] <grrrit-wm>	 (03PS6) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) 
[16:49:15] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388975 (10BBlack) For reference, here's the raw data at the end of some munging of a 5-minute sample of received URLs in a single cache...
[16:49:58] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388976 (10Mattflaschen-WMF) Yeah.  I checked what the first use of External Store was just now:  Seems it was use...
[16:50:19] <grrrit-wm>	 (03PS4) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 
[16:56:58] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388983 (10jcrespo) That is good news, if I understand it correctly it means there is no revisions in cluster23 or...
[17:08:04] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) (owner: 10Dereckson)
[17:08:18] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "the lists should probably be generated but this is fine for now." [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani)
[17:09:27] <grrrit-wm>	 (03CR) 10Ori.livneh: prometheus: add nginx reverse proxy (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[17:12:43] <grrrit-wm>	 (03CR) 10Ori.livneh: "actually could you get away with getting rid of the erb file altogether by using a $title-agnostic regexp, like location ~ /(\-|debug) ?" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[17:14:58] <wikibugs>	 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2389056 (10JanZerebecki)
[17:18:48] <grrrit-wm>	 (03CR) 1020after4: [C: 031] varnish: git.wm.org to iridium, remove related config/tests/monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[17:19:48] <grrrit-wm>	 (03CR) 1020after4: "this is good to go as soon as I67ad308f9e6373e5234cb2d83006457d6f467bf8 is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn)
[17:22:01] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389079 (10mmodell) So we need these two to merge in the listed order:  1. https://gerrit.wikimedia.org/r/#/c/293789/...
[17:23:05] <wikibugs>	 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2389080 (10matthiasmullie) Indeed:  ``` mysql:wikiadmin@10.64.16.18 [flowdb]> SELECT DISTINCT SUBSTR(rev_content,...
[17:24:32] <icinga-wm>	 PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:34:45] <grrrit-wm>	 (03Abandoned) 10Faidon Liambotis: interface: disable IPv6 autoconf on precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/217317 (owner: 10Faidon Liambotis)
[17:38:42] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389143 (10Paladox) @mmodell do we send an email out on wikitech-I saying git.wikimedia.org will be redirected soon.
[17:40:14] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Nice! I'd very much prefer to ditch these two variables but until we do so, this will do." [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris)
[17:42:10] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389149 (10Aklapper) If that was a question the sentence has to end with a question mark. Always.
[17:42:56] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389150 (10Paladox) @Aklapper ok sorry, done.
[17:43:19] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389151 (10mmodell) lol
[17:50:12] <icinga-wm>	 RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[17:53:46] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] network: add $production_networks (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[18:00:41] <grrrit-wm>	 (03PS2) 10Paladox: Block access to jice.ddns.net instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/277904 
[18:04:04] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) 
[18:07:38] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389616 (10greg) Yeah, we should announce before it happens.
[18:13:25] <wikibugs>	 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10faidon) A few of these could be quite problematic, unfortunately. I'd proceed with caution. - ntpd: we have Icinga checks for that I think, so they might trip...
[18:30:25] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repool db1073 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294931 (owner: 10Jcrespo)
[18:32:06] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 with low weight after reimage (duration: 00m 35s)
[18:32:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:53:20] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390465 (10Paladox) We also need to remove https://github.com/wikimedia/operations-puppet/blob/bdd27ef834044a25c5a4e6...
[18:56:35] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet with -XX:+PreserveFramePointer : T137419
[18:56:36] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[18:56:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:29:14] <grrrit-wm>	 (03PS1) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 
[19:29:37] <grrrit-wm>	 (03PS2) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 
[19:31:24] <grrrit-wm>	 (03CR) 1020after4: [C: 031] "Phabricator has since fixed the issue that necessitated adding this in the first place so it's safe to remove." [puppet] - 10https://gerrit.wikimedia.org/r/294945 (owner: 1020after4)
[19:33:50] <grrrit-wm>	 (03CR) 10Paladox: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/294945 (owner: 1020after4)
[19:35:08] <grrrit-wm>	 (03PS3) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) 
[19:47:44] <wikibugs>	 06Operations, 10netops: Notify transits of new esams prefixes - https://phabricator.wikimedia.org/T81989#2390595 (10faidon) 05Open>03Resolved a:03faidon I verified via looking glasses that those routes are being successfully propagated by all of our Amsterdam transits but one. I've mailed them for comple...
[20:15:11] <icinga-wm>	 PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: puppet fail
[20:20:04] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390668 (10Mholloway) Mystery solved: due to a session logging schema change, the Android app began sending sessi...
[20:20:11] <icinga-wm>	 RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 272 seconds ago with 0 failures
[20:20:21] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390669 (10Mholloway) 05Open>03Resolved
[20:23:27] <logmsgbot>	 !log maxsem@tin Synchronized php-1.28.0-wmf.6/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/294958/ (duration: 00m 33s)
[20:23:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:27:41] <icinga-wm>	 PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail
[20:35:51] <urandom>	 !log Disabling puppet on xenon.eqiad.wmnet : T137419
[20:35:52] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[20:35:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:38:04] <wikibugs>	 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2390675 (10hashar)
[20:39:11] <urandom>	 !log Restarting Cassandra on xenon.eqiad.wmnet to apply -XX:+PreserveFramePointer : T137419
[20:39:12] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[20:39:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:47:45] <hashar>	 !sal
[20:47:45] <wm-bot>	 https://wikitech.wikimedia.org/wiki/Server_Admin_Log  https://tools.wmflabs.org/sal/production   See it and you will know all you need.
[20:48:36] <grrrit-wm>	 (03PS1) 10BBlack: tlsproxy: drop ssl cache size back to 1G [puppet] - 10https://gerrit.wikimedia.org/r/295006 
[20:49:04] <hashar>	 MaxSem: the schema.geoFormat issue is gone for me  (  https://phabricator.wikimedia.org/T138078  )
[20:49:15] <MaxSem>	 yeah
[20:49:16] <hashar>	 MaxSem: probably want to reply a nice thing to the bug filler and close the task :)
[20:49:32] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: drop ssl cache size back to 1G [puppet] - 10https://gerrit.wikimedia.org/r/295006 (owner: 10BBlack)
[20:49:33] <hashar>	 MaxSem: I guess thank you for the patch
[20:49:36] <MaxSem>	 "I suck" :P
[20:52:02] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures
[20:57:09] <grrrit-wm>	 (03PS1) 10BBlack: cache_upload: experiment with 4h fe ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/295007 (https://phabricator.wikimedia.org/T124954) 
[20:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:57:41] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cache_upload: experiment with 4h fe ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/295007 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack)
[21:00:03] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:00:27] <hashar>	 MaxSem:  I disagree! You have noticed the task, figured out it was an easy fix, asked to deploy, get a review and got it pushed :)
[21:01:10] <hashar>	 get a nice word / summary and it is all set!
[21:02:52] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:03:58] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[21:06:46] <hashar>	 MaxSem: sleep well :)
[21:08:09] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:11:39] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:12:19] <icinga-wm>	 PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:16:08] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[21:21:45] <urandom>	 !log Reenabling puppet and resetting configuration on xenon.eqiad.wmnet : T137419
[21:21:46] <stashbot>	 T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419
[21:21:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:24:08] <icinga-wm>	 RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy
[21:24:49] <icinga-wm>	 RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures
[21:30:21] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390792 (10Paladox) @greg do you know who would send the email out?
[21:45:30] <grrrit-wm>	 (03PS1) 10Paladox: Only mirror refs/heads/ and refs/tags/ for mw core and operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 
[21:47:44] <grrrit-wm>	 (03CR) 10Paladox: "Were switching mirrors on and this is required for the biggest repo's." [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox)
[21:52:08] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures
[22:00:08] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5297808 keys - replication_delay is 610
[22:04:29] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5244813 keys - replication_delay is 0
[22:05:58] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2390881 (10Southparkfan) @Joe mw1306 seems to have the same IP as mw1091 (although that one shows up as mw1091 in Ganglia, whereas mw1090 shows up as mw1305...). So I...
[22:10:14] <wikibugs>	 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390885 (10GWicke) This is the [updated latency graph](https://docs.google.com/spreadsheets/d/1ZcaXdxMhaAEFMferqi...
[22:11:56] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390889 (10greg) One of me, Mukunda, or Chad, probably.
[22:18:05] <wikibugs>	 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for  gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390924 (10Paladox) Ok Thanks for replying.
[22:45:29] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[23:29:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[23:29:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#11542} [10Gbps]BR
[23:30:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[23:38:35] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[23:38:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[23:39:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0