[00:29:54] <krrrit-wm>	 (03PS1) 10Thcipriani: Stop executing on failure [tools/scap] - 10https://gerrit.wikimedia.org/r/239521 
[00:42:19] <wikibugs>	 6operations, 10Annual-Report: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1655304 (10Dzahn) 3NEW
[00:43:14] <wikibugs>	 6operations, 10Annual-Report: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1655311 (10Dzahn)
[01:02:35] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
[01:03:22] <robh>	 =P
[01:03:44] <robh>	 not quite a false alarm but not enough of an alarm to do anything... unless you are brandon ;]
[01:04:16] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10512 bytes in 0.121 second response time
[01:11:14] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1655355 (10bmansurov) @Dzahn, I have signed L3. Thanks.
[01:12:40] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1655357 (10greg) >>! In T100519#1529708, @BBlack wrote: > Basically, yeah.  I ran down a similar plan with @Chasemp and I think he's working on some patches for it.  @bb...
[01:18:51] <Krenair>	 robh, yeah that's been happening for ages
[01:19:14] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
[01:19:17] <robh>	 yes paging all the ops who are awake on their phones each time =P
[01:19:30] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1655387 (10greg)
[01:19:34] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6
[01:20:54] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 503 bytes in 1.003 second response time
[01:21:24] <icinga-wm>	 RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK
[01:21:45] <icinga-wm>	 PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail
[01:25:25] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1655394 (10Tbayer) 3NEW a:3Ottomata
[01:42:15] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out
[01:44:04] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 503 bytes in 3.007 second response time
[01:48:16] <icinga-wm>	 RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:59:05] <icinga-wm>	 PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures
[02:20:20] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 05s)
[02:20:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:23:29] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-19 02:23:29+00:00
[02:23:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:25:34] <icinga-wm>	 RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:00:13] <wikibugs>	 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Revi) 3NEW
[03:04:36] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied
[03:23:55] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[03:25:35] <icinga-wm>	 PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail
[03:53:55] <icinga-wm>	 RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:59:15] <icinga-wm>	 PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[04:01:26] <icinga-wm>	 PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:01:35] <icinga-wm>	 PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:01:44] <icinga-wm>	 PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:01:44] <icinga-wm>	 PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:01:55] <icinga-wm>	 PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:01:56] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host
[04:02:25] <icinga-wm>	 PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:02:35] <icinga-wm>	 PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:02:35] <icinga-wm>	 PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:02:45] <icinga-wm>	 PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:03:06] <icinga-wm>	 PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:03:06] <icinga-wm>	 PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:03:06] <icinga-wm>	 PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[04:03:44] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 533 bytes in 0.009 second response time
[04:04:34] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied
[04:05:05] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0]
[04:08:55] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:10:35] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING
[04:15:35] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[04:23:05] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host
[04:24:55] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 533 bytes in 0.008 second response time
[04:28:59] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Sep 19 04:28:59 UTC 2015 (duration 28m 57s)
[04:29:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:30:48] <Bsadowski1>	 Oh that's why
[04:59:54] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: puppet fail
[05:05:25] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host
[05:07:14] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 534 bytes in 0.001 second response time
[05:21:02] <paravoid>	 !log powercycling cp1046, dead on console
[05:21:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:22:47] <paravoid>	 !log pybal-depooling cp1046 from eqiad/mobile until further investigation
[05:22:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:23:44] <icinga-wm>	 RECOVERY - Host cp1046 is UP: PING WARNING - Packet loss = 37%, RTA = 1.16 ms
[05:23:45] <icinga-wm>	 RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK
[05:24:04] <icinga-wm>	 RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK
[05:24:06] <icinga-wm>	 RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK
[05:24:06] <icinga-wm>	 RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK
[05:24:14] <icinga-wm>	 RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK
[05:24:16] <icinga-wm>	 RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK
[05:24:16] <icinga-wm>	 RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK
[05:24:16] <icinga-wm>	 RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK
[05:24:25] <icinga-wm>	 RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK
[05:25:04] <icinga-wm>	 RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK
[05:25:06] <icinga-wm>	 RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK
[05:25:24] <icinga-wm>	 RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK
[05:29:54] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:32:45] <icinga-wm>	 PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[05:34:55] <icinga-wm>	 PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:34:56] <icinga-wm>	 PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:34:56] <icinga-wm>	 PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:35:04] <icinga-wm>	 PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:35:36] <icinga-wm>	 PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:35:45] <icinga-wm>	 PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:35:56] <icinga-wm>	 PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:36:04] <icinga-wm>	 PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:36:24] <icinga-wm>	 PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:36:34] <icinga-wm>	 PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:36:35] <icinga-wm>	 PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:36:35] <icinga-wm>	 PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6
[05:38:16] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0]
[05:47:05] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[05:49:28] <logmsgbot>	 !log aaron@tin Synchronized php-1.26wmf23/extensions/TitleBlacklist: 80d3a21a51f9c54ed2d94 (duration: 00m 12s)
[05:49:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:55:08] <krrrit-wm>	 (03CR) 10Papaul: [C: 032] "looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/239513 (https://phabricator.wikimedia.org/T113079) (owner: 10Dzahn)
[05:55:27] <wikibugs>	 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1655521 (10Dzahn) >>! In T91504#1653983, @Krenair wrote: > Which depends on {T111532} (actually, it looks to me like this was already done, will comment there in a sec)  It is. It's "mendel...
[05:56:29] <wikibugs>	 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1655522 (10Dzahn) >>! In T105125#1653951, @Krenair wrote: > Does this block {T74109}?  No, this task is outdated. It is now going to move to a VM. But ask @akosiaris to confirm.
[06:05:54] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds
[06:11:20] <wikibugs>	 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655530 (10Dzahn)
[06:11:49] <wikibugs>	 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Dzahn) @tfinc said:  //Operations will need to update the referrer in Varnish ERB. Similar to  https://gerrit.wikimedia.org/r/#/c/239279///
[06:16:16] <wikibugs>	 6operations, 6Discovery, 10Maps: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1655534 (10Dzahn)
[06:17:06] <wikibugs>	 6operations, 6Discovery, 10Maps: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Dzahn) @Revi said  // It is sad that incubator.wikimedia.org is not included. (Wikivoyages not ready for their own wikis resides on this domain under prefix Wy/**, so Korean one beco...
[06:19:27] <revi>	 mutante: that's included in my OP
[06:19:28] <revi>	 lol
[06:30:35] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:05] <icinga-wm>	 PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:14] <icinga-wm>	 PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:55] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:06] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:35] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:14] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:35] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:40:55] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[06:44:57] <wikibugs>	 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1655546 (10Aklapper) 3NEW
[06:45:12] <krrrit-wm>	 (03PS4) 10Aklapper: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) 
[06:49:36] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds
[06:56:36] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:56:54] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[06:57:06] <icinga-wm>	 RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:27] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[06:57:55] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[06:57:55] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:06] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:12:25] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[08:42:00] <_joe_>	 !log cp1046 dead on console again, powercycling to inspect it
[08:42:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:44:35] <icinga-wm>	 RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms
[08:44:44] <icinga-wm>	 RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK
[08:44:54] <icinga-wm>	 RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK
[08:45:16] <icinga-wm>	 RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK
[08:45:24] <icinga-wm>	 RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK
[08:45:25] <icinga-wm>	 RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK
[08:45:44] <icinga-wm>	 RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK
[08:45:45] <icinga-wm>	 RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK
[08:45:46] <icinga-wm>	 RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK
[08:45:54] <icinga-wm>	 RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK
[08:45:55] <icinga-wm>	 RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK
[08:45:55] <icinga-wm>	 RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK
[08:45:56] <icinga-wm>	 RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK
[08:55:07] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: ! GTT/TiNet (Service 589, Circuit 02773-003-01) {#2013} [10Gbps]BR
[09:02:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0
[09:12:34] <icinga-wm>	 PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail
[09:33:25] <icinga-wm>	 PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: puppet fail
[09:39:04] <icinga-wm>	 RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[10:01:35] <icinga-wm>	 RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[12:04:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:06:15] <icinga-wm>	 PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:31:05] <icinga-wm>	 RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:32:45] <icinga-wm>	 RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[14:21:06] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0]
[14:33:36] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:32:44] <wikibugs>	 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1656171 (10Reedy) I just came across enwikisource.logging_pre_1_10, which has over 130k rows  It wouldn't surprise me if there's similar tables on other wikis
[16:34:06] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0]
[16:35:02] <wikibugs>	 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1656174 (10Reedy) >>! In T54932#1575467, @Krenair wrote: > devwikiinternal.old, rel13testwiki.old, zh_cnwiki.old > I'm not sure these ones should be deleted. These wikis are 'deleted' in the sense...
[16:44:54] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:55:48] <krrrit-wm>	 (03PS1) 10Reedy: Bundle jquery 1.9.1 [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) 
[16:57:54] <wikibugs>	 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Database, 5Patch-For-Review: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1656213 (10Reedy) >>! In T96499#1560208, @Krenair wrote: > It's not just jQuery, but also the Google Visual...
[17:13:41] <krrrit-wm>	 (03PS1) 10Andrew Bogott: Labs: Include python-openstackclient on the controller host. [puppet] - 10https://gerrit.wikimedia.org/r/239570 
[17:41:36] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:43:15] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61445 bytes in 0.419 second response time
[17:44:05] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0]
[17:50:17] <krrrit-wm>	 (03CR) 10Addshore: Rsync api log archives from fluorine to stat1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) (owner: 10Addshore)
[17:50:52] <krrrit-wm>	 (03PS5) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) 
[17:54:35] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:34:17] <paravoid>	 !log reactivating ΒGP with GTT @ eqiad
[18:34:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:38:35] <paravoid>	 !log pooling back cp1046 to pybal eqiad/mobile, has stayed stable
[18:38:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:39:18] <_joe_>	 :)
[18:39:29] <_joe_>	 I was about to ask you if we should've repooled it 
[18:41:35] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:42:33] <wikibugs>	 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1656273 (10faidon) 3NEW
[18:44:28] <_joe_>	 oh should I restart gitblit or let it rot as it deserves?
[18:45:28] <_joe_>	 !log restarted gitblit. I will now substitute myself with a clever perl one-liner.
[18:45:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:52:06] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.356 second response time
[18:54:00] <wikibugs>	 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1656300 (10faidon) 3NEW
[19:04:12] <paravoid>	 !log salt rm /etc/systemd/system/txstatsd.service from all cp*, leftover because of ::txstatsd::decommission (removed with 4a1d4e) missing it
[19:04:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:04:55] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:07:31] <krrrit-wm>	 (03PS1) 10Milimetric: [WIP] Add statistics mount [puppet] - 10https://gerrit.wikimedia.org/r/239577 (https://phabricator.wikimedia.org/T111845) 
[19:08:01] <krrrit-wm>	 (03CR) 10Milimetric: [C: 04-1] [WIP] Add statistics mount [puppet] - 10https://gerrit.wikimedia.org/r/239577 (https://phabricator.wikimedia.org/T111845) (owner: 10Milimetric)
[19:15:24] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.404 second response time
[19:19:44] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[19:21:16] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0]
[19:26:36] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[19:29:54] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:42:15] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.478 second response time
[19:53:15] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:56:00] <jynus>	 !log restarting once more giblit, last chance
[19:56:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:56:36] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.203 second response time
[20:05:55] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:16:16] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61476 bytes in 0.556 second response time
[20:29:40] <krrrit-wm>	 (03CR) 10Jforrester: "As I6e6ce9f03 is merged, can this be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik)
[21:18:19] <krrrit-wm>	 (03CR) 10QChris: Make gerrit offer newer key exchange algorithms for new sshs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris)
[21:48:03] <wikibugs>	 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1656455 (10BBlack) FWIW, I think this pre-dated IPSec and probably isn't related to it.  In earlier investigations it looked like a monitoring failure of some kind and not real.  La...
[21:48:15] <icinga-wm>	 PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: puppet fail
[21:58:55] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:05:56] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.314 second response time
[22:11:25] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:16:26] <icinga-wm>	 RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[22:21:54] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.101 second response time
[22:30:10] <wikibugs>	 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1656458 (10bd808) MediaWiki-Vagrant added some Puppet rules to fix this sort of thing up when it happens (https://github.com/wikimedia/mediawiki-vagrant/blob/87fd342c2df861a2e76c7dff6c04c...
[22:36:16] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:46:44] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61462 bytes in 0.204 second response time
[22:50:32] <krrrit-wm>	 (03CR) 10MarcoAurelio: [C: 04-1] "Lacks community consensus per standard eswiki practices. See phabricator ticket." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides)
[22:54:39] <krrrit-wm>	 (03CR) 10MarcoAurelio: "Perfect. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239308 (https://phabricator.wikimedia.org/T72829) (owner: 10Alex Monk)
[23:01:04] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:04:25] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.054 second response time
[23:08:26] <urandom>	 !log begining Cassandra repair on restbase1004 (nodetool repair -pr)
[23:08:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:08:36] <Vito>	 error 503, it.wiki from Italy
[23:08:40] <Vito>	 need a report?
[23:09:24] <krrrit-wm>	 (03CR) 10Alex Monk: "Looks like has_echo and has_flaggedrevs might be working differently. That said, the perl script clearly hasn't been run for a while." [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk)
[23:10:06] <Krenair>	 Vito, continuously?
[23:10:08] <Krenair>	 or just a single request?
[23:10:25] <Vito>	 lemme refresh
[23:10:37] <Vito>	 only twice in a row
[23:10:48] <Vito>	 so let's see
[23:11:35] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:12:04] <urandom>	 !log begining Cassandra repair on restbase1005 (nodetool repair -pr)
[23:12:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:22:04] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.062 second response time
[23:34:35] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:51:05] <krrrit-wm>	 (03CR) 10Yurik: "From the looks of it, that other patch breaks water polygons pretty badly. I might be wrong about the cause of it, but it has happened rig" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik)