[00:29:54] (03PS1) 10Thcipriani: Stop executing on failure [tools/scap] - 10https://gerrit.wikimedia.org/r/239521 [00:42:19] 6operations, 10Annual-Report: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1655304 (10Dzahn) 3NEW [00:43:14] 6operations, 10Annual-Report: redirect to older annual reports from annual.wikimedia.org - https://phabricator.wikimedia.org/T113113#1655311 (10Dzahn) [01:02:35] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:03:22] =P [01:03:44] not quite a false alarm but not enough of an alarm to do anything... unless you are brandon ;] [01:04:16] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10512 bytes in 0.121 second response time [01:11:14] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1655355 (10bmansurov) @Dzahn, I have signed L3. Thanks. [01:12:40] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1655357 (10greg) >>! In T100519#1529708, @BBlack wrote: > Basically, yeah. I ran down a similar plan with @Chasemp and I think he's working on some patches for it. @bb... [01:18:51] robh, yeah that's been happening for ages [01:19:14] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:19:17] yes paging all the ops who are awake on their phones each time =P [01:19:30] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1655387 (10greg) [01:19:34] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6 [01:20:54] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 503 bytes in 1.003 second response time [01:21:24] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [01:21:45] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [01:25:25] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1655394 (10Tbayer) 3NEW a:3Ottomata [01:42:15] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:44:04] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 503 bytes in 3.007 second response time [01:48:16] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:59:05] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures [02:20:20] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 05s) [02:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:29] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-19 02:23:29+00:00 [02:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:34] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:00:13] 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Revi) 3NEW [03:04:36] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:23:55] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:25:35] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: puppet fail [03:53:55] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:59:15] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [04:01:26] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:01:35] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:01:44] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:01:44] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:01:55] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:01:56] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host [04:02:25] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:02:35] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:02:35] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:02:45] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:03:06] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:03:06] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:03:06] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [04:03:44] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 533 bytes in 0.009 second response time [04:04:34] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:05:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [04:08:55] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:10:35] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [04:15:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:23:05] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host [04:24:55] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 533 bytes in 0.008 second response time [04:28:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Sep 19 04:28:59 UTC 2015 (duration 28m 57s) [04:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:30:48] Oh that's why [04:59:54] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: puppet fail [05:05:25] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: No route to host [05:07:14] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 534 bytes in 0.001 second response time [05:21:02] !log powercycling cp1046, dead on console [05:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:22:47] !log pybal-depooling cp1046 from eqiad/mobile until further investigation [05:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:23:44] RECOVERY - Host cp1046 is UP: PING WARNING - Packet loss = 37%, RTA = 1.16 ms [05:23:45] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [05:24:04] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [05:24:06] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [05:24:06] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [05:24:14] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [05:24:16] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [05:24:16] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [05:24:16] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [05:24:25] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK [05:25:04] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [05:25:06] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [05:25:24] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [05:29:54] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:32:45] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [05:34:55] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:34:56] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:34:56] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:35:04] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:35:36] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:35:45] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:35:56] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:36:04] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:36:24] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:36:34] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:36:35] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:36:35] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [05:38:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [05:47:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:49:28] !log aaron@tin Synchronized php-1.26wmf23/extensions/TitleBlacklist: 80d3a21a51f9c54ed2d94 (duration: 00m 12s) [05:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:55:08] (03CR) 10Papaul: [C: 032] "looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/239513 (https://phabricator.wikimedia.org/T113079) (owner: 10Dzahn) [05:55:27] 6operations, 10OTRS, 6Security, 7HTTPS: SSL-config of the OTRS is outdated - https://phabricator.wikimedia.org/T91504#1655521 (10Dzahn) >>! In T91504#1653983, @Krenair wrote: > Which depends on {T111532} (actually, it looks to me like this was already done, will comment there in a sec) It is. It's "mendel... [05:56:29] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1655522 (10Dzahn) >>! In T105125#1653951, @Krenair wrote: > Does this block {T74109}? No, this task is outdated. It is now going to move to a VM. But ask @akosiaris to confirm. [06:05:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [06:11:20] 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655530 (10Dzahn) [06:11:49] 6operations, 6Discovery, 10Maps: Support incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Dzahn) @tfinc said: //Operations will need to update the referrer in Varnish ERB. Similar to https://gerrit.wikimedia.org/r/#/c/239279/// [06:16:16] 6operations, 6Discovery, 10Maps: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1655534 (10Dzahn) [06:17:06] 6operations, 6Discovery, 10Maps: maps: support wikivoyages in incubator - https://phabricator.wikimedia.org/T113122#1655482 (10Dzahn) @Revi said // It is sad that incubator.wikimedia.org is not included. (Wikivoyages not ready for their own wikis resides on this domain under prefix Wy/**, so Korean one beco... [06:19:27] mutante: that's included in my OP [06:19:28] lol [06:30:35] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:06] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:35] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:40:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [06:44:57] 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1655546 (10Aklapper) 3NEW [06:45:12] (03PS4) 10Aklapper: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) [06:49:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [06:56:36] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:25] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:42:00] <_joe_> !log cp1046 dead on console again, powercycling to inspect it [08:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:35] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 1.82 ms [08:44:44] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [08:44:54] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [08:45:16] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [08:45:24] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [08:45:25] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [08:45:44] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK [08:45:45] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [08:45:46] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [08:45:54] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [08:45:55] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [08:45:55] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [08:45:56] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [08:55:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: ! GTT/TiNet (Service 589, Circuit 02773-003-01) {#2013} [10Gbps]BR [09:02:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [09:12:34] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [09:33:25] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: puppet fail [09:39:04] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:01:35] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:04:35] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [12:06:15] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Puppet has 1 failures [12:31:05] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:45] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:21:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [14:33:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:44] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1656171 (10Reedy) I just came across enwikisource.logging_pre_1_10, which has over 130k rows It wouldn't surprise me if there's similar tables on other wikis [16:34:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [16:35:02] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1656174 (10Reedy) >>! In T54932#1575467, @Krenair wrote: > devwikiinternal.old, rel13testwiki.old, zh_cnwiki.old > I'm not sure these ones should be deleted. These wikis are 'deleted' in the sense... [16:44:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:55:48] (03PS1) 10Reedy: Bundle jquery 1.9.1 [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) [16:57:54] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown, 7Database, 5Patch-For-Review: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1656213 (10Reedy) >>! In T96499#1560208, @Krenair wrote: > It's not just jQuery, but also the Google Visual... [17:13:41] (03PS1) 10Andrew Bogott: Labs: Include python-openstackclient on the controller host. [puppet] - 10https://gerrit.wikimedia.org/r/239570 [17:41:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:15] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61445 bytes in 0.419 second response time [17:44:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [17:50:17] (03CR) 10Addshore: Rsync api log archives from fluorine to stat1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) (owner: 10Addshore) [17:50:52] (03PS5) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) [17:54:35] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:34:17] !log reactivating ΒGP with GTT @ eqiad [18:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:35] !log pooling back cp1046 to pybal eqiad/mobile, has stayed stable [18:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:18] <_joe_> :) [18:39:29] <_joe_> I was about to ask you if we should've repooled it [18:41:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:33] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1656273 (10faidon) 3NEW [18:44:28] <_joe_> oh should I restart gitblit or let it rot as it deserves? [18:45:28] <_joe_> !log restarted gitblit. I will now substitute myself with a clever perl one-liner. [18:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:06] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.356 second response time [18:54:00] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1656300 (10faidon) 3NEW [19:04:12] !log salt rm /etc/systemd/system/txstatsd.service from all cp*, leftover because of ::txstatsd::decommission (removed with 4a1d4e) missing it [19:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:04:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:31] (03PS1) 10Milimetric: [WIP] Add statistics mount [puppet] - 10https://gerrit.wikimedia.org/r/239577 (https://phabricator.wikimedia.org/T111845) [19:08:01] (03CR) 10Milimetric: [C: 04-1] [WIP] Add statistics mount [puppet] - 10https://gerrit.wikimedia.org/r/239577 (https://phabricator.wikimedia.org/T111845) (owner: 10Milimetric) [19:15:24] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.404 second response time [19:19:44] RECOVERY - Disk space on labstore1002 is OK: DISK OK [19:21:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [5000000.0] [19:26:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:29:54] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:15] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.478 second response time [19:53:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:00] !log restarting once more giblit, last chance [19:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61475 bytes in 0.203 second response time [20:05:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61476 bytes in 0.556 second response time [20:29:40] (03CR) 10Jforrester: "As I6e6ce9f03 is merged, can this be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [21:18:19] (03CR) 10QChris: Make gerrit offer newer key exchange algorithms for new sshs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [21:48:03] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1656455 (10BBlack) FWIW, I think this pre-dated IPSec and probably isn't related to it. In earlier investigations it looked like a monitoring failure of some kind and not real. La... [21:48:15] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: puppet fail [21:58:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:56] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.314 second response time [22:11:25] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:26] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:21:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.101 second response time [22:30:10] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1656458 (10bd808) MediaWiki-Vagrant added some Puppet rules to fix this sort of thing up when it happens (https://github.com/wikimedia/mediawiki-vagrant/blob/87fd342c2df861a2e76c7dff6c04c... [22:36:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61462 bytes in 0.204 second response time [22:50:32] (03CR) 10MarcoAurelio: [C: 04-1] "Lacks community consensus per standard eswiki practices. See phabricator ticket." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [22:54:39] (03CR) 10MarcoAurelio: "Perfect. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239308 (https://phabricator.wikimedia.org/T72829) (owner: 10Alex Monk) [23:01:04] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:25] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.054 second response time [23:08:26] !log begining Cassandra repair on restbase1004 (nodetool repair -pr) [23:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:36] error 503, it.wiki from Italy [23:08:40] need a report? [23:09:24] (03CR) 10Alex Monk: "Looks like has_echo and has_flaggedrevs might be working differently. That said, the perl script clearly hasn't been run for a while." [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [23:10:06] Vito, continuously? [23:10:08] or just a single request? [23:10:25] lemme refresh [23:10:37] only twice in a row [23:10:48] so let's see [23:11:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:04] !log begining Cassandra repair on restbase1005 (nodetool repair -pr) [23:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:04] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61461 bytes in 0.062 second response time [23:34:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:05] (03CR) 10Yurik: "From the looks of it, that other patch breaks water polygons pretty badly. I might be wrong about the cause of it, but it has happened rig" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik)