[02:40:07] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 48.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:43:19] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 92.67 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:36:27] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - eswiki_content_1521891951[6](2019-08-15T03:43:12.536Z), enwiki_content_1546970425[2](2019-08-15T03:43:02.394Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:03:33] ACKNOWLEDGEMENT - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Host will be retired - T230518 [08:22:53] (03CR) 10Arturo Borrero Gonzalez: "This LGTM in general." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [08:28:25] !log running `_cluster/reroute?pretty&explain=true&retry_failed` on eqiad production-search cluster to force allocation of shards [08:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:54] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/530712 (https://phabricator.wikimedia.org/T113114) (owner: 10Alexandros Kosiaris) [09:53:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Andrew) I did a manual install-console on this host and it's doing its initial puppet run now. [09:53:35] PROBLEM - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:54:13] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Arturo Borrero Gonzalez rebuilding https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [10:11:29] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:23:41] (03PS3) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [10:23:43] (03PS2) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [10:23:45] (03PS1) 10Andrew Bogott: cloudvirt1023: update nic name for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530763 [10:24:42] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: update nic name for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/530763 (owner: 10Andrew Bogott) [10:25:22] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [10:50:15] (03PS1) 10Alex Monk: Add missing cloudinfra contact group [puppet] - 10https://gerrit.wikimedia.org/r/530765 (https://phabricator.wikimedia.org/T230674) [11:03:46] (03PS4) 10Andrew Bogott: cloud recursors: alias 'puppet' to the new in-labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/530341 (https://phabricator.wikimedia.org/T171188) [11:03:48] (03PS3) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [11:03:50] (03PS1) 10Andrew Bogott: cloudvirt1023: change network names again [puppet] - 10https://gerrit.wikimedia.org/r/530766 [11:04:48] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: change network names again [puppet] - 10https://gerrit.wikimedia.org/r/530766 (owner: 10Andrew Bogott) [11:07:54] RECOVERY - ensure kvm processes are running on cloudvirt1023 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:33:25] (03PS1) 10MarcoAurelio: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 [11:36:20] (03PS2) 10MarcoAurelio: Change language code for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) [11:39:42] PROBLEM - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:43] ACKNOWLEDGEMENT - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T230682 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:39:47] 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T230682 (10ops-monitoring-bot) [11:41:02] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:42:38] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:47:24] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [11:48:58] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:02:35] 10Operations, 10Acme-chief, 10Traffic: Provide the three cert types (chain-only, cert only and chained) as soon as we get the certificate issued - https://phabricator.wikimedia.org/T229096 (10Krenair) Can we close this now? [12:02:59] 10Operations, 10Acme-chief, 10Traffic: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 (10Krenair) Is it working as expected now? [12:47:13] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10Krenair) [12:55:23] (03PS1) 10Krinkle: hieradata: Move beta 'cache::app_directors' from Horizon to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/530771 (https://phabricator.wikimedia.org/T158837) [13:16:49] (03PS1) 10Krinkle: hieradata: Add 'performance.wikimedia.beta.wmflabs.org' routing [puppet] - 10https://gerrit.wikimedia.org/r/530773 (https://phabricator.wikimedia.org/T158837) [13:19:52] Krenair: ^ [13:20:13] to replace performance-beta.wmflabs.org web proxy [13:20:24] looks like wildcard routing puts it on text vcl already [13:20:28] so probably will just work? [13:25:19] bd808: Which one is meant to win? Horizon puppet hiera or puppet.git puppet hiera? [13:27:31] (03CR) 10Urbanecm: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530769 (https://phabricator.wikimedia.org/T230680) (owner: 10MarcoAurelio) [13:29:09] (03CR) 10Krinkle: [C: 04-1] "Help? Cherry-picking on beta puppet master causes a compilation failure." [puppet] - 10https://gerrit.wikimedia.org/r/530773 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [13:29:22] akosiaris: could use some help from someone who knows VCL better. any recommendations? [13:30:13] (03PS2) 10Andrew Bogott: Add missing cloudinfra contact group [puppet] - 10https://gerrit.wikimedia.org/r/530765 (https://phabricator.wikimedia.org/T230674) (owner: 10Alex Monk) [13:31:16] (03CR) 10Andrew Bogott: [C: 03+2] Add missing cloudinfra contact group [puppet] - 10https://gerrit.wikimedia.org/r/530765 (https://phabricator.wikimedia.org/T230674) (owner: 10Alex Monk) [13:32:22] (03CR) 10Andrew Bogott: [C: 03+2] "Thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/530765 (https://phabricator.wikimedia.org/T230674) (owner: 10Alex Monk) [13:48:59] (03CR) 10Alex Monk: "(See T171188)" [puppet] - 10https://gerrit.wikimedia.org/r/530344 (owner: 10Alex Monk) [13:49:08] (03CR) 10Alex Monk: [C: 04-1] "(See T171188)" [puppet] - 10https://gerrit.wikimedia.org/r/530371 (owner: 10Alex Monk) [13:56:54] PROBLEM - Host cp2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:58] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:06] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:08] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:10] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:16] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:20] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:32] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:34] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:36] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:36] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:44] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:50] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:50] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:50] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:05:54] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:00] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:02] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:06] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:06] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:08] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:14] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:14] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:16] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:18] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:18] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:22] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:06:32] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2004_v4, cp2004_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:14:41] okay so that's all just because cp2004 went off I guess [14:46:26] yeah that's typical with the IPsec alerts [14:46:55] AIUI, in the future ATS world, there won't be a need for IPsec [14:51:17] cdanis, just like TLS between everything I guess? [14:51:58] is it done with IPsec right now because varnish and TLS don't mix? [14:53:14] that's right Krenair [14:53:33] I believe it's specifically for "Varnish needs to call other Varnish" case, as that can't use TLS [14:54:30] kind of surprised we can't stick nginx in that path too [14:54:33] but ok [14:55:46] I'm not sure if it's "can't", or "not worth it for something temporary" [14:58:21] right [14:58:26] makes sense. would be more overhead too [14:59:51] yeah, plus it'd be a nontrivial nginx configuration, and I know _j.oe_ has encountered some issues elsewhere when using nginx as a TLS-adding reverse proxy [15:00:37] well, we have some prior art :) [15:00:40] but sure [17:51:38] (03PS9) 10Daimona Eaytoy: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 [21:19:39] (03CR) 10Urbanecm: [C: 03+1] Add `WS` and `CAT` as aliases for zhwikisource namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530413 (https://phabricator.wikimedia.org/T230548) (owner: 10DannyS712)