[00:03:15] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:23] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 920.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:41:38] !log decommissioning Cassandra, restbase-dev1005-a -- T224554 [00:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:41] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [00:42:02] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) [00:57:52] !log krinkle@deploy1001: Deploy performance/navtiming f2a0863b9e4774140463a79d08051814c0d21116 - T226539 [00:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:56] T226539: Update the by-country navtiming breakdown (June 2019) - https://phabricator.wikimedia.org/T226539 [00:59:11] !log krinkle@deploy1001 Started deploy [performance/navtiming@f2a0863]: (no justification provided) [00:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:16] !log krinkle@deploy1001 Finished deploy [performance/navtiming@f2a0863]: (no justification provided) (duration: 00m 05s) [00:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:27] (03CR) 10Krinkle: [C: 03+1] "Yep, LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [01:09:14] (03CR) 10Krinkle: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [01:34:42] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Groceryheist) Hey @Ottomata, it turns out that I think stat1006 is a better fit for my purposes since it has ORES dependencies (mainly hunspell) that were missing on the notebook machin... [01:54:11] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:13:19] (03PS8) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [02:13:21] (03PS1) 10Andrew Bogott: labs.yaml: remove profile::base::certificates::puppet_ca_content [puppet] - 10https://gerrit.wikimedia.org/r/535329 (https://phabricator.wikimedia.org/T171188) [02:14:25] (03CR) 10Andrew Bogott: [C: 03+2] labs.yaml: remove profile::base::certificates::puppet_ca_content [puppet] - 10https://gerrit.wikimedia.org/r/535329 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [02:19:21] PROBLEM - snapshot of s5 in codfw on db1115 is CRITICAL: snapshot for s5 at codfw taken more than 4 days ago: Most recent backup 2019-09-06 01:53:37 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:34:03] RECOVERY - netbox Postgres on netboxdb2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB netbox (host:localhost) 96 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:47:09] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting%23Nova-fullstack [02:52:39] (03PS1) 10CRusnov: postgres::slave: Fix various minor issues with slaves and replication. [puppet] - 10https://gerrit.wikimedia.org/r/535340 [02:57:34] (03PS2) 10CRusnov: postgres::slave: Fix various minor issues with slaves [puppet] - 10https://gerrit.wikimedia.org/r/535340 [03:01:17] (03CR) 10CRusnov: "noops on known-working pg secondaries, and fixes issues that came up on the sync replica in netboxdb2001" [puppet] - 10https://gerrit.wikimedia.org/r/535340 (owner: 10CRusnov) [03:01:26] (03CR) 10CRusnov: [C: 03+2] postgres::slave: Fix various minor issues with slaves [puppet] - 10https://gerrit.wikimedia.org/r/535340 (owner: 10CRusnov) [03:12:20] 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 4 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10Mathew.onipe) [03:19:08] (03PS1) 10Brennen Bearnes: mediawiki-dev: port 8080; apache entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/535342 (https://phabricator.wikimedia.org/T222494) [03:21:11] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [03:22:13] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): cloud-vps puppet cert cleaner not working properly - https://phabricator.wikimedia.org/T232427 (10Andrew) [03:23:44] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs - https://phabricator.wikimedia.org/T232428 (10Andrew) [03:25:14] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Andrew) [03:51:29] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:53:07] PROBLEM - traffic_server tls process restarted on cp5001 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [04:02:51] (03PS1) 10Mathew.onipe: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) [04:04:49] (03CR) 10jerkins-bot: [V: 04-1] wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) (owner: 10Mathew.onipe) [04:14:52] (03PS6) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [04:14:54] (03CR) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [04:14:56] (03PS5) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) [04:14:58] (03CR) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [04:15:26] (03PS2) 10Mathew.onipe: wdqs: setup new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/535345 (https://phabricator.wikimedia.org/T232184) [04:18:45] !log Start s8 (wikidata) pre switchover steps T230762 [04:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:48] T230762: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 [04:22:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1109 with weight 0 and depool it from API T230762', diff saved to https://phabricator.wikimedia.org/P9068 and previous config saved to /var/cache/conftool/dbconfig/20190910-042243-marostegui.json [04:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:40] !log Start topology changes on s8, connect everything under db1109 - T230762 [04:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:48] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [04:37:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [04:44:30] 10Operations, 10Traffic, 10Patch-For-Review: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10Vgutierrez) So it looks like we are still leaking memory with ATS 8.0.5-1wm6: ` (gdb) bt #0 0x00002adfd1a16fff in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0... [04:46:37] !log depool cp5001 for memory leak debugging on ATS - T232298 [04:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:40] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [04:49:01] RECOVERY - traffic_server tls process restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [04:55:11] PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107751 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [04:56:44] In 4 minutes we are starting the wikidata master switchover [05:00:05] marostegui and jynus: My dear minions, it's time we take the moon! Just kidding. Time for s8 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T0500). [05:00:07] jynus: ready? [05:00:10] ok [05:00:16] let's go then! [05:00:24] !log Starting s8 failover from db1104 to db1109 - T227062 [05:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:26] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:00:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s8 as read-only for maintenance T230762', diff saved to https://phabricator.wikimedia.org/P9069 and previous config saved to /var/cache/conftool/dbconfig/20190910-050046-marostegui.json [05:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:54] T230762: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 [05:01:11] ro confirmed [05:01:17] same [05:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1109 to s8 master and remove read-only from s8 T227062', diff saved to https://phabricator.wikimedia.org/P9070 and previous config saved to /var/cache/conftool/dbconfig/20190910-050213-marostegui.json [05:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:18] RO off [05:02:32] topology looks good [05:02:34] I can edit fine [05:02:50] same [05:03:02] <_joe_> wow [05:03:05] let's monitor things [05:03:13] I see no errors? [05:03:15] that is strange [05:03:52] last time we had no errors was beacuse the job queue broke :-D [05:04:08] (not because of us, it was already broken) [05:04:40] _joe_: could you double check mediawiki looks ok in general- e.g. the job queue is doing its thing [05:04:43] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) (owner: 10Marostegui) [05:05:10] <_joe_> jynus: ack [05:05:17] _joe_: normally there is some minimal amount of read only noise [05:06:16] <_joe_> jynus: and maybe some video editing that will fail later [05:06:25] sure [05:06:30] everything looks good I think [05:06:52] marostegui: you know I am worried when things look too good [05:06:59] Me too XD [05:07:25] no more mw deploys on eqiad? [05:07:41] as in db-eqiad.php ? [05:07:57] <_joe_> nothing worrisome [05:07:58] those lines are all gone jynus [05:08:01] <_joe_> in logstash [05:08:11] cdanis: go to sleep [05:08:14] :-D [05:08:32] don’t worry I’m up playing games [05:08:37] _joe_: that is what worries me, I expected a few failures on logs [05:08:49] even from the mw read only [05:10:08] _joe_: for example, there are certain extensions that are not or cannot be compatible with read only [05:10:53] ok, I see some at least of those [05:11:20] "wikidatawiki: [8fa9ed3dd1a8a9480626d900] [no req] Wikimedia\Rdbms\DBReadOnlyError from line 973 of /srv/mediawiki/php-1.34.0-wmf.21/includes/libs/rdbms/database/Database.php: Database is read-only: The master database server is running in read-only mode.· [05:11:28] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) This was done. read-only start: Tue Sep 10 05:00:47 UTC 2019 read-only stop: Tue Sep 10 05:02:14 UTC 2019 Total read-only time: 1 minute 27 s... [05:11:47] <_joe_> yeah I see some read-only errors [05:11:56] <_joe_> like 7? [05:11:57] those were expected [05:12:02] <_joe_> sure [05:12:09] <_joe_> hence my comment I see nothing worrisome [05:12:09] I just didn't see no one at first [05:12:16] <_joe_> they also lasted a short time [05:12:21] 10Operations, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Aklapper) Always helpful when Yahoo does not list that error code on [Yahoo's page that is supposed to list their error codes](https://help.yahoo.com/kb/postmaster/smtp-error-codes-sln23996.html)... [05:15:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1104 into API T230762', diff saved to https://phabricator.wikimedia.org/P9071 and previous config saved to /var/cache/conftool/dbconfig/20190910-051529-marostegui.json [05:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:33] T230762: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 [05:19:32] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) [05:19:54] 10Operations, 10DBA: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) 05Open→03Resolved [05:19:56] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) - https://phabricator.wikimedia.org/T227539 (10Marostegui) [05:21:16] (03PS2) 10Marostegui: mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) [05:27:41] (03PS2) 10Marostegui: mariadb: Decommission db2047 [puppet] - 10https://gerrit.wikimedia.org/r/535203 (https://phabricator.wikimedia.org/T231852) [05:29:11] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2047 [puppet] - 10https://gerrit.wikimedia.org/r/535203 (https://phabricator.wikimedia.org/T231852) (owner: 10Marostegui) [05:30:31] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) [05:33:02] !log decommissioning Cassandra, restbase-dev1005-b -- T224554 [05:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:05] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [05:35:09] !log Remove db2047 from tendril and zarcillo - T231852 [05:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:13] T231852: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 [05:35:48] !log Stop MySQL on db2047 T231852 [05:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:48] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) I've started the decommission of -dev1005-b quite late in my evening; It should be complete by EU mornin... [05:36:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) a:05Marostegui→03RobH [05:37:20] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10Marostegui) This host is ready for #dc-ops to decommission [05:37:31] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:38:55] (03CR) 10Marostegui: [C: 03+1] "Let's try this today with the m1 failover, as we have to move replicas there too" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo) [05:41:48] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1073 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535348 (https://phabricator.wikimedia.org/T231892) [05:43:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1073 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535348 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [05:44:29] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1073 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535348 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [05:45:52] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1073 from config T231892 (duration: 00m 55s) [05:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:57] T231892: Decommission db1073.eqiad.wmnet - https://phabricator.wikimedia.org/T231892 [05:46:24] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1073 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535348 (https://phabricator.wikimedia.org/T231892) (owner: 10Marostegui) [05:46:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1073 from config T231892 (duration: 00m 54s) [05:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 230, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:01] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 232, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:02:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:51] (03CR) 10Marostegui: [WIP] Add optional sanity checks to check mediawiki configuration (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [06:49:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 230, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:08] !log upgrading snapshot* to PHP 7.2.22 T230024 [06:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:13] T230024: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 [07:00:39] the two interfaces down are related to telia scheduled maintenance between eqiad and eqord afaics [07:02:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 232, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:02:33] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:04:05] (03CR) 10Jcrespo: [WIP] Add optional sanity checks to check mediawiki configuration (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [07:05:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:05:18] (03PS5) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) [07:06:43] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:10:33] (03CR) 10Muehlenhoff: [C: 03+2] Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) (owner: 10Muehlenhoff) [07:20:30] (03CR) 10Marostegui: [WIP] Add optional sanity checks to check mediawiki configuration (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [07:21:47] (03CR) 10Jcrespo: [WIP] Add optional sanity checks to check mediawiki configuration (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [07:21:54] !log iron.wikimedia.org is no longer a bastion host [07:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:45] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) [07:27:00] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10decommission, 10Patch-For-Review: Decommission iron - https://phabricator.wikimedia.org/T220505 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH [07:30:58] (03PS16) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [07:39:35] (03CR) 10Vgutierrez: [C: 03+2] lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [07:39:46] (03PS17) 10Vgutierrez: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [07:42:13] !log reimaging mw2231 after hardware maintenance T231192 [07:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:16] T231192: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 [07:54:45] RECOVERY - snapshot of s5 in codfw on db1115 is OK: snapshot for s5 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-10 06:56:38 from db2099.codfw.wmnet:3315 (646 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:57:11] (03PS1) 10Marostegui: change_mw_mysql_pass.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/535505 [07:58:25] (03PS2) 10Marostegui: change_mw_mysql_pass.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/535505 [07:59:00] (03CR) 10Marostegui: [C: 03+2] change_mw_mysql_pass.sh: Add to repo [software] - 10https://gerrit.wikimedia.org/r/535505 (owner: 10Marostegui) [07:59:18] (03PS1) 10Vgutierrez: Revert "lvs: allow access to wdqs lvs on port 8888" [puppet] - 10https://gerrit.wikimedia.org/r/535506 [08:01:24] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: allow access to wdqs lvs on port 8888" [puppet] - 10https://gerrit.wikimedia.org/r/535506 (owner: 10Vgutierrez) [08:01:26] (03PS1) 10Marostegui: Revert "wmnet: Decrease m5-master TTL to 1M" [dns] - 10https://gerrit.wikimedia.org/r/535508 [08:01:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "wmnet: Decrease m5-master TTL to 1M" [dns] - 10https://gerrit.wikimedia.org/r/535508 (owner: 10Marostegui) [08:01:41] (03Abandoned) 10Marostegui: Revert "wmnet: Decrease m5-master TTL to 1M" [dns] - 10https://gerrit.wikimedia.org/r/535508 (owner: 10Marostegui) [08:04:02] (03PS1) 10Marostegui: wmnet: Restore 5M TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/535509 (https://phabricator.wikimedia.org/T229657) [08:05:27] (03CR) 10Marostegui: [C: 03+2] wmnet: Restore 5M TTL for m5-master [dns] - 10https://gerrit.wikimedia.org/r/535509 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [08:06:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:47] (03CR) 10Marostegui: [WIP] Add optional sanity checks to check mediawiki configuration (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [08:06:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd [08:08:01] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs is broken https://wikitech.wikimedia.org/wiki/Confd [08:08:31] uh [08:08:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:23] the revert messed somehow with wdqs :/ [08:13:37] (03PS2) 10Filippo Giunchedi: grafana: use Prometheus swift metrics for dashboard [puppet] - 10https://gerrit.wikimedia.org/r/535180 (https://phabricator.wikimedia.org/T205870) [08:15:06] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: use Prometheus swift metrics for dashboard [puppet] - 10https://gerrit.wikimedia.org/r/535180 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:17:23] (03PS2) 10Elukey: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [08:17:38] vgutierrez: is it back to normal? [08:17:45] nope [08:17:46] do we need to rerun puppet somewhere? [08:18:14] so TBH dunno what's the issue [08:20:15] I see wdqs-queries here https://config-master.wikimedia.org/pybal/codfw/ [08:20:37] and here: https://config-master.wikimedia.org/pybal/eqiad/ [08:21:02] https://config-master.wikimedia.org/pybal/eqiad/wdqs-heavy-queries [08:21:12] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs-heavy-queries on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd [08:21:49] I only see wdqs and wdqs-internal [08:22:20] but it looks like removing the wdqs-heavy-queries definition from the puppet repo is not enough [08:23:47] (03PS3) 10Filippo Giunchedi: swift: port alerts to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/535182 (https://phabricator.wikimedia.org/T205870) [08:23:49] (03PS1) 10Filippo Giunchedi: swift: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535515 (https://phabricator.wikimedia.org/T205870) [08:26:02] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: port alerts to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/535182 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [08:29:00] vgutierrez: do we need to manually remove: https://config-master.wikimedia.org/pybal/eqiad/wdqs-heavy-queries [08:29:08] errr [08:29:09] (03PS2) 10Jbond: wmcs::nfs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531241 (https://phabricator.wikimedia.org/T102099) [08:29:11] that's a 404 for me onimisionipe [08:29:18] weird [08:29:31] I can see that [08:30:03] right.. config-master.wm.o resolves to esams for you, right? [08:30:23] there might be swift alerts coming up after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/535182 FYI [08:30:56] (03CR) 10Jbond: [C: 03+2] wmcs::nfs: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531241 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [08:31:00] its esams for me [08:31:07] yep [08:31:14] I can reproduce with curl --resolve config-master.wikimedia.org:443:91.198.174.192 [08:31:14] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs-heavy-queries on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/wdqs-heavy-queries is broken https://wikitech.wikimedia.org/wiki/Confd [08:32:03] (03CR) 10Elukey: "Andrew: I have updated the IPs that were causing issues with jenkins, and also added the change to the mgmt records. I am still not clear " [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [08:34:26] vgutierrez: its 404 for me now [08:34:43] yup.. I manually cleaned those two on puppetmaster1001 [08:34:47] but that doesn't solve the issue [08:35:12] there's an unmerged repo alerts in icinga [08:35:20] uh? [08:35:59] Unmerged changes on repository puppet [08:36:18] jbond42: ^^ [08:37:08] not sure.. vgutierrez did you do a puppet merge for the revert? [08:37:16] yes onimisionipe [08:37:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wdqs on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [08:37:29] hmm [08:37:35] there's our recovery [08:37:38] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wdqs on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [08:38:14] yeah.. I had to clean /var/run/confd-template errors manually [08:39:35] vgutierrez: thanks got destracted :) [08:40:34] i just got the following from a puppet-mereg [08:40:35] conftool::yaml_log_error: Error parsing yaml file /etc/conftool/etcdrc: [Errno 2] No such file or directory: '/etc/conftool/etcdrc' [08:40:40] is that expected? [08:41:49] hmm I've been seeing that one for the last 2 years... [08:41:50] so I hope so [08:42:21] ok :), obvioulsy just looking closer today [08:43:19] hmmm...there's still alerts on icinga that wont go away [08:44:58] vgutierrez: from icinga: `File not found: /srv/config-master/pybal/codfw/wdqs-heavy-queries` [08:45:24] awesome [08:45:46] jbond42: can you change the 'Ops Clinic Duty' to your nick for this channel? [08:45:52] onimisionipe: can we fix that ferm rules? :) [08:46:34] onimisionipe: i dont think i have that privalge, if you do please go ahead, if not i can ask n m_sec [08:46:50] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T232310 (10MoritzMuehlenhoff) [08:47:05] jbond42: I don't have. akosiaris ^ [08:47:10] I'll do it [08:47:17] onimisionipe: so that check have been cleared after puppet ran in icinga1001 [08:47:20] FFS :) [08:47:23] moritzm: alright thanks [08:47:34] thanks moritzm [08:47:50] vgutierrez: Ok. thanks! [08:48:06] hold on for ferm rule change [08:48:47] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10elukey) @Cmjohnson should the following descriptions be updated as well with their `an-presto` equivalent... [08:52:53] (03CR) 10Elukey: "Found it! The hosts still show their cloudvirtan names:" [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [08:53:39] (03PS2) 10Marostegui: mediawiki: Add rebuildItemTerms for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [08:53:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10MoritzMuehlenhoff) The old host definitions for cloudviran are still in debmonitor, puppetdb and site.pp a... [08:56:08] (03PS1) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) [08:56:21] vgutierrez: ^ [08:56:25] (03CR) 10Marostegui: [C: 03+2] mediawiki: Add rebuildItemTerms for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [08:56:42] jbond42: interestingly enough I don't have access to give you moar access on this channel [08:57:34] akosiaris: thats probably for the best :) [09:03:06] 10Operations, 10Traffic, 10netops, 10IPv6, 10Patch-For-Review: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10MoritzMuehlenhoff) I just reimaged mw2231 for unrelated reasons (broken hardware, system got swapped with a different server) and the... [09:03:20] (03CR) 10Jcrespo: "Thanks!" [software] - 10https://gerrit.wikimedia.org/r/535505 (owner: 10Marostegui) [09:03:40] I probably neither (and even if I have no idea how :-) [09:05:45] (03CR) 10Gehel: [C: 04-1] "as discussed on IRC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [09:05:57] (03PS1) 10Ladsgroup: Set items term store on write both for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535521 (https://phabricator.wikimedia.org/T225055) [09:06:17] (03PS1) 10Alexandros Kosiaris: Introduce wikifees LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/535522 (https://phabricator.wikimedia.org/T170455) [09:06:19] (03PS1) 10Alexandros Kosiaris: Add discovery RRs for wikifeeds [dns] - 10https://gerrit.wikimedia.org/r/535523 (https://phabricator.wikimedia.org/T170455) [09:06:49] moritzm: actually you might do. Chanserv says you are a manager, whereas I am a simple OP [09:07:01] I don't know if you should take that as a compliment though :P [09:09:40] (03PS1) 10Alexandros Kosiaris: Add kubernetes wikifeeds stanzas [puppet] - 10https://gerrit.wikimedia.org/r/535525 (https://phabricator.wikimedia.org/T170455) [09:09:47] (03PS1) 10Ladsgroup: mediawiki: Start rebuildItermTerms for wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/535526 (https://phabricator.wikimedia.org/T225056) [09:10:11] (03PS2) 10Alexandros Kosiaris: Introduce wikifeeds LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/535522 (https://phabricator.wikimedia.org/T170455) [09:10:13] (03PS2) 10Alexandros Kosiaris: Add discovery RRs for wikifeeds [dns] - 10https://gerrit.wikimedia.org/r/535523 (https://phabricator.wikimedia.org/T170455) [09:12:08] someone clearly made a prank there :-) [09:12:33] (03PS2) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) [09:12:35] (03PS1) 10Mathew.onipe: wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) [09:14:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10MoritzMuehlenhoff) 05Open→03Resolved I reimaged mw2231 and repooled it. [09:18:54] (03PS1) 10Jbond: ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) [09:23:42] (03PS2) 10Mathew.onipe: wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) [09:23:45] (03PS3) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) [09:33:01] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10LucasWerkmeister) [09:36:24] (03PS4) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) [09:36:26] (03PS3) 10Mathew.onipe: wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) [09:36:55] PROBLEM - Number of mw swift objects in codfw greater than eqiad on icinga1001 is CRITICAL: account=mw-media class=temp https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [09:37:22] !log added jbond as chanserv ops for #wikimedia-operations [09:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:35] hmm [09:37:40] that swift aler [09:37:42] alert* [09:39:04] godog: ^ ? [09:41:13] yup sort-of expected after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/535182 I'm taking a look but it isn't worrying [09:42:42] yeah filing a task, thanks for the heads up [09:43:28] (03CR) 10Gehel: [C: 03+1] "Looks reasonable to me, but my understanding of our LVS config is limited" [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [09:44:30] ACKNOWLEDGEMENT - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 4837 MB (0% inode=99%): Mobrovac decomm for T224554 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [09:47:12] (03CR) 10Gehel: [C: 03+1] "LGTM, provided we can confirm that we don't have a better solution." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [09:47:35] (03PS4) 10Mathew.onipe: wdqs: allow port 8888 for domain networks [puppet] - 10https://gerrit.wikimedia.org/r/535528 (https://phabricator.wikimedia.org/T176875) [09:47:38] (03PS5) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/535520 (https://phabricator.wikimedia.org/T176875) [09:49:19] (03PS1) 10Alexandros Kosiaris: Add wikifeeds namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/535531 (https://phabricator.wikimedia.org/T170455) [09:49:21] (03PS1) 10Alexandros Kosiaris: Fix a typo in helmfile services examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/535532 [09:49:23] (03PS1) 10Alexandros Kosiaris: Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) [09:50:42] !log installing ghostscript security updates on jessie [09:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:21] (03PS1) 10Filippo Giunchedi: swift: exclude temp objects when checking eqiad/codfw difference [puppet] - 10https://gerrit.wikimedia.org/r/535535 [09:52:16] (03PS2) 10Filippo Giunchedi: swift: exclude temp objects when checking eqiad/codfw difference [puppet] - 10https://gerrit.wikimedia.org/r/535535 (https://phabricator.wikimedia.org/T232448) [09:53:21] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: exclude temp objects when checking eqiad/codfw difference [puppet] - 10https://gerrit.wikimedia.org/r/535535 (https://phabricator.wikimedia.org/T232448) (owner: 10Filippo Giunchedi) [09:54:17] RECOVERY - mediawiki-installation DSH group on mw2231 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:55:50] akosiaris: merging your private change too, puppet-merge is mentioning it [09:55:59] labs/private that is [09:56:04] the "public private" repo [09:56:05] hah... [09:56:12] I forgot about that [09:56:13] thanks! [09:56:21] and I was the one complaining I would forget it [09:56:36] !log restart archiva on archiva1001 - UI not working (probably due to connections to maven central being stuck) [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:49] np! yeah me too, I was wondering why puppet-merge was showing your change and not mine, turns out that's labs/private [09:58:29] (03PS1) 10Pmiazga: Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) [09:59:03] TIL puppet-merge can have a configuration file [10:06:36] (03PS2) 10Jbond: ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) [10:08:18] 10Operations, 10Mail: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10jbond) p:05Triage→03Normal [10:09:22] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10jbond) p:05Triage→03Normal [10:11:07] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10jbond) p:05Triage→03Normal [10:13:08] (03PS3) 10Jbond: ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) [10:15:17] (03CR) 10Muehlenhoff: ip6_mapped: add missing nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:17:57] (03PS2) 10Alexandros Kosiaris: Add kubernetes wikifeeds stanzas [puppet] - 10https://gerrit.wikimedia.org/r/535525 (https://phabricator.wikimedia.org/T170455) [10:18:31] (03PS4) 10Jbond: ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) [10:18:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubernetes wikifeeds stanzas [puppet] - 10https://gerrit.wikimedia.org/r/535525 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [10:18:41] (03CR) 10Jbond: ip6_mapped: add missing nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:18:57] (03PS5) 10Jbond: ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) [10:23:46] (03CR) 10Muehlenhoff: [C: 03+1] ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:24:06] (03CR) 10Jbond: [C: 03+2] ip6_mapped: add missing nodes [puppet] - 10https://gerrit.wikimedia.org/r/535529 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [10:28:20] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add wikifeeds namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/535531 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [10:28:32] (03PS2) 10Muehlenhoff: Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/534403 [10:32:42] !log repool cp5001 with ats-tls collecting memory usage details every hour - T232298 [10:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:46] T232298: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 [10:34:07] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:34:07] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:34:07] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:33] 10Operations, 10Traffic: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) [10:34:35] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:36] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [10:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:39] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:40] 10Operations, 10Traffic: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) p:05Triage→03Normal [10:35:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, and 2 others: Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10ema) 05Open→03Resolved >>! In T230772#5464981, @Gilles wrote: > This should probably be its own task, though, it's not specific to piwik.js Agreed, I... [10:35:19] 10Operations, 10Analytics, 10Traffic: Cookies and misc services caching - https://phabricator.wikimedia.org/T232453 (10ema) [10:39:53] (03PS1) 10Ema: ATS: use TLS to connect to etherpad [puppet] - 10https://gerrit.wikimedia.org/r/535540 (https://phabricator.wikimedia.org/T210411) [10:42:11] (03CR) 10Muehlenhoff: [C: 03+2] Enable puppetdb1002/2002 as puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/534403 (owner: 10Muehlenhoff) [10:45:06] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:07] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [10:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:59] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:36] (03PS1) 10Jbond: ip6_mapped: add ip6_mapped to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1100). [11:00:05] Amir1 and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:18] o/ [11:00:21] o/ [11:00:54] I can SWAT today! [11:01:12] mine is not testable [11:01:12] o/ [11:01:26] Ok Amir1 [11:01:31] (03CR) 10Urbanecm: [C: 03+2] Set items term store on write both for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535521 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:02:39] (03CR) 10Muehlenhoff: ip6_mapped: add ip6_mapped to profile::standard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:02:47] (03Merged) 10jenkins-bot: Set items term store on write both for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535521 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:02:50] similar with mine, it's just a bump of sampling rate [11:03:05] (03CR) 10jenkins-bot: Set items term store on write both for all of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535521 (https://phabricator.wikimedia.org/T225055) (owner: 10Ladsgroup) [11:03:07] (03CR) 10Urbanecm: [C: 03+2] Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:03:10] okay raynor [11:03:57] (03PS2) 10Urbanecm: Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:04:03] (03CR) 10Urbanecm: [C: 03+2] Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:04:29] side question, do we not have a train conductor for this week? [11:04:43] or was it just not entered when this week’s deployment calendar was set up? [11:05:20] greg-g: ^^ [11:05:36] Amir1: syncing [11:05:59] dear deployers [11:06:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 6afe963: Set items term store on write both for all of Wikidata (T225055) (duration: 00m 55s) [11:06:03] just keep in mind that [11:06:04] https://phabricator.wikimedia.org/T227541 [11:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:04] T225055: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_BOTH - https://phabricator.wikimedia.org/T225055 [11:06:05] (03Merged) 10jenkins-bot: Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:06:22] tl;dr [11:06:48] !log cp1075: set weight in etcd back to 100 [11:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:50] How many job runners we have? I thought it's less [11:06:54] (03CR) 10jenkins-bot: Bump MobileWebUIActionsTracking sampling rate to 10 percent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535536 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [11:07:00] Urbanecm: thanks [11:07:02] we are doing some maint work on a rack with some jobservers, app/api server (don't recall) and memcached ser ers [11:07:11] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,dc=eqiad,name=cp1075.eqiad.wmnet [11:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:17] Lucas_WMDE: It's probably ha.shar [11:07:27] https://phabricator.wikimedia.org/T220747 [11:08:15] raynor: syncing [11:08:35] Urbanecm, thx [11:08:53] yw raynor [11:08:59] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: c780fa4: Bump MobileWebUIActionsTracking sampling rate to 10 percent (T220016) (duration: 00m 55s) [11:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:02] T220016: Create, and deploy working MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [11:09:03] done [11:09:12] Fast swat today, it seems :) [11:09:16] !log EU SWAT done [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:21] nice ^^ [11:10:33] Amir1: we don't have that many jobrunners [11:11:28] I thought it's around 15 per dc [11:11:59] Urbanecm, the new sampling rate is in action, thx, works as expected [11:12:06] cool raynor ! [11:12:26] 23 in eqiad and 31 in codfw [11:12:44] It was a super fast SWAT :) [11:12:53] it was less when we had separate job runners and video scalers, but the numbers increased when the clusters were merged [11:13:09] I see [11:15:51] (03PS1) 10Alexandros Kosiaris: Fix typo for mathoid namespace in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/535549 [11:15:54] moritzm: I was trying to keep the mystery alive [11:19:11] Happy Reminder! Today is PDU swap in B6 [11:20:43] effie: :-) [11:20:54] !log swapping the PDU in rack B6 eqiad T227541 [11:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:58] T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 [11:21:03] cmjohnson1: thank you! [11:29:27] (03PS2) 10Jbond: ip6_mapped: add ip6_mapped to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) [11:34:13] (03CR) 10Jbond: ip6_mapped: add ip6_mapped to profile::standard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:39:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:39:57] (03CR) 10Muehlenhoff: ip6_mapped: add ip6_mapped to profile::standard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [11:44:25] PROBLEM - PHP7 rendering on mw2231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2879 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:46:00] looking [11:47:35] RECOVERY - PHP7 rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 76186 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:48:52] 10Operations, 10Wikimedia-Mailing-lists: Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) 05Resolved→03Open [11:52:15] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) a:05jbond→03Jrbranaa The list is public. @Jrbranaa looks like you have to make it private: https://lists.wikim... [11:54:34] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) p:05Triage→03Normal [11:55:21] PROBLEM - Host mw1287.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:55:52] PROBLEM - Host mc1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:55:52] PROBLEM - Host mc1024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:55:53] PROBLEM - Host mw1284.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:59:07] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10jbond) @zeljkofilipin I have [now] set the subscription model to `Require approval`. I leave it to the admins to change the oth... [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1200) [12:01:03] RECOVERY - Host mw1287.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [12:01:13] PROBLEM - Host ps1-b6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [12:01:33] RECOVERY - Host mc1024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [12:01:33] RECOVERY - Host mc1026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [12:01:33] RECOVERY - Host mw1284.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [12:05:04] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10Cmjohnson) [12:06:20] (03PS2) 10Alexandros Kosiaris: Fix typo for mathoid namespace in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/535549 [12:06:26] (03PS2) 10Alexandros Kosiaris: Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) [12:07:01] 10Operations, 10hardware-requests, 10Discovery-Search (Current work): Replace elastic1017-1031 - https://phabricator.wikimedia.org/T221636 (10Cmjohnson) [12:07:25] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix a typo in helmfile services examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/535532 (owner: 10Alexandros Kosiaris) [12:07:32] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix typo for mathoid namespace in helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/535549 (owner: 10Alexandros Kosiaris) [12:08:45] (03PS1) 10Filippo Giunchedi: swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 [12:09:28] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 (owner: 10Filippo Giunchedi) [12:10:00] ema: I am seeing intermittent failures to fetch https://releases.wikimedia.org/charts/raw-0.2.0.tgz on deploy1001 [12:10:17] If you report this error to the Wikimedia System Administrators, please include the details below.Request from 2620:0:861:103:10:64:32:16 via cp1075.eqiad.wmnet, ATS/8.0.5
Error: 502, connect failed at 2019-09-10 12:09:16 GMT [12:10:51] (03CR) 10jerkins-bot: [V: 04-1] swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 (owner: 10Filippo Giunchedi) [12:10:59] lemme paste the entire curl output on phab [12:14:47] !log removing power from ps1-b6 side B...mgmt should not be affected [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:55] ema: https://phabricator.wikimedia.org/P9072 [12:17:57] (03PS2) 10Filippo Giunchedi: swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 [12:18:26] and there we go WARNING: SNI (releases.wikimedia.org) not in certificate. Action=Terminate server=releases.discovery.wmnet(10.64.0.88) [12:19:08] (03CR) 10Jbond: ip6_mapped: add ip6_mapped to profile::standard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:19:12] (03CR) 10Jbond: [C: 03+2] ip6_mapped: add ip6_mapped to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [12:19:17] (03PS3) 10Jbond: ip6_mapped: add ip6_mapped to profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/535544 (https://phabricator.wikimedia.org/T102099) [12:19:54] (03PS1) 10Jbond: puppetmaster1003: move mw1261 and mwdebug1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535564 (https://phabricator.wikimedia.org/T228657) [12:20:47] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:20:49] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [12:23:59] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:24:03] ema: I 've staged a diff in the puppet private repo to fix the releases certificate, but I 'd rather not go and regenerate the certificate with cergen without you acking it [12:24:09] so stalling it [12:24:17] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) 05Open→03Resolved Thanks! I do remember now that that's how other lists work, there's just one password. [12:25:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] ATS: use TLS to connect to etherpad [puppet] - 10https://gerrit.wikimedia.org/r/535540 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:25:58] 10Operations, 10Wikimedia-Mailing-lists, 10Release-Engineering-Team-TODO (201909): Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10zeljkofilipin) 05Resolved→03Open Oops, looks like I've resolved this to quickly. List archives are still public: https://lis... [12:29:32] (03PS2) 10Jbond: puppetmaster1003: move mw1261 and mwdebug1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535564 (https://phabricator.wikimedia.org/T228657) [12:30:15] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move mw1261 and mwdebug1001 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535564 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [12:33:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks for taking a look" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535262 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [12:33:48] (03CR) 10Filippo Giunchedi: [C: 03+1] logging: Remove unused 'wmgLogstashUseCee' variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535263 (https://phabricator.wikimedia.org/T211124) (owner: 10Krinkle) [12:36:13] (03PS1) 10Noa wmde: Configure a feature flag for Wikibase Tainted References [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) [12:37:38] (03CR) 10jerkins-bot: [V: 04-1] Configure a feature flag for Wikibase Tainted References [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [12:39:06] (03PS3) 10Filippo Giunchedi: swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 [12:39:07] is anybody checking the mediawii alerts? [12:39:08] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [12:39:36] jouncebot: now [12:39:36] For the next 0 hour(s) and 20 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1200) [12:39:53] so hmm [12:39:55] time to cut the branch [12:39:58] elukey: I am not :\ [12:40:03] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:40:31] !log the new pdus are racked in b6 [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:33] hm, that’s a lot of errors, mostly on Wikidata, apparently starting right after EU SWAT… [12:40:36] Amir1: ^ [12:41:01] Lucas_WMDE: where is it? [12:41:11] mediawiki-NEW-errors in logstash [12:41:13] elukey: mybad [12:41:20] deadlocks in item terms, it looks like :/ [12:41:36] https://logstash.wikimedia.org/goto/62762fb692707b01da8facc4224bfc8c [12:41:38] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: escape exclamation mark in alerts [puppet] - 10https://gerrit.wikimedia.org/r/535561 (owner: 10Filippo Giunchedi) [12:41:50] Lucas_WMDE: we started write both on all of wikidata for new term store [12:42:04] this is sorta expected as long as it's not super big [12:42:11] filtered fatal monitor in logstash shows the lock problem as well [12:42:13] is it expected to go down again? [12:42:23] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10Cmjohnson) [12:42:26] Amir1: will it recover? [12:42:39] is there something we can do from our end? [12:43:09] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [12:43:17] It will decrease for sure, since new terms need to go to the database (the duplicated values) [12:43:27] but I'm not sure how much [12:43:50] it seems down to regular levels now [12:44:18] the jobqueue one is scary though. The ruwiki should not has anything with the change [12:44:47] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10Cmjohnson) The PDU has been swapped and the new pdus are in netbox. @robh can you help with the setup for serial console please. [12:44:52] "Error from line 130 of /srv/mediawiki/php-1.34.0-wmf.21/extensions/Graph/includes/ApiGraph.php" [12:45:54] Am I the only one experiencing slowness with gerrit rn? [12:46:12] Getting several 502s [12:47:36] if cutting a branch is in progress right now then it is an unfortunate but known side effect (gerrit slow/unresponsive) [12:47:49] RECOVERY - Number of mw swift objects in codfw greater than eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=codfw [12:47:53] Ah thanks, never noticed that [12:47:55] speaking of which is there a task for it ? [12:48:09] godog: T231872 I think [12:48:17] T231872: Gerrit GC thrashing during branch cut - https://phabricator.wikimedia.org/T231872 [12:48:30] Lucas_WMDE: thanks! that's it [12:48:39] Nice, ty [12:49:55] although gerrit looks down altogether from here [12:50:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [12:50:36] Same [12:50:43] Timing out now [12:51:52] I'll take a look, unless someone has a better idea/lead ? [12:52:16] hashar: was branch cutting in progress already? eliminating causes for gerrit slow/down [12:53:01] godog: let's wait for releng to have a look [12:53:10] before we start restarting gerrir [12:53:12] gerrit* [12:53:46] sounds good effie [12:57:20] godog: branch is usually cut in the morning, before train window, but I don't really know hashar's schedule :) [12:57:43] godog: that is the branch cut yes [12:57:51] I was busy with other CI stuff this morning :\ [12:59:39] ack, what's the SOP when gerrit is down due to branch cut ? [13:00:04] TRAINCONDUCTOR: Dear deployers, time to do the MediaWiki train - European version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1300). [13:01:49] !log copied prometheus-jmx-exporter to buster-wikimedia (from stretch-wikimedia, just a package with some jars) [13:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:43] PROBLEM - Check systemd state on puppetdb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:29] ^ those are in setup, I'm setting them to downtime [13:06:47] hashar zeljkof anything I can do to help with gerrit ? [13:07:39] gerrit is down because of branch cut? [13:08:02] godog: hashar is on train duty, he'll know [13:08:31] it is possible gerrit is down due to branch cut yes, that's my hunch though [13:08:37] I think in this instance we may need a restart :( [13:09:36] branch cut causes large allocations and gc can't keep up seemingly. When gc is this time-consuming I don't think it'll catch up and branch cut will take a long time as well [13:09:50] I cant even find the bug report :-\ [13:10:07] T231872 [13:10:07] T231872: Gerrit GC thrashing during branch cut - https://phabricator.wikimedia.org/T231872 [13:10:29] so that would because too many pack files are opened and kept in the heap/java memory or whatever? [13:10:42] I think last week I have noticed a large amount of opened file descriptor on the java process [13:10:43] via lsof [13:10:58] (and sorry I have forgot about that issue this week :-\\\ ) [13:11:20] !log Gerrit experimenting difficulty due to ongoing wmf branch cut - T231872 [13:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CharInsert/': Operation timed out after 300017 milliseconds with 0 out of 0 bytes received [13:11:45] hashar: I'm going to restart gerrit, you should be able to use --continue-from for make-wmf-branch to continue branch cut after it's back [13:11:48] a few times [13:12:00] and apparently make wmf branch is smart enough to detect the error and sleep(5) [13:12:16] !log restarting gerrit [13:12:16] so I guess it would keep retrying while gerrit restart [13:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [13:12:38] 10Operations, 10ops-eqiad: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Cmjohnson) [13:15:10] Amir1: Lucas_WMDE ^ [13:15:19] there are wikidata fatals again [13:15:33] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26247 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [13:15:34] sorry, exceptions [13:16:50] thanks thcipriani, looks like gerrit is coming back [13:18:47] godog: good deal. We have been running near the memory limit for the past few days. Pause times getting longer. Branch cut causes a bunch of humongous allocations and then the GC pause makes the service look dead :( [13:19:32] (03PS2) 10Jbond: puppetmaster1003: move cp1075-77 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535570 (https://phabricator.wikimedia.org/T228657) [13:19:41] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move cp1075-77 to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535570 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [13:20:26] hashar: gerrit should be back. make-wmf-branch -c 'extensions/[last successful branched extension]' should get branch cut back on track, I think. [13:20:39] sorry for the interruption :( [13:22:56] ah it gives up after sometime :] [13:23:41] !log ./make-wmf-branch -n 1.34.0-wmf.22 -o master -c extensions/CharInsert # T220747 [13:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] T220747: 1.34.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T220747 [13:23:50] thcipriani: works like a charm [13:24:20] thcipriani, hashar: Just to make sure: mediawiki/tools/release is only meant to support PHP7+, right? [13:24:59] Daimona: Iwould guess hhvm as well [13:25:22] Daimona: I don't know really, it has various scripts [13:25:28] (03CR) 10Noa wmde: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [13:25:34] (03PS2) 10Muehlenhoff: Remove roentgenium/tureis [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) [13:25:55] I'm asking because it has several scalar typehints. I'm updating PHPCS, but I want to make sure it really shouldn't support HHVM [13:26:04] See e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/release/+/8e8f8bf7fc03709f5dd004480817e5b40b6c3321/make-tarball-release/src/Control.php [13:26:28] This file is PHP71+, I want to be sure the same applies to the others [13:27:29] Although I see make-tarball-release has its own config file, so alright [13:28:14] thcipriani: ack, thanks for the context! I wondering if it is also the rate of operations we're asking gerrit that overwhelms the jvm? IOW if doing "thinks" slower when cutting branches might help alleviate the problem? [13:28:20] shooting from the hip here heh [13:31:09] (03PS1) 10Filippo Giunchedi: thumbor: stop relaying to statsd/statsite [puppet] - 10https://gerrit.wikimedia.org/r/535591 (https://phabricator.wikimedia.org/T205870) [13:34:43] !log reboot stat1005 to clear incosistent process state after tensorflow tests [13:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:33] godog: that's possible. It seems like a very large amount of memory to allocate for adding a few refs. I do wonder if the gerrit api for adding refs might be easier than what we currently do (which is a bit in inefficient to say the least :)) [13:35:53] (03PS2) 10Noa wmde: TR: Configure a feature flag for Wikibase Tainted References [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) [13:36:02] (03PS1) 10Elukey: aptrepo: create component for new version of ROCm [puppet] - 10https://gerrit.wikimedia.org/r/535592 [13:37:34] (03CR) 10jerkins-bot: [V: 04-1] TR: Configure a feature flag for Wikibase Tainted References [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535567 (https://phabricator.wikimedia.org/T232191) (owner: 10Noa wmde) [13:38:10] Daimona: that repo is pretty ad-hoc. I don't really know if we even use the make-tarball-release script anymore. It may have been (partially?) supplanted by a python script. make-wmf-branch is the only PHP I'm familiar with in that repo and no specific efforts have been made to make it php7 compliant; although, it's a glorified shell script and not using any esoteric language features. [13:40:40] I think it would be a good idea to re-develop make-wmf-branch in a way that involves also developing a test suite that tests handling error situations [13:41:57] thcipriani. Thanks. I'm trying to keep the status quo, i.e. allow HHVM for anything other than make-tarball-release. [13:44:38] (03PS1) 10Muehlenhoff: Drop symlink for /etc/puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/535593 [13:45:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/535592 (owner: 10Elukey) [13:45:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1 [13:53:23] (03CR) 10Elukey: [C: 03+2] aptrepo: create component for new version of ROCm [puppet] - 10https://gerrit.wikimedia.org/r/535592 (owner: 10Elukey) [13:53:36] !log scap prep 1.34.0-wmf.22 # T220747 [13:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:39] T220747: 1.34.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T220747 [13:54:16] (03CR) 10Jbond: "may need to change the entries in /etc/default/puppetdb as well" [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [13:55:29] (03PS2) 10Ottomata: Increase default EventGate max_body_size to 10mb [deployment-charts] - 10https://gerrit.wikimedia.org/r/535286 (https://phabricator.wikimedia.org/T232362) [13:56:38] !log Applied security patches to 1.34.0-wmf.22 # T220747 [13:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:37] (03PS1) 10Hashar: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535598 (https://phabricator.wikimedia.org/T220747) [13:58:55] !log hashar@deploy1001 Started scap: testwiki to php-1.34.0-wmf.22 and rebuild l10n cache # T220747 [13:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:58] T220747: 1.34.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T220747 [13:59:23] that is where I have to idle for an hour or so why the code is being copied :-\ [14:07:31] effie: sorry, was at meeting [14:08:46] has anyone filled a bug for the mediawiki error log spam ? [14:09:07] PHP Warning: [data-update-failed]: A data update callback triggered an exception (Wikimedia\Rdbms\Database::makeList: empty input for field wbxl_text_id) [Called from Wikibase\Repo\Content\DataUpdateAdapter::doUpdate in /srv/mediawiki/php-1.34.0-wmf.21/extensions/Wikibase/repo/includes/Content/DataUpdateAdapter.php at line 65] [14:09:09] among others [14:09:26] hashar: let's just revert it for now [14:09:58] (03PS3) 10Alexandros Kosiaris: Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) [14:10:00] (03PS1) 10Alexandros Kosiaris: Add wikifeeds to admin environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/535601 (https://phabricator.wikimedia.org/T170455) [14:10:24] Amir1: if that is revertable, yes please :] [14:10:24] hashar: Can you revert it? I need to leave for lunch. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/535521 [14:10:48] but it is 4pm ! :] [14:10:50] I will [14:10:50] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add wikifeeds to admin environments [deployment-charts] - 10https://gerrit.wikimedia.org/r/535601 (https://phabricator.wikimedia.org/T170455) (owner: 10Alexandros Kosiaris) [14:11:28] (03PS1) 10Hashar: Revert "Set items term store on write both for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 [14:11:34] (03PS2) 10Hashar: Revert "Set items term store on write both for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 [14:11:35] exactly [14:11:38] (03CR) 10Hashar: "Per Ladsgroup." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 (owner: 10Hashar) [14:11:40] (03PS3) 10Ottomata: Increase default EventGate max_body_size to 10mb [deployment-charts] - 10https://gerrit.wikimedia.org/r/535286 (https://phabricator.wikimedia.org/T232362) [14:12:14] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Increase default EventGate max_body_size to 10mb [deployment-charts] - 10https://gerrit.wikimedia.org/r/535286 (https://phabricator.wikimedia.org/T232362) (owner: 10Ottomata) [14:13:46] sorry, I am in a meeting [14:14:01] * hashar presses enter to speed up scap sync [14:14:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,dc=eqiad,cluster=cache_text,service=ats-be [14:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:14] Amir1: so yeah will revert as soon as testwiki got promoted to the new mw version :-] [14:14:22] Amir1: have a good lunch [14:14:32] !log depool cp1075 ats-be to test helmfile sync [14:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:15] !log increasing max_body_size to 10mb for all eventgate services - T232362 [14:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:17] T232362: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 [14:18:20] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:40] (03PS2) 10Muehlenhoff: Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 [14:19:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [14:20:33] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:05] (03PS3) 10Muehlenhoff: Remove roentgenium/tureis [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) [14:22:16] (03PS4) 10Alexandros Kosiaris: Add wikifeeds helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/535533 (https://phabricator.wikimedia.org/T170455) [14:22:18] (03PS1) 10Alexandros Kosiaris: admin: switch eqiad envs.yaml to a symlink as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/535606 [14:23:12] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] admin: switch eqiad envs.yaml to a symlink as well [deployment-charts] - 10https://gerrit.wikimedia.org/r/535606 (owner: 10Alexandros Kosiaris) [14:24:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove roentgenium/tureis [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [14:26:39] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [14:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:14] (03PS1) 10Jbond: puppetmaster1003: move mw appserver, api and lvs server to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535609 (https://phabricator.wikimedia.org/T228657) [14:27:17] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) There is now a task to track the work: T232483 [14:27:51] 14:27:42 Finished sync-apaches (duration: 13m 51s) [14:27:53] not too bad :] [14:27:58] (03PS2) 10Jbond: puppetmaster1003: move mw appserver, api and lvs server to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535609 (https://phabricator.wikimedia.org/T228657) [14:28:38] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move mw appserver, api and lvs server to new pmaster [puppet] - 10https://gerrit.wikimedia.org/r/535609 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [14:29:23] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [14:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:51] scap cdb rebuild almost complete [14:31:59] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [14:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:29] * hashar waits on mwdebug1002 to complete the cdb rebuild [14:32:59] !log hashar@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.22 and rebuild l10n cache # T220747 (duration: 34m 03s) [14:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] T220747: 1.34.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T220747 [14:33:08] (duration: 34m 03s) [14:34:11] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:34:34] hmm [14:34:38] maybe train related bah :-\\\ [14:35:06] w.config.set({"wgBackendResponseTime":37771,"wgHostname":"mw1256"});} [14:35:19] bla bla opcode cache priming [14:35:23] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:35:45] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:36:16] (03PS1) 10Alexandros Kosiaris: Add SubjAltName releases.wikimedia.org to certificate [puppet] - 10https://gerrit.wikimedia.org/r/535613 [14:36:26] (03PS3) 10Muehlenhoff: Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 [14:36:33] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [14:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:03] (03CR) 10Muehlenhoff: "PS3 just avoids some whitespace changes for the existing 4.x config." [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [14:37:18] (03CR) 10Ema: [C: 03+1] "looks flawless" [puppet] - 10https://gerrit.wikimedia.org/r/535613 (owner: 10Alexandros Kosiaris) [14:37:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add SubjAltName releases.wikimedia.org to certificate [puppet] - 10https://gerrit.wikimedia.org/r/535613 (owner: 10Alexandros Kosiaris) [14:38:22] testwiki promoted [14:38:43] and opcode / bytecode caches should be primed hopefully by now [14:39:16] flawless victory? https://www.youtube.com/watch?v=-4nSZ79fXO8 [14:39:58] (03PS1) 10Pmiazga: Disable AMC Outreach modal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535615 (https://phabricator.wikimedia.org/T231436) [14:40:10] bblack: yes! [14:40:23] (03PS1) 10Ayounsi: Depool ulsfo for DC power work [dns] - 10https://gerrit.wikimedia.org/r/535616 [14:40:25] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Ottomata) Ok, max_body_size increased to 10mb. [14:41:50] (03CR) 10Hashar: [C: 03+2] Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535598 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [14:42:25] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for DC power work [dns] - 10https://gerrit.wikimedia.org/r/535616 (owner: 10Ayounsi) [14:44:09] !log depool ulsfo for DC UPS power maintenance (see maint-announce) [14:44:10] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,dc=eqiad,cluster=cache_text,service=ats-be [14:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] (03PS4) 10Muehlenhoff: Drop symlink for /etc/puppetdb and update default file [puppet] - 10https://gerrit.wikimedia.org/r/535593 [14:44:23] (03CR) 10Ottomata: "Thanks Luca! Didn't have time yesterday to follow up on the errors." [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:45:01] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535598 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [14:45:51] !log repool cp1075 ats-be, releases cert updated [14:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:29] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535598 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [14:47:10] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/18228/" [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [14:48:34] (03PS2) 10Ottomata: Prep for installing an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/535209 (https://phabricator.wikimedia.org/T225128) [14:48:39] !log hashar@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:53] and this train is dieing [14:49:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Prep for installing an-presto nodes [puppet] - 10https://gerrit.wikimedia.org/r/535209 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:49:15] canary checks failed :D [14:49:48] PHP Warning: curl_multi_setopt():Invalid curl multi configuration option [14:49:50] yupi [14:50:07] That's probably an AaronSchulz bug [14:50:22] Got a file/line? [14:50:47] * hashar reverts [14:51:43] Reedy: yeah will fill it soon [14:51:45] (03PS1) 10Hashar: Revert "Group0 to 1.34.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535619 (https://phabricator.wikimedia.org/T220747) [14:51:51] see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [14:52:00] (03CR) 10Hashar: [C: 03+2] Revert "Group0 to 1.34.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535619 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [14:52:06] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/535593 (owner: 10Muehlenhoff) [14:52:20] Like, the last couple of patches potentially caused that [14:52:20] https://github.com/wikimedia/mediawiki/commit/46531d62852239f620f7b7c0af1e5747a9006228 [14:52:47] (03PS1) 10Alexandros Kosiaris: admin: Readd wikifeeds for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/535620 [14:54:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:54:37] Reedy: filled as https://phabricator.wikimedia.org/T232487 [14:54:38] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 43.74 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:54:53] (03CR) 10Ottomata: [C: 03+2] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [14:54:57] (03PS3) 10Ottomata: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) [14:55:03] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] admin: Readd wikifeeds for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/535620 (owner: 10Alexandros Kosiaris) [14:55:08] (03PS2) 10Alexandros Kosiaris: admin: Readd wikifeeds for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/535620 [14:55:11] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] admin: Readd wikifeeds for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/535620 (owner: 10Alexandros Kosiaris) [14:55:17] hashar: Yeah, it's that patch [14:56:50] (03PS2) 10Ema: ATS: use TLS to connect to etherpad [puppet] - 10https://gerrit.wikimedia.org/r/535540 (https://phabricator.wikimedia.org/T210411) [14:57:20] Reedy: :]]] [14:57:23] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/535622/ reverts it in .22 [14:57:58] (03CR) 10Ema: [C: 03+2] ATS: use TLS to connect to etherpad [puppet] - 10https://gerrit.wikimedia.org/r/535540 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:59:41] the thing is [14:59:53] it is not the first time we introduce a curl parameter which is not supported on the cluster [15:01:51] (03PS1) 10CRusnov: wikimedia.org: switch netbox. alias to netbox inst [dns] - 10https://gerrit.wikimedia.org/r/535623 [15:01:59] I have lost myself in all the patchsets I have [15:02:38] (03CR) 10CRusnov: [C: 04-1] "-1ing to hold this until we're ready" [dns] - 10https://gerrit.wikimedia.org/r/535623 (owner: 10CRusnov) [15:02:41] (03Merged) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535619 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [15:03:14] (03CR) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.22" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535619 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [15:04:00] (03PS4) 10Ottomata: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) [15:04:09] so reverting group0 entirely [15:04:19] notably tyhe canaries [15:04:26] (03CR) 10Ayounsi: [C: 03+1] "The change itself lgtm" [dns] - 10https://gerrit.wikimedia.org/r/535623 (owner: 10CRusnov) [15:04:55] mobrovac: if you would like to test, I guess you can bump the wikis to 1.34.0-wmf.22 by editing wikiversions.php on mwdebug1001 ? :) [15:05:06] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [15:05:26] hashar: i fugred out what the problem is, that option was introduced in php 7.0.7 [15:05:33] back now [15:05:42] hashar: Is there anything I can do [15:05:48] mobrovac: So hhvm is lolno? [15:05:54] well [15:05:57] hashar: so yeah let's go ahead with the revert [15:06:02] a test would have caught that surely? :] [15:06:11] Reedy: yup, no hhvm love here [15:06:43] (03PS5) 10Ottomata: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) [15:06:45] if we were only on php 7... we'd just be ok as MW wants 7.0.13 [15:07:13] (03CR) 10jerkins-bot: [V: 04-1] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [15:08:12] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:24] PROBLEM - Host netmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:34] PROBLEM - Host es1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:42] PROBLEM - Host kafka-main1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:42] PROBLEM - Host cloudelastic1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:42] PROBLEM - Host cloudvirt1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:43] PROBLEM - Host cloudvirt1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:44] PROBLEM - Host cloudvirt1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:46] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:09:46] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:09:46] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:09:46] PROBLEM - Host restbase1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:09:50] (03PS6) 10Ottomata: Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) [15:09:50] PROBLEM - Host ps1-b1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:09:52] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:00] PROBLEM - Host elastic1032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:01] PROBLEM - Host prometheus1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:01] PROBLEM - Host ores1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:01] PROBLEM - Host prometheus1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:02] PROBLEM - Host msw1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:10] PROBLEM - Host pc1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:11] PROBLEM - Host puppetmaster1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:11] PROBLEM - Host db1103.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:11] PROBLEM - Host elastic1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:11] PROBLEM - Host restbase1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:11] PROBLEM - Host restbase1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:12] PROBLEM - Host labsdb1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:12] PROBLEM - Host kubestage1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:13] PROBLEM - Host puppetmaster1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:13] PROBLEM - Host restbase-dev1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:14] PROBLEM - Host rhodium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:14] PROBLEM - Host relforge1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:14] ^ PDU maintenance? [15:10:15] PROBLEM - Host torrelay1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:15] PROBLEM - Host tungsten.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:16] Huh?! [15:10:16] PROBLEM - Host sodium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:16] PROBLEM - Host scb1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:17] PROBLEM - Host snapshot1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:18] PROBLEM - Host thumbor1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:26] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:30] PROBLEM - Host db1086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:32] PROBLEM - Host ps1-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:32] PROBLEM - Host cloudvirtan1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:34] I don't see anything being logged in SAL? [15:10:36] PROBLEM - Host ps1-a2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:36] PROBLEM - Host ps1-a1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:39] p+_P [15:10:41] o_O* [15:10:48] PROBLEM - Host ps1-a6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:48] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:10:50] PROBLEM - Host cp1078.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:56] Wait, that looks like a whole row? [15:10:57] PROBLEM - Host cloudvirt1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:57] PROBLEM - Host ms-be1019.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:57] PROBLEM - Host cloudvirt1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:57] PROBLEM - Host snapshot1008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:57] PROBLEM - Host labweb1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:58] PROBLEM - Host stat1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:10:59] PROBLEM - Host cloudvirt1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:00] PROBLEM - Host cloudnet1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:00] PROBLEM - Host mw1271.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:01] PROBLEM - Host cloudvirtan1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:02] PROBLEM - Host cloudvirt1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:02] PROBLEM - Host cp1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:03] PROBLEM - Host cp1077.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:04] PROBLEM - Host contint1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:04] PROBLEM - Host conf1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:05] PROBLEM - Host dbproxy1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:05] PROBLEM - Host dbstore1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:06] PROBLEM - Host mc1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:06] PROBLEM - Host mc1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:07] PROBLEM - Host phab1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:08] PROBLEM - Host db1127.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:08] PROBLEM - Host db1129.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:08] PROBLEM - Host db1076.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:09] cmjohnson1: [15:11:09] PROBLEM - Host db1107.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:10] PROBLEM - Host db1112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:11] PROBLEM - Host dbstore1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:11] PROBLEM - Juniper alarms on cr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:11:11] they are management hostnames [15:11:12] def not down, look slike icinga [15:11:14] PROBLEM - Host druid1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:16] PROBLEM - Host elastic1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:16] PROBLEM - Host elastic1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:16] PROBLEM - Host elastic1036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:16] PROBLEM - Host elastic1031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:16] PROBLEM - Host elastic1033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:18] PROBLEM - Host elastic1035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:18] PROBLEM - Host elastic1037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:18] PROBLEM - Host elastic1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:18] PROBLEM - Host elastic1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:19] PROBLEM - Host elastic1048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:19] oh mgmt [15:11:20] PROBLEM - Host elastic1044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:21] PROBLEM - Host elastic1039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:21] PROBLEM - Host elastic1049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:21] PROBLEM - Host ms-be1030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:22] PROBLEM - Host es1011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:24] PROBLEM - Host db1098.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:24] PROBLEM - Host ganeti1005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:32] PROBLEM - Host ganeti1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:32] PROBLEM - Juniper alarms on cr2-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 6 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:11:34] PROBLEM - Juniper alarms on asw2-a-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 5 red alarms, 3 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:11:34] PROBLEM - Host mw1268.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:34] PROBLEM - Host mw1309.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Host helium.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Host kafka-jumbo1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Host kafka-jumbo1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Host lvs1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:36] PROBLEM - Host kubestage1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:37] PROBLEM - Host mc1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:37] bad but not affecting live operations [15:11:42] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:11:44] PROBLEM - Juniper alarms on asw2-b-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 5 red alarms, 3 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:11:44] PROBLEM - Host restbase1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:45] PROBLEM - Host labmon1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:45] PROBLEM - Host kafka1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:45] PROBLEM - Host maps1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:46] PROBLEM - Host ms-be1044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:46] PROBLEM - Host ms-be1046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:46] PROBLEM - Host ms-be1045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:47] PROBLEM - Host maps1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:47] PROBLEM - Host ms-be1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:48] PROBLEM - Host mw1277.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:48] PROBLEM - Host ms-be1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:49] PROBLEM - Host ms-be1017.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:49] PROBLEM - Host ms-be1029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:50] PROBLEM - Host wdqs1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:51] PROBLEM - Host weblog1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:51] PROBLEM - Host ms-be1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:51] not all of them all of them [15:11:52] PROBLEM - Host wtp1025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:52] PROBLEM - Host ms-be1028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:53] PROBLEM - Host ms-be1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:54] PROBLEM - Host ms-fe1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:54] PROBLEM - Host mw1274.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:54] PROBLEM - Host mw1272.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:55] PROBLEM - Host mw1283.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:55] PROBLEM - Host mw1270.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:58] PROBLEM - Host an-conf1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:11:58] PROBLEM - Host mw1278.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:02] RECOVERY - Host cp1078.mgmt is UP: PING WARNING - Packet loss = 50%, RTA = 176.49 ms [15:12:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 230, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:04] PROBLEM - Host an-worker1079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:06] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [15:12:06] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [15:12:06] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [15:12:06] RECOVERY - Host ps1-b1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [15:12:08] RECOVERY - Host ps1-b4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [15:12:11] hmm indeeed not all of them [15:12:14] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:15] can't ssh into an-worker1079 [15:12:20] that's a good counter to see how many hosts we actually have at eqiad [15:12:24] Host contint1001.mgmt is DOWN :-\ [15:12:30] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [15:12:53] it is affecting actual hosts, too [15:12:54] PROBLEM - Host mw1287.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:59] but they have redundancy [15:13:04] RECOVERY - Host ganeti1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:13:04] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert group0 to 1.34.0-wmf.22 # T220747 [15:13:06] and I forgot to press enter when running scap [15:13:08] RECOVERY - Host elastic1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [15:13:08] RECOVERY - Juniper alarms on cr2-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:09] T220747: 1.34.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T220747 [15:13:14] RECOVERY - Host msw1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:13:16] RECOVERY - Host ps1-a1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [15:13:18] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:13:20] RECOVERY - Juniper alarms on asw2-b-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:13:24] RECOVERY - Host cp1076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [15:13:26] RECOVERY - Host mc1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [15:13:26] RECOVERY - Host mw1278.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [15:13:38] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 232, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:13:56] ok lets give it another minute [15:13:58] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Reedy) [15:13:59] and see what is left [15:14:19] cmjohnson1: did someone bump into a cable or something like that? [15:14:20] RECOVERY - Juniper alarms on cr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:14:30] RECOVERY - Host an-worker1079 is UP: PING WARNING - Packet loss = 93%, RTA = 687.97 ms [15:14:44] RECOVERY - Juniper alarms on asw2-a-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:14:48] RECOVERY - Host ms-be1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:14:52] RECOVERY - Host elastic1039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [15:14:56] RECOVERY - Host restbase1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 6.47 ms [15:14:56] RECOVERY - Host es1011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [15:15:08] RECOVERY - Host scb1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [15:15:08] RECOVERY - Host kafka-jumbo1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [15:15:11] XioNoX. No, I’m not even there right now. [15:15:14] RECOVERY - Host es1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [15:15:18] RECOVERY - Host ms-be1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [15:15:22] RECOVERY - Host kafka-main1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [15:15:22] RECOVERY - Host cloudvirt1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [15:15:23] RECOVERY - Host cloudvirt1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [15:15:23] RECOVERY - Host cloudelastic1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [15:15:24] RECOVERY - Host cloudvirt1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [15:15:25] Interesting. [15:15:26] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [15:15:26] RECOVERY - Host ps1-a3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.35 ms [15:15:26] RECOVERY - Host restbase1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.66 ms [15:15:30] RECOVERY - Host netmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [15:15:32] RECOVERY - Host ps1-b5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [15:15:42] RECOVERY - Host prometheus1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [15:15:42] RECOVERY - Host elastic1032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [15:15:42] RECOVERY - Host ores1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [15:15:42] RECOVERY - Host prometheus1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.98 ms [15:15:46] RECOVERY - Host an-conf1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [15:15:52] RECOVERY - Host pc1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [15:15:52] RECOVERY - Host puppetmaster1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [15:15:52] RECOVERY - Host restbase1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.63 ms [15:15:52] RECOVERY - Host db1103.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [15:15:53] RECOVERY - Host labsdb1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [15:15:53] RECOVERY - Host kubestage1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [15:15:54] RECOVERY - Host puppetmaster1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [15:15:54] RECOVERY - Host rhodium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.49 ms [15:15:55] RECOVERY - Host restbase-dev1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [15:15:56] RECOVERY - Host torrelay1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.47 ms [15:15:56] RECOVERY - Host sodium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.24 ms [15:15:56] RECOVERY - Host snapshot1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [15:15:57] RECOVERY - Host relforge1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [15:15:57] RECOVERY - Host tungsten.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.61 ms [15:16:01] RECOVERY - Host thumbor1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [15:16:12] RECOVERY - Host db1086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [15:16:24] RECOVERY - Host maps1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [15:16:26] RECOVERY - Host wtp1025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:16:26] RECOVERY - Host cloudvirtan1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [15:16:31] RECOVERY - Host ps1-a7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [15:16:36] RECOVERY - Host elastic1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [15:16:36] Odd. [15:16:39] RECOVERY - Host cloudvirt1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:39] RECOVERY - Host cloudvirt1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [15:16:39] RECOVERY - Host ms-be1019.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [15:16:40] RECOVERY - Host snapshot1008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:40] RECOVERY - Host stat1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.68 ms [15:16:41] RECOVERY - Host cloudvirt1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [15:16:41] RECOVERY - Host labweb1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [15:16:41] RECOVERY - Host mw1271.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [15:16:42] RECOVERY - Host cloudnet1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [15:16:43] RECOVERY - Host cloudvirt1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [15:16:44] RECOVERY - Host cloudvirtan1005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [15:16:44] RECOVERY - Host conf1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:45] RECOVERY - Host contint1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [15:16:46] RECOVERY - Host cp1077.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:46] RECOVERY - Host dbproxy1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:47] RECOVERY - Host dbstore1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:47] RECOVERY - Host mc1021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [15:16:48] RECOVERY - Host phab1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [15:16:52] RECOVERY - Host db1127.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [15:16:52] RECOVERY - Host db1129.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:16:52] RECOVERY - Host db1076.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [15:16:52] RECOVERY - Host db1107.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [15:16:54] RECOVERY - Host dbstore1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [15:16:54] RECOVERY - Host db1112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [15:16:58] RECOVERY - Host druid1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [15:17:00] RECOVERY - Host elastic1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [15:17:00] RECOVERY - Host elastic1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [15:17:00] RECOVERY - Host elastic1033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [15:17:00] RECOVERY - Host elastic1036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [15:17:00] RECOVERY - Host elastic1031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [15:17:02] RECOVERY - Host elastic1045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [15:17:02] RECOVERY - Host elastic1048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [15:17:02] RECOVERY - Host elastic1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [15:17:02] RECOVERY - Host elastic1035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [15:17:03] RECOVERY - Host elastic1037.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [15:17:04] RECOVERY - Host elastic1049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [15:17:04] RECOVERY - Host ms-be1030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [15:17:05] Did it start with an-worker1079? [15:17:08] RECOVERY - Host db1098.mgmt is UP: PING OK - Packet loss = 16%, RTA = 0.94 ms [15:17:14] RECOVERY - Host ganeti1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [15:17:16] RECOVERY - Host mw1268.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [15:17:16] RECOVERY - Host mw1309.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [15:17:18] RECOVERY - Host helium.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [15:17:18] RECOVERY - Host mc1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [15:17:18] RECOVERY - Host kafka-jumbo1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [15:17:18] RECOVERY - Host kubestage1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [15:17:18] RECOVERY - Host lvs1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [15:17:18] Interesting to see the large variation in RTAs [15:17:26] RECOVERY - Host restbase1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [15:17:27] RECOVERY - Host labmon1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [15:17:27] RECOVERY - Host kafka1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [15:17:27] RECOVERY - Host maps1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [15:17:28] RECOVERY - Host ms-be1044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [15:17:28] RECOVERY - Host ms-be1046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [15:17:28] RECOVERY - Host ms-be1017.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [15:17:28] RECOVERY - Host ms-be1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [15:17:29] RECOVERY - Host ms-be1029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [15:17:29] RECOVERY - Host mw1277.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [15:17:30] RECOVERY - Host wdqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [15:17:31] RECOVERY - Host ms-be1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [15:17:32] RECOVERY - Host weblog1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [15:17:32] RECOVERY - Host ms-be1028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [15:17:32] RECOVERY - Host ms-be1023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [15:17:36] RECOVERY - Host ms-fe1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [15:17:36] RECOVERY - Host mw1274.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [15:17:36] RECOVERY - Host mw1283.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [15:17:36] RECOVERY - Host mw1272.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [15:17:36] RECOVERY - Host mw1270.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.54 ms [15:17:46] RECOVERY - Host an-worker1079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.18 ms [15:18:09] 04Critical Alert for device ps1-a7-eqiad.mgmt.eqiad.wmnet - Device rebooted [15:18:13] (03CR) 10Ottomata: [C: 03+2] Add DNS entries for an-presto nodes [dns] - 10https://gerrit.wikimedia.org/r/535221 (https://phabricator.wikimedia.org/T225128) (owner: 10Ottomata) [15:18:34] RECOVERY - Host mw1287.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [15:18:58] Uh, rebooted? [15:19:05] :O [15:19:53] totally unrelated; was about to reimage some hosts...should I wait? [15:21:00] things seem mostly ok...proceeding [15:21:57] SPF|Cloud: Bsadowski1 and others, that may have looked scary, you may have lost the updates, but we had redundancy on power so failures with seems to have worked nicely [15:22:06] *which [15:22:57] Redundant power should prevent this though? [15:23:11] Not every device can have redundant power [15:23:18] redundat power prevented this [15:23:18] (some switches etc) [15:23:53] mutante: yt? q about https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1808685&oldid=1808684 [15:23:56] SPF|Cloud: The hosts didn't go down, just access to their management interfaces [15:24:02] Ah :) [15:24:02] wmf-aut-reimage giving me unrecognized arguments: --rename --rename-mgmt [15:24:33] wmf-auto-reimage* [15:24:41] Reedy: oh I understand that, didn't know that doesn't apply to mgmt [15:25:05] Which is understandable [15:25:15] production hosts just complained about lack of redundancy [15:25:33] when one power supply got poweroff [15:26:00] 10Operations, 10Performance-Team, 10SRE-Access-Requests: Request access to 'deployment' user group for phedenskog - https://phabricator.wikimedia.org/T232489 (10Krinkle) [15:26:30] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` ['a... [15:26:58] ottomata: what was the full command? [15:27:49] Ah well, at least it's confirming redundancy does work as intended! [15:27:57] elukey: the one I just ran was sudo -i wmf-auto-reimage --new -p T225128 an-presto1001.eqiad.wmnet [15:27:57] T225128: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 [15:28:03] the one i was trying to run was [15:28:07] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10Agusbou2015) This task seems to be done. [15:28:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-a7-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [15:28:11] sudo -i wmf-auto-reimage --rename --rename-mgmt -p T225128 an-presto1001.eqiad.wmnet [15:28:52] ottomata: --rename and --rename-mgmt seem to require arguments, at least checking from the --help.. maybe the wiki was incomplete? [15:30:23] (03PS9) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [15:30:25] (03PS1) 10Andrew Bogott: wmf_sink: skip host key checking when cleaning certs [puppet] - 10https://gerrit.wikimedia.org/r/535626 (https://phabricator.wikimedia.org/T232427) [15:30:55] elukey: ? [15:30:59] i don't see any of those in --help [15:31:10] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: skip host key checking when cleaning certs [puppet] - 10https://gerrit.wikimedia.org/r/535626 (https://phabricator.wikimedia.org/T232427) (owner: 10Andrew Bogott) [15:31:51] ottomata: I did sudo wmf-auto-reimage-host --help [15:31:57] --rename RENAME FQDN of the new name to rename this host to while [15:32:00] reimaging [15:32:02] --rename-mgmt RENAME_MGMT [15:32:05] FQDN of the new name management interface, see [15:32:08] --rename [15:32:21] ah wait you are using auto-reimage, not the -host one [15:32:27] AH [15:32:27] wmf-auto-reimage-host [15:32:29] uh huh. [15:33:03] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/http/MultiHttpClient.php: T232487 (duration: 00m 55s) [15:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:06] hashar: ^ if you want to try again with .22 [15:33:06] T232487: 1.34.0-wmf.22 PHP Warning: curl_multi_setopt():Invalid curl multi configuration option - https://phabricator.wikimedia.org/T232487 [15:33:25] Reedy: yeah deploying that [15:34:17] trying that on 1002 [15:34:22] 1001 is still waiting for reboot [15:34:45] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` clo... [15:34:47] elukey: i think rename won't work here. [15:34:48] Unable to run wmf-auto-reimage-host: Signed cert on Puppet not found for hosts [15:34:49] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirtan1002.eqiad.wmnet'] ` Of which those *... [15:34:53] we've already decommed teh host to re-rack them. [15:35:22] hmm or maybe i need --new [15:35:23] trying [15:35:37] !log hashar@deploy1001 Synchronized php-1.34.0-wmf.22/includes/libs/http/MultiHttpClient.php: Revert "Improve MultiHttpClient connection concurrency and reuse" - T232487 (duration: 00m 55s) [15:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:45] (03PS2) 10Andrew Bogott: wmf_sink: skip host key checking when cleaning certs [puppet] - 10https://gerrit.wikimedia.org/r/535626 (https://phabricator.wikimedia.org/T232427) [15:35:47] (03PS10) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [15:36:04] ok i thnk that is working... [15:36:06] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` clo... [15:37:05] !log Start pre-switchover for m1 steps T231403 [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:08] T231403: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 [15:37:17] (03PS1) 10Hashar: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535629 (https://phabricator.wikimedia.org/T220747) [15:37:32] (03CR) 10Hashar: [C: 03+2] Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535629 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [15:38:35] (03CR) 10Marostegui: mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) (owner: 10Marostegui) [15:38:44] canaries passed :] [15:38:54] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535629 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [15:39:22] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535629 (https://phabricator.wikimedia.org/T220747) (owner: 10Hashar) [15:39:32] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.22 [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:23] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) (owner: 10Marostegui) [15:41:36] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Reedy) [15:41:40] (03PS3) 10Marostegui: mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) [15:42:03] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Reedy) [15:42:25] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1135 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/534386 (https://phabricator.wikimedia.org/T231403) (owner: 10Marostegui) [15:42:32] this time that looks better [15:44:46] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, and 2 others: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on cumin1001.eqiad.wmnet for hosts: ` clo... [15:45:01] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10jcrespo) Could be related to wikidata contention? https://logstash.wikimedia.org/goto/cfd48f0d1bd1d040a0c7ce8f76ec6169 [15:45:32] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Reedy) p:05Triage→03High [15:51:22] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Marostegui) That last link matches this SAL entry: ` 11:06 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: 6afe963: Set items term store on write... [15:51:25] 10Puppet, 10ORES, 10Scoring-platform-team (Current): Include git-lfs in ores::base role - https://phabricator.wikimedia.org/T232494 (10Halfak) [15:51:39] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10John_of_Reading) I'm seeing ten seconds of "Sending request to en.wikipedia.org..." and no other response, not only from "publish changes" but also "show preview" and "show... [15:53:05] (03PS1) 10Halfak: Adds git-lfs requirement to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/535631 [15:53:22] 10Operations, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Ahecht) This seems to affect creating and editing pages, across multiple wikis (I've tried enwiki and meta), and across multiple accounts (I've tried both my main account a... [15:53:44] (03PS2) 10Halfak: Adds git-lfs requirement to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) [15:53:44] 10Puppet, 10ORES, 10Scoring-platform-team (Current): Include git-lfs in ores::base role - https://phabricator.wikimedia.org/T232494 (10Halfak) a:03Halfak [15:53:46] (03PS3) 10Andrew Bogott: wmf_sink: skip host key checking when cleaning certs [puppet] - 10https://gerrit.wikimedia.org/r/535626 (https://phabricator.wikimedia.org/T232427) [15:54:28] (03CR) 10Ladsgroup: [C: 03+2] Revert "Set items term store on write both for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 (owner: 10Hashar) [15:56:09] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: skip host key checking when cleaning certs [puppet] - 10https://gerrit.wikimedia.org/r/535626 (https://phabricator.wikimedia.org/T232427) (owner: 10Andrew Bogott) [15:56:13] (03Merged) 10jenkins-bot: Revert "Set items term store on write both for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 (owner: 10Hashar) [15:56:38] (03CR) 10jenkins-bot: Revert "Set items term store on write both for all of Wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535602 (owner: 10Hashar) [15:57:05] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10Halfak) [15:57:10] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10Halfak) @akosiaris, I noticed that a new worker node I started in labs didn't have git-lfs do I couldn't deploy. I wonder if we had manually ins... [15:58:28] !log restarting gerrit (again) https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&from=1568109359163&to=1568130959163&var-Application=&var-Window=30m due to T224448 [15:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:32] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [15:58:37] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Set items term store on write both for all of Wikidata" (duration: 01m 02s) [15:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:04] marostegui, jynus, akosiaris, mutante, and xionox: How many deployers does it take to do m1 database master failover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1600). [16:00:12] you guys around? [16:00:19] what? [16:00:33] ah [16:00:36] yes [16:00:39] XD [16:00:40] loool [16:00:44] akosiaris, jynus mutante ? [16:00:44] XioNoX: that is Deployers random alert [16:01:14] marostegui: ahahha <3 [16:04:32] let me know what I have to do [16:04:57] XioNoX: nothing really, just check librenms works fine after the failover [16:05:24] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits - https://phabricator.wikimedia.org/T232491 (10Reedy) [16:06:16] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previes/diffs - https://phabricator.wikimedia.org/T232491 (10Daimona) [16:06:57] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Reedy) [16:07:18] (03PS1) 10Elukey: aptrepo: change the amd-rocm27 component to amd-rocm271 [puppet] - 10https://gerrit.wikimedia.org/r/535646 [16:07:38] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 73.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:10:18] !log Failover m1 from db1063 to db1135 - T231403 [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:24] T231403: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 [16:10:50] jynus: all done [16:11:30] now it is time for checking, mutante akosiaris XioNoX [16:11:34] XioNoX: can you check librenms? [16:11:36] Ladies and Gentlemen, I'm happy to announce that LibreNMS is still working as expected. [16:11:45] etherpad is down, we probably have to restart it [16:11:46] I will do that [16:11:50] yeah [16:11:58] let's go over the others [16:12:09] (I will) [16:12:45] etherpad back up [16:12:47] (03CR) 10Awight: "The change looks like it does as advertised. However, it might be more correct to fail if the package isn't available? If ORES will ever" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [16:13:15] XioNoX: :** [16:13:31] jynus: racktables works for me [16:13:32] I don't remember my rt pass [16:13:42] let me check [16:13:44] I can loging there fine [16:13:51] and I can read [16:13:51] but you know, the website is up [16:14:01] puppet we don't care [16:14:05] yep [16:14:08] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [16:14:09] so everything is confirmed working [16:14:11] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): cloud-vps puppet cert cleaner not working properly - https://phabricator.wikimedia.org/T232427 (10Andrew) 05Open→03Resolved [16:14:25] I will try to schedule a restore? [16:14:34] just to be sure [16:14:36] yeah, something small maybe [16:14:56] tendril restore [16:15:00] I am going to do the clean up stuff [16:15:03] puppet and all that [16:15:14] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.75 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:20:53] 10Operations, 10DBA, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) This was done. Read-only starts: Tue Sep 10 16:10:39 UTC 2019 Read-only stops: Tue Sep 10 16:10:45 UTC 2019 Total read-only time:... [16:22:11] 10Operations, 10DBA, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) 05Open→03Resolved Thanks everyone who helped out, closing this! [16:22:14] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [16:24:13] !log disabling reserved space on restbase-dev1005:/dev/mapper/restbase--dev1005--vg-srv -- T224554 [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:17] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [16:29:25] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:29] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:29:29] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:29:49] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:33:53] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:35:05] !log reboot analytics-tool1001 via ganeti gnt - not reachable via ssh [16:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:35] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:37:33] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:37] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:37:37] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [16:38:01] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [16:38:03] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:39:21] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:40:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1002.eqiad.wmnet'] ` Of which those **FAI... [16:41:41] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 103.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:45:09] any idea what those are ^^ [16:45:10] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Rmaung) Awesome-- thank you all! I'm resending the failed messages and haven't had an issue yet today. [16:45:16] oh ulsfo [16:45:24] depooled, repooled it seems [16:45:47] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-presto1001.eqiad.wmnet'] ` Of which those **FAI... [16:56:39] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10ops-monitoring-bot) [16:59:07] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: (2019-09) Create secteam groups in admin.yaml and define permissions - https://phabricator.wikimedia.org/T223463 (10sbassett) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T1700). [17:00:12] no parsoid deploys today [17:03:48] (03PS1) 10Mforns: analytics::refinery::job::data_purge.pp: Correct checksum for drop-unsanitized-events [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) [17:05:21] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [17:05:31] (03CR) 10Elukey: analytics::refinery::job::data_purge.pp: Correct checksum for drop-unsanitized-events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [17:05:43] (03CR) 10jerkins-bot: [V: 04-1] analytics::refinery::job::data_purge.pp: Correct checksum for drop-unsanitized-events [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [17:05:46] akosiaris: I guess you are with that? ^ [17:05:55] yup [17:06:03] marostegui: unrelated though to the master change [17:06:14] that one went fine, it's everything else that is not so good [17:06:32] yep :( [17:08:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:09:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:09:46] (03PS2) 10Mforns: analytics::refinery::job::data_purge.pp: Correct checksum [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) [17:12:51] (03PS3) 10Ottomata: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:12:52] OK, I'm grabbing the conch. [17:13:00] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:13:28] James_F: lemme know when you are done, i have some config no-ops to scap [17:13:47] ottomata: I can do them for you if you want? [17:13:51] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge.pp: Correct checksum [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [17:14:02] sure! [17:14:06] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/534236 [17:14:06] and [17:14:12] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/534637 [17:14:48] (03PS11) 10Andrew Bogott: labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) [17:14:49] Cool. [17:14:51] RECOVERY - Host helium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:15:04] thanks [17:15:35] ottomata: The second one doesn't work. [17:15:51] ? [17:16:07] Aha, link corrupted by my IRC client, no worries. [17:16:11] oh k [17:16:12] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10greg) Hey @jbond just noting the list wasn't created as private. I'm fixing now, but if there's a switch for whatever list creation script you use for future lists that'd be good to... [17:16:26] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10akosiaris) We never installed git-lfs via any ores puppet code. It was always done via scap. Relevant commits are c01b8bdd0a3e82fe6aed564dd6060f... [17:16:58] (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmaster1001/1002: Clean up after moving puppetmasters to the cloud [puppet] - 10https://gerrit.wikimedia.org/r/530382 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [17:16:59] * James_F twiddles thumbs waiting for CI. [17:17:04] James_F, you swatting config patches? [17:17:19] Yes. Have another? [17:17:29] yes, i put one up for evening swat, but could go now as well. [17:17:42] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/534889 ? [17:17:44] should be a no-op effectively for production. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/534889 [17:17:44] yes. [17:17:51] Cool. [17:17:56] ty [17:18:45] (03CR) 10Nuria: [C: 03+1] analytics::refinery::job::data_purge.pp: Correct checksum [puppet] - 10https://gerrit.wikimedia.org/r/535650 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [17:22:54] Oh good grief. [17:22:57] * James_F pokes. [17:24:59] (03Merged) 10jenkins-bot: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:26:41] (03CR) 10jenkins-bot: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:29:30] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: Variant configuration: Be able to write to static (JSON) as well as serialised cache (duration: 01m 03s) [17:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:12] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki T223602 (duration: 01m 02s) [17:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:22] T223602: Define variant Wikimedia production config in compiled, static files - https://phabricator.wikimedia.org/T223602 [17:31:53] Hmm. [17:32:03] Scap just broke on finishing. [17:32:16] "IOError: [Errno 32] Broken pipe" [17:32:34] All seems fine, though? [17:33:00] 10Operations, 10MassMessage, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team): Massmessages not going through, log looks fine - https://phabricator.wikimedia.org/T232362 (10Pchelolo) 05Open→03Resolved a:03Pchelolo I'm resolving this ticket. Filed T232392 for a followup. [17:33:48] (03CR) 10Jforrester: [C: 03+2] Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:33:56] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Re-sync for safety after scap errored with a broken pipe (duration: 01m 03s) [17:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:05] That one worked just fine. [17:34:12] * James_F shrugs. [17:35:01] (03PS3) 10Jforrester: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122) (owner: 10Ppchelko) [17:35:18] James_F: in my recent experience scap takes 2 tries anyway :/ [17:35:22] to really get synced everywhere [17:35:27] It really shouldn't. [17:36:02] (03CR) 10Jforrester: [C: 03+2] Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122) (owner: 10Ppchelko) [17:36:07] (03CR) 10Jforrester: [C: 03+2] Direct Parsoid/PHP rt-testing log events to a different target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [17:37:58] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): cloud-puppetmasters: move some hiera settings from Horizon to git/gerrit - https://phabricator.wikimedia.org/T232509 (10Andrew) [17:41:38] (03PS4) 10Jforrester: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:41:45] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:43:49] (03CR) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [17:48:20] Krinkle: BTW, JSON write (but not read) is live for testwiki. [17:50:51] PROBLEM - MegaRAID on helium is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:51:59] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:21] (03Merged) 10jenkins-bot: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122) (owner: 10Ppchelko) [17:52:25] (03Merged) 10jenkins-bot: Direct Parsoid/PHP rt-testing log events to a different target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [17:52:37] (03CR) 10jenkins-bot: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122) (owner: 10Ppchelko) [17:54:40] (03CR) 10jenkins-bot: Direct Parsoid/PHP rt-testing log events to a different target [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534889 (https://phabricator.wikimedia.org/T232042) (owner: 10Subramanya Sastry) [17:55:09] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T232122 Remove use of eventlogging-service (duration: 01m 03s) [17:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:24] T232122: Decomission eventlogging-service-eventbus and clean up related configs and code - https://phabricator.wikimedia.org/T232122 [17:56:45] !log jforrester@deploy1001 Synchronized wmf-config/ProductionServices.php: T232122 Stop setting production value for eventlogging-service (duration: 01m 00s) [17:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:09] (03CR) 10Alexandros Kosiaris: [C: 04-1] "See comment in task" [puppet] - 10https://gerrit.wikimedia.org/r/535631 (https://phabricator.wikimedia.org/T232494) (owner: 10Halfak) [17:57:11] (03PS5) 10Jforrester: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:57:27] James_F: cool, ok. We could do read for testwiki as well. But beyond that I'd rather get the testing in place. Let me know if you need to bounce ideas etc. [17:58:01] Sure. Have been thinking about what tests we could reasonably do. Do you have thoughts? [17:58:07] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [17:58:17] !log jforrester@deploy1001 Synchronized wmf-config/logging.php: T232042 Direct Parsoid/PHP rt-testing log events to a different target (duration: 01m 02s) [17:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:20] T232042: Direct Parsoid/PHP scandium logs to a different channel - https://phabricator.wikimedia.org/T232042 [17:58:30] subbu: Yours is done. [17:58:40] ottomata: One of yours done, the other is hopefully going to merge soon. [17:58:43] ty! [18:00:09] James_F: Hm.. lots of ideas but as a first pass, I suppose we could traverse all values to be scalar. Another one, implicitly, is that just by being able to compile IS at all, we know it has no dependency on classes constants etc as otherwise they'd cause errors. Right now we depend on wgConf and Defines.php, possibly more, which we'll need to fix. [18:00:30] Yeah. [18:00:39] k [18:01:29] RECOVERY - Disk space on restbase-dev1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [18:02:11] I'm slightly worried about being overly-prescriptive if we test anything about the content of the config. [18:02:35] (03PS1) 10Mforns: analytics::refinery::job::data_purge.pp: fix checksum again [puppet] - 10https://gerrit.wikimedia.org/r/535658 (https://phabricator.wikimedia.org/T229436) [18:02:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:06:37] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge.pp: fix checksum again [puppet] - 10https://gerrit.wikimedia.org/r/535658 (https://phabricator.wikimedia.org/T229436) (owner: 10Mforns) [18:06:57] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10akosiaris) >>! In T224794#5295608, @Volans wrote: > The automatic gathering times out because megacli takes ~3 minutes to return the status of the disks, it blocks at PD7 (the one broken) and takes very long time... [18:10:46] !log test add static route on bast3002 to force advmss [18:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:55] (03Merged) 10jenkins-bot: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [18:13:10] (03CR) 10jenkins-bot: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) (owner: 10Ppchelko) [18:15:16] !log rollback test add static route on bast3002 to force advmss [18:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T229863 Remove EventBusRCFeedEngine eventServiceName (duration: 01m 05s) [18:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:55] T229863: Refactor EventBus mediawiki configuration - https://phabricator.wikimedia.org/T229863 [18:16:53] ottomata: All done. Finally. [18:16:57] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10wiki_willy) a:03Cmjohnson [18:17:31] James_F: thank you! [18:18:08] (03PS3) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [18:18:38] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:21:06] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10wiki_willy) a:05Cmjohnson→03Papaul [18:21:31] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) restbase-dev1005 has been decommissioned and is ready to be reimaged. `lang=shell-session $ ssh restbas... [18:22:36] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10wiki_willy) Looks like the warranty expired on Jan. 14, 2018. @Papaul - let me know if you have any spares lying around or if we need to purchase a new disk. Thanks, Willy [18:22:54] (03PS1) 10Ottomata: Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) [18:24:48] (03CR) 10jerkins-bot: [V: 04-1] Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [18:27:06] (03PS2) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [18:27:08] (03PS5) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [18:27:10] (03PS4) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [18:27:12] (03PS1) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [18:29:03] (03PS2) 10Ottomata: Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) [18:29:50] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on elastic1046 - https://phabricator.wikimedia.org/T228606 (10wiki_willy) @Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection [18:30:59] (03CR) 10jerkins-bot: [V: 04-1] Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [18:31:47] (03PS3) 10Ottomata: Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) [18:33:32] 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10wiki_willy) @Cmjohnson - just following up to see if we have the correct part [18:34:43] (03PS4) 10Herron: prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) [18:37:07] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [18:37:31] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:37:50] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [18:37:57] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:38:15] (03PS4) 10Ottomata: Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) [18:39:40] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/18232/kafka-main1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [18:40:18] (03CR) 10Ottomata: [C: 03+2] Add $ensure params with defaults for eventlogging service - no op [puppet] - 10https://gerrit.wikimedia.org/r/535664 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [18:42:54] !log rolling out "Aggregate IPsec Tunnel Status” icinga check, please disregard for the time being if it alerts [18:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:44] (03PS1) 10Ottomata: Remove LVS, discovery, and secondary monitoring of eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) [18:49:59] (03PS5) 10Herron: prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) [18:52:20] (03CR) 10Herron: [C: 03+2] prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [18:53:00] (03CR) 10Ppchelko: Remove LVS, discovery, and secondary monitoring of eventbus service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [18:55:47] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10Marostegui) 05Open→03Declined There is no need to replace this disk. This host is pending DC-Ops steps for decommissioning T231625 [18:56:19] (03PS2) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [18:56:21] (03PS3) 10Jforrester: Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 [18:56:23] (03PS6) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) [18:56:25] (03PS5) 10Jforrester: Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) [18:57:34] (03PS1) 10Andrew Bogott: cloud cumin: don't use a bastion if cumin is already running in the cloud [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) [18:57:45] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [18:57:51] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write JSON config for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535257 (owner: 10Jforrester) [18:58:21] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10Papaul) 05Declined→03Open Resolving this . Host will be decom in T231625 [18:58:22] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Read from JSON, not serialised PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:58:26] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10wiki_willy) Thanks @Marostegui [18:58:32] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Never write to serialised PHP, drop support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533594 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [18:58:51] (03CR) 10Alex Monk: [C: 03+1] cloud cumin: don't use a bastion if cumin is already running in the cloud [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [18:59:23] (03CR) 10Alex Monk: [C: 03+1] "though come the thought of it I'm not sure about the extra indentation - maybe just make it an AND with the current condition?" [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [19:00:12] (03CR) 10Alex Monk: cloud cumin: don't use a bastion if cumin is already running in the cloud [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [19:01:34] (03PS2) 10Ottomata: Remove LVS, discovery, and secondary monitoring of eventbus service [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) [19:01:49] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Require git-lfs in ores::base puppet role - https://phabricator.wikimedia.org/T232494 (10Halfak) In WMFLabs, we use fabric to do deployments. Hence why I ran into this issue. [19:03:04] (03PS2) 10Andrew Bogott: cloud cumin: don't use a bastion if cumin is already running in the cloud [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) [19:04:42] (03CR) 10Ppchelko: [C: 03+1] "The parts of this I understand seem reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/535669 (https://phabricator.wikimedia.org/T232122) (owner: 10Ottomata) [19:12:03] 10Operations, 10ops-codfw: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T232505 (10Marostegui) 05Open→03Declined >>! In T232505#5480473, @Papaul wrote: > Resolving this . Host will be decom in T231625 You just reopened! :P [19:18:36] (03PS1) 10Andrew Bogott: cloud cumin: add cloud-cumin-01.cloudinfra as a cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535677 (https://phabricator.wikimedia.org/T232429) [19:23:58] (03CR) 10Alex Monk: [C: 03+1] cloud cumin: add cloud-cumin-01.cloudinfra as a cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535677 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [19:27:55] (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: don't use a bastion if cumin is already running in the cloud [puppet] - 10https://gerrit.wikimedia.org/r/535670 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [19:28:11] (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: add cloud-cumin-01.cloudinfra as a cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535677 (https://phabricator.wikimedia.org/T232429) (owner: 10Andrew Bogott) [19:30:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:31:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:35:34] (03PS1) 10Nuria: Add config for wmf_netflow to Turnilo [puppet] - 10https://gerrit.wikimedia.org/r/535681 (https://phabricator.wikimedia.org/T232226) [19:47:01] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:52:07] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status [19:52:07] 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:53:41] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:58:26] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) Talked to @akosiaris, who will open up a new task to replace the newly failed drive. We ordered a few of them last time, so hopefully we'll have more spares lying around. [20:01:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:13:59] (03PS3) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [20:20:28] !log add MSS clamp on archiva1001 - T232456 [20:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:50] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [20:21:21] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Andrew) 05Open→03Resolved There's now a cumin master on cloud-cumon-01.cloudinfra.eqiad.wmflabs that seems to work... [20:21:24] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [20:24:12] !log add MSS clamp on install1002 - T2324563 [20:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:25] (03PS1) 10Herron: prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) [20:33:34] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:35:20] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [20:36:34] (03PS2) 10Herron: prometheus: add alert for widespread systemd failed units [puppet] - 10https://gerrit.wikimedia.org/r/535697 (https://phabricator.wikimedia.org/T230570) [20:40:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:42:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:50:05] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Ahecht) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Cannot_save_edit_on_pages_when_using_cellular_network_with_... [20:51:14] (03PS4) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [20:51:16] (03PS1) 10Jforrester: Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 [20:51:18] (03PS1) 10Jforrester: Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 [20:51:20] (03PS1) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [20:51:22] (03PS1) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [20:57:12] (03PS2) 10Jforrester: composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 [20:57:14] (03PS2) 10Jforrester: composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 [20:57:16] (03PS5) 10Jforrester: Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 [20:57:18] (03PS1) 10Jforrester: composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 [20:58:09] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:58:10] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Ahecht) [20:59:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.15% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:02:22] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [21:03:39] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update --no-dev` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:04:12] (03CR) 10jerkins-bot: [V: 04-1] Commit results of `composer update` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535704 (owner: 10Jforrester) [21:04:46] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [21:05:19] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade minus-x from 0.3.1 to 0.3.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535705 (owner: 10Jforrester) [21:05:35] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade mediawiki-codesniffer from 18.0.0 to 26.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535706 (owner: 10Jforrester) [21:06:15] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:06:54] (03CR) 10jerkins-bot: [V: 04-1] composer: Upgrade php-parallel-lint from 0.9.2 to 1.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535707 (owner: 10Jforrester) [21:07:10] (03CR) 10Jforrester: "instantiator is bumping phpunit to an HHVM-incompatible version. What fun." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:07:12] (03CR) 10jerkins-bot: [V: 04-1] Turn InitialiseSettings into a static array return for testability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [21:24:49] (03CR) 10Jforrester: "@hashar, thoughts about disabling the HHVM job for this repo? I feel rather uneasy about it, but I don't see a reasonable way forward…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535703 (owner: 10Jforrester) [21:25:39] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 85, down: 9, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:31:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:51:09] (03PS1) 10Ayounsi: Inject device hostname [software/homer] - 10https://gerrit.wikimedia.org/r/535720 [22:01:09] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:11] PROBLEM - Check whether ferm is active by checking the default input chain on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:01:35] PROBLEM - DPKG on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:03:03] PROBLEM - dhclient process on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:03:07] PROBLEM - SSH on analytics-tool1001 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:03:33] PROBLEM - configured eth on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:05:29] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:12:35] PROBLEM - Check size of conntrack table on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:13:33] RECOVERY - SSH on analytics-tool1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:13:51] (03PS1) 10Ayounsi: Make the the devices.yaml config stanza optional [software/homer] - 10https://gerrit.wikimedia.org/r/535722 [22:14:14] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Izno) I have issues saving in Chrome (Win10) on my work computer but no issue at home on Firefox (Win10). I use WTE2017.... [22:14:33] PROBLEM - Disk space on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [22:23:37] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is CRITICAL: connect to address 10.64.36.110 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [22:27:14] !log moving netbox -> netbox instances [22:28:29] RECOVERY - Check systemd state on analytics-tool1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:31] RECOVERY - Check whether ferm is active by checking the default input chain on analytics-tool1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:28:39] RECOVERY - Disk space on analytics-tool1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics-tool1001&var-datasource=eqiad+prometheus/ops [22:29:01] RECOVERY - DPKG on analytics-tool1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [22:29:10] !log restarted nagios-nrpe-servec on analytics-tool1001 [22:29:11] RECOVERY - dhclient process on analytics-tool1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [22:29:43] RECOVERY - configured eth on analytics-tool1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [22:29:53] RECOVERY - Check size of conntrack table on analytics-tool1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [22:31:23] chaomodus it appears stashbot left before you logged that [22:31:37] so it does [22:32:05] oh well they are in people's backscroll at least :) [22:33:19] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 59 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:34:07] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Netbox [22:34:16] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10Addshore) [22:38:51] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10netops: Numerous people reporting issues saving edits and viewing previews/diffs - https://phabricator.wikimedia.org/T232491 (10ayounsi) Thanks for the reports, we have narrowed down the cause to a [[ https://en.wikipedia.org/wiki/Maximum_transmiss... [22:41:33] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:55] (03CR) 10Krinkle: Turn InitialiseSettings into a static array return for testability (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/535665 (owner: 10Jforrester) [22:45:27] (03PS2) 10CRusnov: wikimedia.org: switch netbox. alias to netbox inst [dns] - 10https://gerrit.wikimedia.org/r/535623 [22:46:03] (03CR) 10CRusnov: [C: 03+2] wikimedia.org: switch netbox. alias to netbox inst [dns] - 10https://gerrit.wikimedia.org/r/535623 (owner: 10CRusnov) [22:48:11] 08Warning Alert for device cr1-eqiad.wikimedia.org - Memory over 85% [22:50:51] PROBLEM - SSH wtp1031.mgmt on wtp1031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:53:15] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Krenair) 05Resolved→03Open It seems to work within cloudinfra but I think we have a little bit of config tweaking... [22:53:20] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [22:53:27] (03PS1) 10Alex Monk: Labs cumin masters: Only set project filter if we're a project-specific cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535727 (https://phabricator.wikimedia.org/T232429) [22:54:11] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics-tool1001 is OK: OK: synced at Tue 2019-09-10 22:54:10 UTC. https://wikitech.wikimedia.org/wiki/NTP [22:54:33] PROBLEM - traffic_server tls process restarted on cp5001 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [22:55:16] (03CR) 10jerkins-bot: [V: 04-1] Labs cumin masters: Only set project filter if we're a project-specific cumin master [puppet] - 10https://gerrit.wikimedia.org/r/535727 (https://phabricator.wikimedia.org/T232429) (owner: 10Alex Monk) [22:58:54] (03PS2) 10Alex Monk: Labs cumin masters: Only set openstack project filter if we're project-specific [puppet] - 10https://gerrit.wikimedia.org/r/535727 (https://phabricator.wikimedia.org/T232429) [22:58:56] (03PS1) 10Alex Monk: Labs cumin masters: Remove config associated with proxying via bastion [puppet] - 10https://gerrit.wikimedia.org/r/535733 (https://phabricator.wikimedia.org/T232429) [22:59:59] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Krenair) https://gerrit.wikimedia.org/r/535727 should make it behave like the existing cumin master, https://gerrit.wi... [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190910T2300). [23:00:04] subbu: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:26] ah .. i should take that off the calendar since it has already been deployed. :) [23:04:12] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Krenair) [23:14:10] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10Krenair) (I tried applying the first of those manually on the new instance, ran `cumin '*' id` and saw `814 hosts will... [23:17:30] 10Operations, 10Traffic, 10Wikimedia-Logstash, 10observability, 10User-Addshore: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10Krinkle) I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fiel... [23:38:33] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:21] RECOVERY - SSH wtp1031.mgmt on wtp1031.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:34] (03CR) 10Andrew Bogott: [C: 03+2] Labs cumin masters: Only set openstack project filter if we're project-specific [puppet] - 10https://gerrit.wikimedia.org/r/535727 (https://phabricator.wikimedia.org/T232429) (owner: 10Alex Monk) [23:56:24] (03CR) 10Andrew Bogott: "Going to sit on this one for a bit, since currently some VMs can be reached by the old cumin but not by the new one. Once we have most VM" [puppet] - 10https://gerrit.wikimedia.org/r/535733 (https://phabricator.wikimedia.org/T232429) (owner: 10Alex Monk)