[00:00:15] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:15] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:31] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:41] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:49] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:14:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:38:47] (03PS1) 10Andrew Bogott: Added fake passwords for proxy_domain_passwords [labs/private] - 10https://gerrit.wikimedia.org/r/609609 (https://phabricator.wikimedia.org/T256276) [00:40:47] (03PS1) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [00:47:41] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 71483864 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:33] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 913960 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:12] (03PS2) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [00:59:59] (03PS2) 10Andrew Bogott: Added fake passwords for proxy_zone_passwords [labs/private] - 10https://gerrit.wikimedia.org/r/609609 (https://phabricator.wikimedia.org/T256276) [01:00:20] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added fake passwords for proxy_zone_passwords [labs/private] - 10https://gerrit.wikimedia.org/r/609609 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [01:23:39] (03PS1) 10Andrew Bogott: Correct profile::openstack::codfw1dev::horizon::proxy_zone_passwords key name [labs/private] - 10https://gerrit.wikimedia.org/r/609612 [01:24:28] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Correct profile::openstack::codfw1dev::horizon::proxy_zone_passwords key name [labs/private] - 10https://gerrit.wikimedia.org/r/609612 (owner: 10Andrew Bogott) [01:26:35] (03PS3) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [01:29:52] (03PS4) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [01:32:26] (03PS5) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [02:17:33] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 57 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:29:13] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:20:05] (03CR) 10DannyS712: [C: 04-1] "Please post a request at https://meta.wikimedia.org/wiki/Talk:Interwiki_map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577740 (https://phabricator.wikimedia.org/T227053) (owner: 10Fomafix) [04:16:44] 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Marostegui) I guess this is the known bug and restarting the exporter fixed it or is there something else? [04:34:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:42:50] (03PS1) 10Marostegui: db1089: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609613 (https://phabricator.wikimedia.org/T254462) [04:43:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:45:03] (03CR) 10Marostegui: [C: 03+2] db1089: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609613 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [04:45:47] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:49:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11737 and previous config saved to /var/cache/conftool/dbconfig/20200706-044908-marostegui.json [04:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 59 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:54:26] 10Operations, 10ops-eqiad, 10DBA, 10User-Kormat: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T256939 (10Marostegui) 05Open→03Resolved I have tried to check the history for this host/service on icinga to see if it recovered for a while and then got triggered again for some reason (not th... [04:55:54] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10Marostegui) p:05Triage→03Medium [04:56:55] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 45 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:03:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11738 and previous config saved to /var/cache/conftool/dbconfig/20200706-050347-marostegui.json [05:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:13:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1089', diff saved to https://phabricator.wikimedia.org/P11739 and previous config saved to /var/cache/conftool/dbconfig/20200706-051333-marostegui.json [05:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:32] RECOVERY - Host re0.cr3-eqsin is UP: PING OK - Packet loss = 0%, RTA = 251.15 ms [05:22:08] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) 05Resolved→03Open >>! In T256120#6273923, @jbond wrote: > Thanks @jcrespo Thanks for helping this is all set up and ready for th... [05:22:10] 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: CAS Store U2f tokens in a database - https://phabricator.wikimedia.org/T256113 (10Marostegui) [05:26:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 77 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:27:21] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-07-01-044435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609058 (https://phabricator.wikimedia.org/T254143) (owner: 10KartikMistry) [05:28:26] (03Merged) 10jenkins-bot: Update cxserver to 2020-07-01-044435-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609058 (https://phabricator.wikimedia.org/T254143) (owner: 10KartikMistry) [05:30:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:46] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:32:00] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [05:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:35] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:31] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:18] !log Updated cxserver to 2020-07-01-044435-production (T254143) [05:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:23] T254143: Recommendation api always returns 404 when seed article is not supplied - https://phabricator.wikimedia.org/T254143 [05:47:03] (03PS2) 10Ayounsi: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/609571 (https://phabricator.wikimedia.org/T257154) (owner: 10Giuseppe Lavagetto) [05:54:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:01:14] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:01:14] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:03:22] this seems to be the Telia transport to codfw --^ [06:06:50] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:06:50] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:07:05] ok I see maintenance scheduled, it is just not on the gcal [06:14:10] !log elukey@cumin1001 START - Cookbook sre.hadoop.stop-cluster [06:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:19] (test cluster) [06:21:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) [06:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:33] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro [06:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:50] (03CR) 10JMeybohm: [C: 04-1] "Nice, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [06:50:11] (03PS2) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [06:52:47] (03PS3) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [06:53:29] (03CR) 10JMeybohm: Add helm-charts discovery record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:53:49] (03PS4) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) [06:54:18] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.change-distro (exit_code=99) [06:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:36] (03CR) 10Ayounsi: [C: 03+2] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/609571 (https://phabricator.wikimedia.org/T257154) (owner: 10Giuseppe Lavagetto) [06:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1089', diff saved to https://phabricator.wikimedia.org/P11740 and previous config saved to /var/cache/conftool/dbconfig/20200706-065437-marostegui.json [06:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:19] buuuuuu [06:55:24] !log depool eqsin for cr3-eqsin reboot/investigation - T257154 [06:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:28] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [07:01:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sadly there are more places where we need to guard against this change. I'm asking myself if we should just go with filtering the services" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [07:02:26] (03PS2) 10Muehlenhoff: Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412 [07:05:46] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 46.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:09:04] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Failover url downloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/609412 (owner: 10Muehlenhoff) [07:11:54] !log reboot cr3-eqsin - T257154 [07:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:59] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [07:12:01] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) The goal of this task has been reached, there are some remaining systems to be migrated (CI, c... [07:12:05] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) @akosiaris @jcrespo let's replace this master on Wednesday at 08:00 AM UTC? [07:12:11] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10elukey) [07:12:56] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10akosiaris) Fine by me. [07:13:08] 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Kormat) Yep, that's it. It's not so much the bug that bothers me as us not being aware of it in some cases for 2 months. [07:14:11] 10Operations, 10DBA, 10User-Kormat: Add alert for prometheus-mysql-exporter failing to scrape mysql - https://phabricator.wikimedia.org/T257056 (10Marostegui) Completely agree - I was just asking in case it was something else what made it fail [07:16:14] PROBLEM - Host re0.cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [07:21:43] (03CR) 10Marostegui: [C: 03+1] piwik: add binlog to database config. [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:22:08] RECOVERY - Host re0.cr3-eqsin is UP: PING OK - Packet loss = 0%, RTA = 251.09 ms [07:23:18] (03CR) 10Marostegui: [C: 03+2] Add echo_push_subscription to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/609228 (https://phabricator.wikimedia.org/T246716) (owner: 10Mholloway) [07:28:43] RECOVERY - Host cr3-eqsin is UP: PING OK - Packet loss = 0%, RTA = 233.49 ms [07:28:46] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 19.83 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:29:14] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:29:26] RECOVERY - OSPF status on mr1-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:29:30] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:15] ^ I brought back cr3-eqsin on its backup disk, everything looks fine so far, will wait a bit then repool eqsin [07:32:28] RECOVERY - Host cr3-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.31 ms [07:33:26] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:33:34] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 108 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:38:04] (03CR) 10Elukey: [C: 03+2] piwik: add binlog to database config. [puppet] - 10https://gerrit.wikimedia.org/r/609421 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:39:24] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:40:25] 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) ` root@cr3-eqsin> request vmhost snapshot recovery warning: Existing data on the target may be lost Proceed ? [yes,no] (no) yes warning: Proceeding with vmhost snapshot Current root details, De... [07:41:40] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.835 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:42:13] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10jcrespo) Ok. [07:44:27] (03PS1) 10Ayounsi: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/609627 [07:44:44] (03PS2) 10Ayounsi: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/609627 (https://phabricator.wikimedia.org/T257154) [07:44:56] (03PS3) 10Ayounsi: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/609627 (https://phabricator.wikimedia.org/T257154) [07:46:14] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/609627 (https://phabricator.wikimedia.org/T257154) (owner: 10Ayounsi) [07:46:38] !log repool eqsin - T257154 [07:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:42] T257154: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 [07:51:33] !log enable binlog on matomo's database on matomo1002 [07:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:57] (03PS1) 10Majavah: Add arxiv.org to commonswiki wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609700 (https://phabricator.wikimedia.org/T257036) [07:55:15] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [07:56:00] (03PS1) 10Elukey: sre.hadoop.change-distro.py: add a log for better user readability [cookbooks] - 10https://gerrit.wikimedia.org/r/609729 [07:56:55] (03CR) 10Elukey: [C: 03+2] Bump AQS druid snapshot to 2020-06 [puppet] - 10https://gerrit.wikimedia.org/r/609566 (owner: 10Joal) [07:57:23] (03CR) 10jerkins-bot: [V: 04-1] sre.hadoop.change-distro.py: add a log for better user readability [cookbooks] - 10https://gerrit.wikimedia.org/r/609729 (owner: 10Elukey) [07:57:54] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 44.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:58:28] !log Disable puppet on gerrit1002 (gerrit-test) to deploy Gerrit UI updates there to gather more feedback [07:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11742 and previous config saved to /var/cache/conftool/dbconfig/20200706-080509-marostegui.json [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:57] (03PS3) 10Giuseppe Lavagetto: restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) [08:07:59] (03PS4) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [08:08:01] (03PS5) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [08:08:03] (03PS5) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [08:08:27] (03PS1) 10Kormat: realms.pp: Add sysop_itwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/609739 (https://phabricator.wikimedia.org/T257125) [08:09:09] !log roll restart aqs on aqs100[4-9] to pick up new druid settings [08:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:18] (03PS2) 10Elukey: sre.hadoop.change-distro.py: add a log for better user readability [cookbooks] - 10https://gerrit.wikimedia.org/r/609729 [08:16:22] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro.py: add a log for better user readability [cookbooks] - 10https://gerrit.wikimedia.org/r/609729 (owner: 10Elukey) [08:17:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but it's dependent on https://gerrit.wikimedia.org/r/c/operations/puppet/+/609403 being merged and propagated to auth dns servers fi" [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:17:57] (03CR) 10Marostegui: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/23684/db1124.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/609739 (https://phabricator.wikimedia.org/T257125) (owner: 10Kormat) [08:18:20] (03CR) 10Kormat: [C: 03+2] realms.pp: Add sysop_itwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/609739 (https://phabricator.wikimedia.org/T257125) (owner: 10Kormat) [08:19:36] (03Abandoned) 10Urbanecm: Add sysop_itwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/609548 (https://phabricator.wikimedia.org/T256545) (owner: 10Urbanecm) [08:24:07] !log restarting all mariadb instances on sanitarium hosts T256545 [08:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:14] T256545: Create private wiki sysop_itwiki - https://phabricator.wikimedia.org/T256545 [08:25:46] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:33:03] (03CR) 10JMeybohm: [C: 03+2] add chartmuseum[12]001 to dhcp netboot.cfg and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/609449 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm) [08:36:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/23685/restbase1025.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [08:38:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add cumin alias for chartmuseum hosts [puppet] - 10https://gerrit.wikimedia.org/r/609450 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm) [08:38:41] (03CR) 10JMeybohm: [C: 03+2] Add cumin alias for chartmuseum hosts [puppet] - 10https://gerrit.wikimedia.org/r/609450 (https://phabricator.wikimedia.org/T256970) (owner: 10JMeybohm) [08:39:13] (03CR) 10Ema: [C: 04-1] vcl: public_clouds_shutdown: ratelimit API reqs as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609480 (owner: 10CDanis) [08:44:38] !log cr3-ulsfo> request vmhost snapshot - T257153 [08:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:43] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [08:51:45] !log cr1-codfw> request vmhost snapshot routing-engine both - T257153 [08:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] T257153: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 [08:57:22] (03CR) 10JMeybohm: [C: 04-1] "LGTM overall, but I would suggest renaming "allowed_listeners" to "enabled_listeners" to make it's use more clear." [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [08:58:39] 10Operations, 10netops: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) Works as expected at least on single RE devices: ` cr3-ulsfo> show vmhost snapshot UEFI Version: CBEP_P_SUM1_00.13.01 Secondary Disk, Snapshot Time: Wed Sep 19 23:22:47 UTC 2018 Version: set p... [09:00:56] (03PS3) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) [09:01:04] (03PS2) 10Marostegui: mariadb: Promote es1024 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/607236 (https://phabricator.wikimedia.org/T255755) [09:01:47] (03PS3) 10QChris: gerrit: Add table for Zuul CI results underneath the commit message [puppet] - 10https://gerrit.wikimedia.org/r/609519 (https://phabricator.wikimedia.org/T256575) [09:01:49] (03PS3) 10QChris: gerrit: Add SonarQube results to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609520 (https://phabricator.wikimedia.org/T256575) [09:01:51] (03PS1) 10QChris: gerrit: Drop bot name, check date, and check duration from CI table [puppet] - 10https://gerrit.wikimedia.org/r/609742 (https://phabricator.wikimedia.org/T256575) [09:01:53] (03PS1) 10QChris: gerrit: For Zuul, bring only 'Main test build' jobs to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609743 (https://phabricator.wikimedia.org/T256575) [09:01:55] (03PS1) 10QChris: gerrit: For SonarQube, bring only overall result to CI table as 'Sonar Cloud' [puppet] - 10https://gerrit.wikimedia.org/r/609744 (https://phabricator.wikimedia.org/T256575) [09:01:57] (03PS1) 10QChris: gerrit: Add header to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609745 [09:01:59] (03PS1) 10QChris: gerrit: Add marker for empty CI table [puppet] - 10https://gerrit.wikimedia.org/r/609746 [09:03:18] (03PS1) 10QChris: gerrit: Remove no longer comment from CI table code [puppet] - 10https://gerrit.wikimedia.org/r/609747 [09:06:35] 10Operations, 10netops: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) and dual RE: ` cr1-codfw> show vmhost snapshot invoke-on all-routing-engines re0: -------------------------------------------------------------------------- UEFI Version: NGRE_v00.53.00.01 Secon... [09:06:39] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Jgiannelos) [09:07:01] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Jgiannelos) Adding my manager @dcipoletti in the loop [09:07:22] (03CR) 10QChris: "> The Sonar links all point to the same link, it seems confusing to duplicate that 5 or 6 times. Maybe limit it it to one entry for "Son" [puppet] - 10https://gerrit.wikimedia.org/r/609520 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [09:10:13] (03CR) 10QChris: gerrit: Add table for Zuul CI results underneath the commit message (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/609519 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [09:11:36] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Jgiannelos) Regarding `Requested group membership`: I am not sure which are the specific groups. @mateusbs17 do you know if there is any standard set of gr... [09:18:21] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) @thcipriani I believe you will be the manager approving this access. Can you work with the requester (or assign one of your direct reports to work... [09:18:43] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10RhinosF1) >>! In T257187#6280766, @Jgiannelos wrote: > Regarding `Requested group membership`: > > I am not sure which are the specific groups. > @mateusbs... [09:19:24] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) [09:22:07] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Jgiannelos) @RhinosF1 I updated the username. [09:22:57] (03PS1) 10Awight: Update help URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609748 (https://phabricator.wikimedia.org/T256623) [09:29:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:29:51] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) [09:31:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:35:54] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 58s) [09:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:24] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) Not checking yet: > full reasoning for access As, while everything required has been provided, the groups requested are not 100% set on stone, we... [09:40:46] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T257192 (10jcrespo) [09:40:49] (03PS1) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609749 [09:41:23] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T257192 (10jcrespo) 05Open→03Invalid Created wrong task by accident. [09:41:36] (03Abandoned) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609749 (owner: 10Jbond) [09:42:01] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) [09:43:35] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) [09:45:16] 10Operations, 10SRE-Access-Requests: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10Jgiannelos) Thanks @jcrespo for helping with the ticket. I will also consult my manager and @MSantos who helped me with onboarding about the specific groups. [09:45:48] yes [09:45:52] (03PS1) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609751 [09:48:03] (03PS3) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) [09:48:12] (03PS2) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609751 [09:49:01] (03PS4) 10Jbond: apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) [09:49:11] (03PS3) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609751 [09:49:13] (03PS1) 10Jcrespo: [WIP] Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) [09:49:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [09:50:15] (03PS4) 10Jbond: jpa: drop jpa support in order to build a new build for production [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609751 [09:50:24] (03CR) 10Jcrespo: "Please double check the ssh key correctness before deployment." [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [09:50:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: login page redirect frames [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609399 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [09:52:04] (03PS2) 10Jcrespo: [WIP] Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) [09:52:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10jcrespo) p:05Triage→03Medium [09:53:07] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) [09:55:45] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Nahid Sultan - https://phabricator.wikimedia.org/T256971 (10jcrespo) p:05Triage→03High [09:57:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10jcrespo) p:05Triage→03High [09:59:24] (03PS1) 10Jbond: idp: failover to codfw for upgrade [dns] - 10https://gerrit.wikimedia.org/r/609756 [09:59:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10jcrespo) p:05Triage→03Medium [10:00:22] (03CR) 10Jbond: [C: 03+2] idp: failover to codfw for upgrade [dns] - 10https://gerrit.wikimedia.org/r/609756 (owner: 10Jbond) [10:01:26] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10jcrespo) p:05Triage→03High [10:07:04] (03CR) 10Hashar: [C: 03+1] gerrit: Drop bot name, check date, and check duration from CI table [puppet] - 10https://gerrit.wikimedia.org/r/609742 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [10:07:44] (03CR) 10Hashar: [C: 03+1] gerrit: Add header to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609745 (owner: 10QChris) [10:09:32] (03CR) 10Hashar: [C: 03+1] gerrit: Add marker for empty CI table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609746 (owner: 10QChris) [10:10:14] (03CR) 10Jbond: [C: 03+1] "lgtm, minor nits" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/609426 (https://phabricator.wikimedia.org/T240658) (owner: 10Ayounsi) [10:14:36] (03CR) 10Hashar: [C: 03+1] Switch CI to profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:15:12] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: add hiera calls and type validation [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [10:15:16] (03CR) 10Jbond: [C: 03+2] puppetmaster::frontend: manage ca_cert.pem [puppet] - 10https://gerrit.wikimedia.org/r/609186 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [10:16:18] (03PS7) 10Hashar: Switch CI to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:20:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:22:15] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10akosiaris) https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&edit&fu... [10:23:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:25:47] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:28:42] !log rebooting idp1001 for kernel update [10:28:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:58] 10Operations, 10vm-requests, 10Patch-For-Review: Site: eqiad/codwf each 1 VM for helm-charts.wikimedia.org (chartmuseum) - https://phabricator.wikimedia.org/T256970 (10JMeybohm) 05Open→03Resolved VMs created, installed and ran puppet insetup role successfully. Both came up fine after reboot. [10:29:47] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) @Marostegui sorry for the confusion in the initial request. To clarify we would like this database to be available at all times but wit... [10:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T1030). [10:31:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:32:39] (03PS1) 10JMeybohm: Switch role for chartmuseum hosts to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609760 (https://phabricator.wikimedia.org/T253843) [10:35:32] (03PS5) 10Jbond: jpa: Add JPA support for u2f tokens [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/609751 [10:36:59] 10Operations, 10ops-codfw, 10observability: ps1-c3-codfw icinga checks UNKNOWN - https://phabricator.wikimedia.org/T256953 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete, was indeed related to PDU upgrades [10:37:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [10:40:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [10:41:08] (03PS4) 10Giuseppe Lavagetto: restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) [10:41:10] (03PS5) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [10:41:12] (03PS6) 10Giuseppe Lavagetto: restbase: switch to using get_url [puppet] - 10https://gerrit.wikimedia.org/r/609154 (https://phabricator.wikimedia.org/T255133) [10:41:14] (03PS6) 10Giuseppe Lavagetto: restbase: use the services proxy for everything but parsoid-php [puppet] - 10https://gerrit.wikimedia.org/r/609155 (https://phabricator.wikimedia.org/T255133) [10:45:19] (03CR) 10JMeybohm: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [10:46:24] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609762 (https://phabricator.wikimedia.org/T128546) [10:47:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609762 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:48:08] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609762 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:51:05] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:609762| Bumping portals to master (609762)]] (duration: 00m 56s) [10:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:03] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:609762| Bumping portals to master (609762)]] (duration: 00m 57s) [10:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:52] (03CR) 10Hashar: "Puppet compiler result: https://puppet-compiler.wmflabs.org/compiler1002/466/" [puppet] - 10https://gerrit.wikimedia.org/r/605886 (https://phabricator.wikimedia.org/T253553) (owner: 10Muehlenhoff) [10:53:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) (owner: 10Giuseppe Lavagetto) [10:55:12] (03PS5) 10Giuseppe Lavagetto: restbase: install the service proxy along with the tls termination [puppet] - 10https://gerrit.wikimedia.org/r/609152 (https://phabricator.wikimedia.org/T255133) [10:56:23] (03PS1) 10QChris: gerrit: Add Code Review logo as favicon [puppet] - 10https://gerrit.wikimedia.org/r/609764 [10:57:47] kormat: I might look at https://phabricator.wikimedia.org/T256545#6281085 after lunch so can you confirm what shard it's on? s3? [10:59:31] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Let's move this to a "more stable" place then. This host is not guaranteed to be up really, we use it for many things, it can fail,... [10:59:58] RhinosF1: yep, s3 please [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T1100) [11:00:05] Majavah and kostajh: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:00:45] I can deploy today! [11:01:10] \o [11:01:23] (03CR) 10Urbanecm: [C: 03+2] Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073) (owner: 10Majavah) [11:02:10] (03Merged) 10jenkins-bot: Remove "Create a book" link from sidebar on Finnish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609472 (https://phabricator.wikimedia.org/T257073) (owner: 10Majavah) [11:03:10] Majavah: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/609472 is at mwdebug1001, could you check? [11:03:31] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) @Marostegui that sounds good to me, no need to copy the current data, just let me know when its in place and ill update my config. Thanks [11:03:49] Urbanecm: working at mwdebug1001 [11:03:55] thanks, syncing [11:05:09] ty kormat [11:05:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3bc1b46: Remove "Create a book" link from sidebar on Finnish Wikipedia (T257073) (duration: 00m 56s) [11:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:36] T257073: Remove "Create a book" link from sidebar on Finnish Wikipedia - https://phabricator.wikimedia.org/T257073 [11:05:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11744 and previous config saved to /var/cache/conftool/dbconfig/20200706-110544-marostegui.json [11:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:59] (03PS2) 10Urbanecm: Add arxiv.org to commonswiki wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609700 (https://phabricator.wikimedia.org/T257036) (owner: 10Majavah) [11:06:06] (03CR) 10Urbanecm: [C: 03+2] Add arxiv.org to commonswiki wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609700 (https://phabricator.wikimedia.org/T257036) (owner: 10Majavah) [11:06:19] (03CR) 10Jcrespo: [C: 04-2] [WIP] Add Jgiannelos production access [puppet] - 10https://gerrit.wikimedia.org/r/609752 (https://phabricator.wikimedia.org/T257187) (owner: 10Jcrespo) [11:06:53] (03Merged) 10jenkins-bot: Add arxiv.org to commonswiki wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609700 (https://phabricator.wikimedia.org/T257036) (owner: 10Majavah) [11:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 T254462', diff saved to https://phabricator.wikimedia.org/P11745 and previous config saved to /var/cache/conftool/dbconfig/20200706-110723-marostegui.json [11:07:27] Urbanecm: not sure if I'm able to test that, I'm not autopatrolled on commons [11:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:28] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [11:07:42] Majavah: I'll just sync that :) [11:08:17] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609765 (https://phabricator.wikimedia.org/T254462) [11:08:35] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [11:09:02] kostajh: can you confirm the survey link should point to https://meta.wikimedia.org/wiki/Data_retention_guidelines#Exceptions_to_these_guidelines from now on? [11:09:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f4b5001: Add arxiv.org to commonswiki wgCopyUploadsDomains (T257036) (duration: 00m 56s) [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:09] T257036: Add arxiv.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T257036 [11:09:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch role for chartmuseum hosts to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609760 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:09:45] Urbanecm: that's correct [11:09:45] !log Compress InnoDB on db1107 T254462 [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:52] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/609765 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [11:10:17] (03Abandoned) 10Awight: Enable TechWishes survey for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607773 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:10:25] thanks kostajh. To me, it seems kinda weird, because that page says `When we conduct a survey or other research, we will provide you with a privacy statement specifying the term of retention for information`. [11:11:07] Urbanecm: the relevant bullet point is "New editor research" [11:11:29] the idea is that we no longer require each community to translate a pretty lengthy legal document in order to get GrowthExperiments on their wiki [11:11:32] aha, thanks! Going ahead then :) [11:11:39] (03PS2) 10QChris: gerrit: Add marker for empty CI table [puppet] - 10https://gerrit.wikimedia.org/r/609746 [11:11:41] (03PS2) 10QChris: gerrit: Remove no longer comment from CI table code [puppet] - 10https://gerrit.wikimedia.org/r/609747 [11:11:45] (03PS2) 10Urbanecm: GrowthExperiments: Remove overrides to welcome survey privacy policy URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608297 (https://phabricator.wikimedia.org/T252572) (owner: 10Kosta Harlan) [11:11:49] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Remove overrides to welcome survey privacy policy URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608297 (https://phabricator.wikimedia.org/T252572) (owner: 10Kosta Harlan) [11:12:06] Urbanecm: T252572 has some more info [11:12:06] T252572: Scale: update survey privacy link - https://phabricator.wikimedia.org/T252572 [11:12:10] thanks [11:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1129', diff saved to https://phabricator.wikimedia.org/P11746 and previous config saved to /var/cache/conftool/dbconfig/20200706-111221-marostegui.json [11:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:57] !log Deploy schema changes on db1129 [11:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:20] (03Merged) 10jenkins-bot: GrowthExperiments: Remove overrides to welcome survey privacy policy URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608297 (https://phabricator.wikimedia.org/T252572) (owner: 10Kosta Harlan) [11:13:44] kostajh: could you test via mwdebug1001, please? [11:13:54] Urbanecm: yes, looking [11:14:37] Urbanecm: lgtm! [11:14:44] kostajh: thank you, syncing [11:15:03] (the beta part should apply automatically within 30 minutes) [11:15:13] 10Puppet, 10DBA, 10cloud-services-team (Kanban): labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846 (10aborrero) [11:15:22] cool, thanks [11:16:22] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5d971dc: GrowthExperiments: Remove overrides to welcome survey privacy policy URL (T252572) (duration: 00m 56s) [11:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:34] kostajh: done :). [11:17:49] !log EU B&C window was done [11:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:13] (03PS1) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) [11:21:24] 10Operations, 10Wikimedia-Logstash, 10observability: Logstash stops processing messages if a single output becomes blocked - https://phabricator.wikimedia.org/T223483 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving, it is indeed the case that a blocked logstash output exerts backpressu... [11:21:57] (03PS2) 10Ssingh: admin: add Ahmon Dancy to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) [11:23:34] (03CR) 10QChris: gerrit: Add marker for empty CI table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609746 (owner: 10QChris) [11:24:09] (03CR) 10Ssingh: "I rebased this on top of production and this is ready to merge." [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh) [11:24:52] (03PS2) 10Urbanecm: Initial configuration for sysop_itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609768 (https://phabricator.wikimedia.org/T256545) [11:25:22] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Provide authenticated access to Thanos native web interface - https://phabricator.wikimedia.org/T151009 (10fgiunchedi) [11:26:18] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Provide authenticated access to Thanos native web interface - https://phabricator.wikimedia.org/T151009 (10fgiunchedi) Taking over this issue to provide access to Thanos instead, which provides a unified query interface. [11:31:52] 10Operations, 10observability: exim paniclog on $HOST has non-zero size - https://phabricator.wikimedia.org/T224399 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving in favor of {T257016} [11:31:55] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10fgiunchedi) [11:33:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10dcipoletti) Approving access request for @Jgiannelos [11:36:59] 10Operations, 10Phatality, 10observability: Deploying "Phatality" plugin for Kibana invokes oom-killer on logstash::collector nodes - https://phabricator.wikimedia.org/T237706 (10fgiunchedi) AFAIK this hasn't recurred, but we might have not had Phatality deployments since then @mmodell ? [11:40:05] (03PS1) 10Jbond: haveged: install haveged on idp server [puppet] - 10https://gerrit.wikimedia.org/r/609771 [11:40:07] (03PS1) 10Jbond: haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 [11:45:15] (03PS2) 10Jbond: haveged: install haveged on idp server [puppet] - 10https://gerrit.wikimedia.org/r/609771 [11:46:15] 10Operations, 10Wikimedia-Logstash, 10observability: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine - https://phabricator.wikimedia.org/T97297 (10fgiunchedi) 05Open→03Invalid We have the logging pipeline i... [11:49:07] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Icinga, and 2 others: Remove elasticsearch icinga checks from logstash collectors - https://phabricator.wikimedia.org/T218691 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi AFAICT we've been running all elasticsearch checks in all clusters and we'... [11:51:49] (03CR) 10Jbond: [C: 03+1] Netflow: send as little options templates as possible (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/609426 (https://phabricator.wikimedia.org/T240658) (owner: 10Ayounsi) [11:53:27] (03PS1) 10Jbond: profile::base: noop style changes [puppet] - 10https://gerrit.wikimedia.org/r/609774 [12:01:24] (03CR) 10Jbond: [C: 03+2] profile::base: noop style changes [puppet] - 10https://gerrit.wikimedia.org/r/609774 (owner: 10Jbond) [12:02:45] (03PS3) 10Jbond: haveged: install haveged on idp server [puppet] - 10https://gerrit.wikimedia.org/r/609771 [12:05:46] 10Operations, 10observability, 10Sustainability (Incident Prevention): prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) [12:07:05] (03PS4) 10Jbond: haveged: install haveged on idp server [puppet] - 10https://gerrit.wikimedia.org/r/609771 [12:07:07] (03PS2) 10Jbond: haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 [12:11:32] (03CR) 10Muehlenhoff: "haveged would work around this, but fundamentally the problem is already solved on a lower level: The Buster kernel sets CONFIG_RANDOM_TRU" [puppet] - 10https://gerrit.wikimedia.org/r/609771 (owner: 10Jbond) [12:14:24] 10Operations, 10observability, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10fgiunchedi) [12:25:36] (03Abandoned) 10Jbond: haveged: install haveged on idp server [puppet] - 10https://gerrit.wikimedia.org/r/609771 (owner: 10Jbond) [12:25:44] (03PS3) 10Jbond: haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 [12:26:57] (03CR) 10jerkins-bot: [V: 04-1] haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 (owner: 10Jbond) [12:27:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:29:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:30:11] (03PS4) 10Jbond: haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 [12:31:27] (03CR) 10jerkins-bot: [V: 04-1] haveged: install haveged on VM's by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 (owner: 10Jbond) [12:32:16] (03PS1) 10Privacybatm: tranasfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) [12:37:40] (03PS2) 10Privacybatm: transfer.py: Refactor split_target function [software/transferpy] - 10https://gerrit.wikimedia.org/r/609778 (https://phabricator.wikimedia.org/T248256) [12:41:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1129', diff saved to https://phabricator.wikimedia.org/P11747 and previous config saved to /var/cache/conftool/dbconfig/20200706-124105-marostegui.json [12:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074', diff saved to https://phabricator.wikimedia.org/P11748 and previous config saved to /var/cache/conftool/dbconfig/20200706-124237-marostegui.json [12:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:44] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [12:44:07] (03CR) 10CDanis: vcl: public_clouds_shutdown: ratelimit API reqs as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609480 (owner: 10CDanis) [12:45:46] (03PS4) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) [12:46:17] (03CR) 10QChris: Gerrit: Rename ssh_host_key to ssh_host_rsa_key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [12:49:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:49:40] (03CR) 10Privacybatm: "This patch set is for rebase." [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [12:51:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:53:46] !log kill hanging lsof processes on an-airflow to reduce cpu load [12:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:04] (03PS1) 10JMeybohm: Add fake secrets for chartmuseum [labs/private] - 10https://gerrit.wikimedia.org/r/609781 (https://phabricator.wikimedia.org/T253843) [12:54:53] (03CR) 10JMeybohm: [C: 03+2] Add fake secrets for chartmuseum [labs/private] - 10https://gerrit.wikimedia.org/r/609781 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [12:54:57] (03PS5) 10Jbond: haveged: install haveged on VM'si debian < buster by default [puppet] - 10https://gerrit.wikimedia.org/r/609772 [12:55:10] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add fake secrets for chartmuseum [labs/private] - 10https://gerrit.wikimedia.org/r/609781 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:03:01] !log force umount/mount of /mnt/hdfs on an-airflow1001 to unblock dpkg checks (fuse misbehaving, all checks hanging) [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A full PCC at https://puppet-compiler.wmflabs.org/compiler1001/23686/puppetmaster2001.codfw.wmnet/index.html identified a couple of more p" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris) [13:08:54] 10Operations, 10SRE-Access-Requests: Requesting access to centralauth database for Jennifer Wang - https://phabricator.wikimedia.org/T255836 (10jcrespo) > I am a data analyst for AHT team. I need the access to query centralauth database With that short access request description, this looks to me that you don... [13:23:10] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [13:23:20] (03PS1) 10Jbond: java: update java.security to use urandom on buster [puppet] - 10https://gerrit.wikimedia.org/r/609783 [13:23:22] (03PS1) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [13:24:40] (03CR) 10QChris: "> Patch Set 15:" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [13:26:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce es1024 weight in preparation for tomorrow's switchover T255755', diff saved to https://phabricator.wikimedia.org/P11750 and previous config saved to /var/cache/conftool/dbconfig/20200706-132634-marostegui.json [13:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:40] T255755: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 [13:29:31] (03CR) 10QChris: [C: 04-1] "Nit: in `hieradata/cloud/eqiad1/devtools/common.yaml` there is also this line" [puppet] - 10https://gerrit.wikimedia.org/r/556265 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [13:30:06] PROBLEM - Host ms-be1056.mgmt is DOWN: CRITICAL - Time to live exceeded (10.65.5.15) [13:30:17] (03PS2) 10Jbond: java: update java.security to use urandom on buster [puppet] - 10https://gerrit.wikimedia.org/r/609783 [13:30:39] !log Deploy schema change on s5 codfw master T253276 [13:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:44] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [13:35:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={pdu,pdu_sentry4} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:40] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:36:53] (03PS2) 10Ppchelko: Disable HTCP purging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607593 (https://phabricator.wikimedia.org/T250781) [13:38:01] the pdu reduced availability is expected [13:39:28] I would never have thought of downtiming it :) [13:40:01] (03PS2) 10JMeybohm: Switch role for chartmuseum hosts to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609760 (https://phabricator.wikimedia.org/T253843) [13:40:34] hehhe yeah that's fine, can be left as is [13:41:04] also because downtiming the alert for now means all other jobs get downtimed too [13:43:46] PROBLEM - Check systemd state on ms-be2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:57] (03PS1) 10JMeybohm: Fix chartmuseum hiera path (common/profile -> role/common) [labs/private] - 10https://gerrit.wikimedia.org/r/609785 (https://phabricator.wikimedia.org/T253843) [13:46:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix chartmuseum hiera path (common/profile -> role/common) [labs/private] - 10https://gerrit.wikimedia.org/r/609785 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:47:39] (03CR) 10Muehlenhoff: "This will fail since java::security is a legacy class only adapted for Java 8 (it has an explicit required on the openjdk-8-jdk package, w" [puppet] - 10https://gerrit.wikimedia.org/r/609784 (owner: 10Jbond) [13:49:22] (03CR) 10QChris: [C: 04-1] Gerrit: Add ed25519 and ecdsa ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556270 (https://phabricator.wikimedia.org/T240266) (owner: 10Paladox) [13:49:25] (03CR) 10JMeybohm: [C: 03+2] Switch role for chartmuseum hosts to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609760 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [13:51:49] PROBLEM - MariaDB Replica SQL: s7 #page on db1079 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:51:52] PROBLEM - MD RAID on ms-be2025 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:51:53] ACKNOWLEDGEMENT - MD RAID on ms-be2025 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T257214 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:51:56] 10Operations, 10ops-codfw: Degraded RAID on ms-be2025 - https://phabricator.wikimedia.org/T257214 (10ops-monitoring-bot) [13:52:13] RECOVERY - Host ms-be1056.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 1.10 ms [13:52:17] 👋 [13:52:22] PROBLEM - mysqld processes #page on db1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:52:23] checking [13:52:32] * apergos peeks in [13:52:38] * godog here too [13:52:43] PROBLEM - MariaDB read only s7 on db1079 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:52:45] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:52:52] PROBLEM - MariaDB Replica IO: s7 #page on db1079 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [13:53:03] <_joe_> marostegui: want someone to depool it? [13:53:07] yes please [13:53:18] mw errors spike, yep [13:53:23] host got rebooted [13:53:25] _joe_: on it? [13:53:28] <_joe_> jynus: are you doing it? [13:53:29] I am betting BBU broke [13:53:33] <_joe_> you'll be faster than me [13:54:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:54:30] !log jynus@cumin1001 dbctl commit (dc=all): 'depool db1079', diff saved to https://phabricator.wikimedia.org/P11751 and previous config saved to /var/cache/conftool/dbconfig/20200706-135430-jynus.json [13:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:36] date=07/06/2020 [13:54:37] time=13:45 [13:54:37] description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists. [13:54:49] checking if s7 api needs further readjustments [13:54:50] old hosts getting their BBU broken...classic [13:55:01] cough, HP hosts, cough [13:55:21] yeah, BBU issues, VSP has a Battery Shutdown Event [13:55:30] Action: restart system [13:55:33] errors down? can confirm _joe_ good at app later? [13:55:40] or someone else [13:55:47] while I keep tuning loads [13:55:48] I have downtimed the host, going to upgrade and reboot it again [13:55:51] (03PS2) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [13:55:52] <_joe_> it looks like it yes [13:55:55] (they seem good to me) [13:55:56] moritzm: yeah, classic on HP.. [13:56:09] will keep tuning load now for performance [13:56:10] !log Downtime and reboot db1079 after BBU crash [13:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:34] interestingly: it alerted 20 mins before it fully went down: Smart Storage Battery failure: Action: Gather AHS log and contact Support [13:56:52] I am glad we are replacing this hosts in a few months [13:56:55] it should not have lots of user impact [13:57:01] except the ongoing queries [13:57:09] although lots of log spam [13:57:48] fyi, eqiad mgmt network is back to normal [13:57:49] the mgmt alert issue doesn't help debugging (not anyones' fault) [13:58:23] we shouldn't have any performance issues on s7, so I am going to leave this host depooled till tomorrow [13:58:25] I re-forced the check, they should clear quickly [13:58:28] I placed a new host in s7 a few weeks ago [13:58:32] yeah, no prov [13:58:37] *prob [13:59:00] (03PS3) 10Jbond: profile::idp: enable java::Security [puppet] - 10https://gerrit.wikimedia.org/r/609784 [13:59:01] PROBLEM - Disk space on ms-be2025 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2025&var-datasource=codfw+prometheus/ops [13:59:16] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609784 (owner: 10Jbond) [13:59:45] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [13:59:56] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) 05Open→03Resolved All done. [14:00:53] I am going to reduce db1136 general traffic weight to balance load [14:02:17] !log jynus@cumin1001 dbctl commit (dc=all): 'depool db1136 from main traffic as it is the only s7 api host right now', diff saved to https://phabricator.wikimedia.org/P11752 and previous config saved to /var/cache/conftool/dbconfig/20200706-140217-jynus.json [14:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/609784 (owner: 10Jbond) [14:03:24] 10Operations, 10DBA: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) [14:03:53] RECOVERY - MariaDB read only s7 on db1079 is OK: Version 10.1.44-MariaDB, Uptime 179s, read_only: True, read_only: True, 25.31 QPS, connection latency: 0.003266s, query latency: 0.001501s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:04:00] RECOVERY - MariaDB Replica IO: s7 #page on db1079 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:04:26] ms-be2025 is a big issue? [14:04:39] or can be ignored for now? [14:04:56] RECOVERY - MariaDB Replica SQL: s7 #page on db1079 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:05:52] (03PS2) 10QChris: gerrit: Add Code Review logo as favicon [puppet] - 10https://gerrit.wikimedia.org/r/609764 (https://phabricator.wikimedia.org/T257218) [14:05:54] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::novaproxy: add use_wmflabs_root option [puppet] - 10https://gerrit.wikimedia.org/r/609586 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [14:05:56] things looking on [14:05:58] *ok [14:06:02] should I create a ticket? [14:06:10] 10Operations, 10DBA: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) @wiki_willy @Jclark-ctr do we have spare BBUs around? [14:06:12] jynus: check above, the ticket is created [14:06:15] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::novaproxy: add acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609587 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [14:06:16] oh, thank you [14:06:17] 10Operations, 10observability: add monitoring to alert on hosts without RAID - https://phabricator.wikimedia.org/T206131 (10fgiunchedi) 05Open→03Declined With the standard partman recipes being implemented essentially everywhere it also means we get (software) raid "by default". I'm going to boldly resolve... [14:06:28] (03PS2) 10Andrew Bogott: profile::wmcs::novaproxy: add acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609587 (https://phabricator.wikimedia.org/T256276) [14:07:25] (03PS1) 10Marostegui: db1079: Add broken BBU status [puppet] - 10https://gerrit.wikimedia.org/r/609788 (https://phabricator.wikimedia.org/T257216) [14:08:12] 10Operations, 10DBA, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Jclark-ctr) @Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today [14:08:23] 10Operations, 10DBA, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10jcrespo) db1079 was depooled: P11751 Main traffic removed from db1136 as it is currently the only s7 API host on eqiad: P11752 Both should be removed or taken into account if host is rep... [14:08:49] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:00] 10Operations, 10DBA, 10Patch-For-Review: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Marostegui) >>! In T257216#6281956, @Jclark-ctr wrote: > @Marostegui yes believe we still have some. I will be on site in a few hours if we wanted to change it today Excellent, I am goi... [14:09:08] !log Stop MySQL and poweroff db1079 T257216 [14:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:13] T257216: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 [14:09:25] (03CR) 10Marostegui: [C: 03+2] db1079: Add broken BBU status [puppet] - 10https://gerrit.wikimedia.org/r/609788 (https://phabricator.wikimedia.org/T257216) (owner: 10Marostegui) [14:10:00] 10Operations, 10Icinga, 10observability, 10User-CDanis: CLI script for manual paging - https://phabricator.wikimedia.org/T82937 (10fgiunchedi) [14:11:32] 10Operations, 10observability, 10Sustainability (Incident Prevention): Add alerts for Logstash rates in production - https://phabricator.wikimedia.org/T199479 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We have icinga alerts for mediawiki errors rates nowadays, based on Prometheus metrics (via logs... [14:14:10] 10Operations, 10observability, 10Patch-For-Review: mpt raid controller not detected as fact on maps-test2* - https://phabricator.wikimedia.org/T179078 (10fgiunchedi) 05Open→03Declined The old hosts have been eventually decom'd! [14:14:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] (03PS6) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [14:14:51] (03PS1) 10Andrew Bogott: dynamic proxy: don't treat "" as a boolean for $acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609790 (https://phabricator.wikimedia.org/T256276) [14:15:33] 10Operations, 10Icinga, 10observability, 10Patch-For-Review, 10Tor: Icinga check for Tor - https://phabricator.wikimedia.org/T148614 (10fgiunchedi) 05Open→03Declined Tor has been retired in {T243288} [14:16:46] (03CR) 10jerkins-bot: [V: 04-1] dynamic proxy: don't treat "" as a boolean for $acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609790 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [14:18:13] (03PS2) 10Andrew Bogott: dynamic proxy: don't treat "" as a boolean for $acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609790 (https://phabricator.wikimedia.org/T256276) [14:18:15] (03PS7) 10Andrew Bogott: Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) [14:19:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=chartmuseum site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:20:16] (03CR) 10Andrew Bogott: [C: 03+2] dynamic proxy: don't treat "" as a boolean for $acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609790 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [14:20:24] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) p:05Low→03Medium a:03akosiaris Bumping to normal, I am starting work on this. [14:22:42] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10jcrespo) Please read what was is needed to file a valid NDA at the similar ticket T256201#6266045 and reference this ticket so we SREs are notified when it is ready or you t... [14:22:59] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10jcrespo) p:05Triage→03Medium [14:23:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:28] jouncebot: now [14:25:28] No deployments scheduled for the next 2 hour(s) and 34 minute(s) [14:25:31] jouncebot: next [14:25:31] In 2 hour(s) and 34 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T1700) [14:25:38] PROBLEM - ChartMuseum HTTP on chartmuseum2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string true not found on https://helm-charts.discovery.wmnet:443/health - 244 bytes in 1.153 second response time https://wikitech.wikimedia.org/wiki/ChartMuseum [14:25:57] this is me [14:28:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=chartmuseum site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:28:58] !log powercycle ms-be2025, no ssh available - T257214 [14:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:02] T257214: Degraded RAID on ms-be2025 - https://phabricator.wikimedia.org/T257214 [14:29:42] PROBLEM - Host ms-be2025 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 (owner: 10Giuseppe Lavagetto) [14:30:51] (03PS6) 10Giuseppe Lavagetto: wmflib::service: introduce get_url function [puppet] - 10https://gerrit.wikimedia.org/r/609153 [14:31:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:42] RECOVERY - Host ms-be2025 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [14:34:02] RECOVERY - MD RAID on ms-be2025 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:34:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=chartmuseum site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:34:12] (03CR) 10Mholloway: [C: 03+2] Wikifeeds: Update to 2020-07-02-214619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609246 (https://phabricator.wikimedia.org/T255198) (owner: 10Mholloway) [14:35:28] (03Merged) 10jenkins-bot: Wikifeeds: Update to 2020-07-02-214619-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/609246 (https://phabricator.wikimedia.org/T255198) (owner: 10Mholloway) [14:36:38] !log reboot ms-be2025 for hw raid software upgrade - T257214 [14:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:43] T257214: Degraded RAID on ms-be2025 - https://phabricator.wikimedia.org/T257214 [14:37:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1074', diff saved to https://phabricator.wikimedia.org/P11753 and previous config saved to /var/cache/conftool/dbconfig/20200706-143754-marostegui.json [14:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:44] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10jcrespo) The records SRE have access to do not show yet an NDA agreement for guergana.tzatchkova. @guergana.tzatchkova Could you confirm this was sent and it is a d... [14:38:54] PROBLEM - Host ms-be2025 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:16] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10jcrespo) [14:39:18] RECOVERY - Host ms-be2025 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [14:39:38] RECOVERY - Disk space on ms-be2025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2025&var-datasource=codfw+prometheus/ops [14:39:47] 10Operations, 10OTRS, 10serviceops, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) Below is a draft plan for the upgrade: [] Obtain a new, Debian Buster host (has already been done, otrs1001) [] Obtain a point in time snapshot of... [14:39:56] 10Operations, 10ops-codfw: Degraded RAID on ms-be2025 - https://phabricator.wikimedia.org/T257214 (10fgiunchedi) 05Open→03Invalid Host came back clean, I've updated the hw raid firmware while I was at it [14:39:58] RECOVERY - Check systemd state on ms-be2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:33] (03PS1) 10JMeybohm: chartmuseum: Ensure envoy connects via IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/609792 (https://phabricator.wikimedia.org/T253843) [14:41:36] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10jcrespo) FYI @AMooney I don't see Peter on our shared records with legal. Sorry for the delay, but t... [14:42:42] !log installing PHP 7.0 security updates [14:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:15] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:40] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:25] (03PS3) 10Jbond: java: update java.security to use urandom on buster [puppet] - 10https://gerrit.wikimedia.org/r/609783 [14:50:00] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:37] 10Operations, 10OTRS, 10serviceops, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) As this seems more snapshotting related than databases, I may take care myself of the db preparation needed. If db1077 is freed because of T256120#628... [14:55:07] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 - https://phabricator.wikimedia.org/T256536 (10jcrespo) p:05Triage→03Medium [14:55:13] (03CR) 10Hashar: [C: 03+1] "That relies on the success/failure message defined in Zuul configuration but that looks good enough ;)" [puppet] - 10https://gerrit.wikimedia.org/r/609743 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [14:56:34] (03CR) 10Hashar: [C: 03+1] gerrit: For SonarQube, bring only overall result to CI table as 'Sonar Cloud' [puppet] - 10https://gerrit.wikimedia.org/r/609744 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [14:59:16] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) I'm fully available to prepare and handle September 1 event. [14:59:32] 10Operations: serve our production ssh known_hosts file over public HTTPS - https://phabricator.wikimedia.org/T257219 (10CDanis) [15:02:00] !log removing old snapshots for x1 on dbprov[12]002 [15:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:22] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10ayounsi) The switch has been replaced successfully, next steps: * Update Netbox * Wipe/decom old switch [15:09:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:11:24] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) > msw1-eqiad is replaced with T225121 This is done, and is running the same storm-control config as codfw. [15:11:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:11:44] 10Operations, 10OTRS, 10serviceops, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10Marostegui) db1077 can be used yep [15:19:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production infrastructure services for jgiannelos - https://phabricator.wikimedia.org/T257187 (10thcipriani) >>! In T257187#6280782, @jcrespo wrote: > @thcipriani I believe you will be the service owner approving this access. Can y... [15:19:53] (03PS1) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 [15:19:56] (03CR) 10Jcrespo: [C: 03+1] admin: add Ahmon Dancy to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh) [15:20:39] (03PS16) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [15:20:46] (03PS5) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) [15:20:51] (03PS2) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) [15:20:57] (03PS15) 10Addshore: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [15:21:03] (03PS16) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [15:21:46] (03PS2) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 [15:22:20] 10Operations, 10CAS-SSO, 10Patch-For-Review: icinga Blocked by X-Frame-Options Policy - https://phabricator.wikimedia.org/T251513 (10MoritzMuehlenhoff) With the login page redirect merged, after expiring the session, the login page is now correctly shown in the full window (and no longer in the Icinga UI fra... [15:24:56] (03CR) 10Jcrespo: [C: 03+2] admin: add Ahmon Dancy to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609225 (https://phabricator.wikimedia.org/T256770) (owner: 10Ssingh) [15:27:38] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10XanonymusX) Has this been fixed? It looks like the errors are gone now and Score is working again. [15:29:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10jcrespo) 05Open→03Resolved User correctly deployed to production: ` Notice: /Stage[main]/Admin/Admin::Hashuser[dancy]/Admin::User[dancy]... [15:29:29] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) [15:29:46] (03PS3) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 [15:30:48] (03CR) 10Muehlenhoff: java: update java.security to use urandom on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609783 (owner: 10Jbond) [15:31:00] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10jcrespo) @thcipriani given it is your direct report, I think your team will be able to take care of Gerrit membership needs as new deployer,... [15:32:20] (03PS4) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 [15:34:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10thcipriani) >>! In T256770#6282207, @jcrespo wrote: > @thcipriani given it is your direct report, I think your team will be able to take car... [15:42:04] PROBLEM - Host restbase2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:42:50] (03PS5) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 [15:45:50] 10Operations, 10OTRS, 10serviceops, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6282101, @jcrespo wrote: > As this seems more snapshotting related than databases, I may take care myself of the db preparation needed... [15:47:15] (03PS6) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) [15:47:44] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: Include PROXY_DOMAIN_DICT [puppet] - 10https://gerrit.wikimedia.org/r/609610 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [15:47:58] RECOVERY - Host restbase2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [15:51:20] (03PS7) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) [15:52:33] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [15:54:01] (03PS8) 10Jbond: SSHFP: add a text file with the SSHFB of all hosts [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) [15:54:41] (03PS3) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T245595) [15:54:51] (03PS4) 10Cicalese: DO NOT MERGE Remove temporary logging for mediamoderation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606239 (https://phabricator.wikimedia.org/T245595) [15:57:18] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/23710" [puppet] - 10https://gerrit.wikimedia.org/r/609796 (https://phabricator.wikimedia.org/T257219) (owner: 10Jbond) [16:01:54] PROBLEM - Host restbase2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:34] (03PS1) 10Ottomata: Revert "Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609633 [16:02:40] (03PS2) 10Ottomata: Revert "Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609633 [16:04:52] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Trizek-WMF) [16:06:19] (03CR) 10Ottomata: [C: 03+2] Revert "Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609633 (owner: 10Ottomata) [16:07:09] (03Merged) 10jenkins-bot: Revert "Revert "Migrate SearchSatisfaction from EventLogging to EventGate on group1"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609633 (owner: 10Ottomata) [16:07:20] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6282186, @XanonymusX wrote: > Has this been fixed? It looks like the errors are... [16:07:48] RECOVERY - Host restbase2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [16:08:22] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:08:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:09:44] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on group1 - T249261 (duration: 00m 58s) [16:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [16:10:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:15:54] PROBLEM - Host restbase2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:19:46] (03CR) 10JMeybohm: [C: 03+2] chartmuseum: Ensure envoy connects via IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/609792 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [16:30:28] (03PS1) 10Andrew Bogott: Horizon: Set default domain for proxy UI [puppet] - 10https://gerrit.wikimedia.org/r/609804 (https://phabricator.wikimedia.org/T256276) [16:31:50] RECOVERY - ChartMuseum HTTP on chartmuseum2001 is OK: HTTP OK: HTTP/1.1 200 OK - 260 bytes in 1.155 second response time https://wikitech.wikimedia.org/wiki/ChartMuseum [16:32:36] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10guergana.tzatchkova) Should I send my personal data? Or will it be sent in by wmde? [16:35:56] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10XanonymusX) Well, some output did not work two days ago, but it does now. Caching problems, I guess,... [16:39:29] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6282419, @XanonymusX wrote: > Well, some output did not work two days ago, but... [16:40:45] 10Operations, 10observability, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [16:41:46] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:00] 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10Papaul) The server is dead. It has a main board problem. [16:46:00] RECOVERY - Host restbase2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.50 ms [16:47:25] (03PS1) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [16:47:41] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:48:13] (03PS2) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [16:48:28] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:49:18] (03CR) 10VulpesVulpes825: "Sorry for late reply." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112) (owner: 10VulpesVulpes825) [16:49:39] (03CR) 10RLazarus: [C: 03+1] simplelamp2: do not purge unmanaged config files [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) (owner: 10Dzahn) [16:52:38] (03CR) 10Ppchelko: "It would've been cool if this was broken into 2 separate change sets with one copy-pasting all the scaffolding, helpers and shared stuff a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:54:33] (03PS3) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [16:55:39] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:59:34] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] gehel and onimisionipe: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T1700). [17:01:55] (03CR) 10Dzahn: [C: 03+2] "re: ru.wikimedia.org yea, i just changed that recently in an attempt to fix it, the previous version was also broken. I'll remove it then" [puppet] - 10https://gerrit.wikimedia.org/r/609565 (owner: 10Amire80) [17:02:37] (03PS2) 10Dzahn: Remove three entries from the Russian Planet [puppet] - 10https://gerrit.wikimedia.org/r/609565 (https://phabricator.wikimedia.org/T168459) (owner: 10Amire80) [17:03:17] 10Operations, 10DBA: db1079 BBU crashed host rebooted - https://phabricator.wikimedia.org/T257216 (10Jclark-ctr) @Marostegui BBU replaced host is powering up now [17:05:19] (03PS1) 10ArielGlenn: add a README about the content of the commons structured data dumps [puppet] - 10https://gerrit.wikimedia.org/r/609823 (https://phabricator.wikimedia.org/T221917) [17:05:52] (03CR) 10Dzahn: "for some reason it seems to be about chainsaws nowadays. thanks for this patch" [puppet] - 10https://gerrit.wikimedia.org/r/609564 (owner: 10Amire80) [17:06:33] (03PS3) 10Dzahn: Remove englishwikisource.tumblr.com from Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/609564 (https://phabricator.wikimedia.org/T168459) (owner: 10Amire80) [17:07:58] (03CR) 10Ppchelko: "Just some random comments, didn't go too deep yet" (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [17:08:07] (03CR) 10Dzahn: [C: 03+2] Remove englishwikisource.tumblr.com from Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/609564 (https://phabricator.wikimedia.org/T168459) (owner: 10Amire80) [17:08:48] (03PS1) 10Ottomata: EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609634 [17:08:57] (03CR) 10jerkins-bot: [V: 04-1] EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609634 (owner: 10Ottomata) [17:09:36] (03CR) 10Dzahn: [C: 03+2] gerrit: Add SonarQube results to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609520 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [17:11:14] (03CR) 10Dzahn: [C: 03+2] gerrit: Add table for Zuul CI results underneath the commit message [puppet] - 10https://gerrit.wikimedia.org/r/609519 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [17:11:49] (03PS4) 10Dzahn: gerrit: Add SonarQube results to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609520 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [17:11:59] (03PS2) 10Ottomata: EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609634 (https://phabricator.wikimedia.org/T249261) [17:15:34] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: Set default domain for proxy UI [puppet] - 10https://gerrit.wikimedia.org/r/609804 (https://phabricator.wikimedia.org/T256276) (owner: 10Andrew Bogott) [17:17:08] (03PS3) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [17:33:02] (03PS1) 10Ottomata: Provide multiple kafka truststore passwords to profile::logstash::collector [labs/private] - 10https://gerrit.wikimedia.org/r/609828 [17:36:53] (03PS4) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [17:37:25] (03CR) 10Ottomata: "labs-private change with truststore dummy pws here:" [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [17:38:17] (03CR) 10jerkins-bot: [V: 04-1] Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [17:39:35] (03PS5) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [17:40:49] (03PS6) 10Ottomata: Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) [17:40:51] (03CR) 10jerkins-bot: [V: 04-1] Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [17:42:58] (03PS1) 10JMeybohm: prometheus: Switch chartmuseum scrape target to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/609829 (https://phabricator.wikimedia.org/T253843) [17:44:55] (03CR) 10JMeybohm: [C: 03+2] prometheus: Switch chartmuseum scrape target to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/609829 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [17:48:08] (03CR) 10Ottomata: [C: 03+2] EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609634 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [17:48:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10observability: systemd::syslog conf should use :programname equals instead of startswith - https://phabricator.wikimedia.org/T251606 (10Nuria) 05Open→03Resolved [17:48:59] (03Merged) 10jenkins-bot: EventLogging - use EventGate on all wikis for SearchSatisfaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609634 (https://phabricator.wikimedia.org/T249261) (owner: 10Ottomata) [17:50:29] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 (duration: 00m 56s) [17:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:35] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [17:50:40] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10Nuria) 05Open→03Resolved [17:51:41] (03CR) 10Krinkle: [C: 03+1] gerrit: For Zuul, bring only 'Main test build' jobs to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609743 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [17:51:52] (03CR) 10Krinkle: [C: 03+1] gerrit: For SonarQube, bring only overall result to CI table as 'Sonar Cloud' [puppet] - 10https://gerrit.wikimedia.org/r/609744 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [17:54:04] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate SearchSatisfaction from EventLogging to EventGate on all wikis - T249261 - take 2 (duration: 00m 56s) [17:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:07] (03CR) 10Krinkle: gerrit: Add header to CI table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609745 (owner: 10QChris) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T1800) [18:00:04] ProcReader, Huji, Pchelolo, Addshore, Pchelolo, and MatmaRex: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] o/ [18:00:22] I can deploy today! [18:00:26] I did not know about the sticker thingy [18:00:36] Can I push an updated patch that woudl actually break things? [18:00:36] busy day today, Urbanecm :) [18:00:51] * addshore can do mine (at the end), also fine to not get to the last ones [18:01:03] addshore: okay, I'll ping you once I'm done :) [18:01:07] ty! [18:01:42] hello [18:02:39] (03CR) 10Urbanecm: [C: 03+2] Add 'abusefilter-view' as a default right for the CU log user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608222 (https://phabricator.wikimedia.org/T255506) (owner: 10ProcrastinatingReader) [18:02:47] (03CR) 10Urbanecm: Add arbcom group to plwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) (owner: 10ProcrastinatingReader) [18:02:52] ProcReader: could you have a look? :-) [18:02:56] (at my comment) [18:03:22] (03Merged) 10jenkins-bot: Add 'abusefilter-view' as a default right for the CU log user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608222 (https://phabricator.wikimedia.org/T255506) (owner: 10ProcrastinatingReader) [18:03:30] saw, will update [18:03:57] should I update to this commit? or create a separate one? [18:03:57] huji: your patch is at mwdebug1001 for testing. Could you have a look, please? [18:04:03] ProcReader: update this one, please [18:04:29] (and no, please let's not intentionally break stuff :)) [18:04:43] !log andrew@deploy1001 Started deploy [horizon/deploy@bb176c2]: update proxy UI to support multiple pre-set domains [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:28] Urbanecm: I'll do mine myself if you don't mind after you're done [18:05:35] sure, will ping you :) [18:05:38] (03PS2) 10ProcrastinatingReader: Add arbcom group to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) [18:05:43] (you're reading my mind, was just going to ask :D) [18:05:46] (03CR) 10ProcrastinatingReader: Add arbcom group to plwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) (owner: 10ProcrastinatingReader) [18:05:50] (03CR) 10BryanDavis: Pywikibot container (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [18:06:04] (03PS3) 10Urbanecm: Add arbcom group to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) (owner: 10ProcrastinatingReader) [18:06:10] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) (owner: 10ProcrastinatingReader) [18:07:09] huji: how is testing going? :-) [18:07:20] (03Merged) 10jenkins-bot: Add arbcom group to plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608440 (https://phabricator.wikimedia.org/T256572) (owner: 10ProcrastinatingReader) [18:08:20] one sec [18:08:22] got on a call [18:08:22] !log andrew@deploy1001 Finished deploy [horizon/deploy@bb176c2]: update proxy UI to support multiple pre-set domains (duration: 03m 39s) [18:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] (03CR) 10QChris: gerrit: Add header to CI table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609745 (owner: 10QChris) [18:08:58] sure, waiting [18:10:39] ok done with the call. give me 3-5 minutes [18:10:46] sure [18:13:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:14:22] hmmm [18:14:27] so nothing broke [18:14:32] but nothing got fixed either [18:14:36] let me think again [18:15:09] (03PS1) 10Jdlrobson: Enable sidebar instrumentation on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609831 (https://phabricator.wikimedia.org/T256992) [18:15:57] ProcReader: any thoughts? [18:16:10] huji: I made sure the patch is applied correctly. [18:16:20] we gave abusefilter-view to the anonymous user, yet the logs still appear redacted [18:16:40] if it did nothing, not sure. I'd think perhaps we shouldn't add to array but instead set it like I did in an earlier patchset (=, rather than []), but that wouldn't be proper either (+ the default is [], so adding to it shouldn't be an issue) [18:16:43] I'm keeping the patch at mwdebug1001, but unsynced for now. [18:17:03] oh shit [18:17:07] I was looking at the wrong place [18:17:10] (pardon my language) [18:17:11] my test locally of adding to the array, with the other fix it depends on, worked fine [18:17:13] ah [18:17:39] Urbanecm: rookie mistake; i was using the wrong browser, without the wikimedia extension [18:17:56] heh, happens sometimes :) [18:18:15] ok, let me reinstall the extension. what was it called again? [18:18:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Provide multiple kafka truststore passwords to profile::logstash::collector [labs/private] - 10https://gerrit.wikimedia.org/r/609828 (owner: 10Ottomata) [18:18:40] huji: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug has the links [18:19:07] (03CR) 10Herron: [C: 03+1] "jftr" [labs/private] - 10https://gerrit.wikimedia.org/r/609828 (owner: 10Ottomata) [18:19:15] ProcReader: your patch is at mwdebug1002 [18:19:34] Urbanecm: thank you :) - testing [18:20:07] Urbanecm: looks good! [18:20:14] thanks, syncing! [18:20:23] (03CR) 10JJMC89: Pywikibot container (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) (owner: 10BryanDavis) [18:20:28] Urbanecm: how do I verify I am on the correct server? Special:Version? [18:21:19] look at the server header, it should say server: mwdebug1001.eqiad.wmnet [18:21:36] (03PS2) 10Bartosz Dziewoński: Enable validation of new signatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) [18:21:52] confirmed [18:22:31] Urbanecm: it works [18:22:42] thanks, syncing! [18:22:42] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1398171: Add arbcom group to plwiki (T256572) (duration: 00m 56s) [18:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:47] T256572: Creation of a new user group on plwikipedia - https://phabricator.wikimedia.org/T256572 [18:23:32] Urbanecm: confirming good post-sync (for 608440) [18:23:38] thanks [18:23:58] thanks for deploying :) [18:24:18] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: 8878c60: Add `abusefilter-view` as a default right for the CU log user (T255506) (duration: 00m 55s) [18:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:23] T255506: Identify how abuse log details were purged from the CU logs - https://phabricator.wikimedia.org/T255506 [18:24:25] huji: deployed [18:24:47] MatmaRex: your patch is next :). [18:25:11] (03CR) 10Urbanecm: [C: 03+2] Enable validation of new signatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński) [18:25:21] thanks [18:26:23] (03Merged) 10jenkins-bot: Enable validation of new signatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński) [18:27:34] MatmaRex: your patch is available for testing at mwdebug1002, could you have a look? [18:27:45] yep [18:27:51] thanks [18:28:43] Urbanecm: seems good! [18:28:45] Thanks again, Urbanecm; if nothing else is there for me to do, I am going to drop off. [18:29:14] huji: your patch is done now, see you later! [18:29:18] happy to help [18:29:22] MatmaRex: thanks, syncing [18:30:51] (03PS13) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:30:56] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: adffbe6: Enable validation of new signatures (T248632) (duration: 00m 57s) [18:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:01] T248632: Implement new signature requirements - https://phabricator.wikimedia.org/T248632 [18:31:02] MatmaRex: done [18:31:11] addshore: Pchelolo: I'm done now. Not sure which patches are more urgent, so pinging both :) [18:31:20] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [18:31:22] addshore: you go ahead, I have a meeting now [18:31:25] thanks Urbanecm! [18:31:31] happy to help! [18:32:13] (03PS1) 10Ottomata: Keep profile::logstash::collector::input_kafka_ssl_truststore_password for now [labs/private] - 10https://gerrit.wikimedia.org/r/609833 [18:32:22] Pchelolo: ack! [18:32:34] (03PS14) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [18:32:39] (03PS2) 10Ottomata: Keep profile::logstash::collector::input_kafka_ssl_truststore_password for now [labs/private] - 10https://gerrit.wikimedia.org/r/609833 [18:32:47] (03PS17) 10Addshore: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [18:32:51] (03CR) 10Addshore: [C: 03+2] Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [18:33:05] oooh, i have a whole 30 mins, lovely! [18:33:08] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Keep profile::logstash::collector::input_kafka_ssl_truststore_password for now [labs/private] - 10https://gerrit.wikimedia.org/r/609833 (owner: 10Ottomata) [18:33:40] (03Merged) 10jenkins-bot: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [18:35:46] (03CR) 10Herron: [C: 03+1] "LGTM! We should roll this out slowly to the logstash collectors" [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [18:38:41] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T256906 T256907 T256909 T254315 [[gerrit:569260]] Commons: Define entity sources configuration (duration: 00m 56s) [18:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:50] T256907: Call to a member function getDatabaseName() on null, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256907 [18:38:50] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [18:38:50] T256906: No namespace configured for MediaInfo entities, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256906 [18:38:51] T256909: Call to a member function getSourceName() on null, when deploying entity source config to wikidata clients - https://phabricator.wikimedia.org/T256909 [18:39:34] (03PS6) 10Addshore: Wikidata client wikis: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) [18:40:13] James_F: around at all? [18:40:25] (03PS1) 10Andrew Bogott: Galera: use mariadb service name rather than mysql [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) [18:40:40] (03CR) 10Ottomata: [C: 03+2] Collect eventgate error.validation topics into logstash [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [18:41:03] (03CR) 10Ottomata: [C: 03+2] Collect eventgate error.validation topics into logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/589597 (https://phabricator.wikimedia.org/T116719) (owner: 10Ottomata) [18:41:09] (03CR) 10Addshore: [C: 03+2] Wikidata client wikis: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [18:42:01] (03Merged) 10jenkins-bot: Wikidata client wikis: Define entity sources configuration (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608839 (https://phabricator.wikimedia.org/T254315) (owner: 10Addshore) [18:44:25] * addshore stares at the logs...... [18:45:14] !log addshore@deploy1001 Synchronized wmf-config: T254315 Wikidata client wikis: Define entity sources configuration (take 2) [[gerrit:608839]] (duration: 00m 58s) [18:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:20] T254315: entitysources: Directly create entitySources config for WMF production wikis - https://phabricator.wikimedia.org/T254315 [18:47:03] !log addshore@deploy1001 Synchronized dblists/wikidataclient.dblist: T254315 Wikidata client wikis: Define entity sources configuration (take 2) [[gerrit:608839]] (duration: 00m 56s) [18:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:20] (03PS3) 10Addshore: Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) [18:47:34] (03CR) 10Addshore: [C: 03+2] Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [18:48:57] addshore: Hey, sorry, yes. [18:49:04] (03Merged) 10jenkins-bot: Wikibase: stop using wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608944 (https://phabricator.wikimedia.org/T241975) (owner: 10Addshore) [18:49:20] tis okay, I was going to ask if you thought something might break for a reaosn, but turns out it did not! [18:49:27] Good-o. [18:49:54] Production being down isn't technically my department any more (as of last Wednesday, it's now AbstractWikiLambdaPedia), but I still care. :-) [18:50:32] I see! I am also interested in that ;) [18:51:12] (03PS17) 10Addshore: Wikibase: Remove config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [18:51:12] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:608944]] T241975 Wikibase: stop using wmgUseEntitySourceBasedFederation (duration: 00m 56s) [18:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:16] T241975: entitysources: Remove old MultiRepository & PerRepository Service containers and config - https://phabricator.wikimedia.org/T241975 [18:51:19] (03PS2) 10Dzahn: gerrit: Drop bot name, check date, and check duration from CI table [puppet] - 10https://gerrit.wikimedia.org/r/609742 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [18:51:29] (03CR) 10Addshore: [C: 03+2] Wikibase: Remove config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [18:51:52] I almost don't believe i have finally got these config changes out [18:52:02] 10Operations, 10Wikimedia-Mailing-lists: "Uncaught bounce notification" from Yahoo and AOL - https://phabricator.wikimedia.org/T257241 (10Gaurav) [18:52:14] James_F: is there an IRC channel for that btw, or is it all happening somewhere secret? ;) [18:52:23] (03Merged) 10jenkins-bot: Wikibase: Remove config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [18:52:24] addshore: Congratulations. [18:52:47] We just had the first ever team meeting half an hour ago. One thing to discuss is what channels we want to have and where. [18:52:52] xD [18:52:59] I'll remember to ping you whenever we work that out. :-) [18:53:41] lovely! [18:54:16] !log addshore@deploy1001 sync-file aborted: [[gerrit:569263]] (duration: 00m 00s) [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:35] (03CR) 10Dzahn: [C: 03+2] gerrit: Drop bot name, check date, and check duration from CI table [puppet] - 10https://gerrit.wikimedia.org/r/609742 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [18:54:38] (pasted a new line...) [18:54:38] Sorry, shouldn't distract you whilst you're deploying. ;-) [18:55:10] (03PS2) 10Dzahn: gerrit: For Zuul, bring only 'Main test build' jobs to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609743 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [18:55:25] !log addshore@deploy1001 Synchronized wmf-config: [[gerrit:569263]] T241975 Wikibase: Remove config option wmgUseEntitySourceBasedFederation (duration: 00m 58s) [18:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:29] MatmaRex: want to squeeze yours out of the door? [18:55:55] 10Operations, 10Wikimedia-Mailing-lists: "Uncaught bounce notification" from Yahoo and AOL - https://phabricator.wikimedia.org/T257241 (10Gaurav) [18:56:06] addshore: it already went out, unless we're thinking about different things [18:56:18] oh! :P (sorry I saw it at the end of list) awesome! [18:56:33] !log backport / deploy window done [18:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:11] (03CR) 10Dzahn: [C: 03+2] gerrit: For Zuul, bring only 'Main test build' jobs to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609743 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [18:57:27] (03PS1) 10CRusnov: puppetdb microservice: Change allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/609835 [18:58:10] (03PS2) 10Dzahn: gerrit: For SonarQube, bring only overall result to CI table as 'Sonar Cloud' [puppet] - 10https://gerrit.wikimedia.org/r/609744 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [18:59:19] (03CR) 10Dzahn: [C: 03+2] gerrit: For SonarQube, bring only overall result to CI table as 'Sonar Cloud' [puppet] - 10https://gerrit.wikimedia.org/r/609744 (https://phabricator.wikimedia.org/T256575) (owner: 10QChris) [19:00:32] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/609835 (owner: 10CRusnov) [19:01:00] (03PS2) 10Dzahn: gerrit: Add header to CI table [puppet] - 10https://gerrit.wikimedia.org/r/609745 (owner: 10QChris) [19:07:11] (03CR) 10Dzahn: [C: 03+2] "I am merging this because i strongly agree with Chris' comment above." [puppet] - 10https://gerrit.wikimedia.org/r/609745 (owner: 10QChris) [19:08:24] (03CR) 10Dzahn: [C: 03+2] gerrit: Add marker for empty CI table [puppet] - 10https://gerrit.wikimedia.org/r/609746 (owner: 10QChris) [19:08:30] (03PS3) 10Dzahn: gerrit: Add marker for empty CI table [puppet] - 10https://gerrit.wikimedia.org/r/609746 (owner: 10QChris) [19:10:42] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10KFrancis) @guergana.tzatchkova Hello! This request is pending receipt of the following: -Full legal name -Mailing address -Email address -Specifics about the type... [19:16:07] (03CR) 10Herron: [C: 03+1] logstash: decom check_procs [puppet] - 10https://gerrit.wikimedia.org/r/609397 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [19:16:40] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Nuria) Approved on my end [19:16:57] (03CR) 10Herron: [C: 03+1] mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [19:17:02] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:18:04] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10Nuria) @calbon once this goes through please try: ssh to stat1007.eqiad.wmnet @ssingh chris is also g... [19:22:56] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:32:25] (03CR) 10Dzahn: [C: 03+2] gerrit: Remove no longer comment from CI table code [puppet] - 10https://gerrit.wikimedia.org/r/609747 (owner: 10QChris) [19:32:31] (03PS3) 10Dzahn: gerrit: Remove no longer comment from CI table code [puppet] - 10https://gerrit.wikimedia.org/r/609747 (owner: 10QChris) [19:33:35] (03PS1) 10Legoktm: mediawiki: Create /etc/firejail/mediawiki.local [puppet] - 10https://gerrit.wikimedia.org/r/609840 [19:37:34] (03CR) 10Dzahn: [C: 03+2] gerrit: Add Code Review logo as favicon [puppet] - 10https://gerrit.wikimedia.org/r/609764 (https://phabricator.wikimedia.org/T257218) (owner: 10QChris) [19:38:32] 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) JTAC says that we need to RMA the whole chassis. I added @RobH to the email thread to take care of shipping. [19:41:45] !log Enabling puppet on gerrit1002 again to catch up with puppetmaster. [19:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:37] (03PS2) 10Ssingh: admin: add Chris Albon to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609217 (https://phabricator.wikimedia.org/T256412) [19:54:30] (03Abandoned) 10Paladox: Scap: Fix target to be able to set manage_user to false [puppet] - 10https://gerrit.wikimedia.org/r/565713 (owner: 10Paladox) [19:55:21] (03Abandoned) 10Paladox: Remove 4 plugins [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/593852 (owner: 10Paladox) [19:55:45] (03Abandoned) 10Paladox: WIP: Update gerrit to 2.16.13 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/495012 (owner: 10Paladox) [20:00:05] halfak and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T2000) [20:02:57] 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) [20:03:23] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) [20:13:17] (03PS1) 10Ottomata: Remove unused profile::logstash::collector::input_kafka_ssl_truststore_password [labs/private] - 10https://gerrit.wikimedia.org/r/609844 [20:15:20] (03CR) 10Ssingh: "Rebasing patch on top of production; no code change." [puppet] - 10https://gerrit.wikimedia.org/r/609217 (https://phabricator.wikimedia.org/T256412) (owner: 10Ssingh) [20:15:24] (03CR) 10Ssingh: [C: 03+2] admin: add Chris Albon to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/609217 (https://phabricator.wikimedia.org/T256412) (owner: 10Ssingh) [20:16:44] (03PS1) 10Paladox: gerrit: (Devtools) Add deploy-1002 to gerrit::servers [puppet] - 10https://gerrit.wikimedia.org/r/609635 [20:17:41] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) [20:17:44] (03PS2) 10Paladox: gerrit: (Devtools) Add deploy-1002 to gerrit::servers [puppet] - 10https://gerrit.wikimedia.org/r/609635 [20:17:52] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/609635 (owner: 10Paladox) [20:19:21] (03CR) 10Dzahn: [C: 04-1] "nah, that's not a gerrit server, that's a deployment server. what are we fixing?" [puppet] - 10https://gerrit.wikimedia.org/r/609635 (owner: 10Paladox) [20:20:10] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review, 10Scoring-platform-team (Current): Production shell access for Chris Albon - https://phabricator.wikimedia.org/T256412 (10ssingh) >>! In T256412#6283027, @Nuria wrote: > @calbon once this goes through please try: ssh to stat... [20:20:25] (03PS3) 10Paladox: gerrit: (Devtools) Add deploy-1002 to gerrit::servers [puppet] - 10https://gerrit.wikimedia.org/r/609635 [20:20:56] (03CR) 10Paladox: "@Dzahn updated commit message" [puppet] - 10https://gerrit.wikimedia.org/r/609635 (owner: 10Paladox) [20:24:36] 10Operations, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 5 others: GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations - https://phabricator.wikimedia.org/T230245 (10Krinkle) a:03aaron [20:29:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Cmjohnson) [20:30:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10Cmjohnson) The network switches still need to be connected to the network, in the meantime, everything will be c... [20:34:59] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [20:35:13] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) >>! In T229062#6186134, @Krinkle wrote: > Telemetry on current data size and read/write a... [20:36:14] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) 05Open→03Stalled This is pending hardware for the mini extdb cluster for mainstash/db... [20:36:17] 10Operations, 10Core Platform Team, 10Performance-Team, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10Krinkle) [20:51:16] PROBLEM - MegaRAID on db1131 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:18] ACKNOWLEDGEMENT - MegaRAID on db1131 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T257253 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:51:20] 10Operations, 10ops-eqiad: Degraded RAID on db1131 - https://phabricator.wikimedia.org/T257253 (10ops-monitoring-bot) [20:58:22] (03PS1) 10Paladox: devtools: Set deployment_hosts to deploy1001 ip [puppet] - 10https://gerrit.wikimedia.org/r/609636 [20:59:44] (03PS2) 10Paladox: devtools: Set deployment_hosts to deploy-1002 ip [puppet] - 10https://gerrit.wikimedia.org/r/609636 [21:00:02] (03PS3) 10Paladox: devtools: Set deployment_hosts to deploy-1002 ip [puppet] - 10https://gerrit.wikimedia.org/r/609636 [21:00:04] Reedy and sbassett: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T2100). [21:00:24] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/609636 (owner: 10Paladox) [21:01:08] (03Abandoned) 10Paladox: gerrit: (Devtools) Add deploy-1002 to gerrit::servers [puppet] - 10https://gerrit.wikimedia.org/r/609635 (owner: 10Paladox) [21:02:55] (03CR) 10Dzahn: [C: 03+2] devtools: Set deployment_hosts to deploy-1002 ip [puppet] - 10https://gerrit.wikimedia.org/r/609636 (owner: 10Paladox) [21:10:22] (03CR) 10Dzahn: "This change broke puppetmasters in cloud VPS:" [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [21:14:32] 10Operations: Update annual.wikimedia.org redirect to point to latest Annual Report - https://phabricator.wikimedia.org/T257257 (10spatton) [21:15:29] (03PS1) 10Paladox: devtools: Fix setting puppetmaster::servers [puppet] - 10https://gerrit.wikimedia.org/r/609637 [21:16:25] (03PS2) 10Paladox: devtools: Fix setting puppetmaster::servers [puppet] - 10https://gerrit.wikimedia.org/r/609637 [21:17:23] (03CR) 10Dzahn: [C: 03+2] devtools: Fix setting puppetmaster::servers [puppet] - 10https://gerrit.wikimedia.org/r/609637 (owner: 10Paladox) [21:17:35] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/609637 (owner: 10Paladox) [21:19:13] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [21:25:43] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [21:35:44] (03Abandoned) 10Addshore: Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [21:37:19] !log importing jenkins 2.235.1 into APT repo for both stretch and buster T256980 [21:37:21] 10Operations, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Create a profile to standardize the deployment of JVM packages and configurations - https://phabricator.wikimedia.org/T253553 (10hashar) Thank you @elukey for the new Puppet Java profile and for taking in accounts suggestions for t... [21:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:25] T256980: Please import Jenkins Debian package 2.235.1 in apt.wikimedia.org - https://phabricator.wikimedia.org/T256980 [21:37:46] mutante: Danke! :) [21:38:10] hashar: on releases i am about to just upgrade it as well .. on contint i let you do it? [21:38:36] i can already see the new version when simulating apt-get upgrade [21:39:50] I think I wanted to pair it with others from #releng [21:40:04] then I haven't set such arrangement so ;D [21:40:23] releases1001 hosts the Jenkins that runs the branch cut but it should just work. The job is super simple [21:40:40] and contint hosts yeah I will do them tomorrow morning when CI is not too busy [21:41:05] mutante: we got a deal :] [21:41:47] !log upgrading jenkins on releases1001 and releases2001 (T256980) [21:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:56] oh they are under `main` but should be `thirdparty/ci` [21:42:03] hashar: c'est fait [21:42:08] which supposedly reprepro should take care of ? [21:42:38] hashar: i think we had that discussion last time as well but this is how it works? [21:42:45] i followed my own docs :p [21:43:10] I have no idea how reprepro work, I just have the doc at https://wikitech.wikimedia.org/wiki/Jenkins#Upgrading [21:43:22] * hashar amends to drop jessie [21:43:32] ah no I can't login to the wikis bah [21:45:25] "You can search for the .deb package and re-include it in the repository, in the right component" [21:45:48] hashar: i repeated it with the component [21:45:58] -C thirdparty/ci parameter [21:46:39] could you update the wiki doc with whatever command works please? ;) [21:48:38] done. https://wikitech.wikimedia.org/w/index.php?title=Jenkins&type=revision&diff=1872399&oldid=1863363 [21:49:01] (03PS1) 10Jdrewniak: Enable Quicksurveys for Desktop Improvements Project. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609850 (https://phabricator.wikimedia.org/T246977) [21:49:32] ohh [21:49:45] reprepro is unable to do the right thing out of the box is it? [21:50:06] yea, well.. you need to tell it the name of the component [21:50:17] 10Operations: Please import Jenkins Debian package 2.235.1 in apt.wikimedia.org - https://phabricator.wikimedia.org/T256980 (10hashar) 05Open→03Resolved a:03Dzahn Done by @Dzahn [21:50:27] but if you skip it it does not mean you can't install it.. it still works [21:51:22] ok ok ;) [21:52:49] !log Upgraded Jenkins on releases1002 and releases2002 # T256978 [21:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:54] T256978: Upgrade Jenkins instances to 2.235.1 - https://phabricator.wikimedia.org/T256978 [21:55:05] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@65502b2]: 0.3.40 [21:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:06] (03CR) 10Bstorm: "Is it possible to test this in codfw first?" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [21:59:33] (03CR) 10Andrew Bogott: "I haven't tested the puppetization, but I tested stopping mysql, starting mariadb, and then confirming that the host re-joined the existin" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [21:59:58] (03CR) 10Dzahn: "> there is a sane default for in the cloud hiera data" [puppet] - 10https://gerrit.wikimedia.org/r/602732 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [22:03:15] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:03:27] (03CR) 10Dzahn: "If it's a proper systemd service I would recommend to use systemd::service, maybe in a follow-up." [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:04:19] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:06:34] (03CR) 10Andrew Bogott: "Yeah, I'm confused -- I thought that systemd::service was made to provision/configure the service, not just start and stop. Won't it clob" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:09:31] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:13:07] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:14:03] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@65502b2]: 0.3.40 (duration: 18m 58s) [22:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:37] (03CR) 10Dzahn: "even if sticking to just service{} i think at least the "provider =>" parameter would have to be added and set to systemd" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:21:36] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10KFrancis) @amooney @jcrespo Hi all, I checked with Jim Buatti (Sr. Legal Counsel) to get an update.... [22:29:06] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) [22:29:17] 10Operations, 10Wikimedia-Mailing-lists, 10Cloud-VPS (Project-requests): Request creation of mailman VPS project - https://phabricator.wikimedia.org/T257270 (10Ladsgroup) [22:29:23] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to 3.3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) [22:38:41] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [22:51:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:52:21] (03PS1) 10Andrew Bogott: Dynamic proxy api: in update_mapping, accept everything as JSON [puppet] - 10https://gerrit.wikimedia.org/r/609853 (https://phabricator.wikimedia.org/T140391) [22:53:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:54:32] (03PS2) 10Andrew Bogott: Dynamic proxy api: in update_mapping, accept everything as JSON [puppet] - 10https://gerrit.wikimedia.org/r/609853 (https://phabricator.wikimedia.org/T140391) [22:54:49] (03CR) 10Ladsgroup: "This broke commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T256906) (owner: 10WMDE-leszek) [22:55:42] (03CR) 10Andrew Bogott: [C: 03+2] Dynamic proxy api: in update_mapping, accept everything as JSON [puppet] - 10https://gerrit.wikimedia.org/r/609853 (https://phabricator.wikimedia.org/T140391) (owner: 10Andrew Bogott) [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200706T2300). [23:00:04] VulpesVulpes825 and Jdlrobson: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:17] I chose a bad time to go for a walk :/ hoping someone else can deploy [23:05:04] here [23:05:12] if anyone is available for deploy:) [23:05:15] not end of world if not [23:07:20] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10KFrancis) @jcrespo @guergana.tzatchkova Hi there, after reviewing the request again, I have most of the information I need for the NDA... all that is missing is Gue... [23:08:19] Sorry for being a little bit late. I am currently waiting for SWAT deployment of the patches I schedules. [23:08:53] (03PS1) 10BryanDavis: Remove --canonical argument to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) [23:08:55] (03PS1) 10BryanDavis: kubernetes: remove legacy ingress generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) [23:08:57] (03PS1) 10BryanDavis: Remove $HOME/.webservicerc support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) [23:09:10] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10KFrancis) @jcrespo after reviewing the request again, all I need is Nina Cornelia Kawohl's WMDE email address. Please provide that ASAP and I'll get the NDA processed! Tha... [23:13:19] VulpesVulpes825: I went for a walk at the wrong time so I'm far away from my computer, and it doesn't seem like anyone else is around :/ [23:13:35] I'll walk back home but it'll take me about half an hour [23:13:44] I can do it [23:13:52] Oh thanks Reedy [23:14:53] Reedy & RoanKattouw: Great! [23:15:01] (03PS2) 10Reedy: Change the remaining namespace default aliases from Chinese to English for Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609508 (https://phabricator.wikimedia.org/T257101) (owner: 10VulpesVulpes825) [23:15:08] (03CR) 10Reedy: [C: 03+2] Change the remaining namespace default aliases from Chinese to English for Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609508 (https://phabricator.wikimedia.org/T257101) (owner: 10VulpesVulpes825) [23:15:15] (03PS2) 10Dave Pifke: Swift object servers: enable object-expirer [puppet] - 10https://gerrit.wikimedia.org/r/601429 (https://phabricator.wikimedia.org/T229584) [23:15:52] (03Merged) 10jenkins-bot: Change the remaining namespace default aliases from Chinese to English for Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609508 (https://phabricator.wikimedia.org/T257101) (owner: 10VulpesVulpes825) [23:16:51] Reedy: Do you want me to test this patch? [23:18:30] (03PS2) 10Reedy: Don't index NS_USER on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112) (owner: 10VulpesVulpes825) [23:18:45] (03CR) 10Reedy: [C: 03+2] Don't index NS_USER on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112) (owner: 10VulpesVulpes825) [23:19:29] (03Merged) 10jenkins-bot: Don't index NS_USER on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112) (owner: 10VulpesVulpes825) [23:20:27] VulpesVulpes825: they don't need testing. it's fairly obvious :) [23:20:56] Reedy: Great, thank you so much for your help! [23:22:40] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Change some zh canonical namespaces. Don't index NS_USER on hywiki (duration: 00m 58s) [23:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:06] hey Reedy no testing necessary on mine really [23:24:08] 10Operations, 10SRE-OnFire, 10Sustainability (Incident Prevention): Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Ladsgroup) Ping :) [23:24:57] (03PS2) 10Reedy: Enable sidebar instrumentation on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609831 (https://phabricator.wikimedia.org/T256992) (owner: 10Jdlrobson) [23:25:59] (03CR) 10Reedy: [C: 03+2] Enable sidebar instrumentation on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609831 (https://phabricator.wikimedia.org/T256992) (owner: 10Jdlrobson) [23:26:47] (03Merged) 10jenkins-bot: Enable sidebar instrumentation on test wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609831 (https://phabricator.wikimedia.org/T256992) (owner: 10Jdlrobson) [23:26:51] (03PS2) 10BryanDavis: Remove --canonical argument to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) [23:26:53] (03PS2) 10BryanDavis: kubernetes: remove legacy ingress generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) [23:26:55] (03PS2) 10BryanDavis: Remove $HOME/.webservicerc support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) [23:29:24] thx Reedy [23:32:37] (03PS3) 10BryanDavis: Remove --canonical argument to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609855 (https://phabricator.wikimedia.org/T234617) [23:32:39] (03PS3) 10BryanDavis: kubernetes: remove legacy ingress generation [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609856 (https://phabricator.wikimedia.org/T234617) [23:32:41] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable sidebar instrumentation on test wikipedia (duration: 00m 56s) [23:32:41] (03PS3) 10BryanDavis: Remove $HOME/.webservicerc support [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/609857 (https://phabricator.wikimedia.org/T257229) [23:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:42] (03CR) 10CRusnov: "Hello! In order to verify if we are able to proceed as is, is it okay that reth0-2140.pfw3-codfw will gain an A record after this change i" [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [23:49:32] (03CR) 10Dzahn: "> package has a backward compatibility layer with sysv in it." [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [23:50:30] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/609834 (https://phabricator.wikimedia.org/T257231) (owner: 10Andrew Bogott) [23:54:20] 10Operations, 10LDAP-Access-Requests: Add Conny Kawohl to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T257038 (10Dzahn) @KFrancis It's conny.kawohl (at) wikimedia.de