[00:00:32] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:24] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03748 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [00:32:04] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:32:12] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:32:22] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:32:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:32:56] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:33:04] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:33:26] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:34:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:38:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:20:10] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:38] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:31:38] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:31:38] ACKNOWLEDGEMENT - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:31:38] ACKNOWLEDGEMENT - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:31:38] ACKNOWLEDGEMENT - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:31:38] ACKNOWLEDGEMENT - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:44] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:03] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Hi @jcrespo - the host is racked, and the ETA for completion by @Cmjohnson and @RobH is next Wednesday... [01:58:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [02:00:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Confirmed with @Cmjohnson and @RobH today, that these es1026-1034 hosts will be ready for you by end of Octo... [02:01:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [02:02:14] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:02:54] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:12:00] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:16] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:28] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:12:40] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:13:10] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:13:12] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:14:16] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798 [02:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:25] T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798 [02:20:24] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798 (duration: 06m 09s) [02:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:31] T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798 [02:20:39] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4eaad8f]: new codfw, T263798 [02:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:42] !log andrew@deploy1001 Started deploy [horizon/deploy@7b61460]: (no justification provided) [02:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:50] !log andrew@deploy1001 Finished deploy [horizon/deploy@7b61460]: (no justification provided) (duration: 00m 07s) [02:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_wikifeeds_codfw} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:28:36] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:28:38] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:29:44] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4eaad8f]: new codfw, T263798 (duration: 09m 05s) [02:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:51] T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798 [02:30:00] (03CR) 10DannyS712: "@Ahmon Dancy - rebasing after giving CR+2 results in Jenkins running the tests and giving V+2, but not submitting. Should this have been s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy) [02:30:04] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:30:06] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:32:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:39:18] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:39:30] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:39:44] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:40:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:40:18] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:40:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:45:32] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:45:44] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:45:56] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:46:12] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:46:38] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:46:42] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:21:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (install3001, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:12:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:12:54] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:21:50] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:33:26] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:33:32] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:36:40] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:36:46] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:43:34] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:43:34] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:22] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:48:24] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:55:15] ok so there seems to be maintenance in codfw cyrus one [05:55:37] let's keep an eye on all these alerts though [06:12:46] (03PS14) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [06:29:51] (03PS15) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [06:30:57] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [06:32:08] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) I uploaded the Score changes to give you an idea of what a moderately complex caller looks like in practice. It's l... [06:33:36] (03PS16) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [06:34:07] !log reboot stat1004 to pick up kernel settings [06:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:36] !log powercycle ganeti5002 (no instances running on it, mgmt console shows no tty usable) [06:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:04] (03PS17) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 [06:44:48] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 225.28 ms [06:45:32] PROBLEM - puppet last run on ganeti5002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:50:41] !log shutdown ganeti5002 (mistakenly powercycled it without seeing T261130) [06:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:48] T261130: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 [06:51:06] RECOVERY - puppet last run on ganeti5002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:53:16] 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10elukey) Added a week of downtime, sorry for the powercycle :( [06:55:38] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [06:57:01] (03PS1) 10Muehlenhoff: Remove obsolete nda_audit script [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) [06:57:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:34] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:16] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200925T0700) [07:05:14] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10MoritzMuehlenhoff) JFTR; 1:1.31-1+deb10u1 is not an ideal version for an internal backport; better use 1:1.31-1~deb10u1 or 1:1.31-0+deb10u1 for future backports: If there's no... [07:05:16] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25428/" [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [07:08:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove stretch-backports from bootstrapvz config [puppet] - 10https://gerrit.wikimedia.org/r/610121 (https://phabricator.wikimedia.org/T256881) (owner: 10Muehlenhoff) [07:10:24] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:50] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:11:12] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:11:38] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:12:00] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:13:05] this should still be the maintenance on cyrus one, all links seems to be related to eqdfw [07:16:24] (03PS5) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [07:16:27] (03PS6) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [07:24:02] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:24:28] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:25:16] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:43:32] (03PS1) 10Elukey: Re-add host level hiera settings for analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/630040 [07:43:58] (03CR) 10Elukey: [C: 03+2] Re-add host level hiera settings for analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/630040 (owner: 10Elukey) [07:50:15] (03PS1) 10Elukey: cdh: sort list of yarn/hdfs partitions in erb templates [puppet] - 10https://gerrit.wikimedia.org/r/630041 [07:57:01] (03PS5) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [07:57:01] (03PS3) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) [07:58:40] (03PS1) 10Muehlenhoff: Enable managed sources.list on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562) [07:59:38] (03PS2) 10Elukey: cdh: sort list of yarn partitions in its erb template [puppet] - 10https://gerrit.wikimedia.org/r/630041 [08:01:47] (03CR) 10Elukey: [C: 03+2] cdh: sort list of yarn partitions in its erb template [puppet] - 10https://gerrit.wikimedia.org/r/630041 (owner: 10Elukey) [08:11:16] 10Operations, 10Cloud-Services, 10Traffic: cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10ema) [08:12:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ema) [08:13:03] 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) >>! In T236241#6188649, @Formatierer wrote: > But Google... [08:15:25] 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema) [08:24:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [08:24:47] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed sources.list on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [08:25:43] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) >>! In T108580#6488253, @BBlack wrote: > Do we need to clean these up in some new subtasks Yup, tasks created! > and/or implement some check to prevent adding new http... [08:26:20] 10Operations, 10Cloud-Services, 10Traffic: cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10ema) [08:26:22] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) [08:26:44] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ema) [08:26:46] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) [08:27:21] 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema) [08:27:23] 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) [08:27:57] (03PS1) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) [08:29:01] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [08:30:41] (03PS2) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) [08:34:21] (03PS1) 10ZPapierski: Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) [08:37:39] (03PS3) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) [08:40:54] (03CR) 10Arturo Borrero Gonzalez: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [08:41:05] (03PS4) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) [08:42:37] (03PS1) 10Ema: cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) [08:42:59] (03CR) 10jerkins-bot: [V: 04-1] cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:44:22] (03PS2) 10Ema: cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) [08:44:39] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:45:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/25435/" [puppet] - 10https://gerrit.wikimedia.org/r/629724 (owner: 10Arturo Borrero Gonzalez) [08:46:12] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [08:47:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Jenkins: Upload Jenkins LTS v2.46.2 to jessie-wikimedia/third-party - https://phabricator.wikimedia.org/T157429 (10hashar) https://issues.jenkins-ci.org/browse/INFRA-165 has been declined by upstream. Seems that reprepr... [08:50:10] (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [08:52:11] (03PS1) 10Kormat: installer_server: Update MAC for db2125 [puppet] - 10https://gerrit.wikimedia.org/r/630088 (https://phabricator.wikimedia.org/T260670) [08:52:43] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25436/" [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [08:53:33] (03CR) 10Kormat: [C: 03+2] installer_server: Update MAC for db2125 [puppet] - 10https://gerrit.wikimedia.org/r/630088 (https://phabricator.wikimedia.org/T260670) (owner: 10Kormat) [08:56:02] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [08:56:21] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) a:03hashar TLDR: * Remove reference to Jessie/Stretch ini... [08:57:28] •• [08:58:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff) [09:02:03] !log text@eqsin: rolling varnish upgrade to 6.0.6-1wm1 T263557 [09:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:10] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:04:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can... [09:06:51] (03CR) 10Jbond: "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [09:08:40] (03PS1) 10Muehlenhoff: Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) [09:09:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:13:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) This doesn't solve the issue that the LTS releas... [09:15:38] (03CR) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [09:16:56] (03PS1) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) [09:20:48] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:42] (03CR) 10Gehel: [C: 04-1] "LGTM, except for the minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski) [09:22:20] !log upload@eqsin: rolling varnish upgrade to 6.0.6-1wm1 T263557 [09:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:26] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [09:22:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:20] (03CR) 10Hnowlan: [C: 03+2] restbase102[89]/restbase1030: add cassandra hosts for new nodes [dns] - 10https://gerrit.wikimedia.org/r/629723 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [09:25:55] PROBLEM - Webrequests Varnishkafka log producer on cp5001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:26:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:26:55] RECOVERY - Webrequests Varnishkafka log producer on cp5001 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [09:28:08] ah ok varnish 6 rollout [09:28:21] elukey: yup! [09:28:35] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 37.68 ms [09:29:45] (03PS1) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 [09:29:47] (03PS2) 10ZPapierski: Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) [09:30:39] (03CR) 10ZPapierski: Add dsh groups config for wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski) [09:31:07] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] ` and were **ALL** successful. [09:31:35] (03PS2) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 [09:32:24] (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:33:49] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:59] RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:28] (03CR) 10Muehlenhoff: "Can you also amend the Cumin aliases with the patch?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:36:18] (03CR) 10Elukey: "Hey John, so this would probably be a solution but it wouldn't share all the config that I currently have for role::analytics_cluster::had" [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:38:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:38:24] (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:39:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:40:13] (03CR) 10Muehlenhoff: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:41:09] (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:41:48] (03PS3) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 [09:42:52] (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:44:53] PROBLEM - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:44:56] ACKNOWLEDGEMENT - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T263837 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:45:02] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10ops-monitoring-bot) [09:45:24] (03PS1) 10Hnowlan: restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512) [09:46:48] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:50:05] (03PS4) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 [09:50:07] (03PS1) 10Elukey: profile::hadoop::spark2: set default from hiera [puppet] - 10https://gerrit.wikimedia.org/r/630108 [09:51:12] (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [09:53:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25439/" [puppet] - 10https://gerrit.wikimedia.org/r/630108 (owner: 10Elukey) [09:53:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) (owner: 10Muehlenhoff) [09:55:01] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:55:23] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5009 is CRITICAL: connect to address 10.132.0.109 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [09:55:55] the cp5009 alert is me, ignore ^ [09:56:07] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:57:19] (03CR) 10Klausman: [C: 03+2] profile::hadoop::spark2: set default from hiera [puppet] - 10https://gerrit.wikimedia.org/r/630108 (owner: 10Elukey) [09:57:41] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:58:07] !log restarting archiva to pick up Java security update [09:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:23] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [09:59:00] (03CR) 10Jbond: profile::hadoop::common: get the datanode mountpoints from facter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:00:42] (03CR) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey) [10:01:11] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [10:02:17] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:53] RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [10:05:21] (03PS2) 10Muehlenhoff: Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) [10:06:58] (03PS1) 10Elukey: Add fake hadoop keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/630112 [10:07:21] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake hadoop keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/630112 (owner: 10Elukey) [10:09:41] (03CR) 10Muehlenhoff: [C: 03+2] Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) (owner: 10Muehlenhoff) [10:13:13] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [10:25:30] (03PS2) 10Cparle: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) [10:26:58] RECOVERY - Check systemd state on es1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:26] !log reimaging sretest1002 to validate puppetised sources.list with a new installation T158562 [10:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:34] T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 [10:32:57] (03CR) 10Cparle: "> I was thinking that some functions could be abstracted out of dumpwikidatajson.sh into the script dupwikibasejson.sh which would be used" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [10:40:05] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [10:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:06] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:16] (03CR) 10ArielGlenn: "> > If there are a bunch of settings that are the same between rdf and json for the two projects, maybe there should be a generic wikidata" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [10:52:47] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Esanders) >>! In T93049#6451786, @greg wrote: > Adding #editing-team per https://www.mediawiki.org/wiki/D... [11:10:15] !log reimaging sretest1001 to validate puppetised sources.list with a new installation T158562 [11:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:22] T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 [11:23:22] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [11:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:18] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:41] 10Operations, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore) [11:42:08] PROBLEM - MariaDB Replica SQL: s5 #page on db2137 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:21] PROBLEM - MariaDB Replica SQL: s5 on db1100 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:24] PROBLEM - MariaDB Replica SQL: s5 #page on db2089 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:27] yo [11:42:31] PROBLEM - MariaDB Replica SQL: s5 on db2099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:50] marostegui, kormat, jayme ^ [11:42:51] PROBLEM - MariaDB Replica SQL: s5 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:51] * apergos peeks in [11:43:11] duplicate entry, ugh. yeah that's yell for help from a dba all right [11:44:36] I'm rallying the troops. [11:46:27] (03CR) 10Jbond: "See comments inline" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [11:46:35] (03PS4) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) [11:46:52] 10Operations, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore) [11:47:30] this doesn't look good [11:47:43] no, and it's not like we can depool them all [11:48:08] i'm here [11:48:17] duplicate key, help! [11:48:24] i agree [11:48:30] (03CR) 10jerkins-bot: [V: 04-1] Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [11:48:45] how may we help? [11:49:32] (03PS5) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) [11:50:46] we need an actual DBA, not a pretend one. [11:51:10] sobanski is rining jynus [11:51:12] *ringing [11:51:16] ok [11:51:31] ETA 5 minutes [11:52:03] Do we know what the impact of this is? [11:52:20] sobanski: edits of all wikis hosted on s5 will not show up [11:52:57] it's bad, but not sure it's so bad. I think mediawiki will depool those hosts. It's not all of the s5 ones, are they? [11:52:59] PROBLEM - MariaDB Replica Lag: s5 #page on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 803.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:09] https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/s5.dblist [11:53:10] PROBLEM - MariaDB Replica Lag: s5 on db1144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 815.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:24] PROBLEM - MariaDB Replica Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 830.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:33] some pretty big wikis btw [11:53:39] * akosiaris updating topic [11:53:40] PROBLEM - MariaDB Replica Lag: s5 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 845.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:53:44] PROBLEM - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 849.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:54:00] it's 4 of 7 replicas I think [11:54:04] in codfw [11:54:46] can someone open a task for this? [11:54:51] I'm around if I can help [11:55:02] PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 927.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:15] kormat: I think it's incidence doc that we need [11:55:25] I 'll take IC [11:55:26] can we have an incident coordinator and open up a doc? [11:55:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:29] yeah that ^ [11:55:32] akosiaris: good point, thanks [11:55:49] this is the full error from db2099: Error 'Duplicate entry '62.60.63.20-0-0-0' for key 'ipb_address'' on query. Default database: 'enwikivoyage'. Query: 'UPDATE /* MediaWiki\Block\DatabaseBlockStore::updateBlock */ `ipblocks` SET ipb_address = '62.60.63.20',ipb_user = 0,ipb_timestamp = '20200925110538',ipb_auto = 0,ipb_anon_only = 0,ipb_create_account = 1,ipb_enable_autoblock = 0,ipb_expiry = '20201225113933',ipb_range_start [11:55:49] = '3E3C3F14',ipb_range_end = '3E3C3F14',ipb_deleted = 0,ipb_block_email = 0,ipb_allow_usertalk = 1,ipb_parent_block_id = 0,ipb_sitewide = 1,ipb_reason_id = '1348933',ipb_by_actor = 380 WHERE ipb_id = 16326' [11:56:02] let me put that in a paste [11:56:28] https://phabricator.wikimedia.org/P12796 [11:56:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:57:10] PROBLEM - MariaDB Replica Lag: s5 on db1110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1056.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:57:30] here if i can help with anything [11:57:47] same [11:58:02] PROBLEM - MariaDB Replica Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1107.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:03:10] is there a public task for people asking for updates? [12:03:18] (03CR) 10Jbond: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [12:03:41] Majavah: not yet [12:04:04] PROBLEM - MariaDB Replica Lag: s5 on db1130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:04:17] Majavah: no, not yet, but I 'll create one [12:07:52] PROBLEM - MariaDB Replica Lag: s5 on db1113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1697.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:10:25] 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) [12:10:33] 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) p:05Triage→03Unbreak! [12:10:51] thanks akosiaris! [12:11:02] 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team [12:12:05] 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) List of affected wikis ` apiportalwiki avkwiki cebwiki dewiki enwikivoyage jawikivoyage lldwiki mgwiktionary mhwiktionary muswiki shwiki srwiki thankyouwiki ` [12:13:33] !log kormat@cumin1001 dbctl commit (dc=all): 'Add db2113 to various groups T263842', diff saved to https://phabricator.wikimedia.org/P12797 and previous config saved to /var/cache/conftool/dbconfig/20200925-121332-kormat.json [12:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:40] T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 [12:15:58] I’m doing some experiments on mwdebug2001 [12:16:13] (looks like nobody else is online there so I think it should be fine) [12:16:39] RECOVERY - MariaDB Replica SQL: s5 #page on db2089 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:17:39] 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) DBA are testing a recovery action prior to applying it broadly. [12:18:21] oh, looks like I’m not allowed to edit files directly on mwdebug… [12:18:38] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Majavah) [12:19:05] in that case, is it okay if I edit the file on the deployment host and then `scap pull` it to mwdebug? [12:19:13] Lucas_WMDE: we are in an s5 outage, could you please pause [12:19:14] ? [12:19:18] oh dear [12:19:21] ok [12:19:25] thanks [12:19:25] didn’t know sorry [12:19:40] RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:19:45] RECOVERY - MariaDB Replica Lag: s5 #page on db2137 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:19:51] * jayme around now if I can help with something [12:19:52] RECOVERY - MariaDB Replica SQL: s5 on db2099 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:00] PROBLEM - MariaDB Replica Lag: s5 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2426.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:15] RECOVERY - MariaDB Replica SQL: s5 #page on db2137 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:28] RECOVERY - MariaDB Replica SQL: s5 on db2139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:20:44] RECOVERY - MariaDB Replica Lag: s5 on db2139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:21:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] (03CR) 10Gehel: [C: 03+1] "LGTM, we can deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski) [12:27:10] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) A fix was applied and users of affected wikis should be seeing recovery now. [12:29:14] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05Unbreak!→03High This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be... [12:33:34] 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) Ah, sorry, I didn't think that hard about sort order. It's been many, many years now since I had an @debian.org email address. [12:41:42] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:46] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:13] 10Operations, 10Patch-For-Review, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) I did a test installation with the new setting as I had a hunch there would be issues in early install and turns out I was right: The installer writes out an initial... [12:45:11] Lucas_WMDE: I think you are good to go again. Thanks for waiting [12:50:01] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) a:03jcrespo This needs research, it is weird this happened, specially after T260042 was done prior to switchover. [12:58:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={burrow,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:00:06] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [13:00:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:00:44] (03CR) 10Hashar: "While at it, there is a leftover comment that we can most probably drop related to reprepro not supporting double redirects." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff) [13:07:17] akosiaris: thanks! I’ll try again later today (and addshore told me how to edit files on mwdebug ^^) [13:07:33] glad to hear the issue was resolved [13:08:13] (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [13:18:06] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [13:23:56] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [13:29:26] 10Operations, 10Traffic, 10Patch-For-Review: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) The upgrade to Varnish 6 (T263557) seems to have fixed this, or at least I could not reproduce the problem by issuing vari... [13:41:25] (03PS7) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841 [13:41:44] (03PS2) 10Kormat: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) [13:48:02] (03CR) 10Muehlenhoff: reboot-groups (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [14:01:42] (03PS1) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562) [14:08:34] (03PS2) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) [14:08:38] (03CR) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff) [14:20:19] (03PS1) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [14:21:40] (03CR) 10Elukey: "The alternative is a simple https://gerrit.wikimedia.org/r/c/operations/puppet/+/630185 that uses regex.yaml.." [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [14:22:52] (03PS1) 10Kormat: Split static methods out of WMFMariaDB.py into dbutil.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 [14:24:04] (03PS2) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [14:25:45] (03CR) 10Jcrespo: "More than happy about this, but coordinated merge and deployment should be done on dependent code on wmfbackups (and elsewhere?)." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat) [14:28:56] (03PS3) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [14:33:52] (03PS1) 10Kormat: Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 [14:34:23] (03PS2) 10Kormat: Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 [14:34:25] (03CR) 10jerkins-bot: [V: 04-1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [14:34:48] (03CR) 10jerkins-bot: [V: 04-1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [14:35:34] (03CR) 10Kormat: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat) [14:36:09] (03CR) 10Jcrespo: "Thanks for taking the time to fix that. This will fail until the other is merged." (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [14:36:17] (03Abandoned) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey) [14:37:09] (03CR) 10Jcrespo: Update WMFMariaDB to dbutils (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [14:37:46] (03CR) 10Jcrespo: [C: 03+1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat) [14:40:17] (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat) [14:45:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:48] (03CR) 10Muehlenhoff: "Looks good to me, a few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:54:52] !log install linux-image-4.19-amd64 on an-worker1096 + reboot [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:22] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Zbyszko) @thcipriani I have the change in puppet (https://gerrit.wikimedia.org/r/630081). What else woul... [15:16:30] (03CR) 10Ahmon Dancy: [C: 03+2] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy) [15:17:59] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10eranroz) It is desirable when there are trolls using ISPs which use CGN (maybe other cases) - I think this is quite rare case - but when it is required... [15:23:39] (03PS4) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [15:23:48] !log fixing enwikivoyage ipblocks inconsistency cluster-wide T263842 [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:55] T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 [15:25:32] (03PS5) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [15:25:36] (03CR) 10Andrew Bogott: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott) [15:26:48] RECOVERY - MariaDB Replica SQL: s5 on db1100 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:30:28] (03CR) 10Elukey: [C: 03+2] Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 (owner: 10Elukey) [15:30:32] (03PS6) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 [15:30:57] (03PS1) 10Gehel: [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202 [15:31:55] (03PS1) 10Jdlrobson: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032) [15:31:59] (03PS1) 10Jdlrobson: Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032) [15:32:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:32:53] (03CR) 10jerkins-bot: [V: 04-1] [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202 (owner: 10Gehel) [15:33:11] (03PS1) 10Jdlrobson: Make all section `collapsible` during server side rendering [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832) [15:33:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:03] 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Formatierer) For me google only shows one result in the mentioned... [15:35:44] RECOVERY - MariaDB Replica Lag: s5 on db1145 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:35:50] RECOVERY - MariaDB Replica Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:35:56] RECOVERY - MariaDB Replica Lag: s5 on db1130 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:35:58] RECOVERY - MariaDB Replica Lag: s5 on db1113 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:36:14] RECOVERY - MariaDB Replica Lag: s5 on db1110 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:36:29] (03CR) 10Jbond: "Thanks updated" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:36:32] (03PS12) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) [15:36:42] RECOVERY - MariaDB Replica Lag: s5 on db1144 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:36:42] RECOVERY - MariaDB Replica Lag: s5 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:48:12] RECOVERY - MariaDB Replica Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:52:01] Jdlrobson: are you deploying the backport of peter's fix as soon as it merges? [15:53:17] ah, I see it's not +2ed yet [15:53:39] if you want we can verify the fix on beta first by waiting for the next WPT run there [15:54:52] we might have to wait a while for that to happen naturally, though [15:57:43] (03CR) 10CRusnov: [C: 03+1] "THank you Moritz 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff) [15:58:57] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on stat1004 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [15:59:32] I rebooted this morning --^ :P [16:01:45] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05High→03Medium After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on c... [16:02:03] (03CR) 10Eevans: [C: 03+1] restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [16:03:36] I’m going to try something out on mwdebug2001 (make a minor config change and see how it affects the site) [16:03:46] let me know if I should stop for any reason (like the s5 outage earlier) [16:06:33] (03CR) 10Andrew Bogott: [C: 03+2] trafficserver: update to use a .wikimedia.cloud dns name [puppet] - 10https://gerrit.wikimedia.org/r/626466 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [16:13:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-1/0/0; [edit interfaces interface-range disabled]... [16:17:03] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] - member ge-6/0/19; [edit interfaces interface-range disabled] member... [16:17:37] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 should now be available to be put back on test-* section, I don't think it is needed anymore as an m2 (otrs test) host. @M... [16:18:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:20:40] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) > After discussing proposed fix of table inconsistency with enwikivoyage admins Was this public anywhere for the sake of transparency? Could... [16:22:25] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) > Was this public anywhere for the sake of transparency? Could a log / page be linked to? Yes, it was on their Village pump. https://en.wiki... [16:23:27] I’m done with my mwdebug2001 experiments (did a `scap pull` just to ensure I didn’t leave anything behind) [16:23:32] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [16:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:44] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) Thanks for the quick reply @jcrespo [16:26:26] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs wher... [16:28:05] (03PS2) 10Reedy: Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 [16:28:11] (03CR) 10Reedy: [C: 03+2] Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 (owner: 10Reedy) [16:28:50] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) a:03Papaul [16:28:56] (03Merged) 10jenkins-bot: Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 (owner: 10Reedy) [16:29:26] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:03] 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) [16:30:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:31:10] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) [16:31:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:33:56] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Promote 1.35.0 to stable in extensiondistributor (duration: 00m 57s) [16:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:47] !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@94c8e6a]: fixed start data for wikidata ttl import [16:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:57] !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@94c8e6a]: fixed start data for wikidata ttl import (duration: 01m 10s) [16:35:59] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1037734369. Your dispatch request has been successfully created and will be reviewed by our team. You can monito... [16:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:09] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) p:05Triage→03Medium a:03Papaul [16:49:49] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1037735666. Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your D... [16:50:19] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) p:05Triage→03Medium [17:02:23] gilles: hey still there? [17:03:20] if somebody can backport the patch I can test, but I can't backport it myself... I [17:03:34] but I also have a doctor's visit in about 30 mins :/ [17:05:28] 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10crusnov) [17:05:30] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10crusnov) [17:09:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:11:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:22:18] Jdlrobson: is this about the UBN? [17:22:30] I could pull it onto mwdebug2001 for you [17:22:32] Lucas_WMDE: yeh [17:22:37] Lucas_WMDE: that would be great [17:22:40] if it’s not too late for the doctor’s visit now [17:22:43] ok [17:22:53] i have about 15 mins hopefully [17:24:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I trust Jdlrobson, this was already merged on master, and it should fix an UBN!, so backporting." [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832) (owner: 10Jdlrobson) [17:24:29] gah, Zuul anticipates 22 minutes… [17:25:39] Jdlrobson: I pulled it onto mwdebug2001 ahead of time, please test [17:26:15] testung [17:27:16] Lucas_WMDE: yup that looks fixed to me [17:27:21] ok [17:27:29] not sure if I should go ahead with a `scap sync` while it’s still going through CI [17:27:32] or wait [17:28:09] probably wiser to wait, but I can then sync it without you since you already tested it [17:28:30] if you can wait that's great. thanks for taking care of this Lucas_WMDE [17:28:36] i think this should also take care of your wikidata problem [17:28:38] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JRabah) [17:28:44] ok, thanks :) [17:29:01] then I guess I can look into that on mwdebug while I wait for CI [17:29:28] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add analytics users without ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) [17:30:58] (03PS2) 10Elukey: role::analytics_test_cluster::coordinator: add analytics users without ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) [17:32:32] hm, I still get collapsed sections on that Wikidata issue, even on mwdebug2001 [17:35:11] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ae3c936]: Deploy glent 0.2.3 [17:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:13] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae3c936]: Deploy glent 0.2.3 (duration: 02m 01s) [17:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:46] (03Merged) 10jenkins-bot: Make all section `collapsible` during server side rendering [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832) (owner: 10Jdlrobson) [17:45:07] ok, syncing that change [17:47:13] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/MobileFrontend/: Backport: [[gerrit:630065|Make all section `collapsible` during server side rendering (T263832)]] (duration: 00m 59s) [17:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:21] T263832: Major performance regression on mobile site associated with 1.36.0-wmf.10 - https://phabricator.wikimedia.org/T263832 [17:50:04] anyone from performance team still online who could check whether T263832 is looking better now? [17:50:22] the task has lots of dashboard screenshots but no links and I haven’t found them so far [17:53:42] Hi, on otrs interface I see following message: "The OTRS background task is not launched. Please contact your administrator." should i create a task for it or somebody is playing on the host? [17:53:56] Framawiki: please create a task [17:54:11] Lucas_WMDE: it's possibly https://grafana.wikimedia.org/d/000000218/navigation-timing-by-browser but i can't quite reproduce [17:54:42] cdanis: that looks similar at least, thanks [17:56:04] Framawiki reference T187984 on the ticket, it may be something related to the upgrade? [17:56:05] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [17:56:16] Lucas_WMDE: also I pinged in #wikimedia-perf [17:56:24] ah, so that’s the channel :) thanks [17:56:33] (tried -performance and -performance-team and then gave up ^^) [17:56:35] g.illes was around half an hour ago, so maybe still is [17:58:05] thanks jynus cdanis , created https://phabricator.wikimedia.org/T263873 [17:59:03] Lucas_WMDE: if il reading the backs roll correctly you’ve deployed the back port to production, right? [17:59:09] 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 4), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10WDoranWMF) [17:59:13] gilles: yes [17:59:38] j.dlrobson confirmed it worked on mwdebug2001 but had to leave before CI finished and I synced the change everywhere [17:59:40] The effect won’t be immediate on RUM metrics (navigation timing), because affected articles will need to get purged/edited for the bad HTML to go away [17:59:46] ok [18:00:16] In the meantime I will purge one of the articles we have synthetic tests for and we should see the effect on the next test run [18:00:34] (03CR) 10Dzahn: [C: 03+2] "Yep, per IRC and confirmed, I also could not see it in openstack-browser or across the repo. Thanks for the cleanup change." [puppet] - 10https://gerrit.wikimedia.org/r/629863 (owner: 10Jeena Huneidi) [18:01:00] Incidentally dpifke is currently fixing a disk space problem on the test host, but should be back to normal shortly [18:01:49] in that case I’ll just watch mediawiki-errors for a bit more and otherwise hope it works [18:02:35] Should be fine, imho, I verified it on Beta earlier [18:02:47] I’ll check the next synthetic test runs when I go home later [18:02:53] ok thanks [18:03:08] then I think I can go home as well soon :) [18:03:48] back! thanks Lucas_WMDE gilles [18:03:51] (03CR) 10Dzahn: [C: 03+2] Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff) [18:04:28] hope everything went well! [18:05:07] Done moving files around, WPT host should be healthy now. [18:08:33] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10hashar) [18:08:35] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10hashar) [18:22:11] (03PS2) 10Dzahn: bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) [18:26:30] (03PS2) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439 [18:27:53] (03PS2) 10Dzahn: service::uwsgi: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/629437 [18:29:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) >>! In T260282#6493671, @MoritzMuehlenhoff wrote: > This do... [18:33:11] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25446/" [puppet] - 10https://gerrit.wikimedia.org/r/629437 (owner: 10Dzahn) [18:37:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25448/" [puppet] - 10https://gerrit.wikimedia.org/r/629425 (owner: 10Dzahn) [18:37:53] (03PS2) 10Dzahn: statistics::web: hiera->lookup, add data type [puppet] - 10https://gerrit.wikimedia.org/r/629425 [18:40:29] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@303eaf3]: Enable icutoknorm in glent m0 and m1 [18:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10Dzahn) This was basically all done: https://gerrit.wikimedia.org/r/c/operations/puppet/+/591000 tlsproxy::envoy: allow limiting firewall srange https://... [18:49:02] (03CR) 10Dzahn: "confirmed noop on thorium" [puppet] - 10https://gerrit.wikimedia.org/r/629425 (owner: 10Dzahn) [18:52:35] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6493968, @akosiaris wrote: > List of affected wikis > > ` > apiportalwiki > avkwiki > cebwiki > dewiki > enwikivoyage > ja... [19:02:01] 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6494987, @Marostegui wrote: >>>! In T263842#6493968, @akosiaris wrote: >> List of affected wikis >> >> ` >> apiportalwiki... [19:10:07] (03CR) 10Dzahn: [V: 03+1] swift::proxy: convert role to profile, fix various lint issues (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:10:21] (03PS3) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 [19:10:55] (03CR) 10Dzahn: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:11:24] (03CR) 10jerkins-bot: [V: 04-1] swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:28:13] (03PS4) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 [19:29:13] (03CR) 10jerkins-bot: [V: 04-1] swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:30:34] (03CR) 10Dzahn: [C: 04-1] "stalled" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [19:34:27] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@303eaf3]: Enable icutoknorm in glent m0 and m1 (duration: 53m 58s) [19:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:21] (03PS1) 10Dzahn: cumin: remove alias wikireplicas-analytics [puppet] - 10https://gerrit.wikimedia.org/r/630256 [19:45:19] (03PS1) 10Dzahn: phabricator: don't create chk_phuser shell script from erb (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630257 [19:46:20] (03CR) 10jerkins-bot: [V: 04-1] phabricator: don't create chk_phuser shell script from erb (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630257 (owner: 10Dzahn) [20:02:21] (03PS3) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 [20:02:23] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [20:03:23] (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [20:09:37] (03PS4) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 [20:10:18] (03PS1) 10Ppchelko: Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479) [20:10:40] (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [20:17:18] (03PS6) 10ArielGlenn: showcrcs: util to write out crc information from a bzip2 file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/490299 (https://phabricator.wikimedia.org/T216009) [20:17:20] (03PS1) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 [20:18:06] (03PS2) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319) [20:18:08] (03CR) 10Dzahn: "I do not currently see why this "gets struct" instead of a Boolean. Looks like a Boolean in hiera to me." [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn) [20:24:47] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) 05Stalled→03Resolved arbcom-ru@wikimedia.org was established. [20:25:18] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10thcipriani) >>! In T252124#6494323, @Zbyszko wrote: > @thcipriani I have the change in puppet (https://g... [20:26:51] !log installing memcached 1.4.33-1+deb9u1 on mwdebug1001 [20:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:22] Hey, I can't log in on mulwikisource (wikisource.org) - the auto login isn't working (even though I'm logged in on other wikis) and despite trying to submit the form repeatedly I get "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form." [20:33:51] it works for me, i am logged in on wikisource.org with SAL account after clicking login (it wasn't fully automatic though, i had to click). could be Safari/iPad like in https://phabricator.wikimedia.org/T157592 [20:35:26] I'm in chrome and other wikis work fine, with automatic, and sometimes if automatic doesn't work automatically I have to manually log in, but it won't let me log in manually [20:36:35] visit https://wikisource.org/wiki/Special:UserLogin - put in username and password - click login - doesn't work (session hijacking) - fails with both "keep me logged in" enabled and not enabled [20:37:44] I guess I'll just try again in a few hours? [20:39:05] DannyS712: " You may receive this message if you are blocking cookies." i'd say more like try with different browser [20:39:57] see the other tickets when searching phabricator for "precaution against session hijacking" [20:40:09] The message I'm seeing doesn't say anything about cookies, and nothing has changed from other wikis and from previous times I've logged in [20:41:27] please create a new ticket then [20:41:28] (03CR) 10MusikAnimal: Enable managed sources.list on sretest1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [20:44:11] filed T263888 [20:44:12] T263888: Unable to login on sourcewiki - `sessionfailure` - https://phabricator.wikimedia.org/T263888 [20:44:23] is there a specific tag I should add for WMF login / operations? [20:44:32] also, anything I can do to debug or provide more info? [20:47:32] DannyS712: thanks. it seems you already found a tag specifically for MW-user-login, seems good [20:48:47] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Krinkle) Simple repro: ` krinkle@people1002$ echo -e "Hello world.\n" > 'foo;' krinkle@people1002$ cat foo\; H... [20:54:21] (03CR) 10Dave Pifke: "I need to work with Timo on testing and rolling out the mediawiki-config patch, which should land first. I'm out next week but will pick " [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [20:54:24] (03PS2) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [20:55:33] (03CR) 10jerkins-bot: [V: 04-1] WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [21:11:53] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d999f76]: adding debug info to deployment [21:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:27] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d999f76]: adding debug info to deployment (duration: 11m 33s) [21:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:57] (03CR) 10Dzahn: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [21:33:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:35:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:58:07] (03PS1) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 [21:58:58] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [22:03:52] (03PS2) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 [22:04:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:04:37] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [22:05:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:06:48] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d1a619f]: increase airflow_variable debugging verbosity [22:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:30] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d1a619f]: increase airflow_variable debugging verbosity (duration: 10m 42s) [22:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:05] (03PS15) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [22:24:07] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:29:03] (03PS16) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [22:30:04] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [22:36:10] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a135388]: correct scap variable refernce in airflow_variables [22:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:58] (03CR) 10Dzahn: labstore: add data types and some other style fixes (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [22:42:14] (03PS9) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [22:43:16] (03CR) 10jerkins-bot: [V: 04-1] labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [22:52:07] (03PS1) 10Dzahn: facilities: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630308 [22:53:57] (03PS1) 10Dzahn: phabricator: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630309 [22:56:04] (03PS1) 10Dzahn: tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310 [23:03:08] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a135388]: correct scap variable refernce in airflow_variables (duration: 26m 57s) [23:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:24:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:29:12] (03PS1) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 [23:29:38] (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [23:39:12] (03PS2) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 [23:42:34] (03PS1) 10Dzahn: add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526) [23:44:19] (03PS1) 10CDanis: geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496) [23:44:21] (03PS1) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496) [23:44:23] (03PS1) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496) [23:44:25] (03PS10) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [23:46:13] (03PS17) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [23:48:02] (03PS3) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 [23:48:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:48:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [23:50:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:17] (03PS1) 10Dzahn: mariadb::core_test: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/630317 [23:53:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [23:55:27] (03PS4) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 [23:56:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [23:56:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:57:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:58:45] (03PS5) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970