[00:00:32] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[00:15:24] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03748 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash
[00:32:04] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:32:12] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:32:22] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:32:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:32:56] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:33:04] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:33:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:34:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:38:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:40:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:20:10] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:31:38] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP CDanis GTT maintenance #5429586 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:44:44] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db1150 (see note on hostname) - https://phabricator.wikimedia.org/T260817 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Hi @jcrespo - the host is racked, and the ETA for completion by @Cmjohnson and @RobH is next Wednesday...
[01:58:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx1001 & frdata1002 - https://phabricator.wikimedia.org/T260181 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson
[02:00:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson Confirmed with @Cmjohnson and @RobH today, that these es1026-1034 hosts will be ready for you by end of Octo...
[02:01:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson
[02:02:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:02:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:12:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:12:16] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:12:28] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:12:40] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:13:10] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:13:12] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:14:16] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798
[02:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:14:25] <stashbot>	 T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798
[02:20:24] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4eaad8f]: eqiad-only, T263798 (duration: 06m 09s)
[02:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:31] <stashbot>	 T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798
[02:20:39] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@4eaad8f]: new codfw, T263798
[02:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:42] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7b61460]: (no justification provided)
[02:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:50] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7b61460]: (no justification provided) (duration: 00m 07s)
[02:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_codfw,swagger_check_wikifeeds_codfw} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:28:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:28:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:29:44] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4eaad8f]: new codfw, T263798 (duration: 09m 05s)
[02:29:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:51] <stashbot>	 T263798: /feed/onthisday/selected latency is very high - https://phabricator.wikimedia.org/T263798
[02:30:00] <wikibugs>	 (03CR) 10DannyS712: "@Ahmon Dancy - rebasing after giving CR+2 results in Jenkins running the tests and giving V+2, but not submitting. Should this have been s" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy)
[02:30:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:30:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:32:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:39:18] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:39:30] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:39:44] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:40:12] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:40:18] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:40:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:45:32] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:45:44] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:45:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:46:12] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:46:38] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:46:42] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:21:12] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 3 (install3001, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:12:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:12:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:21:50] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:33:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 130, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:33:32] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:36:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:36:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:43:34] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:43:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:48:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:48:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:55:15] <elukey>	 ok so there seems to be maintenance in codfw cyrus one
[05:55:37] <elukey>	 let's keep an eye on all these alerts though 
[06:12:46] <wikibugs>	 (03PS14) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647
[06:29:51] <wikibugs>	 (03PS15) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647
[06:30:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[06:32:08] <wikibugs>	 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) I uploaded the Score changes to give you an idea of what a moderately complex caller looks like in practice. It's l...
[06:33:36] <wikibugs>	 (03PS16) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647
[06:34:07] <elukey>	 !log reboot stat1004 to pick up kernel settings
[06:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:36] <elukey>	 !log powercycle ganeti5002 (no instances running on it, mgmt console shows no tty usable)
[06:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:04] <wikibugs>	 (03PS17) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter [puppet] - 10https://gerrit.wikimedia.org/r/629647
[06:44:48] <icinga-wm>	 RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 225.28 ms
[06:45:32] <icinga-wm>	 PROBLEM - puppet last run on ganeti5002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:50:41] <elukey>	 !log shutdown ganeti5002 (mistakenly powercycled it without seeing T261130)
[06:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:50:48] <stashbot>	 T261130: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130
[06:51:06] <icinga-wm>	 RECOVERY - puppet last run on ganeti5002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:53:16] <wikibugs>	 10Operations, 10ops-eqsin, 10serviceops: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10elukey) Added a week of downtime, sorry for the powercycle :(
[06:55:38] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100%
[06:57:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete nda_audit script [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364)
[06:57:08] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:57:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:58:16] <icinga-wm>	 PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200925T0700)
[07:05:14] <wikibugs>	 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10MoritzMuehlenhoff) JFTR; 1:1.31-1+deb10u1 is not an ideal version for an internal backport; better use 1:1.31-1~deb10u1 or 1:1.31-0+deb10u1 for future backports: If there's no...
[07:05:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25428/" [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[07:08:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove stretch-backports from bootstrapvz config [puppet] - 10https://gerrit.wikimedia.org/r/610121 (https://phabricator.wikimedia.org/T256881) (owner: 10Muehlenhoff)
[07:10:24] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:10:50] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:11:12] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:11:38] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:12:00] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:13:05] <elukey>	 this should still be the maintenance on cyrus one, all links seems to be related to eqdfw
[07:16:24] <wikibugs>	 (03PS5) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[07:16:27] <wikibugs>	 (03PS6) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[07:24:02] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:28] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:25:16] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:43:32] <wikibugs>	 (03PS1) 10Elukey: Re-add host level hiera settings for analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/630040
[07:43:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Re-add host level hiera settings for analytics1057 [puppet] - 10https://gerrit.wikimedia.org/r/630040 (owner: 10Elukey)
[07:50:15] <wikibugs>	 (03PS1) 10Elukey: cdh: sort list of yarn/hdfs partitions in erb templates [puppet] - 10https://gerrit.wikimedia.org/r/630041
[07:57:01] <wikibugs>	 (03PS5) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783)
[07:57:01] <wikibugs>	 (03PS3) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783)
[07:58:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable managed sources.list on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562)
[07:59:38] <wikibugs>	 (03PS2) 10Elukey: cdh: sort list of yarn partitions in its erb template [puppet] - 10https://gerrit.wikimedia.org/r/630041
[08:01:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cdh: sort list of yarn partitions in its erb template [puppet] - 10https://gerrit.wikimedia.org/r/630041 (owner: 10Elukey)
[08:11:16] <wikibugs>	 10Operations, 10Cloud-Services, 10Traffic: cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10ema)
[08:12:18] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ema)
[08:13:03] <wikibugs>	 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) >>! In T236241#6188649, @Formatierer wrote: > But Google...
[08:15:25] <wikibugs>	 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema)
[08:24:20] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[08:24:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable managed sources.list on sretest1001 [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff)
[08:25:43] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema) >>! In T108580#6488253, @BBlack wrote: > Do we need to clean these up in some new subtasks  Yup, tasks created!  > and/or implement some check to prevent adding new http...
[08:26:20] <wikibugs>	 10Operations, 10Cloud-Services, 10Traffic: cloudweb2001-dev: add TLS termination - https://phabricator.wikimedia.org/T263829 (10ema)
[08:26:22] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema)
[08:26:44] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10ema)
[08:26:46] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema)
[08:27:21] <wikibugs>	 10Operations, 10Traffic, 10serviceops: puppetmaster[12]001: add TLS termination - https://phabricator.wikimedia.org/T263831 (10ema)
[08:27:23] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10codfw-rollout: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10ema)
[08:27:57] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957)
[08:29:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey)
[08:30:41] <wikibugs>	 (03PS2) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957)
[08:34:21] <wikibugs>	 (03PS1) 10ZPapierski: Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124)
[08:37:39] <wikibugs>	 (03PS3) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957)
[08:40:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott)
[08:41:05] <wikibugs>	 (03PS4) 10Elukey: profile::hadoop::common: cleanup after new TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957)
[08:42:37] <wikibugs>	 (03PS1) 10Ema: cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557)
[08:42:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:44:22] <wikibugs>	 (03PS2) 10Ema: cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557)
[08:44:39] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:45:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/25435/" [puppet] - 10https://gerrit.wikimedia.org/r/629724 (owner: 10Arturo Borrero Gonzalez)
[08:46:12] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[08:47:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Jenkins: Upload Jenkins LTS v2.46.2  to  jessie-wikimedia/third-party - https://phabricator.wikimedia.org/T157429 (10hashar) https://issues.jenkins-ci.org/browse/INFRA-165 has been declined by upstream.  Seems that reprepr...
[08:50:10] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630083 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema)
[08:52:11] <wikibugs>	 (03PS1) 10Kormat: installer_server: Update MAC for db2125 [puppet] - 10https://gerrit.wikimedia.org/r/630088 (https://phabricator.wikimedia.org/T260670)
[08:52:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25436/" [puppet] - 10https://gerrit.wikimedia.org/r/630078 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey)
[08:53:33] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] installer_server: Update MAC for db2125 [puppet] - 10https://gerrit.wikimedia.org/r/630088 (https://phabricator.wikimedia.org/T260670) (owner: 10Kormat)
[08:56:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn)
[08:56:21] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) a:03hashar TLDR: * Remove reference to Jessie/Stretch ini...
[08:57:28] <jbond42>	 ••
[08:58:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff)
[09:02:03] <ema>	 !log text@eqsin: rolling varnish upgrade to 6.0.6-1wm1 T263557
[09:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:10] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[09:04:48] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts: ` ['db2125.codfw.wmnet'] ` The log can...
[09:06:51] <wikibugs>	 (03CR) 10Jbond: "LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[09:08:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754)
[09:09:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:10:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:13:58] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) This doesn't solve the issue that the LTS releas...
[09:15:38] <wikibugs>	 (03CR) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[09:16:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282)
[09:20:48] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:42] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "LGTM, except for the minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski)
[09:22:20] <ema>	 !log upload@eqsin: rolling varnish upgrade to 6.0.6-1wm1 T263557
[09:22:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:26] <stashbot>	 T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557
[09:22:49] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] restbase102[89]/restbase1030: add cassandra hosts for new nodes [dns] - 10https://gerrit.wikimedia.org/r/629723 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan)
[09:25:55] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp5001 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:26:09] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:26:55] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp5001 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[09:28:08] <elukey>	 ah ok varnish 6 rollout
[09:28:21] <ema>	 elukey: yup!
[09:28:35] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 37.68 ms
[09:29:45] <wikibugs>	 (03PS1) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099
[09:29:47] <wikibugs>	 (03PS2) 10ZPapierski: Add dsh groups config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124)
[09:30:39] <wikibugs>	 (03CR) 10ZPapierski: Add dsh groups config for wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski)
[09:31:07] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2125.codfw.wmnet'] `  and were **ALL** successful.
[09:31:35] <wikibugs>	 (03PS2) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099
[09:32:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:33:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:33:59] <icinga-wm>	 RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:34:28] <wikibugs>	 (03CR) 10Muehlenhoff: "Can you also amend the Cumin aliases with the patch?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:36:18] <wikibugs>	 (03CR) 10Elukey: "Hey John, so this would probably be a solution but it wouldn't share all the config that I currently have for role::analytics_cluster::had" [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:38:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:38:24] <wikibugs>	 (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:39:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:40:13] <wikibugs>	 (03CR) 10Muehlenhoff: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:41:09] <wikibugs>	 (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:41:48] <wikibugs>	 (03PS3) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099
[09:42:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:44:53] <icinga-wm>	 PROBLEM - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:44:56] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T263837 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:45:02] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10ops-monitoring-bot)
[09:45:24] <wikibugs>	 (03PS1) 10Hnowlan: restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512)
[09:46:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:47:54] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:50:05] <wikibugs>	 (03PS4) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099
[09:50:07] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::spark2: set default from hiera [puppet] - 10https://gerrit.wikimedia.org/r/630108
[09:51:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[09:53:11] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:53:13] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25439/" [puppet] - 10https://gerrit.wikimedia.org/r/630108 (owner: 10Elukey)
[09:53:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) (owner: 10Muehlenhoff)
[09:55:01] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:55:23] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5009 is CRITICAL: connect to address 10.132.0.109 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[09:55:55] <ema>	 the cp5009 alert is me, ignore ^ 
[09:56:07] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100%
[09:57:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] profile::hadoop::spark2: set default from hiera [puppet] - 10https://gerrit.wikimedia.org/r/630108 (owner: 10Elukey)
[09:57:41] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:58:07] <moritzm>	 !log restarting archiva to pick up Java security update
[09:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:23] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish
[09:59:00] <wikibugs>	 (03CR) 10Jbond: profile::hadoop::common: get the datanode mountpoints from facter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[10:00:42] <wikibugs>	 (03CR) 10Elukey: profile::hadoop::common: get the datanode mountpoints from facter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629647 (owner: 10Elukey)
[10:01:11] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[10:02:17] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:02:53] <icinga-wm>	 RECOVERY - HTTPS-planet on en.planet.wikimedia.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 82 days) https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org
[10:05:21] <wikibugs>	 (03PS2) 10Muehlenhoff: Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754)
[10:06:58] <wikibugs>	 (03PS1) 10Elukey: Add fake hadoop keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/630112
[10:07:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake hadoop keytabs for new Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/630112 (owner: 10Elukey)
[10:09:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Create a role for the sretest servers and apply to sretest100[12] [puppet] - 10https://gerrit.wikimedia.org/r/630093 (https://phabricator.wikimedia.org/T245754) (owner: 10Muehlenhoff)
[10:13:13] <icinga-wm>	 RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2020-12-17 10:00:19 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/
[10:25:30] <wikibugs>	 (03PS2) 10Cparle: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067)
[10:26:58] <icinga-wm>	 RECOVERY - Check systemd state on es1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:28:26] <moritzm>	 !log reimaging sretest1002 to validate puppetised sources.list with a new installation T158562
[10:28:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:34] <stashbot>	 T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562
[10:32:57] <wikibugs>	 (03CR) 10Cparle: "> I was thinking that some functions could be abstracted out of dumpwikidatajson.sh into the script dupwikibasejson.sh which would be used" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[10:40:05] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime
[10:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:06] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:16] <wikibugs>	 (03CR) 10ArielGlenn: "> > If there are a bunch of settings that are the same between rdf and json for the two projects, maybe there should be a generic wikidata" [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle)
[10:52:47] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Esanders) >>! In T93049#6451786, @greg wrote: > Adding #editing-team per https://www.mediawiki.org/wiki/D...
[11:10:15] <moritzm>	 !log reimaging sretest1001 to validate puppetised sources.list with a new installation T158562
[11:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:22] <stashbot>	 T158562: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562
[11:23:22] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime
[11:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:18] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:41] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore)
[11:42:08] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 #page on db2137 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:42:21] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db1100 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:42:24] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 #page on db2089 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:42:27] <XioNoX>	 yo
[11:42:31] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db2099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:42:50] <XioNoX>	 marostegui, kormat, jayme ^
[11:42:51] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s5 on db2139 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 62.60.63.20-0-0-0 for key ipb_address on query. Default database: enwikivoyage. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:42:51] * apergos peeks in
[11:43:11] <apergos>	 duplicate entry, ugh. yeah that's yell for help from a dba all right
[11:44:36] <sobanski>	 I'm rallying the troops.
[11:46:27] <wikibugs>	 (03CR) 10Jbond: "See comments inline" (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff)
[11:46:35] <wikibugs>	 (03PS4) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783)
[11:46:52] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Termbox, 10serviceops, 10User-Addshore: Plan to scale up termbox service to be able to render the termbox for desktop pageviews - https://phabricator.wikimedia.org/T261486 (10Addshore)
[11:47:30] <akosiaris>	 this doesn't look good
[11:47:43] <apergos>	 no, and it's not like we can depool them all
[11:48:08] <kormat>	 i'm here
[11:48:17] <apergos>	 duplicate key, help!
[11:48:24] <kormat>	 i agree
[11:48:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[11:48:45] <akosiaris>	 how may we help?
[11:49:32] <wikibugs>	 (03PS5) 10Gehel: Enable cumin EventHandler to disable output. [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783)
[11:50:46] <kormat>	 we need an actual DBA, not a pretend one.
[11:51:10] <kormat>	 sobanski is rining jynus
[11:51:12] <kormat>	 *ringing
[11:51:16] <akosiaris>	 ok
[11:51:31] <sobanski>	 ETA 5 minutes
[11:52:03] <sobanski>	 Do we know what the impact of this is?
[11:52:20] <kormat>	 sobanski: edits of all wikis hosted on s5 will not show up
[11:52:57] <akosiaris>	 it's bad, but not sure it's so bad. I think mediawiki will depool those hosts. It's not all of the s5 ones, are they?
[11:52:59] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 #page on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 803.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:53:09] <kormat>	 https://github.com/wikimedia/operations-mediawiki-config/blob/master/dblists/s5.dblist
[11:53:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 815.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:53:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 830.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:53:33] <akosiaris>	 some pretty big wikis btw
[11:53:39] * akosiaris updating topic
[11:53:40] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db2139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 845.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:53:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 849.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:54:00] <apergos>	 it's 4 of 7 replicas I think
[11:54:04] <apergos>	 in codfw
[11:54:46] <kormat>	 can someone open a task for this?
[11:54:51] <paravoid>	 I'm around if I can help
[11:55:02] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 927.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:55:15] <akosiaris>	 kormat: I think it's incidence doc that we need
[11:55:25] <akosiaris>	 I 'll take IC
[11:55:26] <paravoid>	 can we have an incident coordinator and open up a doc?
[11:55:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:55:29] <paravoid>	 yeah that ^
[11:55:32] <kormat>	 akosiaris: good point, thanks
[11:55:49] <kormat>	 this is the full error from db2099: Error 'Duplicate entry '62.60.63.20-0-0-0' for key 'ipb_address'' on query. Default database: 'enwikivoyage'. Query: 'UPDATE /* MediaWiki\Block\DatabaseBlockStore::updateBlock  */  `ipblocks` SET ipb_address = '62.60.63.20',ipb_user = 0,ipb_timestamp = '20200925110538',ipb_auto = 0,ipb_anon_only = 0,ipb_create_account = 1,ipb_enable_autoblock = 0,ipb_expiry = '20201225113933',ipb_range_start 
[11:55:49] <kormat>	 = '3E3C3F14',ipb_range_end = '3E3C3F14',ipb_deleted = 0,ipb_block_email = 0,ipb_allow_usertalk = 1,ipb_parent_block_id = 0,ipb_sitewide = 1,ipb_reason_id = '1348933',ipb_by_actor = 380 WHERE ipb_id = 16326'
[11:56:02] <kormat>	 let me put that in a paste
[11:56:28] <kormat>	 https://phabricator.wikimedia.org/P12796
[11:56:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:57:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1056.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:57:30] <jbond42>	 here if i can help with anything
[11:57:47] <moritzm>	 same
[11:58:02] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1107.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:03:10] <Majavah>	 is there a public task for people asking for updates?
[12:03:18] <wikibugs>	 (03CR) 10Jbond: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[12:03:41] <moritzm>	 Majavah: not yet
[12:04:04] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:04:17] <akosiaris>	 Majavah: no, not yet, but I 'll create one
[12:07:52] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1113 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1697.84 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:10:25] <wikibugs>	 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris)
[12:10:33] <wikibugs>	 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) p:05Triage→03Unbreak!
[12:10:51] <Majavah>	 thanks akosiaris!
[12:11:02] <wikibugs>	 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) This is being internally tracked as there is some PII, but feel free to use this task for updates from the SRE team
[12:12:05] <wikibugs>	 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10akosiaris) List of affected wikis  ` apiportalwiki avkwiki cebwiki dewiki enwikivoyage jawikivoyage lldwiki mgwiktionary mhwiktionary muswiki shwiki srwiki thankyouwiki `
[12:13:33] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Add db2113 to various groups T263842', diff saved to https://phabricator.wikimedia.org/P12797 and previous config saved to /var/cache/conftool/dbconfig/20200925-121332-kormat.json
[12:13:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:40] <stashbot>	 T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842
[12:15:58] <Lucas_WMDE>	 I’m doing some experiments on mwdebug2001
[12:16:13] <Lucas_WMDE>	 (looks like nobody else is online there so I think it should be fine)
[12:16:39] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 #page on db2089 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:17:39] <wikibugs>	 10Operations, 10DBA: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) DBA are testing a recovery action prior to applying it broadly.
[12:18:21] <Lucas_WMDE>	 oh, looks like I’m not allowed to edit files directly on mwdebug…
[12:18:38] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Majavah)
[12:19:05] <Lucas_WMDE>	 in that case, is it okay if I edit the file on the deployment host and then `scap pull` it to mwdebug?
[12:19:13] <akosiaris>	 Lucas_WMDE: we are in an s5 outage, could you please pause
[12:19:14] <akosiaris>	 ?
[12:19:18] <Lucas_WMDE>	 oh dear
[12:19:21] <Lucas_WMDE>	 ok
[12:19:25] <akosiaris>	 thanks
[12:19:25] <Lucas_WMDE>	 didn’t know sorry
[12:19:40] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:19:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 #page on db2137 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:19:51] * jayme around now if I can help with something
[12:19:52] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on db2099 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:20:00] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2426.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:20:15] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 #page on db2137 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:20:28] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on db2139 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:20:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db2139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:21:50] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:21:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:54] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM, we can deploy this on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/630081 (https://phabricator.wikimedia.org/T252124) (owner: 10ZPapierski)
[12:27:10] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10LSobanski) A fix was applied and users of affected wikis should be seeing recovery now.
[12:29:14] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05Unbreak!→03High This shoud be fixed now for end-users. removing unbreak now. Please report any strange things you may find (should be...
[12:33:34] <wikibugs>	 10Operations, 10Traffic: backport ipvsadm>=1.30 to buster-wikimedia or buster-backports - https://phabricator.wikimedia.org/T263788 (10CDanis) Ah, sorry, I didn't think that hard about sort order.  It's been many, many years now since I had an @debian.org email address.
[12:41:42] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:46] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:13] <wikibugs>	 10Operations, 10Patch-For-Review, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) I did a test installation with the new setting as I had a hunch there would be issues in early install and turns out I was right: The installer writes out an initial...
[12:45:11] <akosiaris>	 Lucas_WMDE: I think you are good to go again. Thanks for waiting
[12:50:01] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) a:03jcrespo This needs research, it is weird this happened, specially after T260042 was done prior to switchover.
[12:58:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={burrow,swagger_check_restbase_esams} site={codfw,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:00:06] <icinga-wm>	 PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[13:00:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:00:44] <wikibugs>	 (03CR) 10Hashar: "While at it, there is a leftover comment that we can most probably drop related to reprepro not supporting double redirects." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff)
[13:07:17] <Lucas_WMDE>	 akosiaris: thanks! I’ll try again later today (and addshore told me how to edit files on mwdebug ^^)
[13:07:33] <Lucas_WMDE>	 glad to hear the issue was resolved
[13:08:13] <wikibugs>	 (03CR) 10Elukey: Introduce a role for Hadoop workers with GPUs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[13:18:06] <wikibugs>	 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis)
[13:23:56] <icinga-wm>	 RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops
[13:29:26] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Discarded VCL files stuck in auto/busy state cause high number of  backend probe requests - https://phabricator.wikimedia.org/T236754 (10ema) The upgrade to Varnish 6 (T263557) seems to have fixed this, or at least I could not reproduce the problem by issuing vari...
[13:41:25] <wikibugs>	 (03PS7) 10Kormat: bsection: Script for binary-searching log files. [puppet] - 10https://gerrit.wikimedia.org/r/627841
[13:41:44] <wikibugs>	 (03PS2) 10Kormat: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238)
[13:48:02] <wikibugs>	 (03CR) 10Muehlenhoff: reboot-groups (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff)
[14:01:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Have the puppetised sources.list depend on the wikimedia repository [puppet] - 10https://gerrit.wikimedia.org/r/630179 (https://phabricator.wikimedia.org/T158562)
[14:08:34] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282)
[14:08:38] <wikibugs>	 (03CR) 10Muehlenhoff: Remove jenkins update targets for jessie/stretch and rename the update config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff)
[14:20:19] <wikibugs>	 (03PS1) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[14:21:40] <wikibugs>	 (03CR) 10Elukey: "The alternative is a simple https://gerrit.wikimedia.org/r/c/operations/puppet/+/630185 that uses regex.yaml.." [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[14:22:52] <wikibugs>	 (03PS1) 10Kormat: Split static methods out of WMFMariaDB.py into dbutil.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187
[14:24:04] <wikibugs>	 (03PS2) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[14:25:45] <wikibugs>	 (03CR) 10Jcrespo: "More than happy about this, but coordinated merge and deployment should be done on dependent code on wmfbackups (and elsewhere?)." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat)
[14:28:56] <wikibugs>	 (03PS3) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[14:33:52] <wikibugs>	 (03PS1) 10Kormat: Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193
[14:34:23] <wikibugs>	 (03PS2) 10Kormat: Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193
[14:34:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat)
[14:34:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat)
[14:35:34] <wikibugs>	 (03CR) 10Kormat: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat)
[14:36:09] <wikibugs>	 (03CR) 10Jcrespo: "Thanks for taking the time to fix that. This will fail until the other is merged." (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat)
[14:36:17] <wikibugs>	 (03Abandoned) 10Elukey: Introduce a role for Hadoop workers with GPUs [puppet] - 10https://gerrit.wikimedia.org/r/630099 (owner: 10Elukey)
[14:37:09] <wikibugs>	 (03CR) 10Jcrespo: Update WMFMariaDB to dbutils (031 comment) [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat)
[14:37:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Update WMFMariaDB to dbutils [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/630193 (owner: 10Kormat)
[14:40:17] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/630187 (owner: 10Kormat)
[14:45:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:47:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:47:48] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good to me, a few comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[14:54:52] <elukey>	 !log install linux-image-4.19-amd64 on an-worker1096 + reboot
[14:54:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:22] <wikibugs>	 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Zbyszko) @thcipriani I have the change in puppet (https://gerrit.wikimedia.org/r/630081). What else woul...
[15:16:30] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621589 (owner: 10Ahmon Dancy)
[15:17:59] <wikibugs>	 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10eranroz) It is desirable when there are trolls using ISPs which use CGN (maybe other cases) - I think this is quite rare case  - but when it is required...
[15:23:39] <wikibugs>	 (03PS4) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[15:23:48] <jynus>	 !log fixing enwikivoyage ipblocks inconsistency cluster-wide T263842
[15:23:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:55] <stashbot>	 T263842: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842
[15:25:32] <wikibugs>	 (03PS5) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[15:25:36] <wikibugs>	 (03CR) 10Andrew Bogott: OpenStack: add initial manifests for OpenStack Barbican, a secrets API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629472 (https://phabricator.wikimedia.org/T263680) (owner: 10Andrew Bogott)
[15:26:48] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s5 on db1100 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:30:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185 (owner: 10Elukey)
[15:30:32] <wikibugs>	 (03PS6) 10Elukey: Add role::analytics_cluster::hadoop::worker to an-worker1096 [puppet] - 10https://gerrit.wikimedia.org/r/630185
[15:30:57] <wikibugs>	 (03PS1) 10Gehel: [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202
[15:31:55] <wikibugs>	 (03PS1) 10Jdlrobson: Enable search in header A/B test for logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630206 (https://phabricator.wikimedia.org/T263032)
[15:31:59] <wikibugs>	 (03PS1) 10Jdlrobson: Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032)
[15:32:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:32:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202 (owner: 10Gehel)
[15:33:11] <wikibugs>	 (03PS1) 10Jdlrobson: Make all section `collapsible` during server side rendering [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832)
[15:33:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:35:03] <wikibugs>	 10Operations, 10Product-Analytics, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Formatierer)  For me google only shows one result in the mentioned...
[15:35:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1145 is OK: OK slave_sql_lag Replication lag: 0.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:35:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:35:56] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1130 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:35:58] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1113 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:36:14] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1110 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:36:29] <wikibugs>	 (03CR) 10Jbond: "Thanks updated" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond)
[15:36:32] <wikibugs>	 (03PS12) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792)
[15:36:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1144 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:36:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:48:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1082 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:52:01] <gilles>	 Jdlrobson: are you deploying the backport of peter's fix as soon as it merges?
[15:53:17] <gilles>	 ah, I see it's not +2ed yet
[15:53:39] <gilles>	 if you want we can verify the fix on beta first by waiting for the next WPT run there
[15:54:52] <gilles>	 we might have to wait a while for that to happen naturally, though
[15:57:43] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "THank you Moritz 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630024 (https://phabricator.wikimedia.org/T247364) (owner: 10Muehlenhoff)
[15:58:57] <icinga-wm>	 RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on stat1004 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode
[15:59:32] <elukey>	 I rebooted this morning --^ :P
[16:01:45] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) p:05High→03Medium After discussing proposed fix of table inconsistency with enwikivoyage admins, an old block, that was only applied on c...
[16:02:03] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] restbase: add restbase102[89]/restbase1030 [puppet] - 10https://gerrit.wikimedia.org/r/630106 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan)
[16:03:36] <Lucas_WMDE>	 I’m going to try something out on mwdebug2001 (make a minor config change and see how it affects the site)
[16:03:46] <Lucas_WMDE>	 let me know if I should stop for any reason (like the s5 outage earlier)
[16:06:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] trafficserver: update to use a .wikimedia.cloud dns name [puppet] - 10https://gerrit.wikimedia.org/r/626466 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[16:13:09] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] -    member ge-1/0/0; [edit interfaces interface-range disabled]...
[16:17:03] <wikibugs>	 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] -    member ge-6/0/19; [edit interfaces interface-range disabled]      member...
[16:17:37] <wikibugs>	 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) db1077 should now be available to be put back on test-* section, I don't think it is needed anymore as an m2 (otrs test) host. @M...
[16:18:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:20:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:20:40] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) > After discussing proposed fix of table inconsistency with enwikivoyage admins Was this public anywhere for the sake of transparency? Could...
[16:22:25] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) > Was this public anywhere for the sake of transparency? Could a log / page be linked to?  Yes, it was on their Village pump. https://en.wiki...
[16:23:27] <Lucas_WMDE>	 I’m done with my mwdebug2001 experiments (did a `scap pull` just to ensure I didn’t leave anything behind)
[16:23:32] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[16:23:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:44] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10RhinosF1) Thanks for the quick reply @jcrespo
[16:26:26] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10jcrespo) Effectively no block was applied or removed by me, only metadata was made consistent by "merging" 2 other partially applied blocks. Logs wher...
[16:28:05] <wikibugs>	 (03PS2) 10Reedy: Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854
[16:28:11] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 (owner: 10Reedy)
[16:28:50] <wikibugs>	 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) a:03Papaul
[16:28:56] <wikibugs>	 (03Merged) 10jenkins-bot: Prepare ExtensionDistributor for REL1_35 stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629854 (owner: 10Reedy)
[16:29:26] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:03] <wikibugs>	 10Operations, 10ops-codfw, 10decommission-hardware: decommission es2018.codfw.wmnet - https://phabricator.wikimedia.org/T263615 (10Papaul)
[16:30:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:31:10] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2012.codfw.wmnet - https://phabricator.wikimedia.org/T263613 (10Papaul)
[16:31:47] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:33:56] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Promote 1.35.0 to stable in extensiondistributor (duration: 00m 57s)
[16:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:47] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@94c8e6a]: fixed start data for wikidata ttl import
[16:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:57] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@94c8e6a]: fixed start data for wikidata ttl import (duration: 01m 10s)
[16:35:59] <wikibugs>	 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul)  Create Dispatch: Success You have successfully submitted request SR1037734369.  Your dispatch request has been successfully created and will be reviewed by our team. You can monito...
[16:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:09] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul) p:05Triage→03Medium a:03Papaul
[16:49:49] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Papaul)  Create Dispatch: Success You have successfully submitted request SR1037735666.  Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your D...
[16:50:19] <wikibugs>	 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) p:05Triage→03Medium
[17:02:23] <Jdlrobson>	 gilles: hey still there?
[17:03:20] <Jdlrobson>	 if somebody can backport the patch I can test, but I can't backport it myself... I 
[17:03:34] <Jdlrobson>	 but I also have a doctor's visit in about 30 mins :/
[17:05:28] <wikibugs>	 10Operations, 10homer, 10netops: Homer: Netbox driven switch interfaces - https://phabricator.wikimedia.org/T250429 (10crusnov)
[17:05:30] <wikibugs>	 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10crusnov)
[17:09:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:11:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:22:18] <Lucas_WMDE>	 Jdlrobson: is this about the UBN?
[17:22:30] <Lucas_WMDE>	 I could pull it onto mwdebug2001 for you
[17:22:32] <Jdlrobson>	 Lucas_WMDE: yeh
[17:22:37] <Jdlrobson>	 Lucas_WMDE: that would be great
[17:22:40] <Lucas_WMDE>	 if it’s not too late for the doctor’s visit now
[17:22:43] <Lucas_WMDE>	 ok
[17:22:53] <Jdlrobson>	 i have about 15 mins hopefully
[17:24:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I trust Jdlrobson, this was already merged on master, and it should fix an UBN!, so backporting." [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832) (owner: 10Jdlrobson)
[17:24:29] <Lucas_WMDE>	 gah, Zuul anticipates 22 minutes…
[17:25:39] <Lucas_WMDE>	 Jdlrobson: I pulled it onto mwdebug2001 ahead of time, please test
[17:26:15] <Jdlrobson>	 testung
[17:27:16] <Jdlrobson>	 Lucas_WMDE: yup that looks fixed to me
[17:27:21] <Lucas_WMDE>	 ok
[17:27:29] <Lucas_WMDE>	 not sure if I should go ahead with a `scap sync` while it’s still going through CI
[17:27:32] <Lucas_WMDE>	 or wait
[17:28:09] <Lucas_WMDE>	 probably wiser to wait, but I can then sync it without you since you already tested it
[17:28:30] <Jdlrobson>	 if you can wait that's great. thanks for taking care of this Lucas_WMDE 
[17:28:36] <Jdlrobson>	 i think this should also take care of your wikidata problem
[17:28:38] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10JRabah)
[17:28:44] <Lucas_WMDE>	 ok, thanks :)
[17:29:01] <Lucas_WMDE>	 then I guess I can look into that on mwdebug while I wait for CI
[17:29:28] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: add analytics users without ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660)
[17:30:58] <wikibugs>	 (03PS2) 10Elukey: role::analytics_test_cluster::coordinator: add analytics users without ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660)
[17:32:32] <Lucas_WMDE>	 hm, I still get collapsed sections on that Wikidata issue, even on mwdebug2001
[17:35:11] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@ae3c936]: Deploy glent 0.2.3
[17:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:13] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@ae3c936]: Deploy glent 0.2.3 (duration: 02m 01s)
[17:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:46] <wikibugs>	 (03Merged) 10jenkins-bot: Make all section `collapsible` during server side rendering [extensions/MobileFrontend] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630065 (https://phabricator.wikimedia.org/T263832) (owner: 10Jdlrobson)
[17:45:07] <Lucas_WMDE>	 ok, syncing that change
[17:47:13] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/MobileFrontend/: Backport: [[gerrit:630065|Make all section `collapsible` during server side rendering (T263832)]] (duration: 00m 59s)
[17:47:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:21] <stashbot>	 T263832: Major performance regression on mobile site associated with 1.36.0-wmf.10  - https://phabricator.wikimedia.org/T263832
[17:50:04] <Lucas_WMDE>	 anyone from performance team still online who could check whether T263832 is looking better now?
[17:50:22] <Lucas_WMDE>	 the task has lots of dashboard screenshots but no links and I haven’t found them so far
[17:53:42] <Framawiki>	 Hi, on otrs interface I see following message: "The OTRS background task is not launched. Please contact your administrator." should i create a task for it or somebody is playing on the host?
[17:53:56] <cdanis>	 Framawiki: please create a task
[17:54:11] <cdanis>	 Lucas_WMDE: it's possibly https://grafana.wikimedia.org/d/000000218/navigation-timing-by-browser but i can't quite reproduce
[17:54:42] <Lucas_WMDE>	 cdanis: that looks similar at least, thanks
[17:56:04] <jynus>	 Framawiki reference T187984 on the ticket, it may be something related to the upgrade?
[17:56:05] <stashbot>	 T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984
[17:56:16] <cdanis>	 Lucas_WMDE: also I pinged in #wikimedia-perf
[17:56:24] <Lucas_WMDE>	 ah, so that’s the channel :) thanks
[17:56:33] <Lucas_WMDE>	 (tried -performance and -performance-team and then gave up ^^)
[17:56:35] <cdanis>	 g.illes was around half an hour ago, so maybe still is
[17:58:05] <Framawiki>	 thanks jynus cdanis , created https://phabricator.wikimedia.org/T263873
[17:59:03] <gilles>	 Lucas_WMDE: if il reading the backs roll correctly you’ve deployed the back port to production, right?
[17:59:09] <wikibugs>	 10Operations, 10Traffic, 10Platform Team Initiatives (API Gateway), 10Platform Team Sprints Board (Sprint 4), and 2 others: Client Developer has a cookie-free API call - https://phabricator.wikimedia.org/T258748 (10WDoranWMF)
[17:59:13] <Lucas_WMDE>	 gilles: yes
[17:59:38] <Lucas_WMDE>	 j.dlrobson confirmed it worked on mwdebug2001 but had to leave before CI finished and I synced the change everywhere
[17:59:40] <gilles>	 The effect won’t be immediate on RUM metrics (navigation timing), because affected articles will need to get purged/edited for the bad HTML to go away
[17:59:46] <Lucas_WMDE>	 ok
[18:00:16] <gilles>	 In the meantime I will purge one of the articles we have synthetic tests for and we should see the effect on the next test run
[18:00:34] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "Yep, per IRC and confirmed, I also could not see it in openstack-browser or across the repo. Thanks for the cleanup change." [puppet] - 10https://gerrit.wikimedia.org/r/629863 (owner: 10Jeena Huneidi)
[18:01:00] <gilles>	 Incidentally dpifke is currently fixing a disk space problem on the test host, but should be back to normal shortly
[18:01:49] <Lucas_WMDE>	 in that case I’ll just watch mediawiki-errors for a bit more and otherwise hope it works
[18:02:35] <gilles>	 Should be fine, imho, I verified it on Beta earlier
[18:02:47] <gilles>	 I’ll check the next synthetic test runs when I go home later
[18:02:53] <Lucas_WMDE>	 ok thanks
[18:03:08] <Lucas_WMDE>	 then I think I can go home as well soon :)
[18:03:48] <Jdlrobson>	 back! thanks Lucas_WMDE gilles 
[18:03:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Remove jenkins update targets for jessie/stretch and rename the update config [puppet] - 10https://gerrit.wikimedia.org/r/630095 (https://phabricator.wikimedia.org/T260282) (owner: 10Muehlenhoff)
[18:04:28] <Lucas_WMDE>	 hope everything went well!
[18:05:07] <dpifke>	 Done moving files around, WPT host should be healthy now.
[18:08:33] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10hashar)
[18:08:35] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10hashar)
[18:22:11] <wikibugs>	 (03PS2) 10Dzahn: bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526)
[18:26:30] <wikibugs>	 (03PS2) 10Dzahn: maps: hiera()->lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/629439
[18:27:53] <wikibugs>	 (03PS2) 10Dzahn: service::uwsgi: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/629437
[18:29:06] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) >>! In T260282#6493671, @MoritzMuehlenhoff wrote: > This do...
[18:33:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/25446/" [puppet] - 10https://gerrit.wikimedia.org/r/629437 (owner: 10Dzahn)
[18:37:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25448/" [puppet] - 10https://gerrit.wikimedia.org/r/629425 (owner: 10Dzahn)
[18:37:53] <wikibugs>	 (03PS2) 10Dzahn: statistics::web: hiera->lookup, add data type [puppet] - 10https://gerrit.wikimedia.org/r/629425
[18:40:29] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@303eaf3]: Enable icutoknorm in glent m0 and m1
[18:40:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10Dzahn) This was basically all done:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/591000 tlsproxy::envoy: allow limiting firewall srange https://...
[18:49:02] <wikibugs>	 (03CR) 10Dzahn: "confirmed noop on thorium" [puppet] - 10https://gerrit.wikimedia.org/r/629425 (owner: 10Dzahn)
[18:52:35] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6493968, @akosiaris wrote: > List of affected wikis >  > ` > apiportalwiki > avkwiki > cebwiki > dewiki > enwikivoyage > ja...
[19:02:01] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: S5 replication issue, affecting watchlist and probably recentchanges - https://phabricator.wikimedia.org/T263842 (10Marostegui) >>! In T263842#6494987, @Marostegui wrote: >>>! In T263842#6493968, @akosiaris wrote: >> List of affected wikis >>  >> ` >> apiportalwiki...
[19:10:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] swift::proxy: convert role to profile, fix various lint issues (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[19:10:21] <wikibugs>	 (03PS3) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970
[19:10:55] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[19:11:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[19:28:13] <wikibugs>	 (03PS4) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970
[19:29:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn)
[19:30:34] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "stalled" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[19:34:27] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@303eaf3]: Enable icutoknorm in glent m0 and m1 (duration: 53m 58s)
[19:34:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:21] <wikibugs>	 (03PS1) 10Dzahn: cumin: remove alias wikireplicas-analytics [puppet] - 10https://gerrit.wikimedia.org/r/630256
[19:45:19] <wikibugs>	 (03PS1) 10Dzahn: phabricator: don't create chk_phuser shell script from erb (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630257
[19:46:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] phabricator: don't create chk_phuser shell script from erb (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/630257 (owner: 10Dzahn)
[20:02:21] <wikibugs>	 (03PS3) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152
[20:02:23] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[20:03:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[20:09:37] <wikibugs>	 (03PS4) 10Dzahn: base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152
[20:10:18] <wikibugs>	 (03PS1) 10Ppchelko: Eventstreams: expose revision-visibility-change events. [deployment-charts] - 10https://gerrit.wikimedia.org/r/630262 (https://phabricator.wikimedia.org/T262479)
[20:10:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] base/monitoring: move monitor_screens to proper profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[20:17:18] <wikibugs>	 (03PS6) 10ArielGlenn: showcrcs: util to write out crc information from a bzip2 file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/490299 (https://phabricator.wikimedia.org/T216009)
[20:17:20] <wikibugs>	 (03PS1) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267
[20:18:06] <wikibugs>	 (03PS2) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319)
[20:18:08] <wikibugs>	 (03CR) 10Dzahn: "I do not currently see why this "gets struct" instead of a Boolean. Looks like a Boolean in hiera to me." [puppet] - 10https://gerrit.wikimedia.org/r/628152 (owner: 10Dzahn)
[20:24:47] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) 05Stalled→03Resolved arbcom-ru@wikimedia.org was established.
[20:25:18] <wikibugs>	 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10thcipriani) >>! In T252124#6494323, @Zbyszko wrote: > @thcipriani I have the change in puppet (https://g...
[20:26:51] <effie>	 !log installing memcached 1.4.33-1+deb9u1 on mwdebug1001
[20:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:22] <DannyS712>	 Hey, I can't log in on mulwikisource (wikisource.org) - the auto login isn't working (even though I'm logged in on other wikis) and despite trying to submit the form repeatedly I get "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form."
[20:33:51] <mutante>	 it works for me, i am logged in on wikisource.org with SAL account after clicking login (it wasn't fully automatic though, i had to click). could be Safari/iPad like in https://phabricator.wikimedia.org/T157592
[20:35:26] <DannyS712>	 I'm in chrome and other wikis work fine, with automatic, and sometimes if automatic doesn't work automatically I have to manually log in, but it won't let me log in manually
[20:36:35] <DannyS712>	 visit https://wikisource.org/wiki/Special:UserLogin - put in username and password - click login - doesn't work (session hijacking) - fails with both "keep me logged in" enabled and not enabled
[20:37:44] <DannyS712>	 I guess I'll just try again in a few hours?
[20:39:05] <mutante>	 DannyS712: " You may receive this message if you are blocking cookies." i'd say more like try with different browser
[20:39:57] <mutante>	 see the other tickets when searching phabricator for "precaution against session hijacking"
[20:40:09] <DannyS712>	 The message I'm seeing doesn't say anything about cookies, and nothing has changed from other wikis and from previous times I've logged in
[20:41:27] <mutante>	 please create a new ticket then
[20:41:28] <wikibugs>	 (03CR) 10MusikAnimal: Enable managed sources.list on sretest1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630066 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff)
[20:44:11] <DannyS712>	 filed T263888
[20:44:12] <stashbot>	 T263888: Unable to login on sourcewiki - `sessionfailure` - https://phabricator.wikimedia.org/T263888
[20:44:23] <DannyS712>	 is there a specific tag I should add for WMF login / operations?
[20:44:32] <DannyS712>	 also, anything I can do to debug or provide more info?
[20:47:32] <mutante>	 DannyS712: thanks. it seems you already found a tag specifically for MW-user-login, seems good
[20:48:47] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Krinkle) Simple repro:  ` krinkle@people1002$ echo -e "Hello world.\n" > 'foo;' krinkle@people1002$ cat foo\; H...
[20:54:21] <wikibugs>	 (03CR) 10Dave Pifke: "I need to work with Timo on testing and rolling out the mediawiki-config patch, which should land first.  I'm out next week but will pick " [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[20:54:24] <wikibugs>	 (03PS2) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340)
[20:55:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli)
[21:11:53] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d999f76]: adding debug info to deployment
[21:11:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:27] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d999f76]: adding debug info to deployment (duration: 11m 33s)
[21:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:57] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[21:33:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:35:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:58:07] <wikibugs>	 (03PS1) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301
[21:58:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[22:03:52] <wikibugs>	 (03PS2) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301
[22:04:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:04:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[22:05:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:06:48] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@d1a619f]: increase airflow_variable debugging verbosity
[22:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:12:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:17:30] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@d1a619f]: increase airflow_variable debugging verbosity (duration: 10m 42s)
[22:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:05] <wikibugs>	 (03PS15) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[22:24:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:29:03] <wikibugs>	 (03PS16) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[22:30:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn)
[22:36:10] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a135388]: correct scap variable refernce in airflow_variables
[22:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:58] <wikibugs>	 (03CR) 10Dzahn: labstore: add data types and some other style fixes (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn)
[22:42:14] <wikibugs>	 (03PS9) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666
[22:43:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn)
[22:52:07] <wikibugs>	 (03PS1) 10Dzahn: facilities: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630308
[22:53:57] <wikibugs>	 (03PS1) 10Dzahn: phabricator: replace Stdlib::Ip_address with IP::Address [puppet] - 10https://gerrit.wikimedia.org/r/630309
[22:56:04] <wikibugs>	 (03PS1) 10Dzahn: tor: use Stdlib::Host to match FQDN or IP [puppet] - 10https://gerrit.wikimedia.org/r/630310
[23:03:08] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a135388]: correct scap variable refernce in airflow_variables (duration: 26m 57s)
[23:03:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:24:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:29:12] <wikibugs>	 (03PS1) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312
[23:29:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn)
[23:39:12] <wikibugs>	 (03PS2) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312
[23:42:34] <wikibugs>	 (03PS1) 10Dzahn: add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526)
[23:44:19] <wikibugs>	 (03PS1) 10CDanis: geoip VCL: init/free functions are now reusable [puppet] - 10https://gerrit.wikimedia.org/r/630314 (https://phabricator.wikimedia.org/T263496)
[23:44:21] <wikibugs>	 (03PS1) 10CDanis: geoip VCL: add a 'which' param to get_geo_xcip [puppet] - 10https://gerrit.wikimedia.org/r/630315 (https://phabricator.wikimedia.org/T263496)
[23:44:23] <wikibugs>	 (03PS1) 10CDanis: VCL: Attach a variety of GeoIP info as bereq headers [puppet] - 10https://gerrit.wikimedia.org/r/630316 (https://phabricator.wikimedia.org/T263496)
[23:44:25] <wikibugs>	 (03PS10) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666
[23:46:13] <wikibugs>	 (03PS17) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666
[23:48:02] <wikibugs>	 (03PS3) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301
[23:48:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:48:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[23:50:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:52:17] <wikibugs>	 (03PS1) 10Dzahn: mariadb::core_test: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/630317
[23:53:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn)
[23:55:27] <wikibugs>	 (03PS4) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301
[23:56:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[23:56:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:57:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:58:45] <wikibugs>	 (03PS5) 10Dzahn: swift: convert roles to profiles, fix various lint issues [puppet] - 10https://gerrit.wikimedia.org/r/628970